• Title/Summary/Keyword: latent feature model

Search Result 26, Processing Time 0.02 seconds

Optimal supervised LSA method using selective feature dimension reduction (선택적 자질 차원 축소를 이용한 최적의 지도적 LSA 방법)

  • Kim, Jung-Ho;Kim, Myung-Kyu;Cha, Myung-Hoon;In, Joo-Ho;Chae, Soo-Hoan
    • Science of Emotion and Sensibility
    • /
    • v.13 no.1
    • /
    • pp.47-60
    • /
    • 2010
  • Most of the researches about classification usually have used kNN(k-Nearest Neighbor), SVM(Support Vector Machine), which are known as learn-based model, and Bayesian classifier, NNA(Neural Network Algorithm), which are known as statistics-based methods. However, there are some limitations of space and time when classifying so many web pages in recent internet. Moreover, most studies of classification are using uni-gram feature representation which is not good to represent real meaning of words. In case of Korean web page classification, there are some problems because of korean words property that the words have multiple meanings(polysemy). For these reasons, LSA(Latent Semantic Analysis) is proposed to classify well in these environment(large data set and words' polysemy). LSA uses SVD(Singular Value Decomposition) which decomposes the original term-document matrix to three different matrices and reduces their dimension. From this SVD's work, it is possible to create new low-level semantic space for representing vectors, which can make classification efficient and analyze latent meaning of words or document(or web pages). Although LSA is good at classification, it has some drawbacks in classification. As SVD reduces dimensions of matrix and creates new semantic space, it doesn't consider which dimensions discriminate vectors well but it does consider which dimensions represent vectors well. It is a reason why LSA doesn't improve performance of classification as expectation. In this paper, we propose new LSA which selects optimal dimensions to discriminate and represent vectors well as minimizing drawbacks and improving performance. This method that we propose shows better and more stable performance than other LSAs' in low-dimension space. In addition, we derive more improvement in classification as creating and selecting features by reducing stopwords and weighting specific values to them statistically.

  • PDF

A Study on Classification of Variant Malware Family Based on ResNet-Variational AutoEncoder (ResNet-Variational AutoEncoder기반 변종 악성코드 패밀리 분류 연구)

  • Lee, Young-jeon;Han, Myung-Mook
    • Journal of Internet Computing and Services
    • /
    • v.22 no.2
    • /
    • pp.1-9
    • /
    • 2021
  • Traditionally, most malicious codes have been analyzed using feature information extracted by domain experts. However, this feature-based analysis method depends on the analyst's capabilities and has limitations in detecting variant malicious codes that have modified existing malicious codes. In this study, we propose a ResNet-Variational AutoEncder-based variant malware classification method that can classify a family of variant malware without domain expert intervention. The Variational AutoEncoder network has the characteristics of creating new data within a normal distribution and understanding the characteristics of the data well in the learning process of training data provided as input values. In this study, important features of malicious code could be extracted by extracting latent variables in the learning process of Variational AutoEncoder. In addition, transfer learning was performed to better learn the characteristics of the training data and increase the efficiency of learning. The learning parameters of the ResNet-152 model pre-trained with the ImageNet Dataset were transferred to the learning parameters of the Encoder Network. The ResNet-Variational AutoEncoder that performed transfer learning showed higher performance than the existing Variational AutoEncoder and provided learning efficiency. Meanwhile, an ensemble model, Stacking Classifier, was used as a method for classifying variant malicious codes. As a result of learning the Stacking Classifier based on the characteristic data of the variant malware extracted by the Encoder Network of the ResNet-VAE model, an accuracy of 98.66% and an F1-Score of 98.68 were obtained.

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity (문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안)

  • Lee, Min Seok;Yang, Seok Woo;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.105-122
    • /
    • 2019
  • Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

An Algorithm of Fingerprint Image Restoration Based on an Artificial Neural Network (인공 신경망 기반의 지문 영상 복원 알고리즘)

  • Jang, Seok-Woo;Lee, Samuel;Kim, Gye-Young
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.21 no.8
    • /
    • pp.530-536
    • /
    • 2020
  • The use of minutiae by fingerprint readers is robust against presentation attacks, but one weakness is that the mismatch rate is high. Therefore, minutiae tend to be used with skeleton images. There have been many studies on security vulnerabilities in the characteristics of minutiae, but vulnerability studies on the skeleton are weak, so this study attempts to analyze the vulnerability of presentation attacks against the skeleton. To this end, we propose a method based on the skeleton to recover the original fingerprint using a learning algorithm. The proposed method includes a new learning model, Pix2Pix, which adds a latent vector to the existing Pix2Pix model, thereby generating a natural fingerprint. In the experimental results, the original fingerprint is restored using the proposed machine learning, and then, the restored fingerprint is the input for the fingerprint reader in order to achieve a good recognition rate. Thus, this study verifies that fingerprint readers using the skeleton are vulnerable to presentation attacks. The approach presented in this paper is expected to be useful in a variety of applications concerning fingerprint restoration, video security, and biometrics.

The Research Features Analysis of Leisure and Recreation based on Co-authors Network and Topic Model (공저자 네트워크 및 토픽 모델링 기반 여가레크리에이션 학술 연구 특징 분석)

  • Park, SungGeon;Park, Kwang-Won;Kang, Hyun-Wook
    • 한국체육학회지인문사회과학편
    • /
    • v.57 no.2
    • /
    • pp.279-289
    • /
    • 2018
  • The purpose of this study is to investigate features of leisure and recreation scholarship study in The Korean Journal of physical education based on co-authors network and topic modeling through using Word Cloud and LDA Topic Modeling(Latent Dirichlet Allocation). The data collected for this study are 2,697 papers published online from January 2008 to March 2017 on the Korean journal of physical education. Respectively ordered analysis targets are the major author, author of correspondence, co-author 1, co-author 2, co-author n in related document to explore studies' trends using the 369 documents. As a result, the co-author network analysis result found that 451 were linked to the research network, on average researchers had 1.52 relationships and the average distance between researchers was 2.33. The Representative author's concentration of connection was ranked high in the order of the following, Lee. K. M., Hwang. S. H., H., Lee. C. S., and proximity centers were shown in Seo K. B., Han. J. H., Kim. K. J. Finally, parameter-centric features appeared in order of Lee. C. W. and Seo. K. B. was most actively connected between the researchers of the leisure-related academic papers. Future research needs discussions among scholars regarding the trend and direction of future leisure research.

Analysis of Interactions in Multiple Genes using IFSA(Independent Feature Subspace Analysis) (IFSA 알고리즘을 이용한 유전자 상호 관계 분석)

  • Kim, Hye-Jin;Choi, Seung-Jin;Bang, Sung-Yang
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.33 no.3
    • /
    • pp.157-165
    • /
    • 2006
  • The change of external/internal factors of the cell rquires specific biological functions to maintain life. Such functions encourage particular genes to jnteract/regulate each other in multiple ways. Accordingly, we applied a linear decomposition model IFSA, which derives hidden variables, called the 'expression mode' that corresponds to the functions. To interpret gene interaction/regulation, we used a cross-correlation method given an expression mode. Linear decomposition models such as principal component analysis (PCA) and independent component analysis (ICA) were shown to be useful in analyzing high dimensional DNA microarray data, compared to clustering methods. These methods assume that gene expression is controlled by a linear combination of uncorrelated/indepdendent latent variables. However these methods have some difficulty in grouping similar patterns which are slightly time-delayed or asymmetric since only exactly matched Patterns are considered. In order to overcome this, we employ the (IFSA) method of [1] to locate phase- and shut-invariant features. Membership scoring functions play an important role to classify genes since linear decomposition models basically aim at data reduction not but at grouping data. We address a new function essential to the IFSA method. In this paper we stress that IFSA is useful in grouping functionally-related genes in the presence of time-shift and expression phase variance. Ultimately, we propose a new approach to investigate the multiple interaction information of genes.