• Title/Summary/Keyword: 희소주성분분석

Search Result 5, Processing Time 0.017 seconds

Feature selection for text data via sparse principal component analysis (희소주성분분석을 이용한 텍스트데이터의 단어선택)

  • Won Son
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.6
    • /
    • pp.501-514
    • /
    • 2023
  • When analyzing high dimensional data such as text data, if we input all the variables as explanatory variables, statistical learning procedures may suffer from over-fitting problems. Furthermore, computational efficiency can deteriorate with a large number of variables. Dimensionality reduction techniques such as feature selection or feature extraction are useful for dealing with these problems. The sparse principal component analysis (SPCA) is one of the regularized least squares methods which employs an elastic net-type objective function. The SPCA can be used to remove insignificant principal components and identify important variables from noisy observations. In this study, we propose a dimension reduction procedure for text data based on the SPCA. Applying the proposed procedure to real data, we find that the reduced feature set maintains sufficient information in text data while the size of the feature set is reduced by removing redundant variables. As a result, the proposed procedure can improve classification accuracy and computational efficiency, especially for some classifiers such as the k-nearest neighbors algorithm.

Sparse Web Data Analysis Using MCMC Missing Value Imputation and PCA Plot-based SOM (MCMC 결측치 대체와 주성분 산점도 기반의 SOM을 이용한 희소한 웹 데이터 분석)

  • Jun, Sung-Hae;Oh, Kyung-Whan
    • The KIPS Transactions:PartD
    • /
    • v.10D no.2
    • /
    • pp.277-282
    • /
    • 2003
  • The knowledge discovery from web has been studied in many researches. There are some difficulties using web log for training data on efficient information predictive models. In this paper, we studied on the method to eliminate sparseness from web log data and to perform web user clustering. Using missing value imputation by Bayesian inference of MCMC, the sparseness of web data is removed. And web user clustering is performed using self organizing maps based on 3-D plot by principal component. Finally, using KDD Cup data, our experimental results were shown the problem solving process and the performance evaluation.

Study on Principal Sentiment Analysis of Social Data (소셜 데이터의 주된 감성분석에 대한 연구)

  • Jang, Phil-Sik
    • Journal of the Korea Society of Computer and Information
    • /
    • v.19 no.12
    • /
    • pp.49-56
    • /
    • 2014
  • In this paper, we propose a method for identifying hidden principal sentiments among large scale texts from documents, social data, internet and blogs by analyzing standard language, slangs, argots, abbreviations and emoticons in those words. The IRLBA(Implicitly Restarted Lanczos Bidiagonalization Algorithm) is used for principal component analysis with large scale sparse matrix. The proposed system consists of data acquisition, message analysis, sentiment evaluation, sentiment analysis and integration and result visualization modules. The suggested approaches would help to improve the accuracy and expand the application scope of sentiment analysis in social data.

Comparison of Fish Distribution Characteristics by Substrate Structure in the 4 Streams (하상구조에 따른 4개 하천의 어류 분포 특성 비교)

  • Yoon, Seok-Jin;Choi, Jun-Kil;Lee, Hwang-Goo
    • Korean Journal of Environment and Ecology
    • /
    • v.28 no.3
    • /
    • pp.302-313
    • /
    • 2014
  • This study was conducted to compare the characteristics of fish distribution according to sand type stream and cobble type stream the 4 stream selected every season. The collected Korea endemic species during the survey period were 24, including Acheilognathus gracilis. Dominant species of Hongcheon stream and Muju Namdae stream was Zacco koreanus, each accounting for 39.9% and 28.4% in order, and dominant species in Yanghwa stream was Rhodeus notatus, 13.6%, and those in Gap stream was Z. platypus, by 26.0%. As a result of community analysis, dominant index was 0.27~0.63, diversity index was 1.92~2.67, evenness index was 0.6~0.79, richness index was 3.09~3.53, and dominant index was the highest in Hongcheon stream, and the indices of diversity, evenness and richness were the highest in Yanghwa stream. As a result of tolerance guild analysis, Hongcheon stream and Muju Namdae stream with a variety of substrates accounted for relatively higher rate by 50.1% and 46.4% in sensitive species respectively, and Yanghwa stream and Gap stream with greater sand substrates had 0.5% and 5.3% scarce rate of sensitive species. As a result of similarity analysis using the species, population and substrate structures of the fisheries appeared in each stream, cobble type streams such as Hongcheon stream and Muju Namdae stream were the most similar by 50.4% in species and population, 95.2% in bed structure. As a result of IBI analysis, Hongcheon stream and Muju Namdae stream appeared as 'Class A,' Yanhwa stream and Gap stream as 'Class B' and the two groups of cobble type stream and sand type stream were divided as a result of principal components analysis.

Feature Selection for Anomaly Detection Based on Genetic Algorithm (유전 알고리즘 기반의 비정상 행위 탐지를 위한 특징선택)

  • Seo, Jae-Hyun
    • Journal of the Korea Convergence Society
    • /
    • v.9 no.7
    • /
    • pp.1-7
    • /
    • 2018
  • Feature selection, one of data preprocessing techniques, is one of major research areas in many applications dealing with large dataset. It has been used in pattern recognition, machine learning and data mining, and is now widely applied in a variety of fields such as text classification, image retrieval, intrusion detection and genome analysis. The proposed method is based on a genetic algorithm which is one of meta-heuristic algorithms. There are two methods of finding feature subsets: a filter method and a wrapper method. In this study, we use a wrapper method, which evaluates feature subsets using a real classifier, to find an optimal feature subset. The training dataset used in the experiment has a severe class imbalance and it is difficult to improve classification performance for rare classes. After preprocessing the training dataset with SMOTE, we select features and evaluate them with various machine learning algorithms.