Search | Korea Science

Feature selection for text data via sparse principal component analysis (희소주성분분석을 이용한 텍스트데이터의 단어선택)

Won Son
- The Korean Journal of Applied Statistics
- /
- v.36 no.6
- /
- pp.501-514
- /
- 2023
When analyzing high dimensional data such as text data, if we input all the variables as explanatory variables, statistical learning procedures may suffer from over-fitting problems. Furthermore, computational efficiency can deteriorate with a large number of variables. Dimensionality reduction techniques such as feature selection or feature extraction are useful for dealing with these problems. The sparse principal component analysis (SPCA) is one of the regularized least squares methods which employs an elastic net-type objective function. The SPCA can be used to remove insignificant principal components and identify important variables from noisy observations. In this study, we propose a dimension reduction procedure for text data based on the SPCA. Applying the proposed procedure to real data, we find that the reduced feature set maintains sufficient information in text data while the size of the feature set is reduced by removing redundant variables. As a result, the proposed procedure can improve classification accuracy and computational efficiency, especially for some classifiers such as the k-nearest neighbors algorithm.
https://doi.org/10.5351/KJAS.2023.36.6.501 인용 PDF

A Criterion for the Selection of Principal Components in the Robust Principal Component Regression (로버스트주성분회귀에서 최적의 주성분선정을 위한 기준)

Kim, Bu-Yong
- Communications for Statistical Applications and Methods
- /
- v.18 no.6
- /
- pp.761-770
- /
- 2011
Robust principal components regression is suggested to deal with both the multicollinearity and outlier problem. A main aspect of the robust principal components regression is the selection of an optimal set of principal components. Instead of the eigenvalue of the sample covariance matrix, a selection criterion is developed based on the condition index of the minimum volume ellipsoid estimator which is highly robust against leverage points. In addition, the least trimmed squares estimation is employed to cope with regression outliers. Monte Carlo simulation results indicate that the proposed criterion is superior to existing ones.
https://doi.org/10.5351/CKSS.2011.18.6.761 인용 PDF KSCI

Input Variables Selection by Principal Component Analysis and Mutual Information Estimation (주요성분분석과 상호정보 추정에 의한 입력변수선택)

Cho, Yong-Hyun;Hong, Seong-Jun
- Journal of the Korean Institute of Intelligent Systems
- /
- v.17 no.2
- /
- pp.220-225
- /
- 2007
This paper presents an efficient input variable selection method using both principal component analysis(PCA) and adaptive partition mutual information(AP-MI) estimation. PCA which is based on 2nd order statistics, is applied to prevent a overestimation by quickly removing the dependence between input variables. AP-MI estimation is also applied to estimate an accurate dependence information by equally partitioning the samples of input variable for calculating the probability density function. The proposed method has been applied to 2 problems for selecting the input variables, which are the 7 artificial signals of 500 samples and the 24 environmental pollution signals of 55 samples, respectively. The experimental results show that the proposed methods has a fast and accurate selection performance. The proposed method has also respectively better performance than AP-MI estimation without the PCA and regular partition MI estimation.
https://doi.org/10.5391/JKIIS.2007.17.2.220 인용 PDF KSCI

HisCoM-PCA: software for hierarchical structural component analysis for pathway analysis based using principal component analysis

Jiang, Nan;Lee, Sungyoung;Park, Taesung
- Genomics & Informatics
- /
- v.18 no.1
- /
- pp.11.1-11.3
- /
- 2020
In genome-wide association studies, pathway-based analysis has been widely performed to enhance interpretation of single-nucleotide polymorphism association results. We proposed a novel method of hierarchical structural component model (HisCoM) for pathway analysis of common variants (HisCoM for pathway analysis of common variants [HisCoM-PCA]) which was used to identify pathways associated with traits. HisCoM-PCA is based on principal component analysis (PCA) for dimensional reduction of single nucleotide polymorphisms in each gene, and the HisCoM for pathway analysis. In this study, we developed a HisCoM-PCA software for the hierarchical pathway analysis of common variants. HisCoM-PCA software has several features. Various principle component scores selection criteria in PCA step can be specified by users who want to summarize common variants at each gene-level by different threshold values. In addition, multiple public pathway databases and customized pathway information can be used to perform pathway analysis. We expect that HisCoM-PCA software will be useful for users to perform powerful pathway analysis.
https://doi.org/10.5808/GI.2020.18.1.e11 인용 PDF KSCI

Bayesian Typhoon Track Prediction Using Wind Vector Data

Han, Minkyu;Lee, Jaeyong
- Communications for Statistical Applications and Methods
- /
- v.22 no.3
- /
- pp.241-253
- /
- 2015
In this paper we predict the track of typhoons using a Bayesian principal component regression model based on wind field data. Data is obtained at each time point and we applied the Bayesian principal component regression model to conduct the track prediction based on the time point. Based on regression model, we applied to variable selection prior and two kinds of prior distribution; normal and Laplace distribution. We show prediction results based on Bayesian Model Averaging (BMA) estimator and Median Probability Model (MPM) estimator. We analysis 8 typhoons in 2006 using data obtained from previous 6 years (2000-2005). We compare our prediction results with a moving-nest typhoon model (MTM) proposed by the Korea Meteorological Administration. We posit that is possible to predict the track of a typhoon accurately using only a statistical model and without a dynamical model.
https://doi.org/10.5351/CSAM.2015.22.3.241 인용 PDF KSCI

A Fuzzy Neural Network Combining Wavelet Denoising and PCA for Sensor Signal Estimation

Na, Man-Gyun
- Nuclear Engineering and Technology
- /
- v.32 no.5
- /
- pp.485-494
- /
- 2000
In this work, a fuzzy neural network is used to estimate the relevant sensor signal using other sensor signals. Noise components in input signals into the fuzzy neural network are removed through the wavelet denoising technique . Principal component analysis (PCA) is used to reduce the dimension of an input space without losing a significant amount of information. A lower dimensional input space will also usually reduce the time necessary to train a fuzzy-neural network. Also, the principal component analysis makes easy the selection of the input signals into the fuzzy neural network. The fuzzy neural network parameters are optimized by two learning methods. A genetic algorithm is used to optimize the antecedent parameters of the fuzzy neural network and a least-squares algorithm is used to solve the consequent parameters. The proposed algorithm was verified through the application to the pressurizer water level and the hot-leg flowrate measurements in pressurized water reactors.
PDF

Sensor array optimization techniques for exhaled breath analysis to discriminate diabetics using an electronic nose

Jeon, Jin-Young;Choi, Jang-Sik;Yu, Joon-Boo;Lee, Hae-Ryong;Jang, Byoung Kuk;Byun, Hyung-Gi
- ETRI Journal
- /
- v.40 no.6
- /
- pp.802-812
- /
- 2018
Disease discrimination using an electronic nose is achieved by measuring the presence of a specific gas contained in the exhaled breath of patients. Many studies have reported the presence of acetone in the breath of diabetic patients. These studies suggest that acetone can be used as a biomarker of diabetes, enabling diagnoses to be made by measuring acetone levels in exhaled breath. In this study, we perform a chemical sensor array optimization to improve the performance of an electronic nose system using Wilks' lambda, sensor selection based on a principal component (B4), and a stepwise elimination (SE) technique to detect the presence of acetone gas in human breath. By applying five different temperatures to four sensors fabricated from different synthetic materials, a total of 20 sensing combinations are created, and three sensing combinations are selected for the sensor array using optimization techniques. The measurements and analyses of the exhaled breath using the electronic nose system together with the optimized sensor array show that diabetic patients and control groups can be easily differentiated. The results are confirmed using principal component analysis (PCA).
https://doi.org/10.4218/etrij.2017-0018 인용 PDF KSCI

Prediction of Melting Point for Drug-like Compounds Using Principal Component-Genetic Algorithm-Artificial Neural Network

Habibi-Yangjeh, Aziz;Pourbasheer, Eslam;Danandeh-Jenagharad, Mohammad
- Bulletin of the Korean Chemical Society
- /
- v.29 no.4
- /
- pp.833-841
- /
- 2008
Principal component-genetic algorithm-multiparameter linear regression (PC-GA-MLR) and principal component-genetic algorithm-artificial neural network (PC-GA-ANN) models were applied for prediction of melting point for 323 drug-like compounds. A large number of theoretical descriptors were calculated for each compound. The first 234 principal components (PC’s) were found to explain more than 99.9% of variances in the original data matrix. From the pool of these PC’s, the genetic algorithm was employed for selection of the best set of extracted PC’s for PC-MLR and PC-ANN models. The models were generated using fifteen PC’s as variables. For evaluation of the predictive power of the models, melting points of 64 compounds in the prediction set were calculated. Root-mean square errors (RMSE) for PC-GA-MLR and PC-GA-ANN models are 48.18 and $12.77{^{\circ}C}$, respectively. Comparison of the results obtained by the models reveals superiority of the PC-GA-ANN relative to the PC-GA-MLR and the recently proposed models (RMSE = $40.7{^{\circ}C}$). The improvements are due to the fact that the melting point of the compounds demonstrates non-linear correlations with the principal components.
https://doi.org/10.5012/bkcs.2008.29.4.833 인용 PDF KSCI

The Pattern Recognition Methods for Emotion Recognition with Speech Signal (음성신호를 이용한 감성인식에서의 패턴인식 방법)

Park Chang-Hyun;Sim Kwee-Bo
- Journal of Institute of Control, Robotics and Systems
- /
- v.12 no.3
- /
- pp.284-288
- /
- 2006
In this paper, we apply several pattern recognition algorithms to emotion recognition system with speech signal and compare the results. Firstly, we need emotional speech databases. Also, speech features for emotion recognition is determined on the database analysis step. Secondly, recognition algorithms are applied to these speech features. The algorithms we try are artificial neural network, Bayesian learning, Principal Component Analysis, LBG algorithm. Thereafter, the performance gap of these methods is presented on the experiment result section. Truly, emotion recognition technique is not mature. That is, the emotion feature selection, relevant classification method selection, all these problems are disputable. So, we wish this paper to be a reference for the disputes.
https://doi.org/10.5302/J.ICROS.2006.12.3.284 인용 PDF KSCI

The Pattern Recognition Methods for Emotion Recognition with Speech Signal (음성신호를 이용한 감성인식에서의 패턴인식 방법)

Park Chang-Hyeon;Sim Gwi-Bo
- Proceedings of the Korean Institute of Intelligent Systems Conference
- /
- 2006.05a
- /
- pp.347-350
- /
- 2006
In this paper, we apply several pattern recognition algorithms to emotion recognition system with speech signal and compare the results. Firstly, we need emotional speech databases. Also, speech features for emotion recognition is determined on the database analysis step. Secondly, recognition algorithms are applied to these speech features. The algorithms we try are artificial neural network, Bayesian learning, Principal Component Analysis, LBG algorithm. Thereafter, the performance gap of these methods is presented on the experiment result section. Truly, emotion recognition technique is not mature. That is, the emotion feature selection, relevant classification method selection, all these problems are disputable. So, we wish this paper to be a reference for the disputes.
PDF

Search Result 155, Processing Time 0.026 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)