• Title/Summary/Keyword: Data Principal

Search Result 2,090, Processing Time 0.027 seconds

Feature selection for text data via sparse principal component analysis (희소주성분분석을 이용한 텍스트데이터의 단어선택)

  • Won Son
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.6
    • /
    • pp.501-514
    • /
    • 2023
  • When analyzing high dimensional data such as text data, if we input all the variables as explanatory variables, statistical learning procedures may suffer from over-fitting problems. Furthermore, computational efficiency can deteriorate with a large number of variables. Dimensionality reduction techniques such as feature selection or feature extraction are useful for dealing with these problems. The sparse principal component analysis (SPCA) is one of the regularized least squares methods which employs an elastic net-type objective function. The SPCA can be used to remove insignificant principal components and identify important variables from noisy observations. In this study, we propose a dimension reduction procedure for text data based on the SPCA. Applying the proposed procedure to real data, we find that the reduced feature set maintains sufficient information in text data while the size of the feature set is reduced by removing redundant variables. As a result, the proposed procedure can improve classification accuracy and computational efficiency, especially for some classifiers such as the k-nearest neighbors algorithm.

Comparison of Methods for Reducing the Dimension of Compositional Data with Zero Values

  • Song, Taeg-Youn;Choi, Byung-Jin
    • Communications for Statistical Applications and Methods
    • /
    • v.19 no.4
    • /
    • pp.559-569
    • /
    • 2012
  • Compositional data consist of compositions that are non-negative vectors of proportions with the unit-sum constraint. In disciplines such as petrology and archaeometry, it is fundamental to statistically analyze this type of data. Aitchison (1983) introduced a log-contrast principal component analysis that involves logratio transformed data, as a dimension-reduction technique to understand and interpret the structure of compositional data. However, the analysis is not usable when zero values are present in the data. In this paper, we introduce 4 possible methods to reduce the dimension of compositional data with zero values. Two real data sets are analyzed using the methods and the obtained results are compared.

A Classification Method Using Data Reduction

  • Uhm, Daiho;Jun, Sung-Hae;Lee, Seung-Joo
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.12 no.1
    • /
    • pp.1-5
    • /
    • 2012
  • Data reduction has been used widely in data mining for convenient analysis. Principal component analysis (PCA) and factor analysis (FA) methods are popular techniques. The PCA and FA reduce the number of variables to avoid the curse of dimensionality. The curse of dimensionality is to increase the computing time exponentially in proportion to the number of variables. So, many methods have been published for dimension reduction. Also, data augmentation is another approach to analyze data efficiently. Support vector machine (SVM) algorithm is a representative technique for dimension augmentation. The SVM maps original data to a feature space with high dimension to get the optimal decision plane. Both data reduction and augmentation have been used to solve diverse problems in data analysis. In this paper, we compare the strengths and weaknesses of dimension reduction and augmentation for classification and propose a classification method using data reduction for classification. We will carry out experiments for comparative studies to verify the performance of this research.

A Study on Determination of Initial Principal Dimension for High-Speed Boat using Existing Boat DB (실적선 DB를 이용한 고속보트 초기 주요치수 결정에 관한 연구)

  • Lee, Dae-Hak;Kim, Dong-Joon;Song, Yeun-Hee
    • Journal of Navigation and Port Research
    • /
    • v.42 no.3
    • /
    • pp.177-186
    • /
    • 2018
  • Designers need a lot of information to determine the principal dimensions in the initial stage of boat design, and most of the information they need can be obtained by investigating and analyzing similar existing boat data. In addition, the principal dimensions that are determined have an impact throughout the design process (basic/detailed design), which in turn leads directly to the stability and performance of the boat. Therefore, in this study, the initial design system for the boat (design support platform) was developed using a correlation analysis with existing data for more than 700 boats. It was confirmed that the designer could conveniently and reasonably derive and determine the principal dimensions for a boat in the initial design stage, for the 50ft-class of small and high-speed boats.

Local Projective Display of Multivariate Numerical Data

  • Huh, Myung-Hoe;Lee, Yong-Goo
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.4
    • /
    • pp.661-668
    • /
    • 2012
  • For displaying multivariate numerical data on a 2D plane by the projection, principal components biplot and the GGobi are two main tools of data visualization. The biplot is very useful for capturing the global shape of the dataset, by representing $n$ observations and $p$ variables simultaneously on a single graph. The GGobi shows a dynamic movie of the images of $n$ observations projected onto a sequence of unit vectors floating on the $p$-dimensional sphere. Even though these two methods are certainly very valuable, there are drawbacks. The biplot is too condensed to describe the detailed parts of the data, and the GGobi is too burdensome for ordinary data analyses. In this paper, "the local projective display(LPD)" is proposed for visualizing multivariate numerical data. Main steps of the LDP are 1) $k$-means clustering of the data into $k$ subsets, 2) drawing $k$ principal components biplots of individual subsets, and 3) sequencing $k$ plots by Hurley's (2004) endlink algorithm for cognitive continuity.

County-Based Vulnerability Evaluation to Agricultural Drought Using Principal Component Analysis - The case of Gyeonggi-do - (주성분 분석법을 이용한 시군단위별 농업가뭄에 대한 취약성 분석에 관한 연구 - 경기도를 중심으로 -)

  • Jang, Min-Won
    • Journal of Korean Society of Rural Planning
    • /
    • v.12 no.1 s.30
    • /
    • pp.37-48
    • /
    • 2006
  • The objectives of this study were to develop an evaluation method of regional vulnerability to agricultural drought and to classify the vulnerability patterns. In order to test the method, 24 city or county areas of Gyeonggi-do were chose. First, statistic data and digital maps referred for agricultural drought were defined, and the input data of 31 items were set up from 5 categories: land use factor, water resource factor, climate factor, topographic and soil factor, and agricultural production foundation factor. Second, for simplification of the factors, principal component analysis was carried out, and eventually 4 principal components which explain about 80.8% of total variance were extracted. Each of the principal components was explained into the vulnerability components of scale factor, geographical factor, weather factor and agricultural production foundation factor. Next, DVIP (Drought Vulnerability Index for Paddy), was calculated using factor scores from principal components. Last, by means of statistical cluster analysis on the DVIP, the study area was classified as 5 patterns from A to E. The cluster A corresponds to the area where the agricultural industry is insignificant and the agricultural foundation is little equipped, and the cluster B includes typical agricultural areas where the cultivation areas are large but irrigation facilities are still insufficient. As for the cluster C, the corresponding areas are vulnerable to the climate change, and the D cluster applies to the area with extensive forests and high elevation farmlands. The last cluster I indicates the areas where the farmlands are small but most of them are irrigated as much.

Comparison of Dietary Externalization in Korea and Japan -by Principal Component Analysis- (식생활 외부화에 관한 한일 비교 연구 -주성분 분석을 이용하여-)

  • Choi Hyun-Sook
    • Journal of the East Asian Society of Dietary Life
    • /
    • v.16 no.1
    • /
    • pp.23-28
    • /
    • 2006
  • The purpose of this paper was to clarify the actual conditions of the 'Dietary externalization' mainly by using the economic and nutrition-related data, accompanied by the economic development in Korea and Japan. 'Modernization of food style' and other modernization have taken place, among which 'Dietary externalization' in particular has recently drawn interest. At the time this paper clarified with econometric analysis whether there are differences between the two countries in term of the modernization of food style and dietary externalization trend. The trends of Dietary externalization of both Korea and Japan were studied using Principal Component Analysis method. The food subgroup were investigated based on the annual report on the household income and expenditure survey of Korea and the annual report on the family income and expenditure survey of Japan. The statistical data from both country were analyzed by SAS program. The results are as follows; 1. In Korea, the ratio of carbohydrates in the total calorie intake is quite high and animal protein is rather low compared to those in Japan. 2. Traditional food such as grains and vegetables are consumed much more in Korea than in Japan. 3. The Principal Component 1, 2 were extracted in both countries during the whole analysis period, which suggested the 'Dietary externalization' 4. Principal Component 1 has a positive factor loaded in all food items including meals outside the home and process food. In other words, it is apparent that the 'Dietary externalization' tread in Korea has a simple pattern suggesting that all externalization related items are on the rise. 5. Principal component 1, 2 which indicated the dietary externalization, were detected in Japan.

  • PDF

On-line Nonlinear Principal Component Analysis for Nonlinear Feature Extraction (비선형 특징 추출을 위한 온라인 비선형 주성분분석 기법)

  • 김병주;심주용;황창하;김일곤
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.3
    • /
    • pp.361-368
    • /
    • 2004
  • The purpose of this study is to propose a new on-line nonlinear PCA(OL-NPCA) method for a nonlinear feature extraction from the incremental data. Kernel PCA(KPCA) is widely used for nonlinear feature extraction, however, it has been pointed out that KPCA has the following problems. First, applying KPCA to N patterns requires storing and finding the eigenvectors of a N${\times}$N kernel matrix, which is infeasible for a large number of data N. Second problem is that in order to update the eigenvectors with an another data, the whole eigenspace should be recomputed. OL-NPCA overcomes these problems by incremental eigenspace update method with a feature mapping function. According to the experimental results, which comes from applying OL-NPCA to a toy and a large data problem, OL-NPCA shows following advantages. First, OL-NPCA is more efficient in memory requirement than KPCA. Second advantage is that OL-NPCA is comparable in performance to KPCA. Furthermore, performance of OL-NPCA can be easily improved by re-learning the data.

A dimensional reduction method in cluster analysis for multidimensional data: principal component analysis and factor analysis comparison (다차원 데이터의 군집분석을 위한 차원축소 방법: 주성분분석 및 요인분석 비교)

  • Hong, Jun-Ho;Oh, Min-Ji;Cho, Yong-Been;Lee, Kyung-Hee;Cho, Wan-Sup
    • The Journal of Bigdata
    • /
    • v.5 no.2
    • /
    • pp.135-143
    • /
    • 2020
  • This paper proposes a pre-processing method and a dimensional reduction method in the analysis of shopping carts where there are many correlations between variables when dividing the types of consumers in the agri-food consumer panel data. Cluster analysis is a widely used method for dividing observational objects into several clusters in multivariate data. However, cluster analysis through dimensional reduction may be more effective when several variables are related. In this paper, the food consumption data surveyed of 1,987 households was clustered using the K-means method, and 17 variables were re-selected to divide it into the clusters. Principal component analysis and factor analysis were compared as the solution for multicollinearity problems and as the way to reduce dimensions for clustering. In this study, both principal component analysis and factor analysis reduced the dataset into two dimensions. Although the principal component analysis divided the dataset into three clusters, it did not seem that the difference among the characteristics of the cluster appeared well. However, the characteristics of the clusters in the consumption pattern were well distinguished under the factor analysis method.

Reduced Order Modeling of Marine Engine Status by Principal Component Analysis (주성분 분석을 통한 선박 기관 상태의 차수 축소 모델링)

  • Seungbeom Lee;Jeonghwa Seo;Dong-Hwan Kim;Sangmin Han;Kwanwoo Kim;Sungwook Chung;Byeongwoo Yoo
    • Journal of the Society of Naval Architects of Korea
    • /
    • v.61 no.1
    • /
    • pp.8-18
    • /
    • 2024
  • The present study concerns reduced order modeling of a marine diesel engine, which can be used for outlier detection in status monitoring and carbon intensity index calculation. Principal Component Analysis (PCA) is introduced for the reduced order modeling, focusing on the feasibility of detecting and treating nonlinear variables. By cross-correlation, it is found that there are seven non-linear data channels among 23 data channels, i.e., fuel mode, exhaust gas temperature after the turbocharger, and cylinder coolant temperatures. The dataset is handled so that the mean is located at the nominal continuous rating. Polynomial presentation of the dataset is also applied to reflect the linearity between the engine speed and other channels. The first principal mode shows strong effects of linearity of the most data channels to show the linearity of the system. The non-linear variables are effectively explained by other modes. second mode concerns the temperature of the cylinder cooling water, which shows small correlation with other variables. The third and fourth modes correlates the fuel mode and turbocharger exhaust gas temperature, which have inferior linearity to other channels. PCA is proven to be applicable to data given in binary type of fuel mode selection, as well as numerical type data.