Browse > Article
http://dx.doi.org/10.5916/jkosme.2016.40.8.726

Performance evaluation of principal component analysis for clustering problems  

Kim, Jae-Hwan (Department of Data Information, Korea Maritime and Ocean University)
Yang, Tae-Min (Department of Data Information, Korea Maritime and Ocean University)
Kim, Jung-Tae (Department of Data Information, Korea Maritime and Ocean University)
Abstract
Clustering analysis is widely used in data mining to classify data into categories on the basis of their similarity. Through the decades, many clustering techniques have been developed, including hierarchical and non-hierarchical algorithms. In gene profiling problems, because of the large number of genes and the complexity of biological networks, dimensionality reduction techniques are critical exploratory tools for clustering analysis of gene expression data. Recently, clustering analysis of applying dimensionality reduction techniques was also proposed. PCA (principal component analysis) is a popular methd of dimensionality reduction techniques for clustering problems. However, previous studies analyzed the performance of PCA for only full data sets. In this paper, to specifically and robustly evaluate the performance of PCA for clustering analysis, we exploit an improved FCBF (fast correlation-based filter) of feature selection methods for supervised clustering data sets, and employ two well-known clustering algorithms: k-means and k-medoids. Computational results from supervised data sets show that the performance of PCA is very poor for large-scale features.
Keywords
Clustering algorithm; Dimensionality reduction; PCA; Feature selection;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 T. Sorensen, "A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons," Biologiske Skrifter, vol. 5, pp. 1-34, 1948.
2 J. R. Ward and H. Joe, "Hierarchical grouping to optimize an objective function," Journal of the American statistical association, vol. 58, no. 301, pp. 236-244, 1963.   DOI
3 L. Kaufman, and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley & Sons, 2009.
4 A. Rodriguez and A. Laio, "Clustering by fast search and find of density peaks," Science, vol. 344, no. 6191, pp. 1492-1496, 2014.   DOI
5 J. B. MacQueen, "Some Methods for classification and Analysis of Multivariate Observations," Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281-297, 1967.
6 K. Barker, "Singular value decomposition tutorial," The Ohio State University, vol. 24, 2005.
7 I. Jolliffe, Principal Component Analysis, John Wiley & Sons, 2002.
8 T. K. Landauer, P. W. Foltz, and D. Laham, "An introduction to latent semantic analysis," Discourse processes, vol. 25 no. 2-3, pp. 259-284, 1998.   DOI
9 E. Bingham and H. Mannila, "Random projection in dimensionality reduction: applications to image and text data," Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York: Association for Computing Machinery, pp. 245-250, 2001.
10 K. Y. Yeung and W. L. Ruzzo, "Principal component analysis for clustering gene expression data," Bioinformatics, vol. 17, no. 9, pp. 763-774, 2001.   DOI
11 M. Song, H. Yang, S. H. Siadat, and M. Pechenizkiy, "A comparative study of dimensionality reduction techniques to enhance trace clustering performances," Expert Systems with Applications, vol. 40, no. 9, pp. 3722-3737, 2013.   DOI
12 J. T. Kim, H. Y. Kum, and J. H. Kim, "A comparative study of filter methods based on information entropy," Journal of the Korean Society of Marine Engineering, vol. 40, no. 5 pp. 437-446, 2016.   DOI
13 M. Du, S. Ding and H. Jia, "Study on density peaks clustering based on k-nearest neighbors and principal component analysis", Knowledge-Based Systems, vol. 99, pp. 135-145, 2016.   DOI
14 L. Yu and H. Liu, "Feature selection for high- dimensional data: A fast correlation-based filter solution," International Conference Machine Learning, vol. 3, pp. 856-863, 2003.
15 H. Peng, F. Long, and C. Ding, "Feature selection based on mutual information criteria of max-dependency max-relevance and min-redundancy," IEEE Transactions on pattern analysis and machine intelligence, vol. 27, no. 8, pp. 1226-1238, 2005.   DOI
16 W. M. Rand, "Objective criteria for the evaluation of clustering methods," Journal of the American Statistical Association, vol. 66, no. 336, pp. 846-850, 1971.   DOI
17 L. Hubert and P. Arabie, "Comparing partitions," Journal of Classification, vol. 2, no. 1, pp. 193-218, 1985.   DOI