DOI QR코드

DOI QR Code

Performance evaluation of principal component analysis for clustering problems

  • Kim, Jae-Hwan (Department of Data Information, Korea Maritime and Ocean University) ;
  • Yang, Tae-Min (Department of Data Information, Korea Maritime and Ocean University) ;
  • Kim, Jung-Tae (Department of Data Information, Korea Maritime and Ocean University)
  • 투고 : 2016.09.26
  • 심사 : 2016.10.17
  • 발행 : 2016.10.31

초록

Clustering analysis is widely used in data mining to classify data into categories on the basis of their similarity. Through the decades, many clustering techniques have been developed, including hierarchical and non-hierarchical algorithms. In gene profiling problems, because of the large number of genes and the complexity of biological networks, dimensionality reduction techniques are critical exploratory tools for clustering analysis of gene expression data. Recently, clustering analysis of applying dimensionality reduction techniques was also proposed. PCA (principal component analysis) is a popular methd of dimensionality reduction techniques for clustering problems. However, previous studies analyzed the performance of PCA for only full data sets. In this paper, to specifically and robustly evaluate the performance of PCA for clustering analysis, we exploit an improved FCBF (fast correlation-based filter) of feature selection methods for supervised clustering data sets, and employ two well-known clustering algorithms: k-means and k-medoids. Computational results from supervised data sets show that the performance of PCA is very poor for large-scale features.

키워드

참고문헌

  1. T. Sorensen, "A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons," Biologiske Skrifter, vol. 5, pp. 1-34, 1948.
  2. J. R. Ward and H. Joe, "Hierarchical grouping to optimize an objective function," Journal of the American statistical association, vol. 58, no. 301, pp. 236-244, 1963. https://doi.org/10.1080/01621459.1963.10500845
  3. L. Kaufman, and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley & Sons, 2009.
  4. A. Rodriguez and A. Laio, "Clustering by fast search and find of density peaks," Science, vol. 344, no. 6191, pp. 1492-1496, 2014. https://doi.org/10.1126/science.1242072
  5. J. B. MacQueen, "Some Methods for classification and Analysis of Multivariate Observations," Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281-297, 1967.
  6. K. Barker, "Singular value decomposition tutorial," The Ohio State University, vol. 24, 2005.
  7. I. Jolliffe, Principal Component Analysis, John Wiley & Sons, 2002.
  8. T. K. Landauer, P. W. Foltz, and D. Laham, "An introduction to latent semantic analysis," Discourse processes, vol. 25 no. 2-3, pp. 259-284, 1998. https://doi.org/10.1080/01638539809545028
  9. E. Bingham and H. Mannila, "Random projection in dimensionality reduction: applications to image and text data," Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York: Association for Computing Machinery, pp. 245-250, 2001.
  10. K. Y. Yeung and W. L. Ruzzo, "Principal component analysis for clustering gene expression data," Bioinformatics, vol. 17, no. 9, pp. 763-774, 2001. https://doi.org/10.1093/bioinformatics/17.9.763
  11. M. Song, H. Yang, S. H. Siadat, and M. Pechenizkiy, "A comparative study of dimensionality reduction techniques to enhance trace clustering performances," Expert Systems with Applications, vol. 40, no. 9, pp. 3722-3737, 2013. https://doi.org/10.1016/j.eswa.2012.12.078
  12. J. T. Kim, H. Y. Kum, and J. H. Kim, "A comparative study of filter methods based on information entropy," Journal of the Korean Society of Marine Engineering, vol. 40, no. 5 pp. 437-446, 2016. https://doi.org/10.5916/jkosme.2016.40.5.437
  13. M. Du, S. Ding and H. Jia, "Study on density peaks clustering based on k-nearest neighbors and principal component analysis", Knowledge-Based Systems, vol. 99, pp. 135-145, 2016. https://doi.org/10.1016/j.knosys.2016.02.001
  14. L. Yu and H. Liu, "Feature selection for high- dimensional data: A fast correlation-based filter solution," International Conference Machine Learning, vol. 3, pp. 856-863, 2003.
  15. H. Peng, F. Long, and C. Ding, "Feature selection based on mutual information criteria of max-dependency max-relevance and min-redundancy," IEEE Transactions on pattern analysis and machine intelligence, vol. 27, no. 8, pp. 1226-1238, 2005. https://doi.org/10.1109/TPAMI.2005.159
  16. W. M. Rand, "Objective criteria for the evaluation of clustering methods," Journal of the American Statistical Association, vol. 66, no. 336, pp. 846-850, 1971. https://doi.org/10.1080/01621459.1971.10482356
  17. L. Hubert and P. Arabie, "Comparing partitions," Journal of Classification, vol. 2, no. 1, pp. 193-218, 1985. https://doi.org/10.1007/BF01908075