DOI QR코드

DOI QR Code

Unsupervised Feature Selection Method Based on Principal Component Loading Vectors

주성분 분석 로딩 벡터 기반 비지도 변수 선택 기법

  • 박영준 (고려대학교 산업경영공학과) ;
  • 김성범 (고려대학교 산업경영공학과)
  • Received : 2013.12.26
  • Accepted : 2014.05.17
  • Published : 2014.06.15

Abstract

One of the most widely used methods for dimensionality reduction is principal component analysis (PCA). However, the reduced dimensions from PCA do not provide a clear interpretation with respect to the original features because they are linear combinations of a large number of original features. This interpretation problem can be overcome by feature selection approaches that identifying the best subset of given features. In this study, we propose an unsupervised feature selection method based on the geometrical information of PCA loading vectors. Experimental results from a simulation study demonstrated the efficiency and usefulness of the proposed method.

Keywords

References

  1. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. (1999), Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of the National Academy of Sciences, 96(12), 6745-6750. https://doi.org/10.1073/pnas.96.12.6745
  2. Bolshakova, N. and Azuaje, F. (2003), Cluster validation techniques for genome expression data, Signal processing, 83(4), 825-833. https://doi.org/10.1016/S0165-1684(02)00475-9
  3. Borovecki, F., Lovrecic, L., Zhou, J., Jeong, H., Then, F., Rosas, H. D., and Krainc, D. (2005), Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease, Proceedings of the National Academy of Sciences of the United States of America, 102(31), 11023-11028. https://doi.org/10.1073/pnas.0504921102
  4. Boutsidis, C., Mahoney, M. W., and Drineas, P. (2008), Unsupervised feature selection for principal components analysis, In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 61-69.
  5. Dash, M., Choi, K., Scheuermann, P., and Liu, H. (2002), Feature selection for clustering-a filter solution. In Data Mining, 2002, ICDM 2003, Proceedings, 2002 IEEE International Conference, IEEE, 115-122
  6. Chin, K., DeVries, S., Fridlyand, J., Spellman, P. T., Roydasgupta, R., Kuo, W. L., and Gray, J. W. (2006), Genomic and transcriptional aberrations linked to breast cancer pathophysiologies, Cancer cell, 10(6), 529-541. https://doi.org/10.1016/j.ccr.2006.10.009
  7. Chowdary, D., Lathrop, J., Skelton, J., Curtin, K., Briggs, T., Zhang, Y., and Mazumder, A. (2006), Prognostic gene expression signatures can be measured in tissues collected in RNAlater preservative, The journal of molecular diagnostics, 8(1), 31-39. https://doi.org/10.2353/jmoldx.2006.050056
  8. Gordon, G. J., Jensen, R. V., Hsiao, L. L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., and Bueno, R. (2002), Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma, Cancer research, 62(17), 4963-4967.
  9. Gravier, E., Pierron, G., Vincent-Salomon, A., Gruel, N., Raynal, V., Savignoni, A., and Delattre, O. (2010), A prognostic DNA signature for T1T2 node-negative breast cancer patients, Genes, Chromosomes and Cancer, 49(12), 1125-1134. https://doi.org/10.1002/gcc.20820
  10. Guo, Q., Wu, W., Massart, D. L., Boucon, C., and De Jong, S. (2002), Feature selection in principal component analysis of analytical data, Chemometrics and Intelligent Laboratory Systems, 61(1), 123-132. https://doi.org/10.1016/S0169-7439(01)00203-9
  11. Guyon, I. and Elisseeff, A. (2003), An introduction to variable and feature selection, The Journal of Machine Learning Research, 3, 1157-1182.
  12. Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., and Tibshirani, R. (2009), The elements of statistical learning, 2(1), New York : Springer.
  13. Jolliffe, I. T. (1972), Discarding variables in a principal component analysis, I : Artificial data. Applied statistics, 160-173.
  14. Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., and Meltzer, P. S. (2001), Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature medicine, 7(6), 673-679. https://doi.org/10.1038/89044
  15. Kim, S. B. (2009), Feature Extraction/Selection in High-Dimensional Spectral Data, In J. Wang (Ed.), Encyclopedia of Data Warehousing and Mining, Second Edition, (863-869), Hershey, PA : Information Science Reference, doi:10.4018/978-1-60566-010-3.ch133.
  16. Kim, S. B. and Rattakorn, P. (2011), Unsupervised feature selection using weighted principal components, Expert Systems with Applications, 38(5), 5704-5710. https://doi.org/10.1016/j.eswa.2010.10.063
  17. Malhi, A. and Gao, R. X. (2004), PCA-based feature selection scheme for machine defect classification, Instrumentation and Measurement, IEEE Transactions, 53(6), 1517-1525. https://doi.org/10.1109/TIM.2004.834070
  18. Mao, K. Z. (2005), Identifying critical variables of principal components for unsupervised feature selection, Systems, Man, and Cybernetics, Part B : Cybernetics, IEEE Transactions, 35(2), 339-344. https://doi.org/10.1109/TSMCB.2004.843269
  19. Mitra, P., Murthy, C. A., and Pal, S. K. (2002), Unsupervised feature selection using feature similarity, IEEE transactions on pattern analysis and machine intelligence, 24(3), 301-312. https://doi.org/10.1109/34.990133
  20. Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., and Golub, T. R. (2002), Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature, 415(6870), 436-442. https://doi.org/10.1038/415436a
  21. Roth, V. and Lange, T. (2003), Feature selection in clustering problems, In Advances in neural information processing systems.
  22. Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., and Golub, T. R. (2002), Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning, Nature medicine, 8(1), 68-74. https://doi.org/10.1038/nm0102-68
  23. Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., and Mesirov, J. P. (2005), Gene set enrichment analysis : a knowledge-based approach for interpreting genome-wide expression profiles, Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-15550. https://doi.org/10.1073/pnas.0506580102
  24. Tian, E., Zhan, F., Walker, R., Rasmussen, E., Ma, Y., Barlogie, B., and Shaughnessy Jr, J. D. (2003), The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma, New England Journal of Medicine, 349(26), 2483-2494. https://doi.org/10.1056/NEJMoa030847
  25. Wang, P. and Kim, J. (2014), Analysis of Chinese Provinces for Introduction of Reverse Mortgage Scheme Using Principal Component Analysis, Journal of the Korean Institute of Industrial Engineers, 40(2), 205-214. https://doi.org/10.7232/JKIIE.2014.40.2.205
  26. West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., and Nevins, J. R. (2001), Predicting the clinical status of human breast cancer by using gene expression profiles, Proceedings of the National Academy of Sciences, 98(20), 11462-11467. https://doi.org/10.1073/pnas.201162998
  27. Widjaja, D., Varon, C., Dorado, A., Suykens, J. A., and Van Huffel, S. (2012), Application of Kernel Principal Component Analysis for Single-Lead-ECG-Derived Respiration, Biomedical Engineering, IEEE Transactions on, 59(4), 1169-1176.
  28. Yu, L. and Liu, H. (2003), Feature selection for high-dimensional data : A fast correlation-based filter solution, In ICML, 3, 856-863.