DOI QR코드

DOI QR Code

Sparse Document Data Clustering Using Factor Score and Self Organizing Maps

인자점수와 자기조직화지도를 이용한 희소한 문서데이터의 군집화

  • Received : 2011.11.25
  • Accepted : 2012.02.04
  • Published : 2012.04.25

Abstract

The retrieved documents have to be transformed into proper data structure for the clustering algorithms of statistics and machine learning. A popular data structure for document clustering is document-term matrix. This matrix has the occurred frequency value of a term in each document. There is a sparsity problem in this matrix because most frequencies of the matrix are 0 values. This problem affects the clustering performance. The sparseness of document-term matrix decreases the performance of clustering result. So, this research uses the factor score by factor analysis to solve the sparsity problem in document clustering. The document-term matrix is transformed to document-factor score matrix using factor scores in this paper. Also, the document-factor score matrix is used as input data for document clustering. To compare the clustering performances between document-term matrix and document-factor score matrix, this research applies two typed matrices to self organizing map (SOM) clustering.

통계학과 기계학습의 다양한 기법을 이용하여 문서집합을 군집화하기 위해서는 우선 군집화분석에 적합한 데이터구조로 대상 문서집합을 변환해야 한다. 문서군집화를 위한 대표적인 구조가 문서-단어행렬이다. 각 문서에서 발생한 특정단어의 빈도값을 갖는 문서-단어행렬은 상당부분의 빈도값이 0인 희소성문제를 갖는다. 이 문제는 문서군집화의 성능에 직접적인 영향을 주어 군집화결과의 성능감소를 초래한다. 본 논문에서는 문서-단어행렬의 희소성문제를 해결하기 위하여 인자분석을 통한 인자점수를 이용하였다. 즉, 문서-단어행렬을 문서-인자점수행렬로 바꾸어 문서군집화의 입력데이터로 사용하였다. 대표적인 문서군집화 알고리즘인 자기조직화지도에 적용하여 문서-단어행렬과 문서-인자점수행렬에 대한 문서군집화의 결과들을 비교하였다.

Keywords

References

  1. N. O. Andrews, E. A. Fox, "Recent Developments in Document Clustering," Technical Report TR-07-35, Computer Science, Virginia Tech, 2007.
  2. S. Scott, S. Matwin, "Text Classification Using WordNet Hypernyms," Proceeding of Workshop on Usage of WordNet in Natural Language Processing Systems, pp. 38-44, 1998.
  3. M. Buenaga, J. M. Gomez-Hidalgo, B. Diaz-Agudo, "Using WordNet to Complement Training Information in Text Categorization," Recent Advances in Natural Language Processing, pp. 150-157, 1997.
  4. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Data Mining, Inference, and Prediction, Springer, 2001.
  5. T. W. S. Chow, M. K. M. Rahman, "Multilayer SOM With Tree-Structured Data for Efficient Document Retrieval and Plagiarism Detection," IEEE Transactions on Neural Networks, vol. 20, no. 9, pp. 1385-1402.
  6. T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, A. Saarela, "Self Organization of a Massive Document Collection," IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 574-585, 2000. https://doi.org/10.1109/72.846729
  7. C. Hung, S. Wermter, "A Dynamic Adaptive Self-Organizing Hybrid Model for Text Clustering," Proceeding of IEEE International Conference on Data Mining (ICDM 03), pp. 75-82, 2003.
  8. H. Chen, C. Schuffels, R. Orwig, "Internet Categorization and Search: A Self-Organizing Approach," Journal of Visual Communication and Image Representation, vol. 7, no. 1, pp. 88-102, 1996. https://doi.org/10.1006/jvci.1996.0008
  9. 홍정표, 황승국, "SOM을 이용한 퍼지 TAM 네트워크 모델," 한국지능시스템학회논문지, 제16권, 제5호, pp. 642-646, 2006.
  10. 윤경배, 최준혁, "앙상블 Support Vector Machine과 하이브리드 SOM을 이용한 동적 웹 정보 추천 시스템," 한국지능시스템학회논문지, 제13권, 제4호, pp. 433-438, 2003.
  11. S. Jun, "Improvement of SOM using Stratification," International Journal of Fuzzy Logic and Intelligent Systems, vol. 9, no. 1, pp. 36-41, 2009. https://doi.org/10.5391/IJFIS.2009.9.1.036
  12. S. Jun, "Improvement of Self Organizing Maps using Gap Statistic and Probability Distribution," International Journal of Fuzzy Logic and Intelligent Systems, vol. 8, no. 2, pp. 116-120, 2008. https://doi.org/10.5391/IJFIS.2008.8.2.116
  13. I. Feinerer, A Text Mining Framework in R and Its Applications, PhD Dissertation, Department of Statistics and Mathematics Vienna University of Economics and Business Administration, 2008.
  14. United State Patent and Trademark Office, http://www.uspto.gov
  15. C. Hung, S. Wermter, P. Smith, "Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet," IEEE Intelligent Systems, vol. 19, iss. 2, pp. 68-77, 2003.
  16. Y. Yam, P. Baranyi, C. T. Yang, "Reduction of fuzzy rule base via singular value decomposition," IEEE Transactions on Fuzzy Systems, vol. 7, Iss. 2, pp. 120-132, 1999. https://doi.org/10.1109/91.755394
  17. J. J. Wei, C. J. Chang, N. K. Chou, G. J. Jan, "ECG data compression using truncated singular value decomposition," IEEE Transactions on Information Technology in Biomedicine, vol. 5, Iss. 4, pp. 290-299, 2001. https://doi.org/10.1109/4233.966104
  18. S. Lee, M. H. Hayes, "Properties of the singular value decomposition for efficient data clustering," Signal Processing Letters, vol. 11, Iss. 11, pp. 862-866, 2004. https://doi.org/10.1109/LSP.2004.833513
  19. P. Howland, H. Park, "Generalizing discriminant analysis using the generalized singular value decomposition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, Iss. 8, pp. 995-1006, 2004. https://doi.org/10.1109/TPAMI.2004.46
  20. P. Bao, X. Ma, "Image adaptive watermarking using wavelet domain singular value decomposition," IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, Iss. 1, pp. 96-102, 2005. https://doi.org/10.1109/TCSVT.2004.836745
  21. S. K. Jha, R. D. S. Yadava, "Denoising by Singular Value Decomposition and Its Application to Electronic Nose Data Processing," Sensors Journal, vol. 11, Iss. 1, pp. 35-44, 2011. https://doi.org/10.1109/JSEN.2010.2049351
  22. Hair, J. F., Black, B., Babin, B., Anderson, R. E., Multivariate Data Analysis, Prentice Hall, 1992.
  23. 김기영, 전명식, 다변량 통계자료분석, 자유아카데미, 1994.
  24. Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis, 5th ed. Prentice Hall, 2002.
  25. Rencher, A. C., Methods Of Multivariate Analysis 2nd ed. John Wiley & Sons, 2002.
  26. T. Kohonen, Self-Organizing Maps, Springer, 2001.
  27. Han, J., Kamber, M., Data Mining Concepts and Techniques, Morgan Kaufmann, 2001.
  28. T. Kohonen, "Self-Organized Formation of Topologically Correct Feature Maps," Biological Cybernetics, vol. 43, pp. 59-69, 1982. https://doi.org/10.1007/BF00337288
  29. 오일석, 패턴인식, 교보문고, 2008
  30. R Development Core Team, R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org, 2010.
  31. J. Yan, Self-Organizing Map - Package 'som', CRAN www.r-project.org, 2011.