Browse > Article
http://dx.doi.org/10.5391/JKIIS.2012.22.2.205

Sparse Document Data Clustering Using Factor Score and Self Organizing Maps  

Jun, Sung-Hae (청주대학교 통계학과)
Publication Information
Journal of the Korean Institute of Intelligent Systems / v.22, no.2, 2012 , pp. 205-211 More about this Journal
Abstract
The retrieved documents have to be transformed into proper data structure for the clustering algorithms of statistics and machine learning. A popular data structure for document clustering is document-term matrix. This matrix has the occurred frequency value of a term in each document. There is a sparsity problem in this matrix because most frequencies of the matrix are 0 values. This problem affects the clustering performance. The sparseness of document-term matrix decreases the performance of clustering result. So, this research uses the factor score by factor analysis to solve the sparsity problem in document clustering. The document-term matrix is transformed to document-factor score matrix using factor scores in this paper. Also, the document-factor score matrix is used as input data for document clustering. To compare the clustering performances between document-term matrix and document-factor score matrix, this research applies two typed matrices to self organizing map (SOM) clustering.
Keywords
Sparse Document Clustering; Document-term Matrix; Document-factor score Matrix; Factor Analysis; Self Organizing Map;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 N. O. Andrews, E. A. Fox, "Recent Developments in Document Clustering," Technical Report TR-07-35, Computer Science, Virginia Tech, 2007.
2 S. Scott, S. Matwin, "Text Classification Using WordNet Hypernyms," Proceeding of Workshop on Usage of WordNet in Natural Language Processing Systems, pp. 38-44, 1998.
3 M. Buenaga, J. M. Gomez-Hidalgo, B. Diaz-Agudo, "Using WordNet to Complement Training Information in Text Categorization," Recent Advances in Natural Language Processing, pp. 150-157, 1997.
4 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Data Mining, Inference, and Prediction, Springer, 2001.
5 T. W. S. Chow, M. K. M. Rahman, "Multilayer SOM With Tree-Structured Data for Efficient Document Retrieval and Plagiarism Detection," IEEE Transactions on Neural Networks, vol. 20, no. 9, pp. 1385-1402.
6 T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, A. Saarela, "Self Organization of a Massive Document Collection," IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 574-585, 2000.   DOI
7 C. Hung, S. Wermter, "A Dynamic Adaptive Self-Organizing Hybrid Model for Text Clustering," Proceeding of IEEE International Conference on Data Mining (ICDM 03), pp. 75-82, 2003.
8 H. Chen, C. Schuffels, R. Orwig, "Internet Categorization and Search: A Self-Organizing Approach," Journal of Visual Communication and Image Representation, vol. 7, no. 1, pp. 88-102, 1996.   DOI
9 홍정표, 황승국, "SOM을 이용한 퍼지 TAM 네트워크 모델," 한국지능시스템학회논문지, 제16권, 제5호, pp. 642-646, 2006.
10 윤경배, 최준혁, "앙상블 Support Vector Machine과 하이브리드 SOM을 이용한 동적 웹 정보 추천 시스템," 한국지능시스템학회논문지, 제13권, 제4호, pp. 433-438, 2003.
11 S. Jun, "Improvement of SOM using Stratification," International Journal of Fuzzy Logic and Intelligent Systems, vol. 9, no. 1, pp. 36-41, 2009.   DOI
12 S. Jun, "Improvement of Self Organizing Maps using Gap Statistic and Probability Distribution," International Journal of Fuzzy Logic and Intelligent Systems, vol. 8, no. 2, pp. 116-120, 2008.   DOI
13 I. Feinerer, A Text Mining Framework in R and Its Applications, PhD Dissertation, Department of Statistics and Mathematics Vienna University of Economics and Business Administration, 2008.
14 United State Patent and Trademark Office, http://www.uspto.gov
15 C. Hung, S. Wermter, P. Smith, "Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet," IEEE Intelligent Systems, vol. 19, iss. 2, pp. 68-77, 2003.
16 Y. Yam, P. Baranyi, C. T. Yang, "Reduction of fuzzy rule base via singular value decomposition," IEEE Transactions on Fuzzy Systems, vol. 7, Iss. 2, pp. 120-132, 1999.   DOI
17 J. J. Wei, C. J. Chang, N. K. Chou, G. J. Jan, "ECG data compression using truncated singular value decomposition," IEEE Transactions on Information Technology in Biomedicine, vol. 5, Iss. 4, pp. 290-299, 2001.   DOI   ScienceOn
18 S. Lee, M. H. Hayes, "Properties of the singular value decomposition for efficient data clustering," Signal Processing Letters, vol. 11, Iss. 11, pp. 862-866, 2004.   DOI
19 P. Howland, H. Park, "Generalizing discriminant analysis using the generalized singular value decomposition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, Iss. 8, pp. 995-1006, 2004.   DOI
20 P. Bao, X. Ma, "Image adaptive watermarking using wavelet domain singular value decomposition," IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, Iss. 1, pp. 96-102, 2005.   DOI
21 S. K. Jha, R. D. S. Yadava, "Denoising by Singular Value Decomposition and Its Application to Electronic Nose Data Processing," Sensors Journal, vol. 11, Iss. 1, pp. 35-44, 2011.   DOI
22 Hair, J. F., Black, B., Babin, B., Anderson, R. E., Multivariate Data Analysis, Prentice Hall, 1992.
23 김기영, 전명식, 다변량 통계자료분석, 자유아카데미, 1994.
24 Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis, 5th ed. Prentice Hall, 2002.
25 Rencher, A. C., Methods Of Multivariate Analysis 2nd ed. John Wiley & Sons, 2002.
26 T. Kohonen, Self-Organizing Maps, Springer, 2001.
27 Han, J., Kamber, M., Data Mining Concepts and Techniques, Morgan Kaufmann, 2001.
28 T. Kohonen, "Self-Organized Formation of Topologically Correct Feature Maps," Biological Cybernetics, vol. 43, pp. 59-69, 1982.   DOI
29 오일석, 패턴인식, 교보문고, 2008
30 R Development Core Team, R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org, 2010.
31 J. Yan, Self-Organizing Map - Package 'som', CRAN www.r-project.org, 2011.