DOI QR코드

DOI QR Code

Enhancing Text Document Clustering Using Non-negative Matrix Factorization and WordNet

  • Kim, Chul-Won (Department of Computer Engineering, Honam University) ;
  • Park, Sun (Networked Computing System Lab., Gwangju Institute of Science and Technology)
  • 투고 : 2012.12.28
  • 심사 : 2013.03.22
  • 발행 : 2013.12.31

초록

A classic document clustering technique may incorrectly classify documents into different clusters when documents that should belong to the same cluster do not have any shared terms. Recently, to overcome this problem, internal and external knowledge-based approaches have been used for text document clustering. However, the clustering results of these approaches are influenced by the inherent structure and the topical composition of the documents. Further, the organization of knowledge into an ontology is expensive. In this paper, we propose a new enhanced text document clustering method using non-negative matrix factorization (NMF) and WordNet. The semantic terms extracted as cluster labels by NMF can represent the inherent structure of a document cluster well. The proposed method can also improve the quality of document clustering that uses cluster labels and term weights based on term mutual information of WordNet. The experimental results demonstrate that the proposed method achieves better performance than the other text clustering methods.

키워드

참고문헌

  1. J. Hu, L. Fang, Y. Cao, H. J. Zeng, H. Li, Q. Yang, and Z. Chen, "Enhancing text clustering by leveraging Wikipedia semantics," in Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, pp. 179-186, 2008.
  2. S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data. Boston, MA: Morgan Kaufmann Publishers, 2003.
  3. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, 2nd ed. New York, NY: Addison-Wesley, 2011.
  4. W. Xu, X. Liu, and Y. Gong, "Document clustering based on nonnegative matrix factorization," in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in information Retrieval, Toronto, Canada, pp. 267-273, 2003.
  5. S. Park, D. U. An, B. Cha, and C. W. Kim, "Document clustering with cluster refinement and non-negative matrix factorization," in Proceedings of the 16th International Conference on Neural Information Processing, Bangkok, Thailand, pp. 281-288, 2009.
  6. S. Park and K. J. Kim, "Document clustering using non-negative matrix factorization and fuzzy relationship," Journal of Korea Navigation Institute, vol. 14, no. 2, pp. 239-246, 2010.
  7. W. Xu and Y. Gong, "Document clustering by concept factorization," in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, pp. 202-209, 2004.
  8. T. Li, S. Ma, and M. Ogihara, "Document clustering via adap-tive subspace iteration," in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, pp. 218-225, 2004.
  9. F. Wang, C. Zhang, and T. Li, "Regularized clustering for documents," in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, pp. 95-102, 2007.
  10. D. D. Lee and H. S. Seung, "Learning the parts of objects by non negative matrix factorization," Nature, vol. 401, no. 6755, pp. 788- 791, 1999. https://doi.org/10.1038/44565
  11. L. Jing, L. Zhou, M. K. Ng, and J. Z. Huang, "Ontology-based distance measure for text clustering," in Proceedings of 2006 SIAM International Conference on Data Mining, Bethesda, MD, 2006.
  12. X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou, "Exploiting Wikipedia as external knowledge for document clustering," in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, pp. 389-396, 2009.
  13. H. H. Tar and T. T. S. Nyaunt, "Ontology-based concept weighting for text documents," World Academy of Science, Engineering and Technology, vol. 57, pp. 249-253, 2011.
  14. S. Park and S. R. Lee, "Enhancing document clustering using condensing cluster terms and fuzzy association," Journal of IEICE Transactions on Information and Systems, vol. 94D, no. 6, pp. 1227-1234, 2011.
  15. W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures & Algorithms. Englewood Cliffs, NJ: Prentice-Hall, 1992.
  16. G. A. Miller, "WordNet: a lexical database for English," Communications of the ACM, vol. 38, no. 11, pp. 39-41, 1995.
  17. The 20 newsgroups data set [Internet], Available: http://qwone.com/-jason/20Newsgroups/.