DOI QR코드

DOI QR Code

Document Clustering Using Semantic Features and Fuzzy Relations

  • Kim, Chul-Won (Department of Computer Engineering, Honam University) ;
  • Park, Sun (Institute of Information Science and Engineering Research, Mokpo National University)
  • Received : 2013.03.04
  • Accepted : 2013.04.23
  • Published : 2013.09.30

Abstract

Traditional clustering methods are usually based on the bag-of-words (BOW) model. A disadvantage of the BOW model is that it ignores the semantic relationship among terms in the data set. To resolve this problem, ontology or matrix factorization approaches are usually used. However, a major problem of the ontology approach is that it is usually difficult to find a comprehensive ontology that can cover all the concepts mentioned in a collection. This paper proposes a new document clustering method using semantic features and fuzzy relations for solving the problems of ontology and matrix factorization approaches. The proposed method can improve the quality of document clustering because the clustered documents use fuzzy relation values between semantic features and terms to distinguish clearly among dissimilar documents in clusters. The selected cluster label terms can represent the inherent structure of a document set better by using semantic features based on non-negative matrix factorization, which is used in document clustering. The experimental results demonstrate that the proposed method achieves better performance than other document clustering methods.

Keywords

References

  1. S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data. Boston, MA: Morgan-Kaufmann, 2003.
  2. S. J. Fodeh, W. F. Punch, and P. N. Tan, "Combining statistics and semantics via ensemble model for document clustering," in Proceeding of the 24th Annual ACM Symposium on Applied Computing, Honolulu, HI, pp. 1446-1450, 2009.
  3. W. B. Frankes and B. Y. Ricardo, Information Retrieval: Data Structure & Algorithms. Englewood Cliffs, NJ: Prentice-Hall, 1992.
  4. J. Han and M. Kamber, Data Mining Concepts and Techniques, 2nd ed. Boston, MA: Morgan-Kaufmann, 2006.
  5. X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou, "Exploiting Wikipedia as external knowledge for document clustering," in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, pp. 389-396, 2009.
  6. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. New York, NY: ACM Press, 1999.
  7. F. Wang, C. Zhang, and T. Li, "Regularized clustering for documents," in Proceeding of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, pp. 95-102, 2007.
  8. X. Ji and W. Xu, "Document clustering with prior knowledge," in Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, pp. 405-412, 2006.
  9. X. Liu, Y. Gong, W. Xu, and S. Zhu, "Document clustering with cluster refinement and model selection capabilities," in Proceeding of the 25th Annnual Internationnal ACM SIGIRR Conference onn Research and Development in Information Retrieval, Tampere, Finland, pp. 191-198, 2002.
  10. X. Zhang, X. Hu, and X. Zhou, "A comparative evaluation of different link types on enhancing document clustering," in Proceeding of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, pp. 555-562, 2008.
  11. T. Hu, H. Xiong, W. Zhou, S. Y. Sung, and H. Luo, "Hypergraph partitioningg for document clustering: a unified clique perspective," in Proceeding of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, pp. 871-872, 2008.
  12. S. Park, D. U. An, B. R. Cha, and C. W. Kim, "Document clustering with cluster refinement and non-negative matrix factorization," in Proceeding of the 16th International Conference on Neural Information Processing, Bangkook, Thailand, pp. 281-288, 2009.
  13. W. Xu aand Y. Gong, "Document clustering by concept factorization," in Proceeding of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, pp. 202-209, 2004.
  14. W. Xu, X. Liu, and Y. Gong, "Document clustering based on nonnegative matrix factorization," in Proceeding of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 267-273, 2003.
  15. S. Park, D. U. An, B. R. Cha, and C. W. Kim, "Document clustering with semantic features and fuzzy association," in Proceeding of the 4th International Conference on Information Systems, Technology and Management, Bangkok, Thailand, pp.167-175, 2010.
  16. D. D. Lee and H. S. Seung, "Learning the parts of objects by non-negative matrix factorization," Nature, vol. 401, no. 6755, pp. 788-791, 1999. https://doi.org/10.1038/44565
  17. D. D. Lee and H. S. Seung, "Algorithms for non-negative matrix factorization," in Advances in Neural Information Processing Systems 13. Cambridge, MA: MIT Press, pp. 556-562, 2001.
  18. C. Haruechaiyasak, M. L. Shyu, S. C. Chen, aand X. Li, "Web document classification based on fuzzy association," in Proceedings of the 26th International Computer Software and Applications Conference on Prolonging Software Life: Development and Redevelopment, Oxford, UK, pp. 487-492, 2002.
  19. L. A. Zadeh, "Fuzzy sets," in Readings in Fuzzy Sets for Intelligent Systems. San Francisco, CA: Morgan-Kaufmann, 1993.
  20. T. Li, S. Ma, and M. Ogihara, "Document clusterring via adaptive subspace iteration," in Proceeding of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, pp. 218-225, 2004.
  21. J. Wang, H. Zeng, Z. Chen, H. Lu, L. Tao, and W. Y. Ma, "ReCoM: reinforcement clustering of multi-type interrelatedd data objects," in Proceeding of the 26th Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval, Toronto, Canada, pp. 274-281, 2003.
  22. T. Li and C. Ding, "The relationships among various nonnegative matrix factorization method for clustering," in Proceeding of the 6th International Conference on Data Mining, Hong Kong, China, pp. 362-371, 2006.