DOI QR코드

DOI QR Code

Document Clustering Method using PCA and Fuzzy Association

주성분 분석과 퍼지 연관을 이용한 문서군집 방법

  • 박선 (전북대학교 전기전자정보인력양성사업단) ;
  • 안동언 (전북대학교 전기전자컴퓨터공학부)
  • Received : 2009.12.07
  • Accepted : 2010.01.07
  • Published : 2010.04.30

Abstract

This paper proposes a new document clustering method using PCA and fuzzy association. The proposed method can represent an inherent structure of document clusters better since it select the cluster label and terms of representing cluster by semantic features based on PCA. Also it can improve the quality of document clustering because the clustered documents by using fuzzy association values distinguish well dissimilar documents in clusters. The experimental results demonstrate that the proposed method achieves better performance than other document clustering methods.

본 논문은 주성분 분석과 퍼지 연관을 이용한 새로운 문서군집 방법을 제안한다. 제안된 방법은 주성분 분석의 의미특징을 이용하여 군집 레이블과 군집의 대표 용어들을 선택하기 때문에 문서군집의 내부구조를 더 잘 표현할 수 있다. 또한 퍼지연관 값을 이용한 군집은 문서군집에 유사하지 않은 문서를 더 잘 구분함으로써 문서군집의 성능을 높일 수 있다. 실험결과 제안방법을 적용한 문서군집방법이 다른 문서군집 방법에 비하여 좋은 성능을 보인다.

Keywords

References

  1. 이창범, 김민수, 이기호, 이귀상, 박혁로. “주성분 분석을 이용한 문서 주제어 추출”, 정보과학회논문지 : 소프트웨어 및 응용 제 29권 제 10호, 2002.
  2. The 20 newsgroups data set. http://people.csail.mit.edu/jrennie/20Newsgroups/, 2009.
  3. S. Basu, A.Banerjee, R. Mooney, “Semi-supervised Clustering by Seeding,” Proceeding of International Conference on Machine Learning (ICML), pp.19-26, 2002.
  4. S. Chakrabarti, “mining the web: Discovering Knowledge from Hypertext Data,” Morgan Kaufmann Publishers, 2003.
  5. W. B. Frankes, B. Y. Ricardo, “Information Retrieval : Data Structure & Algorithms,” Prentice-Hall, 1992.
  6. X. Ji, W. Xu, S. Zhu, “Document Clustering with Prior Knowledge”, Proceeding of Special Interest Group on Information Retrieval (SIGIR), pp.405-412, 2006. https://doi.org/10.1145/1148170.1148241
  7. R. A. Johnson, D. W. Wichern, Applied Multivariate Statistical Analysis 5th ed., Prentice hall, 2007.
  8. J. Han, M. Kamber, “Second Edition Data Mining Concepts and Techniques,” Morgan Kaufman, 2006.
  9. C. Haruechaiyasak, M. L. Shyu, S. C. Chen, "Web Document Classification Based on Fuzzy Association," In proceedings of the 25th Annual International Computer Software and Applications Conference (COMPSAC'02), 2002. https://doi.org/10.1109/CMPSAC.2002.1045052
  10. Y. Huang, T. M. Mitchell, “Text Clustering with Extended User Feedback”, Proceeding of Special Interest Group on Information Retrieval (SIGIR), pp.413-420, 2006. https://doi.org/10.1145/1148170.1148242
  11. T. Li, S. Ma, M. Ogihara, "Document Clustering via Adaptive Subspace Iteration," In proceeding of SIGIR'04, pp.218-225, 2004. https://doi.org/10.1145/1008992.1009031
  12. S. Park, D. U. An, B. R. Char, C. W. Kim, "Document Clustering with Cluster Refinement and Non-negative Matrix Factorization," In proceeding of ICONIP'09, pp.281-288, 2009. https://doi.org/10.1007/978-3-642-10684-2_31
  13. B. Y. Ricardo, R. N. Berthier, “Moden Information Retrieval,” ACM Press, 1999.
  14. F. Wang, C. Zhang, "Regularized Clustering for Documents," In proceeding of ACM SIGIR'07, 95-102, 2007. https://doi.org/10.1145/1277741.1277760
  15. W. Xu, X. Liu, Y. Gon, “Document Clustering Based On Non-negative Matrix Factorization,” Proceeding of Special Interest Group on Information Retrieval (SIGIR), pp.267-274, 2003. https://doi.org/10.1145/860435.860485
  16. L. A. Zadeh, "Fuzzy Sets, in Dubois, D., Prade, H. and Yager, R. R. editiors, Readings in Fuzzy Sets for Intelligent Systems," Morgan Kaufmann Publiishers, 1993.
  17. H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, J. Ma, “Learning to Cluster Web Search Results,” Proceeding of Special Interest Group on Information Retrieval (SIGIR), 210-217, 2004. https://doi.org/10.1145/1008992.1009030