DOI QR코드

DOI QR Code

Document Clustering with Relational Graph Of Common Phrase and Suffix Tree Document Model

공통 Phrase의 관계 그래프와 Suffix Tree 문서 모델을 이용한 문서 군집화 기법

  • 조윤호 (고려대학교 정보통신대학 컴퓨터통신공학부) ;
  • 이상근 (고려대학교 정보통신대학 컴퓨터통신공학부)
  • Published : 2009.02.28

Abstract

Previous document clustering method, NSTC measures similarities between two document pairs using TF-IDF during web document clustering. In this paper, we propose new similarity measure using common phrase-based relational graph, not TF-IDF. This method suggests that weighting common phrases by relational graph presenting relationship among common phrases in document collection. And experimental results indicate that proposed method is more effective in clustering document collection than NSTC.

기존의 문서 군집화 기법 NSTC은 문서 군집화 과정 내에서 TF-IDF를 이용하여 문서간 유사도를 측정한다. 본 논문에서는 TF-IDF가 아닌, 공통 Phrase의 관계 그래프를 이용한 새로운 문서간 유사도 측정을 제안한다. 이 방법은 문서 집합 내의 공통 Phrase들의 관계를 나타낸 관계 그래프를 통해 공통 Phrase의 가중치를 부여하는 방법을 제시한다. 또한 실험을 통해 NSTC와 비교하여 본 논문에서 제안한 문서간 유사도 측정 기법이 문서 군집화에 더욱 효과적임을 보였다.

Keywords

References

  1. H. Chim and X. Deng, "A New Suffix Tree Similarity Measure for Document Clustering," In Proceedings of the 16th International Conference on World Wide Web, pp.121-130, 2007.
  2. G. Salton and C. Buckley, "Term-Weighting Approaches In Automatic Text Retrieval," Information Processing and Management, Vol.24, No.5, pp.513-523, 1988. https://doi.org/10.1016/0306-4573(88)90021-0
  3. E. Ukkonen, "On-Line Construction of Suffix Trees," Algorithmica, Vol.14, No.3, pp.249-260, 1995. https://doi.org/10.1007/BF01206331
  4. E. M. McCreight, "A Space-Economical Suffix Tree Construction Algorithm," Journal of the ACM, Vol.23, No.2, pp.262-272, 1976. https://doi.org/10.1145/321941.321946
  5. O. Zamir and O. Etzioni, "Web Document Clustering: A Feasibility Demonstration," In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp.46-54, 1998.
  6. F. Gelgi, H. Davulcu, and S. Vadrevu, "Term Ranking for Clustering Web Search Results," In Proceedings of the 10th International Workshop on Web and Database, 2007.
  7. E. M. Voorhees, "Implementing Agglomerative Hierarchic Clustering Algorithms for Use in Document Retrieval," Information Processing and Management, Vol.22, No.6, pp.465-476, 1986. https://doi.org/10.1016/0306-4573(86)90097-X
  8. S. Brin and L. Page, "The Anatomy of a Large Scale Hypertextual Web Search Engine," In Proceedings of the 7th International Conference on World Wide Web, pp.107-117, 1998.
  9. L. Page, S. Brin, R. Motwani, and T. Winograd, "The Pagerank Citation Ranking: Bringing Order to the Web," Technical Report, Stanford Digital Library Technologies Project, 1998.
  10. W. Hersh, C. Buckley, T. J. Leone, and D. Hickam, "OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research," In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.192-201, 1994.
  11. D. D. Lewis, Y. Yang, T .G. Rose, and F. Li, "RCV1: A New Benchmark Collection for Text Categorization Research," Journal of Machine Learning Research, Vol.5, pp.361-397, 2004.
  12. M. Rosell, V. Kann, and J. E. Litton, "Comparing comparisons: Document clustering evaluation using two manual classifications," In Proceedings of the 3th International Conference on Natural Language Processing, 2004.
  13. http://en.wikipedia.org/wiki/WordNet