TextRank 알고리즘을 이용한 문서 범주화

Text Categorization Using TextRank Algorithm

  • 배원식 (국립창원대학교 컴퓨터공학과) ;
  • 차정원 (국립창원대학교 컴퓨터공학과)
  • 발행 : 2010.01.15

초록

본 논문에서는 TextRank 알고리즘을 이용한 문서 범주화 방법에 대해 기술한다. TextRank 알고리즘은 그래프 기반의 순위화 알고리즘이다. 문서에서 나타나는 각각의 단어를 노드로, 단어들 사이의 동시출현성을 이용하여 간선을 만들면 문서로부터 그래프를 생성할 수 있다. TextRank 알고리즘을 이용하여 생성된 그래프로부터 중요도가 높은 단어를 선택하고, 그 단어와 인접한 단어를 묶어 하나의 자질로 사용하여 문서 분류를 수행하였다. 동시출현 자질(인접한 단어 쌍)은 단어 하나가 갖는 의미를 보다 명확하게 만들어주므로 문서 분류에 좋은 자질로 사용될 수 있을 것이라 가정하였다. 문서 분류기로는 지지 벡터 기계, 베이지언 분류기, 최대 엔트로피 모델, k-NN 분류기 등을 사용하였다. 20 Newsgroups 문서 집합을 사용한 실험에서 모든 분류기에서 제안된 방법을 사용했을 때, 문서 분류 성능이 향상된 결과를 확인할 수 있었다.

We describe a new method for text categorization using TextRank algorithm. Text categorization is a problem that over one pre-defined categories are assigned to a text document. TextRank algorithm is a graph-based ranking algorithm. If we consider that each word is a vertex, and co-occurrence of two adjacent words is a edge, we can get a graph from a document. After that, we find important words using TextRank algorithm from the graph and make feature which are pairs of words which are each important word and a word adjacent to the important word. We use classifiers: SVM, Na$\ddot{i}$ve Bayesian classifier, Maximum Entropy Model, and k-NN classifier. We use non-cross-posted version of 20 Newsgroups data set. In consequence, we had an improved performance in whole classifiers, and the result tells that is a possibility of TextRank algorithm in text categorization.

키워드

참고문헌

  1. Y. Yang and J. O. Pederson, "A comparative study on feature selection in text categorization," Proc. of the 14th International Conference on Machine Learning, pp.412-420, 1997.
  2. C. Y. Lin and E. Hovy, "The Automated Acquisition of Topic Signatures for Text Summarization," Proc. of the 18th International Conference on Computation Linguistics, pp.495-500, 2000.
  3. D. D. Lewis, "Naive (bayes) at forty: The independence assumption in information retrieval," Proc. of the 10th European Conference on Machine Learning, pp.4-15, 1998.
  4. A. K. McCallum and K. Nigram, "A Comparison of Event Models for Naive Bayes Text Classification," Proc. of the AAAI-98 Workshop on Learning for Text Categorization, pp.41-48, 1998.
  5. T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," Proc. of the 10th European Conference on Machine Learning, pp.137-142, 1998.
  6. Y. Yang, "Expert netword: Effective and efficient learning from human decisions in text categorization and retrieval," Proc. of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.13-22, 1994.
  7. K. Nigam, J. Lafferty, and A. K. McCallum, "Using Maximum Entropy for Text Categorization," Proc. of the IJCAI-99 Workshop on Machine Learning for Information Filtering, pp.61-67, 1999.
  8. R. Mihalcea and P. Tarau, "TextRank: Bringing Order into Texts," Proc. of the Conference on Empirical Methods in Natural Language Processing 2004, pp.404-411, 2004.
  9. K. Lang, "The 20 Newsgroups data set," http://people.csail.mit.edu/~jrennie/20Newsgroups
  10. W. Bae, Y. Han, and J. Cha, "Text Categorization using Topic Signature and Co-occurrence Features," Proc. of the KIISE Korea Computer Congress 2008, vol.35, no.1, pp.262-267, 2008. (in Korean)
  11. D. D. Lewis, "The Reuters-21578 data set," http://www.daviddlewis.com/resources/testcollections/reut ers21578
  12. S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Computer Networks and ISDN Systems, vol.30, pp.107-117, 1998. https://doi.org/10.1016/S0169-7552(98)00110-X
  13. A. K. McCallum, "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering," http://www.cs.cmu.edu/~mccallum/bow/, 1996.
  14. K. Pearson, "On the theory of contingency and its relation to association and normal correlation," In Karl Pearson's early statistical papers, Cambridge: Cambridge University Press, pp.443-475, 1904/1948.
  15. Y. Yang, "An evaluation of statistical approach to text categorization," Information Retrieval, vol.1, no.1-2, pp.69-90, 1996.
  16. A. Gliozzo and C. Strapparava, "Domain Kernels for Text Categorization," Proc. of the 9th Conference on Computational Natural Language Learning, pp.56-63, 2005.
  17. S. Tan, "Using Error-Correcting Output Codes with Model-Refinement to Boost Centroid Text Classifier," Proc. of the ACL 2007 Demo and Poster Sessions, pp.81-84, 2007.
  18. R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, "On feature distributional clustering for text categorization," Proc. of 24th Annual International ACM SIGIR Conference, pp.146-153, 2001.
  19. Y. Yoon, C. Lee, and G. G. Lee, "Hierarchical text categorization using support vector machine," Proc. of the 15th Human and Cognitive Language Technology, pp.1-8, 2003. (in Korean)
  20. Y. Yoon and G. G. Lee, "Efficient implementation of associative classifiers for document classification," Information Processing and Management, vol.43, pp.393-405, 2007. https://doi.org/10.1016/j.ipm.2006.07.012