Browse > Article

Text Categorization Using TextRank Algorithm  

Bae, Won-Sik (국립창원대학교 컴퓨터공학과)
Cha, Jeong-Won (국립창원대학교 컴퓨터공학과)
Abstract
We describe a new method for text categorization using TextRank algorithm. Text categorization is a problem that over one pre-defined categories are assigned to a text document. TextRank algorithm is a graph-based ranking algorithm. If we consider that each word is a vertex, and co-occurrence of two adjacent words is a edge, we can get a graph from a document. After that, we find important words using TextRank algorithm from the graph and make feature which are pairs of words which are each important word and a word adjacent to the important word. We use classifiers: SVM, Na$\ddot{i}$ve Bayesian classifier, Maximum Entropy Model, and k-NN classifier. We use non-cross-posted version of 20 Newsgroups data set. In consequence, we had an improved performance in whole classifiers, and the result tells that is a possibility of TextRank algorithm in text categorization.
Keywords
TextRank algorithm; Text Categorization; Co-occurrence Word; SVM; Na$\ddot{i}$ve Bayesian classifier; Maximum Entropy Model; 20 Newsgroups data set;
Citations & Related Records
연도 인용수 순위
  • Reference
1 D. D. Lewis, "Naive (bayes) at forty: The independence assumption in information retrieval," Proc. of the 10th European Conference on Machine Learning, pp.4-15, 1998.
2 T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," Proc. of the 10th European Conference on Machine Learning, pp.137-142, 1998.
3 Y. Yang, "Expert netword: Effective and efficient learning from human decisions in text categorization and retrieval," Proc. of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.13-22, 1994.
4 W. Bae, Y. Han, and J. Cha, "Text Categorization using Topic Signature and Co-occurrence Features," Proc. of the KIISE Korea Computer Congress 2008, vol.35, no.1, pp.262-267, 2008. (in Korean)
5 Y. Yang, "An evaluation of statistical approach to text categorization," Information Retrieval, vol.1, no.1-2, pp.69-90, 1996.
6 Y. Yoon, C. Lee, and G. G. Lee, "Hierarchical text categorization using support vector machine," Proc. of the 15th Human and Cognitive Language Technology, pp.1-8, 2003. (in Korean)
7 R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, "On feature distributional clustering for text categorization," Proc. of 24th Annual International ACM SIGIR Conference, pp.146-153, 2001.
8 S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Computer Networks and ISDN Systems, vol.30, pp.107-117, 1998.   DOI   ScienceOn
9 S. Tan, "Using Error-Correcting Output Codes with Model-Refinement to Boost Centroid Text Classifier," Proc. of the ACL 2007 Demo and Poster Sessions, pp.81-84, 2007.
10 K. Pearson, "On the theory of contingency and its relation to association and normal correlation," In Karl Pearson's early statistical papers, Cambridge: Cambridge University Press, pp.443-475, 1904/1948.
11 R. Mihalcea and P. Tarau, "TextRank: Bringing Order into Texts," Proc. of the Conference on Empirical Methods in Natural Language Processing 2004, pp.404-411, 2004.
12 A. K. McCallum and K. Nigram, "A Comparison of Event Models for Naive Bayes Text Classification," Proc. of the AAAI-98 Workshop on Learning for Text Categorization, pp.41-48, 1998.
13 A. K. McCallum, "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering," http://www.cs.cmu.edu/~mccallum/bow/, 1996.
14 Y. Yang and J. O. Pederson, "A comparative study on feature selection in text categorization," Proc. of the 14th International Conference on Machine Learning, pp.412-420, 1997.
15 K. Nigam, J. Lafferty, and A. K. McCallum, "Using Maximum Entropy for Text Categorization," Proc. of the IJCAI-99 Workshop on Machine Learning for Information Filtering, pp.61-67, 1999.
16 Y. Yoon and G. G. Lee, "Efficient implementation of associative classifiers for document classification," Information Processing and Management, vol.43, pp.393-405, 2007.   DOI   ScienceOn
17 C. Y. Lin and E. Hovy, "The Automated Acquisition of Topic Signatures for Text Summarization," Proc. of the 18th International Conference on Computation Linguistics, pp.495-500, 2000.
18 K. Lang, "The 20 Newsgroups data set," http://people.csail.mit.edu/~jrennie/20Newsgroups
19 A. Gliozzo and C. Strapparava, "Domain Kernels for Text Categorization," Proc. of the 9th Conference on Computational Natural Language Learning, pp.56-63, 2005.
20 D. D. Lewis, "The Reuters-21578 data set," http://www.daviddlewis.com/resources/testcollections/reut ers21578