[KSCI] Korea Science Citation Index Service

Text Categorization Using TextRank Algorithm

Bae, Won-Sik (국립창원대학교 컴퓨터공학과)
Cha, Jeong-Won (국립창원대학교 컴퓨터공학과)

Publication Information

Journal of KIISE:Computing Practices and Letters / v.16, no.1, 2010 , pp. 110-114 More about this Journal

Abstract

We describe a new method for text categorization using TextRank algorithm. Text categorization is a problem that over one pre-defined categories are assigned to a text document. TextRank algorithm is a graph-based ranking algorithm. If we consider that each word is a vertex, and co-occurrence of two adjacent words is a edge, we can get a graph from a document. After that, we find important words using TextRank algorithm from the graph and make feature which are pairs of words which are each important word and a word adjacent to the important word. We use classifiers: SVM, Na $\ddot{i}$ ve Bayesian classifier, Maximum Entropy Model, and k-NN classifier. We use non-cross-posted version of 20 Newsgroups data set. In consequence, we had an improved performance in whole classifiers, and the result tells that is a possibility of TextRank algorithm in text categorization.

Keywords

TextRank algorithm; Text Categorization; Co-occurrence Word; SVM; Na $\ddot{i}$ ve Bayesian classifier; Maximum Entropy Model; 20 Newsgroups data set;

Citations & Related Records

Reference

1	D. D. Lewis, "Naive (bayes) at forty: The independence assumption in information retrieval," Proc. of the 10th European Conference on Machine Learning, pp.4-15, 1998.
2	T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," Proc. of the 10th European Conference on Machine Learning, pp.137-142, 1998.
3	Y. Yang, "Expert netword: Effective and efficient learning from human decisions in text categorization and retrieval," Proc. of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.13-22, 1994.
4	W. Bae, Y. Han, and J. Cha, "Text Categorization using Topic Signature and Co-occurrence Features," Proc. of the KIISE Korea Computer Congress 2008, vol.35, no.1, pp.262-267, 2008. (in Korean)
5	Y. Yang, "An evaluation of statistical approach to text categorization," Information Retrieval, vol.1, no.1-2, pp.69-90, 1996.
6	Y. Yoon, C. Lee, and G. G. Lee, "Hierarchical text categorization using support vector machine," Proc. of the 15th Human and Cognitive Language Technology, pp.1-8, 2003. (in Korean)
7	R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, "On feature distributional clustering for text categorization," Proc. of 24th Annual International ACM SIGIR Conference, pp.146-153, 2001.
8	S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Computer Networks and ISDN Systems, vol.30, pp.107-117, 1998. DOI ScienceOn
9	S. Tan, "Using Error-Correcting Output Codes with Model-Refinement to Boost Centroid Text Classifier," Proc. of the ACL 2007 Demo and Poster Sessions, pp.81-84, 2007.
10	K. Pearson, "On the theory of contingency and its relation to association and normal correlation," In Karl Pearson's early statistical papers, Cambridge: Cambridge University Press, pp.443-475, 1904/1948.
11	R. Mihalcea and P. Tarau, "TextRank: Bringing Order into Texts," Proc. of the Conference on Empirical Methods in Natural Language Processing 2004, pp.404-411, 2004.
12	A. K. McCallum and K. Nigram, "A Comparison of Event Models for Naive Bayes Text Classification," Proc. of the AAAI-98 Workshop on Learning for Text Categorization, pp.41-48, 1998.
13	A. K. McCallum, "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering," http://www.cs.cmu.edu/~mccallum/bow/, 1996.
14	Y. Yang and J. O. Pederson, "A comparative study on feature selection in text categorization," Proc. of the 14th International Conference on Machine Learning, pp.412-420, 1997.
15	K. Nigam, J. Lafferty, and A. K. McCallum, "Using Maximum Entropy for Text Categorization," Proc. of the IJCAI-99 Workshop on Machine Learning for Information Filtering, pp.61-67, 1999.
16	Y. Yoon and G. G. Lee, "Efficient implementation of associative classifiers for document classification," Information Processing and Management, vol.43, pp.393-405, 2007. DOI ScienceOn
17	C. Y. Lin and E. Hovy, "The Automated Acquisition of Topic Signatures for Text Summarization," Proc. of the 18th International Conference on Computation Linguistics, pp.495-500, 2000.
18	K. Lang, "The 20 Newsgroups data set," http://people.csail.mit.edu/~jrennie/20Newsgroups
19	A. Gliozzo and C. Strapparava, "Domain Kernels for Text Categorization," Proc. of the 9th Conference on Computational Natural Language Learning, pp.56-63, 2005.
20	D. D. Lewis, "The Reuters-21578 data set," http://www.daviddlewis.com/resources/testcollections/reut ers21578

KSCI

Text Categorization Using TextRank Algorithm TextRank 알고리즘을 이용한 문서 범주화

Text Categorization Using TextRank Algorithm