Browse > Article
http://dx.doi.org/10.5391/JKIIS.2015.25.4.386

Comparison Between Optimal Features of Korean and Chinese for Text Classification  

Ren, Mei-Ying (Dept. of Computer & Information Engineering, Daegu University)
Kang, Sinjae (School of Computer & Information Technology, Daegu University)
Publication Information
Journal of the Korean Institute of Intelligent Systems / v.25, no.4, 2015 , pp. 386-391 More about this Journal
Abstract
This paper proposed the optimal attributes for text classification based on Korean and Chinese linguistic features. The experiments committed to discover which is the best feature among n-grams which is known as language independent, morphemes that have language dependency and some other feature sets consisted with n-grams and morphemes showed best results. This paper used SVM classifier and Internet news for text classification. As a result, bi-gram was the best feature in Korean text categorization with the highest F1-Measure of 87.07%, and for Chinese document classification, 'uni-gram+noun+verb+adjective+idiom', which is the combined feature set, showed the best performance with the highest F1-Measure of 82.79%.
Keywords
Chinese Text Classification; Korean Text Classification; Information Gain; SVM Classifier; Feature Selection;
Citations & Related Records
Times Cited By KSCI : 9  (Citation Analysis)
연도 인용수 순위
1 B. Pang, L. Lee and S. Vaithyanathan, "Thumbs up?: sentiment classification using machine learning techniques," EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing, Vol. 10, pp. 79-86, 2002
2 B. Kim, "A Study on Comparison with SVM, EM, and Naivebayes Algorithm." The Institute of Electronics and Information Engineers Summer Conference, Vol. 32 (1), pp. 683-684, 2009
3 C. Park, D. Seong, K. Park, "Automatic IPC Classification for Patent Documents using Machine Learning," Journal of Advanced Information Technology and Convergence, Vol. 10 (4), pp. 119-128, 2012
4 X. Li, J. Liu and Z. Shi, "A Chinese Web Page Classifier Based on SVM and Unsupervised Clustering," Chinese Journal of Computers, Vol. 24(1), pp. 62-68, 2001
5 D. Choi, S. Lee, J. Kim, J. Lee, "A Study on Graph-based Topic Extraction form Microblogs," Journal of The Korean Institute of Intelligent Systems, Vol. 21(5), pp. 564-568, 2011   DOI
6 T. Basu, C. A. Murty, "Effective Text Classification by a Supervised Feature Selection Approach," IEEE 12th International Conference on Data Mining Workshops, pp. 918-925, 2012
7 Y. Yang and J. O. Pedersen. "A Comparison Study on Feature Selection in Text Categorization," In Proceedings of the Fourteenth International Conference on Machine Learning (ICML 97), pp. 412-420, 1997
8 B. Shim, J. Park, J. Seo, "Term Weighting Using Date Information and Its Appliance in Automatic Text Classification," Proceedings of the 19th Annual Conference on Human and Cognitive Language Technology, Vol. 10, pp. 169-173, 2007
9 Y. Zhang, J. Lu and J. Yang, "Research on the Technique of Chinese Text Classification Based on the Single Chinese Character Feature," Pattern Recognition, 2009. CCPR 2009. Chinese Conference on, pp. 1-5, 2009
10 S. Rho, B. Kim, N. Huh, "Representative keyword Extraction from Few Documents through Fuzzy Inference," Journal of The Korean Institute of Intelligent Systems, Vol. 11(9), pp. 837-843, 2001
11 T. Goncalves and P. Quaresma, "Text Classification Using Tree Kernels and Linguistic Information,", IEEE Seventh International Conference on Machine Learning and Applications, pp. 763-768, 2008
12 J. Roh, H. Kim, J. Chang, "Improving Hypertext Classification Systems through WordNet-based Feature Abstraction," Journal of Society for e-Business Studies, Vol. 18(2), pp. 95-110, 2013   DOI
13 S. Park, B. Zhang, "Text Categorization Using Both Lexical Information and Syntactic Information," The Korean Institute of Information Scientists and Engineers Autumn Conference, Vol 28(2), pp. 37-39, 2001
14 I. Kang, "A Comparative Study on Using SentiWordNet for English Twitter Sentiment Analysis," Journal of Korean Institute of Intelligent Systems, Vol. 23 (4), pp. 317-324, 2013   DOI   ScienceOn
15 E. D'hondt, S. Verberne, C. Koster and L. Boves, "Text Representation for Patent Classification," Computational Linguistics, vol 39(3), pp. 755-775, 2013   DOI
16 T. Kim, J. Lee, M. Chang, "A Minimal Pair Searching Tool Based on Dictionary," Journal of The Korean Institute of Intelligent Systems, Vol. 24(2), pp. 117-122, 2014   DOI
17 J. In, J. Kim, S. Chae, "Combined Feature Set and Hybrid Feature Selection Method for Effective Document Classification," Journal of Korean Society for Internet Information, vol. 14 (5), pp. 49-57, 2013
18 S. Choi, S. Park, "Categorization of POIs Using Word and Context information," Journal of Korean Institute of Intelligent Systems, Vol 24 (5), pp. 470-476,2014   DOI
19 S. Kang, J. Kim, "Intelligent Spam-mail Filtering Based on Textual Information and Hyperlinks," Journal of The Korean Institute of Intelligent Systems, Vol. 14 (7), pp.895-901, 2004   DOI
20 J. Son, J. Go, S. Park, K. Kim, "Kernelized Structure Feature for Discriminating Meaningful Table from Decorative Table," Journal of The Korean Institute of Intelligent Systems, Vol. 21(5), pp. 618-623, 2011   DOI
21 P. Wang and X. Fan, "Study on Chinese Text Classification Based on Dependency Relation," Computer Engineering and Applications, Vol.46(3), pp. 131-141, 2010
22 H. Xiao, "CorpusWordParser.exe, Computer software. Corpus Online. Vers. 3.0.0.0," Ministry of Education and Institute of Applied Linguistics, Web. . 2014
23 L. H. Witten, E. Frank and M. A. Hall, "DATA MINING: Practical Machine Learning Tools and Techniques," third Edition.