Browse > Article
http://dx.doi.org/10.7232/JKIIE.2014.40.1.018

KR-WordRank : An Unsupervised Korean Word Extraction Method Based on WordRank  

Kim, Hyun-Joong (Dept. of Industrial Engineering, Seoul National University)
Cho, Sungzoon (Dept. of Industrial Engineering, Seoul National University)
Kang, Pilsung (Dept. of Industrial and Information Systems Engineering, Seoul National University of Science and Technology)
Publication Information
Journal of Korean Institute of Industrial Engineers / v.40, no.1, 2014 , pp. 18-33 More about this Journal
Abstract
A Word is the smallest unit for text analysis, and the premise behind most text-mining algorithms is that the words in given documents can be perfectly recognized. However, the newly coined words, spelling and spacing errors, and domain adaptation problems make it difficult to recognize words correctly. To make matters worse, obtaining a sufficient amount of training data that can be used in any situation is not only unrealistic but also inefficient. Therefore, an automatical word extraction method which does not require a training process is desperately needed. WordRank, the most widely used unsupervised word extraction algorithm for Chinese and Japanese, shows a poor word extraction performance in Korean due to different language structures. In this paper, we first discuss why WordRank has a poor performance in Korean, and propose a customized WordRank algorithm for Korean, named KR-WordRank, by considering its linguistic characteristics and by improving the robustness to noise in text documents. Experiment results show that the performance of KR-WordRank is significantly better than that of the original WordRank in Korean. In addition, it is found that not only can our proposed algorithm extract proper words but also identify candidate keywords for an effective document summarization.
Keywords
Word Extraction; Keyword Extraction; Text Mining; Unsupervised Learning; WordRank;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 Berry, M. W. and Castellanos, M. (2007), Survey of Text Mining : Clustering, Classification, and Retrieval, Springer, New York, NY, USA.
2 Chen, S., Xu, Y., and Chang, H. (2011), A simple and effective unsupervised word segmentation approach, In proceedings of the 25th AAAI Conference on Artificial Intelligence, San Francisco, CA, USA.
3 Cho, S. G. and Kim, S. B. (2012), Finding meaningful pattern of key words in IIE Transactions using text mining, Journal of the Korean Institute of Industrial Engineers, 38(1), 67-73.   과학기술학회마을   DOI   ScienceOn
4 Fellbaum, C. (2005), WordNet and wordnets, In: Brown, Keith et al. (eds.), Encyclopedia of Language and Linguistics, Second Edition, Oxford: Elsevier, 665-670.
5 Feng, H., Chen, K., Deng, X., and Zheng, W. (2004), Accessor variety criteria for Chinese word extraction. Computational Linguistics, 30(1), 75-93.   DOI   ScienceOn
6 Harris, Z. S. (1955), From phoneme to morpheme, Language, 31(2), 190-222.   DOI
7 Hotho, A., Nurnberger, A., and Paass, Gerhard (2005), A brief survey of text mining, Ldv Forum, 20(1), 19-62.
8 Jin, Z. and Tanaka-Ishii, K. (2006), Unsupervised segmentation of Chinese text by use of branching entropy, In Proceedings of the COLING/ACL on Main conference poster sessions, Association for Computational Linguistics.
9 Jurafsky, D. and Martin, J. H. (2009), Speech and Language Processing : An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall.
10 Kleinberg, J. M. (1999), Authoritative sources in a hyperlinked environment, Journal of ACM, 46(5), 604-632.   DOI   ScienceOn
11 Lawrence, P., Brin, S., Rajeev, M., and Terry, W. (1999), The PageRank citation ranking: Bringing order to the web. Technical Report, Stanford InfoLab.
12 Lee, D., Yeon, J., Hwang, I., and Lee, S.-G. (2010), KKMA : A tool for utilizing Sejong Corpus based on Relational Database, Journal of KIISE : Computing Practices and Letters, 16(11), 1046-1050.   과학기술학회마을
13 Lu, X., Zhang, L., and Hu, J. (2004), Statistical substring reduction in linear time, In proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP), Hainan Island, China.
14 Mochihashi, D. Yamada T. and Ueda N. (2009), Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP.
15 Maosong, S. Dayang, S., and Tsou, B. K. (1998), Chinese word segmentation without using lexicon and hand-crafted training data, In proceedings of the 17th International Conference on Computational Linguistics (COLING), Stroudsburg, PA, USA.
16 McKinsey Global Institute (2011), Big Data : The Next Frontier for Innovation, Competition, and Productivity.
17 Mihalcea, R. and Tarau, P. (2004), TextRank : Bringing order into texts, In proceedings of 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain.
18 Petrovic, S., Snajder J., and Dalbelo B. (2010), Extending lexical association measures for collocation extraction, 24(2), 383-394.   DOI
19 Porter, M. F. (1980), An algorithm for suffix stripping, Program, 14(3), 130-137.   DOI
20 Willett, P. (2006), The Porter stemming algorithm : then and now, Program : Electronic Library and Information Systems, 40(3), 219-223.   DOI
21 Zhao, H. and Kit, C. (2007), Incorporating global information into supervised learning for Chinese word segmentation, In proceedings of the 10th Conference of the Pacifi c Association for Computational Linguistics (PCALING), Melbourne, Australia.