DOI QR코드

DOI QR Code

KR-WordRank : An Unsupervised Korean Word Extraction Method Based on WordRank

KR-WordRank : WordRank를 개선한 비지도학습 기반 한국어 단어 추출 방법

  • Kim, Hyun-Joong (Dept. of Industrial Engineering, Seoul National University) ;
  • Cho, Sungzoon (Dept. of Industrial Engineering, Seoul National University) ;
  • Kang, Pilsung (Dept. of Industrial and Information Systems Engineering, Seoul National University of Science and Technology)
  • 김현중 (서울대학교 산업공학과) ;
  • 조성준 (서울대학교 산업공학과) ;
  • 강필성 (서울과학기술대학교 글로벌융합산업공학과)
  • Received : 2013.11.18
  • Accepted : 2014.01.09
  • Published : 2014.02.15

Abstract

A Word is the smallest unit for text analysis, and the premise behind most text-mining algorithms is that the words in given documents can be perfectly recognized. However, the newly coined words, spelling and spacing errors, and domain adaptation problems make it difficult to recognize words correctly. To make matters worse, obtaining a sufficient amount of training data that can be used in any situation is not only unrealistic but also inefficient. Therefore, an automatical word extraction method which does not require a training process is desperately needed. WordRank, the most widely used unsupervised word extraction algorithm for Chinese and Japanese, shows a poor word extraction performance in Korean due to different language structures. In this paper, we first discuss why WordRank has a poor performance in Korean, and propose a customized WordRank algorithm for Korean, named KR-WordRank, by considering its linguistic characteristics and by improving the robustness to noise in text documents. Experiment results show that the performance of KR-WordRank is significantly better than that of the original WordRank in Korean. In addition, it is found that not only can our proposed algorithm extract proper words but also identify candidate keywords for an effective document summarization.

Keywords

References

  1. Berry, M. W. and Castellanos, M. (2007), Survey of Text Mining : Clustering, Classification, and Retrieval, Springer, New York, NY, USA.
  2. Chen, S., Xu, Y., and Chang, H. (2011), A simple and effective unsupervised word segmentation approach, In proceedings of the 25th AAAI Conference on Artificial Intelligence, San Francisco, CA, USA.
  3. Cho, S. G. and Kim, S. B. (2012), Finding meaningful pattern of key words in IIE Transactions using text mining, Journal of the Korean Institute of Industrial Engineers, 38(1), 67-73. https://doi.org/10.7232/JKIIE.2012.38.1.067
  4. Fellbaum, C. (2005), WordNet and wordnets, In: Brown, Keith et al. (eds.), Encyclopedia of Language and Linguistics, Second Edition, Oxford: Elsevier, 665-670.
  5. Feng, H., Chen, K., Deng, X., and Zheng, W. (2004), Accessor variety criteria for Chinese word extraction. Computational Linguistics, 30(1), 75-93. https://doi.org/10.1162/089120104773633394
  6. Harris, Z. S. (1955), From phoneme to morpheme, Language, 31(2), 190-222. https://doi.org/10.2307/411036
  7. Hotho, A., Nurnberger, A., and Paass, Gerhard (2005), A brief survey of text mining, Ldv Forum, 20(1), 19-62.
  8. Jin, Z. and Tanaka-Ishii, K. (2006), Unsupervised segmentation of Chinese text by use of branching entropy, In Proceedings of the COLING/ACL on Main conference poster sessions, Association for Computational Linguistics.
  9. Jurafsky, D. and Martin, J. H. (2009), Speech and Language Processing : An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall.
  10. Kleinberg, J. M. (1999), Authoritative sources in a hyperlinked environment, Journal of ACM, 46(5), 604-632. https://doi.org/10.1145/324133.324140
  11. Lawrence, P., Brin, S., Rajeev, M., and Terry, W. (1999), The PageRank citation ranking: Bringing order to the web. Technical Report, Stanford InfoLab.
  12. Lee, D., Yeon, J., Hwang, I., and Lee, S.-G. (2010), KKMA : A tool for utilizing Sejong Corpus based on Relational Database, Journal of KIISE : Computing Practices and Letters, 16(11), 1046-1050.
  13. Lu, X., Zhang, L., and Hu, J. (2004), Statistical substring reduction in linear time, In proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP), Hainan Island, China.
  14. Maosong, S. Dayang, S., and Tsou, B. K. (1998), Chinese word segmentation without using lexicon and hand-crafted training data, In proceedings of the 17th International Conference on Computational Linguistics (COLING), Stroudsburg, PA, USA.
  15. McKinsey Global Institute (2011), Big Data : The Next Frontier for Innovation, Competition, and Productivity.
  16. Mihalcea, R. and Tarau, P. (2004), TextRank : Bringing order into texts, In proceedings of 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain.
  17. Mochihashi, D. Yamada T. and Ueda N. (2009), Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP.
  18. Petrovic, S., Snajder J., and Dalbelo B. (2010), Extending lexical association measures for collocation extraction, 24(2), 383-394. https://doi.org/10.1016/j.csl.2009.06.001
  19. Porter, M. F. (1980), An algorithm for suffix stripping, Program, 14(3), 130-137. https://doi.org/10.1108/eb046814
  20. Willett, P. (2006), The Porter stemming algorithm : then and now, Program : Electronic Library and Information Systems, 40(3), 219-223. https://doi.org/10.1108/00330330610681295
  21. Zhao, H. and Kit, C. (2007), Incorporating global information into supervised learning for Chinese word segmentation, In proceedings of the 10th Conference of the Pacifi c Association for Computational Linguistics (PCALING), Melbourne, Australia.