DOI QR코드

DOI QR Code

Word Embedding 자질을 이용한 한국어 개체명 인식 및 분류

Korean Named Entity Recognition and Classification using Word Embedding Features

  • 최윤수 (창원대학교 친환경해양플랜트FEED공학과) ;
  • 차정원 (창원대학교 컴퓨터공학과)
  • 투고 : 2016.02.15
  • 심사 : 2016.04.06
  • 발행 : 2016.06.15

초록

한국어 개체명 인식에 다양한 연구가 있었지만, 영어 개체명 인식에 비해 자질이 부족한 문제를 가지고 있다. 본 논문에서는 한국어 개체명 인식의 자질 부족 문제를 해결하기 위해 word embedding 자질을 개체명 인식에 사용하는 방법을 제안한다. CBOW(Continuous Bag-of-Words) 모델을 이용하여 word vector를 생성하고, word vector로부터 K-means 알고리즘을 이용하여 군집 정보를 생성한다. word vector와 군집 정보를 word embedding 자질로써 CRFs(Conditional Random Fields)에 사용한다. 실험 결과 TV 도메인과 Sports 도메인, IT 도메인에서 기본 시스템보다 각각 1.17%, 0.61%, 1.19% 성능이 향상되었다. 또한 제안 방법이 다른 개체명 인식 및 분류 시스템보다 성능이 향상되는 것을 보여 그 효용성을 입증했다.

Named Entity Recognition and Classification (NERC) is a task for recognition and classification of named entities such as a person's name, location, and organization. There have been various studies carried out on Korean NERC, but they have some problems, for example lacking some features as compared with English NERC. In this paper, we propose a method that uses word embedding as features for Korean NERC. We generate a word vector using a Continuous-Bag-of-Word (CBOW) model from POS-tagged corpus, and a word cluster symbol using a K-means algorithm from a word vector. We use the word vector and word cluster symbol as word embedding features in Conditional Random Fields (CRFs). From the result of the experiment, performance improved 1.17%, 0.61% and 1.19% respectively for TV domain, Sports domain and IT domain over the baseline system. Showing better performance than other NERC systems, we demonstrate the effectiveness and efficiency of the proposed method.

키워드

과제정보

연구 과제번호 : WiseKB: 빅데이터 이해 기반 자가학습형 지식베이스 및 추론 기술 개발

연구 과제 주관 기관 : 정보통신기술진홍센터

참고문헌

  1. DM. Bikel, S. Miller, R. Schwartz, R. Weischedel, "Nymble: a High-Performance Learning Namefinder," Proc. of the 5th Conference on Applied Natural Language Processing, pp. 194-201, 1997.
  2. X. Liu, M. Zhou, F. Wei, Z. Fu and X. Zhou, "Joint Inference of Named Entity Recognition and Normalization for Tweets," Proc. of the 50th Annual Meeting of the Association for Computational Linguistics, Vol. 1, pp. 526-535, 2012.
  3. E. Chung, H. Lee, Y. Hwang and B. Yun, "Korean Name Entity Detection using Co-Training Methods," Proc. of the Human Computer Interaction 2003, pp. 1289-1293, 2003.
  4. C. Lee, et al., "Fine-Grained Named Entity Recognition using Conditional Random Fields for Question Answering," Proc. of the 18th Annual Conference on Human & Cognitive Language Technology, pp. 268-272, 2006.
  5. C. Lee and M. Jang, "Named Entity Recognition with Structural SVMs and Pegasos algorithm," Journal of The Korean Society for Cognitive Science, Vol. 21, No. 4, pp. 655-667. Dec. 2010. https://doi.org/10.19066/cogsci.2010.21.4.009
  6. C. Lee, J. Kim, J. Kim and H. Kim, "Named Entity Recognition using Deep Learning," Proc. of the 41th KIISE Winter Conference, pp. 423-425, 2014.
  7. S. Bae and Y. Ko, "Automatic Construction of Class Hierarchies and Named Entity Dictionaries using Korean Wikipedia," Journal of KIISE : Computing Practices and Letters, Vol. 16, No. 4, pp. 492-496, Apr. 2010.
  8. Y. Song, S. Jeong and H. Kim, "A Constructing Method of Named Entity Dictionary using Wikipedia Based on Information Retrieval Method," Proc. of the KIISE Korea Computer Congress 2015, pp. 648-650, 2015.
  9. J. Turian, L. Ratinov and Y. Bengio, "Word representations: A simple and general method for semisupervised learning," Proc. of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384-394, 2010.
  10. Y. Bengio, R. Ducharme, P. Vincent and C. Jauvin, "A Neural Probabilistic Language Model," Journal of Machine Learning Research, Vol. 3, pp. 1137-1155, 2003.
  11. T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space," ICLR Workshop, 2013.
  12. J. Hong and J. Cha, "A New Korean Morphological Analyzer using Eojeol Pattern dictionary," Proc. of the KIISE Korea Computer Congress 2008, pp. 279-284, 2008.