DOI QR코드

DOI QR Code

기계학습 기반 개체명 인식을 위한 사전 자질 생성

Feature Generation of Dictionary for Named-Entity Recognition based on Machine Learning

  • 김재훈 (한국해양대학교 컴퓨터공학과) ;
  • 김형철 (한국해양대학교 컴퓨터공학과) ;
  • 최윤수 (한국과학기술정보연구원 정보기술연구실)
  • Kim, Jae-Hoon (Dept. of Computer Engineering, Korea Maritime University) ;
  • Kim, Hyung-Chul (Dept. of Computer Engineering, Korea Maritime University) ;
  • Choi, Yun-Soo (Dept. of Information Technology Research, KISTI)
  • 투고 : 2010.03.08
  • 심사 : 2010.03.26
  • 발행 : 2010.04.30

초록

오늘날 정보 추출의 한 단계로서 개체명 인식은 정보검색 분야 뿐 아니라 질의응답과 요약 분야에서 매우 유용하게 사용되고 있다. 개체명은 일반 단어와 달리 다양한 문서에서 꾸준히 생성되고 변화되고 있다. 이와 같은 개체명의 특성 때문에 여러 응용 시스템에서 미등록어 문제가 야기된다. 본 논문에서는 이런 미등록어 문제를 해결하기 위해 기계학습 기반 개체명 인식 시스템을 위한 새로운 자질 생성 방법을 제안한다. 일반적으로 기계학습 기반 개체명 인식 시스템은 단어 단위의 자질을 사용하므로 구절 단위의 개체명을 그대로 자질로 사용할 수 없다. 이 문제를 해결하기 위해 본 논문에서는 새로운 구절 단위의 정보를 단어 단위의 자질로 변환하는 자질 생성 방법을 제안하였다. 이 방법으로 개체명 사전과 WordNet을 개체명 인식의 자질로 사용할 수 있었다. 그 결과 영어 개체명 시스템은 F1 점수의 약 6%가 향상되었고 오류의 약 38%가 줄어들었다.

Now named-entity recognition(NER) as a part of information extraction has been used in the fields of information retrieval as well as question-answering systems. Unlike words, named-entities(NEs) are generated and changed steadily in documents on the Web, newspapers, and so on. The NE generation causes an unknown word problem and makes many application systems with NER difficult. In order to alleviate this problem, this paper proposes a new feature generation method for machine learning-based NER. In general features in machine learning-based NER are related with words, but entities in named-entity dictionaries are related to phrases. So the entities are not able to be directly used as features of the NER systems. This paper proposes an encoding scheme as a feature generation method which converts phrase entities into features of word units. Futhermore, due to this scheme, entities with semantic information in WordNet can be converted into features of the NER systems. Through our experiments we have shown that the performance is increased by about 6% of F1 score and the errors is reduced by about 38%.

키워드

참고문헌

  1. 김형철, 김재훈, 최윤수. 2009. 접사 정보를 이용한 영어 미등록어의 품사부착 성능개선. 한글 및 한국어 정보처리 학술대회 발표 논문집, 21(2009): 186-190.
  2. 이창기, 황이규, 오효정, 임수종, 허정, 이충희, 김현진, 왕지현, 장명길. 2006. Conditional Random Fields를 이용한 세부 분류 개체명 인식. 한글 및 한국어 정보처리 학술대회 발표논문집, 18(2006): 268-272.
  3. 최윤수, 정창후, 최성필, 류범종, 김재훈. 2009. 대용량 자원 기반 과학기술 핵심개체 탐지에 관한 정보추출기술 통합에 관한 연구. 정보관리연구, 40(4): 1-22. https://doi.org/10.1633/JIM.2009.40.4.001
  4. Ananiadoua, S., Friedman, C., and Tsujii, J. 2004. "Introduction: named entity recognition in biomedicine." Journal of Biomedical Informatics, 37(6): 393-395. https://doi.org/10.1016/j.jbi.2004.08.011
  5. Asahara, M. and Matsumoto, Y. 2003. "Japanese named entity extraction with redundant morphological analysis." Proceedings of the Human Language Technology Conference - North American chapter of the Association for Computational Linguistics, 8-15.
  6. Baluja, S., Mittal, V. and Sukthankar, R. 2000. "Applying machine learning for high performance named-entity extraction." Proceedings of the Conference of the Pacific Association for Computational Linguistics, 365-378.
  7. Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, R. 1997. "Nymble: a High-performance learning name-finder." Proceedings of the Conference on Applied Natural Language Processing, 194-201.
  8. Black, W. and Vasilakopoulos, A. 2002. "Language independent named entity classification by modified transformation- based learning and by decision tree induction." Proceedings of the 6th Conference on Natural Language Learning, 159-162.
  9. Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. 1998. "NYU: Description of the MENE named entity system as used in MUC-7." Proceedings of the 7th Message Understanding Conference.
  10. Boutsis, S., Demiros, I., Giouli, V., Liakata, M., Papageorgiou, H. and Piperidis, S. 2000. "A system for recognition of named entities in Greek." Lecture Notes in Computer Science, 1835: 424-435.
  11. Brin, S. 1998. "Extracting patterns and relations from the World Wide Web." Proceedings of WebDB Workshop at 6th International Conference on Extending Database Technology, 172-183.
  12. Chinchor, N., Brown, E., Ferro, L. and Robinson, P. 1999. Named Entity Recognition Task Definition, version 1.4.
  13. Cohen, W. 2004. "Exploiting dictionaries in named entity extraction: Combining semi-Markov extraction processes and data integration methods." Proceedings of KDD, 89-98.
  14. Egorov, S., Yuryev, A. and Daraselia, N. 2004. "A simple and practical dictionary- based approach for identification of proteins in medline abstracts." The Journal of the American Medical Informatics Association, 11(3): 174-178. https://doi.org/10.1197/jamia.M1453
  15. Fu, G. and Luke, K.-K. 2005. "Chinese named entity recognition using lexicalized HMMs." ACM SIGKDD Explorations Newsletter, 7(1): 19-25. https://doi.org/10.1145/1089815.1089819
  16. Grishman, R. and Sundheim, B. 1996. "Message understanding conference - 6: A brief history." Proceedings of the 16th International Conference on Computational Linguistics, 466 -471.
  17. Han, X. and Zhoa, J. 2009. "Named entity disambiguation by leveraging wikipedia semantic knowledge." Proceeding of the 18th ACM conference on Information and Knowledge Management, 215-224.
  18. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L. and Weischedel, R. 2006. "OntoNotes: The 90% solution." Proceedings of Proceedings of the Human Language Technology Conference of the NAACL, 57-60.
  19. Kim Sang, E. F. T. and de Meulder, F. 2003. "Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition." Proceedings of the seventh conference on Natural Language Learning, 142-147.
  20. Lafferty, J., McCallum, A. and Pereira, F. 2001. "Conditional random fields: Probabilistic models for segmenting and labeling sequence data." Proceedings of the 18th International Conference on Machine Learning, 282-289.
  21. Liu, H., Hu, Z. Z., Torii, M., Wu, C., and Friedman, C. 2006. "Quantitative assessment of dictionary-based protein named entity tagging." Journal of the American Medical Informatics Association, 13(5): 497-507. https://doi.org/10.1197/jamia.M2085
  22. Magnini, B., Negri, M., Prevete, R., and Taney H. 2002. "A WordNet-based approach to named entities recognition." Proceedings of the International Conference On Computational Linguistics(on SEMANET: Building and Using Semantic Networks), 1-7.
  23. McCallum, A. and Li, W. 2003. "Early results for named entity recognition with conditional random fields, features induction and web-enhanced lexicons." Proceedings of the Conference on Computational Natural Language Learning, 188-191.
  24. Miller, G. A. 1995. "WordNet: A lexical database for English." Communications of the ACM, 38(11): 39-41.
  25. Nadeau, D. and Sekine, S. 2007. "A survey of named entity recognition and classification." Journal of Linguisticae Investigationes, 30(1): 3-26. https://doi.org/10.1075/li.30.1.03nad
  26. Negri, M. and Magnini, B. 2004. "Using WordNet predicates for multilingual named entity recognition." Proceedings of The Second Global WordNet Conference, 169-174.
  27. Poibeau, T. 2003. "The multilingual named entity recognition framework." Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics, 155-158.
  28. Rabiner. L. R. 1989. "A tutorial on hidden Markov models and selected applications in speech recognition." Proceedings of the IEEE, 77(2): 257-286. https://doi.org/10.1109/5.18626
  29. Ramshaw, L. A. and Marcus, M. P. 1995. "Text chunking using transformation-based learning." Proceedings of the Third ACL Workshop on Very Large Corpora, 82-94.
  30. Ratnaparkhi, A. 1997. A Simple Introduction to Maximum Entropy Models for Natural Language Processing. University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-97-08.
  31. Ravin, Y. and Wacholder, N. 1996. Extracting Names from Natural-Language Text. IBM Research Report RC 2033.
  32. Lise Getoor and Ben Taskar. 2007. Introduction to Statistical Relational Learning. Cambridge, Mass: MIT Press.
  33. Utsuro, T., Sassano, M. and Uchimoto, K. 2002, "Combining outputs of multiple Japanese named entity chunkers by stacking." Proceedings of the Conference on Empirical Methods in Natural Language Processing, 281-288.
  34. Wattarujeekrit, T. 2005. Exploring Semantic Roles for Named Entity Recognition in the Molecular Biology Domain. Ph.D. diss., Department of Informatics, School of Multidisciplinary Sciences, The Graduate University for Advanced Studies.