Machine-Learning Based Biomedical Term Recognition

기계학습에 기반한 생의학분야 전문용어의 자동인식

  • Published : 2006.08.01

Abstract

There has been increasing interest in automatic term recognition (ATR), which recognizes technical terms for given domain specific texts. ATR is composed of 'term extraction', which extracts candidates of technical terms and 'term selection' which decides whether terms in a term list derived from 'term extraction' are technical terms or not. 'term selection' is a process to rank a term list depending on features of technical term and to find the boundary between technical term and general term. The previous works just use statistical features of terms for 'term selection'. However, there are limitations on effectively selecting technical terms among a term list using the statistical feature. The objective of this paper is to find effective features for 'term selection' by considering various aspects of technical terms. In order to solve the ranking problem, we derive various features of technical terms and combine the features using machine-learning algorithms. For solving the boundary finding problem, we define it as a binary classification problem which classifies a term in a term list into technical term and general term. Experiments show that our method records 78-86% precision and 87%-90% recall in boundary finding, and 89%-92% 11-point precision in ranking. Moreover, our method shows higher performance than the previous work's about 26% in maximum.

일정 분야의 문서들에서 그 분야 특정을 반영하는 전문용어를 자동으로 인식하는 연구에 대한 관심이 증가하고 있다. '전문용어 인식'은 문서에서 전문용어가 될 수 있는 언어적 단위를 파악하는 '용어 추출' 과정과 '용어추출' 과정에서 얻어진 용어목록 중 해당분야의 전문용어를 고르는 '전문용어 선택' 과정으로 구성된다. '전문용어 선택' 과정은 용어목록을 전문용어의 특정에 따라 순위화한 후 타당한 전문용어를 파악하는 작업으로 정의된다. 따라서 전문용어 선택 문제는 용어목록의 순위화 작업과 순위화된 목록에서 전문용어와 비전문용어 간의 경계를 인식하는 작업으로 정의된다. 기존의 전문용어 선택 기법은 주로 용어의 빈도수 등과 같은 통계적 특정만을 이용하였다. 하지만 통계적 특정만으로는 효과적으로 전문용어를 선택하기 어렵다. 본 논문의 논제는 전문용어 선택에서 다양한 전문용어의 특정을 고려하고 이들 중 전문용어 선택에서 효과적인 특정을 찾으려는 것이다. 순위화 문제는 다양한 전문용어 특정을 도출하고 이들을 기계학습방법으로 통합하여 해결한다. 경계인식 문제는 전문용어와 비전문용어의 이진 분류 문제로 정의하고 기계학습방법으로 해결한다. 본 논문의 기법은 경계인식측면에서 78-86%의 정확률과 87% -90%의 재현율을 나타내었으며, 순위화 측면에서 89%-92%의 11포인트 평균정확률을 나타내었다. 또한 기존 연구보다 최고 26% 의 성능향상을 보였다.

Keywords

References

  1. Ananiadou, S., A Methodology for Automatic Term Recognition, In Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, pp. 1034-1038, 1994 https://doi.org/10.3115/991250.991317
  2. Bourigault D., LEXTER, a Terminology Extraction Software for Knowledge Acquisition from Texts, 9th Knowledge Acquisition for Knowledge-Based Systems Workshop, 1995
  3. Blaschke C, Andrade MA, Ouzounis C and Valencia A., Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions. ISMB99, 60-67, 1999
  4. Damerau F. (1993) Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts. Information Processing & Management, 29(4), 433-448, 1993 https://doi.org/10.1016/0306-4573(93)90039-G
  5. Frantzi, K.T. and S.Ananiadou, The C-value/NC-value domain independent method for multiword term extraction. Journal of Natural Language Processing, 6(3) pp. 145-180, 1999 https://doi.org/10.5715/jnlp.6.3_145
  6. Fukuda, K. and A Tamura and T Tsunoda and T Takagi, Toward information extraction: identifying protein names from biological papers. In Proceeding of the Pacific Symposium on Biocomputing (PSB98), 707-718, 1998
  7. Jacquemin, C., Judith L.K. and Evelyne, T., 'Expansion of Muti-word Terms for indexing and Retrieval Using Morphology and Syntax,' 35th Annual Meeting of the Association for Computational Linguistics, pp. 24-30, 1997
  8. Justeson, J.S. and S.M. Katz, Technical terminology : some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(1) pp. 9-27, 1995
  9. Maynard D. and Sophia Ananiadou, TRUCKS: a model for automatic term recognition, Journal of Natural Language Processing, December, 2000
  10. Nakagawa, Tataunori Mori, Automatic Term Recognition based on Statistics of Compound Nouns, In Proceeding of the First Workshop on Computational Terminology Computerm02 in COLING02, pp. 29-35, 2002
  11. Oh Jong-Hoon, Juho Lee, Kyung-Soon Lee, Key-Sun Choi, 'Japanese Term Extraction Using Dictionary Hierarchy and Machine Translation System,' Terminology, 6(2), John Benjamins Publishing Company, pp. 287-311, 2000 https://doi.org/10.1075/term.6.2.09oh
  12. 오종훈, 이경순, 최기선, '분야간 유사도와 통계기법을 이용한 전문용어의 자동 추출', 정보과학회 논문지, 제29권 제3,4호, pp. 258-269, 2002
  13. Pum-Mo Ryu, Key-Sun Choi, 'Determining the Specificity of Terms based on Information Theoretic Measures', Proceedings of CompuTerm 2004, 3rd International Workshop on Computational Terminology, Coling 2004, pp.87-90, 2004
  14. Friedman, C., Kra, P., Yu, H., Krauthammer, M. and Rzhetsky, A., GENIES: a natural-language processing systems for the extraction of molecular pathways from journal articles. Bioinformatics, 17, S74-S82, 2001 https://doi.org/10.1093/bioinformatics/17.suppl_1.S74
  15. Jessen, T.-K., Laegreid, A., Komorowski, J. and Hovig, E, A literature network of human genes for high-throughput analysis of gene expression. Nature Genet., 28, 21-28, 2001 https://doi.org/10.1038/ng0501-21
  16. Ono T., H. Hishigaki, A. Tanigami, T. Takagi, Automated Extraction of Information on Protein-Protein Interactions from the Biological Literature. Bioiniormatics, 17:155-161, 2001 https://doi.org/10.1093/bioinformatics/17.2.155
  17. Rindflesch, T.C., L.Tanabe, J.N.Weinstein, and L.Hunter, Edgar: Extraction of drugs, genes, and relations from the biomedical literature. In Proc. Pacific Symposium on Biocomputing, pages 514-525, 2000
  18. Thomas, J., Milward, D., Ouzounis, C., Pulman, S. and Carroll, M., Automatic extraction of protein interactions from scientific abstracts. PSB'00, 541-551, 2000
  19. Yakushiji Akane, Y. Tateisi, Y. Miyao, and J. Tsujii, Event extraction from biomedical papers using a full parser, In proceedings of PSB'01, 408-419, 2001
  20. Yandell, M.D. and Majoros, W.H., Genomics and natural language processing. Nat.Rev. Genet., 3, 601-610, 2002 https://doi.org/10.1038/nrg861
  21. Sager, J.C., 'Section 1.2.1 Term formation', in Handbook of terminology management Vol. 1, John Benjamins publishing company, 1997
  22. Janson Barbara and Med Cohen, Medical Terminology: An Illustrated Guide, Lippincott Williams & Wilkins, 2003
  23. Lewis D.D., Naive (Bayes) at forty: The independence assumption in information retrieval. In ECML-98, 1998
  24. Jones Rosie, Andrew McCallum, Kamal Nigam, Ellen Riloff, Bootstrapping for Text learning Tasks. IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, 1999
  25. Nigam K., Andrew McCallum, Sebastian Thrun ?and Tom Mitchell, Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3). pp. 103-134, 2000
  26. Mooney R.J. and M.E. Califf., Induction of First-Order Decision Lists: Results on Learning the Past Tense of English Verbs, Artificial Intelligence Research, Vol. 3, pp. 1-24, 1995
  27. Rivest, Ronald L., Learning Decision Lists, Machine Learning, 2(3), pp. 229-246, 1987
  28. Yarowsky, D., Unsupervised Word Sense Disambiguation Rivalling Supervised Methods, In Proceeding of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189-196, 1995 https://doi.org/10.3115/981658.981684
  29. Burges. C.J.C., A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):1-47, 1998
  30. Cristianini N. and J, Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge University Press, 2000
  31. Vapnik. V., Statistical Learning Theory. John Wiley & Sons, 1998
  32. Joachims Thorsten, Learning to Classify Text Using Support Vector Machines, Kluwer, 2002
  33. Berger A., S. Della Pietra, and V. Della Pietra, A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39-71, 1996
  34. Miyao, Yusuke and Jun ichi Tsujii, Maximum Entropy Estimation for Feature Forests. In the Proceedings of Human Language Technology Conference (HLT 2002), 2002
  35. Zhang Le., Maximum Entropy Modeling Toolkit for Python and C++, http://www.nlplab.cn/zhangle/, 2004
  36. Ohta and Yuka Tateisi and Hideki Mima and Jun'ichi Tsujii, GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain In Proceeding of he Human Language Technology Conference, 2002
  37. NLM., Unified Medical Language System (UMLS), 2003
  38. Brill, E., Transformation-Based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics, 1995
  39. Salton, G. and McGill, M., Introduction to Modern Information Retrieval, New-York: McGraw-Hill, 1983