Machine-Learning Based Biomedical Term Recognition

Oh Jong-Hoon;Choi Key-Sun;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 33 Issue 8
/
Pages.718-729
/
2006
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

Machine-Learning Based Biomedical Term Recognition

기계학습에 기반한 생의학분야 전문용어의 자동인식

Oh Jong-Hoon (Expert researcher, Computational Linguistics Group, NICT) ;
Choi Key-Sun

오종훈 ;
최기선 (한국과학기술원 전산학과)

Published : 2006.08.01

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

There has been increasing interest in automatic term recognition (ATR), which recognizes technical terms for given domain specific texts. ATR is composed of 'term extraction', which extracts candidates of technical terms and 'term selection' which decides whether terms in a term list derived from 'term extraction' are technical terms or not. 'term selection' is a process to rank a term list depending on features of technical term and to find the boundary between technical term and general term. The previous works just use statistical features of terms for 'term selection'. However, there are limitations on effectively selecting technical terms among a term list using the statistical feature. The objective of this paper is to find effective features for 'term selection' by considering various aspects of technical terms. In order to solve the ranking problem, we derive various features of technical terms and combine the features using machine-learning algorithms. For solving the boundary finding problem, we define it as a binary classification problem which classifies a term in a term list into technical term and general term. Experiments show that our method records 78-86% precision and 87%-90% recall in boundary finding, and 89%-92% 11-point precision in ranking. Moreover, our method shows higher performance than the previous work's about 26% in maximum.

일정 분야의 문서들에서 그 분야 특정을 반영하는 전문용어를 자동으로 인식하는 연구에 대한 관심이 증가하고 있다. '전문용어 인식'은 문서에서 전문용어가 될 수 있는 언어적 단위를 파악하는 '용어 추출' 과정과 '용어추출' 과정에서 얻어진 용어목록 중 해당분야의 전문용어를 고르는 '전문용어 선택' 과정으로 구성된다. '전문용어 선택' 과정은 용어목록을 전문용어의 특정에 따라 순위화한 후 타당한 전문용어를 파악하는 작업으로 정의된다. 따라서 전문용어 선택 문제는 용어목록의 순위화 작업과 순위화된 목록에서 전문용어와 비전문용어 간의 경계를 인식하는 작업으로 정의된다. 기존의 전문용어 선택 기법은 주로 용어의 빈도수 등과 같은 통계적 특정만을 이용하였다. 하지만 통계적 특정만으로는 효과적으로 전문용어를 선택하기 어렵다. 본 논문의 논제는 전문용어 선택에서 다양한 전문용어의 특정을 고려하고 이들 중 전문용어 선택에서 효과적인 특정을 찾으려는 것이다. 순위화 문제는 다양한 전문용어 특정을 도출하고 이들을 기계학습방법으로 통합하여 해결한다. 경계인식 문제는 전문용어와 비전문용어의 이진 분류 문제로 정의하고 기계학습방법으로 해결한다. 본 논문의 기법은 경계인식측면에서 78-86%의 정확률과 87% -90%의 재현율을 나타내었으며, 순위화 측면에서 89%-92%의 11포인트 평균정확률을 나타내었다. 또한 기존 연구보다 최고 26% 의 성능향상을 보였다.

Keywords

References

Ananiadou, S., A Methodology for Automatic Term Recognition, In Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, pp. 1034-1038, 1994 https://doi.org/10.3115/991250.991317
Bourigault D., LEXTER, a Terminology Extraction Software for Knowledge Acquisition from Texts, 9th Knowledge Acquisition for Knowledge-Based Systems Workshop, 1995
Blaschke C, Andrade MA, Ouzounis C and Valencia A., Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions. ISMB99, 60-67, 1999
Damerau F. (1993) Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts. Information Processing & Management, 29(4), 433-448, 1993 https://doi.org/10.1016/0306-4573(93)90039-G
Frantzi, K.T. and S.Ananiadou, The C-value/NC-value domain independent method for multiword term extraction. Journal of Natural Language Processing, 6(3) pp. 145-180, 1999 https://doi.org/10.5715/jnlp.6.3_145
Fukuda, K. and A Tamura and T Tsunoda and T Takagi, Toward information extraction: identifying protein names from biological papers. In Proceeding of the Pacific Symposium on Biocomputing (PSB98), 707-718, 1998
Jacquemin, C., Judith L.K. and Evelyne, T., 'Expansion of Muti-word Terms for indexing and Retrieval Using Morphology and Syntax,' 35th Annual Meeting of the Association for Computational Linguistics, pp. 24-30, 1997
Justeson, J.S. and S.M. Katz, Technical terminology : some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(1) pp. 9-27, 1995
Maynard D. and Sophia Ananiadou, TRUCKS: a model for automatic term recognition, Journal of Natural Language Processing, December, 2000
Nakagawa, Tataunori Mori, Automatic Term Recognition based on Statistics of Compound Nouns, In Proceeding of the First Workshop on Computational Terminology Computerm02 in COLING02, pp. 29-35, 2002
Oh Jong-Hoon, Juho Lee, Kyung-Soon Lee, Key-Sun Choi, 'Japanese Term Extraction Using Dictionary Hierarchy and Machine Translation System,' Terminology, 6(2), John Benjamins Publishing Company, pp. 287-311, 2000 https://doi.org/10.1075/term.6.2.09oh
오종훈, 이경순, 최기선, '분야간 유사도와 통계기법을 이용한 전문용어의 자동 추출', 정보과학회 논문지, 제29권 제3,4호, pp. 258-269, 2002
Pum-Mo Ryu, Key-Sun Choi, 'Determining the Specificity of Terms based on Information Theoretic Measures', Proceedings of CompuTerm 2004, 3rd International Workshop on Computational Terminology, Coling 2004, pp.87-90, 2004
Friedman, C., Kra, P., Yu, H., Krauthammer, M. and Rzhetsky, A., GENIES: a natural-language processing systems for the extraction of molecular pathways from journal articles. Bioinformatics, 17, S74-S82, 2001 https://doi.org/10.1093/bioinformatics/17.suppl_1.S74
Jessen, T.-K., Laegreid, A., Komorowski, J. and Hovig, E, A literature network of human genes for high-throughput analysis of gene expression. Nature Genet., 28, 21-28, 2001 https://doi.org/10.1038/ng0501-21
Ono T., H. Hishigaki, A. Tanigami, T. Takagi, Automated Extraction of Information on Protein-Protein Interactions from the Biological Literature. Bioiniormatics, 17:155-161, 2001 https://doi.org/10.1093/bioinformatics/17.2.155
Rindflesch, T.C., L.Tanabe, J.N.Weinstein, and L.Hunter, Edgar: Extraction of drugs, genes, and relations from the biomedical literature. In Proc. Pacific Symposium on Biocomputing, pages 514-525, 2000
Thomas, J., Milward, D., Ouzounis, C., Pulman, S. and Carroll, M., Automatic extraction of protein interactions from scientific abstracts. PSB'00, 541-551, 2000
Yakushiji Akane, Y. Tateisi, Y. Miyao, and J. Tsujii, Event extraction from biomedical papers using a full parser, In proceedings of PSB'01, 408-419, 2001
Yandell, M.D. and Majoros, W.H., Genomics and natural language processing. Nat.Rev. Genet., 3, 601-610, 2002 https://doi.org/10.1038/nrg861
Sager, J.C., 'Section 1.2.1 Term formation', in Handbook of terminology management Vol. 1, John Benjamins publishing company, 1997
Janson Barbara and Med Cohen, Medical Terminology: An Illustrated Guide, Lippincott Williams & Wilkins, 2003
Lewis D.D., Naive (Bayes) at forty: The independence assumption in information retrieval. In ECML-98, 1998
Jones Rosie, Andrew McCallum, Kamal Nigam, Ellen Riloff, Bootstrapping for Text learning Tasks. IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, 1999
Nigam K., Andrew McCallum, Sebastian Thrun ？and Tom Mitchell, Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3). pp. 103-134, 2000
Mooney R.J. and M.E. Califf., Induction of First-Order Decision Lists: Results on Learning the Past Tense of English Verbs, Artificial Intelligence Research, Vol. 3, pp. 1-24, 1995
Rivest, Ronald L., Learning Decision Lists, Machine Learning, 2(3), pp. 229-246, 1987
Yarowsky, D., Unsupervised Word Sense Disambiguation Rivalling Supervised Methods, In Proceeding of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189-196, 1995 https://doi.org/10.3115/981658.981684
Burges. C.J.C., A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):1-47, 1998
Cristianini N. and J, Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge University Press, 2000
Vapnik. V., Statistical Learning Theory. John Wiley & Sons, 1998
Joachims Thorsten, Learning to Classify Text Using Support Vector Machines, Kluwer, 2002
Berger A., S. Della Pietra, and V. Della Pietra, A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39-71, 1996
Miyao, Yusuke and Jun ichi Tsujii, Maximum Entropy Estimation for Feature Forests. In the Proceedings of Human Language Technology Conference (HLT 2002), 2002
Zhang Le., Maximum Entropy Modeling Toolkit for Python and C++, http://www.nlplab.cn/zhangle/, 2004
Ohta and Yuka Tateisi and Hideki Mima and Jun'ichi Tsujii, GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain In Proceeding of he Human Language Technology Conference, 2002
NLM., Unified Medical Language System (UMLS), 2003
Brill, E., Transformation-Based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics, 1995
Salton, G. and McGill, M., Introduction to Modern Information Retrieval, New-York: McGraw-Hill, 1983

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Machine-Learning Based Biomedical Term Recognition

기계학습에 기반한 생의학분야 전문용어의 자동인식

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)