Browse > Article
http://dx.doi.org/10.3745/KIPSTD.2011.18D.5.329

Terminology Recognition System based on Machine Learning for Scientific Document Analysis  

Choi, Yun-Soo (한국과학기술정보연구원)
Song, Sa-Kwang (한국과학기술정보연구원)
Chun, Hong-Woo (한국과학기술정보연구원)
Jeong, Chang-Hoo (한국과학기술정보연구원)
Choi, Sung-Pil (한국과학기술정보연구원)
Abstract
Terminology recognition system which is a preceding research for text mining, information extraction, information retrieval, semantic web, and question-answering has been intensively studied in limited range of domains, especially in bio-medical domain. We propose a domain independent terminology recognition system based on machine learning method using dictionary, syntactic features, and Web search results, since the previous works revealed limitation on applying their approaches to general domain because their resources were domain specific. We achieved F-score 80.8 and 6.5% improvement after comparing the proposed approach with the related approach, C-value, which has been widely used and is based on local domain frequencies. In the second experiment with various combinations of unithood features, the method combined with NGD(Normalized Google Distance) showed the best performance of 81.8 on F-score. We applied three machine learning methods such as Logistic regression, C4.5, and SVMs, and got the best score from the decision tree method, C4.5.
Keywords
Terminology Recognition; Text Mining; Machine Learning; Information Extraction;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Y. Tseng, C. Lin, Y. Lin, "Text mining techniques for patent analysis," Information Processing and Management, Vol.43, No.5, pp.1216-1247, 2007.   DOI   ScienceOn
2 Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
3 Rudi Cilibrasi and Paul Vitanyi, "The Google Similarity Distance," IEEE Trans. Knowledge and Data Engineering, Vol.19, No.3, pp.370-383, 2007.   DOI   ScienceOn
4 Qing T. Zeng, Tony Tse, et. al., "Term identification methods for consumer health vocabulary development," Journal of medical Internet research, Vol.9, No.1, 2007.
5 WEKA - Data Mining Software in Java, http:// www.cs.waikato.ac.nz/ml/weka/
6 Joachim Wermter and Udo Hahn, "Paradigmatic Modifiability Statistics for the Extraction of Complex Multi-Word Terms," HLT'05 Proceedings of the conference on Human Language Technology and Empirical Methods in NLP, 2005.
7 K. Frantzi and S. Ananiadou and Hideki Mima, "Automatic recognition of multi-word terms: the C-value/NC-value method," International Journal on Digital Libraries, Vol.3, No.2, pp.115-130, 2000.   DOI
8 LIBSVM - A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/-cjlin/libsvm/
9 G. Zhou, J. Zhang, J. Su, D. Shen and C. Tan, "Recognizing names in biomedical texts: a machine learning approach," Bioinformatics, Vol.20, No.7, pp.1178-1190, 2004.   DOI   ScienceOn
10 Nakagawa, Hiroshi and Tatsunori Mori, "Automatic term recognition based on statistics of compound nouns and their components," Terminology, Vol.9, No.2, pp.201-219, 2003.   DOI
11 Ido Dagan and Kenneth W. Church, "Termight: Identifying and translating technical terminology," ANLP, pp.34-40, 1994.
12 J. Kazama, T. Makino, Y. Ohta, J. Tsujii, "Tuning support vector machines for biomedical named entity recognition," Proceedings of the ACL-02 workshop on NLP in the biomedical domain, Vol.3, pp.1-8, 2002.   DOI
13 Corinna Cortes and V. Vapnik, "Support-Vector Networks", Machine Learning, Vol.20, No.3, pp-273-297, 1995.
14 Justeson, J.S. and S.M. Katz, "Technical terminology : some lingustic propertis and an algorithm for identification in text," Natural Language Engineering, Vol.1, No.1, pp.9-27, 1995.
15 Beatrice Daille, Eric Gaussier, and Jean-Marc Lange, "Towards Automatic Extraction of Monolingual and Bilingual Terminology. COLING-94, 1994.
16 Church, K. & Hanks. P, "Word association norms, mutual information, and lexicography," Computational Linguistics, Vol.16, No.1, pp.22-29, 1990.
17 F. Smadja, K. R. McKeown, and V. Hatzivassiloglou, "Translating collocations for bilingual lexicons: A statistical approach", Computational Linguistics, Vol.22, No.1, pp.1-38, 1996.
18 Dunning, T. "Accurate methods for the statistics of surprise and coincidence," Computational Linguistics, Vol.19, No.1, pp.61-74, 1993.