[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.1633/JISTaP.2014.2.1.1

TAKES: Two-step Approach for Knowledge Extraction in Biomedical Digital Libraries

Song, Min (Department of Library and Information Science Yonsei University)

Publication Information

Journal of Information Science Theory and Practice / v.2, no.1, 2014 , pp. 6-21 More about this Journal

Abstract

This paper proposes a novel knowledge extraction system, TAKES (Two-step Approach for Knowledge Extraction System), which integrates advanced techniques from Information Retrieval (IR), Information Extraction (IE), and Natural Language Processing (NLP). In particular, TAKES adopts a novel keyphrase extraction-based query expansion technique to collect promising documents. It also uses a Conditional Random Field-based machine learning technique to extract important biological entities and relations. TAKES is applied to biological knowledge extraction, particularly retrieving promising documents that contain Protein-Protein Interaction (PPI) and extracting PPI pairs. TAKES consists of two major components: DocSpotter, which is used to query and retrieve promising documents for extraction, and a Conditional Random Field (CRF)-based entity extraction component known as FCRF. The present paper investigated research problems addressing the issues with a knowledge extraction system and conducted a series of experiments to test our hypotheses. The findings from the experiments are as follows: First, the author verified, using three different test collections to measure the performance of our query expansion technique, that DocSpotter is robust and highly accurate when compared to Okapi BM25 and SLIPPER. Second, the author verified that our relation extraction algorithm, FCRF, is highly accurate in terms of F-Measure compared to four other competitive extraction algorithms: Support Vector Machine, Maximum Entropy, Single POS HMM, and Rapier.

Keywords

Semantic Query Expansion; Information Extraction; Information Retrieval; Text Mining;

Citations & Related Records

Reference

1	Abdou, S., & Savoy, J. (2008). Searching in Medline: Query expansion and manual indexing evaluation. Information Processing and Management, 44(2), 781-789. DOI ScienceOn
2	Agichtein, E., & Gravano, L. (2003). Querying text databases for efficient information extraction. Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE), 113-124. New York.
3	Airola, A., Pyysalo, S., Bjorne, J., Pahikkala, T., Ginter, F., & Salakoski, T. (2008). All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics, 9:S2.
4	Banko, M., & Etzioni, O. (2007). Strategies for lifelong knowledge extraction from the web. Proceedings of the 4th International Conference on Knowledge Capture, 95-102.
5	Blaschke, C., Andrade, M. A., Ouzounis, C., & Valencia, A. (1999). Automatic extraction of biological information from scientific text: Protein-Protein interactions. Proceedings of the First International Conference on Intelligent Systems for Molecular Biology, 60-67. New York.
6	Blaschke, C., Hirschman, L., Shatkay, H., & Valencia, A. (2010). Overview of the Ninth Annual Meeting of the BioLINK SIG at ISMB: Linking Literature. Information and Knowledge for Biology, Linking Literature, Information, and Knowledge for Biology, 6004: 1-7. DOI ScienceOn
7	Califf, M. E., & Mooney, R. (2003). Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 2, 177-210.
8	Carpineto, C., & Romano, G. (2010). Towards more effective techniques for automatic query expansion. Research and Advanced Technology for Digital Libraries, 851-852.
9	Cohen, W., & Singer, Y. (1996). Learning to query the web. Proceedings of the AAAI Workshop on Internet-Based Information System.
10	Feng, D., Burns, G., & Hovy, E. (2008). Adaptive information extraction for complex biomedical tasks. BioNLP 2008: Current Trends in Biomedical Natural Language Processing, 120-121. New York.
11	Frants, V.I., & Shapiro, J. (1991). Algorithm for automatic construction of query formulations in Boolean form. Journal of the American Society for Information Science, 42(1), 16-26. DOI
12	He, M., Wang, Y., & Li, W. (2009). PPI finder: A mining tool for human protein-protein interactions. PLoS One, 4(2): e4554. Epub 2009 Feb 23. DOI ScienceOn
13	Hu, X., & Shen, X. (2009). Mining biomedical literature for identification of potential virus/bacteria. IEEE Intelligent System, 24(6), 73-77. New York. DOI ScienceOn
14	Manning, C., & Klein, D. (2003). Optimization, maxent models, and conditional estimation without magic. Tutorial at HLT-NAACL 2003. New York.
15	Kim, M. Y. (2008). Detection of protein subcellular localization based on a full syntactic parser and semantic information. Fifth International Conference on Fuzzy Systems and Knowledge Discovery, 4, 407-411.
16	Kudo, T., & Matsumoto, Y. (2000). Use of support vector learning for chunk identification. Proceedings of CoNLL- 2000 and LLL-2000, 142-144. Saarbruncken, Germany; New York.
17	Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the ICML' 01.
18	McKusick, V.A. (1998). Mendelian inheritance in man. A catalog of human genes and genetic disorders, 12th ed. Johns Hopkins University Press: Baltimore, MD.
19	Mitra, C.U., Singhal, A., & Buckely, C. (1998). Improving automatic query expansion. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 206-214. New York.
20	Miyao, Y., Sagae, K., Saetre, R., Matsuzaki, T., & Tsujii, J. (2009). Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics, 25(3), 394-400. DOI ScienceOn
21	Muller, H.W., Kenny E.E., & Sternberg, P.W. (2004). Textpresso: An ontology-based information retrieval and extraction system for biological literature, PLoS Biol. Nov,2(11), e309. DOI ScienceOn
22	Poon, H., & Vanderwende, L. (2010). Joint inference for knowledge extraction from biomedical literature. Proceedings of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, NJ: Human Language Technologies 2010 conference. Los Angeles, CA.
23	Robertson, S. E., Zaragoza, H., & Taylor, M. (2004). Simple BM25 extension to multiple weighted fields. Proceedings of the thirteenth ACM international conference on Information and knowledge management, 42-49. New York.
24	Pyysalo, S., Ginter, F., Heimonen, J., Bjorne, J., Boberg, J., Jarvinen, J., & Salakoski, T. (2007). BioInfer: A corpus for information extraction in the biomedical domain. BMC Bioinformatics, 8(50).
25	Quinlan, J. R. (1993). Programs for machine learning. San Mateo, CA: Morgan Kaufmann.
26	Ray, S., & Craven, M. (2001). Representing sentence structure in hidden markov models for information extraction. Proceedings of the 17th International Joint Conference on Artificial Intelligence. Seattle, WA: Morgan Kaufmann.
27	Robertson, S.E., & Sparck, J.K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27, 129-146. DOI ScienceOn
28	Shatkay, H., & Feldman, R. (2003). Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 10 (6), 821-855. DOI ScienceOn
29	Xenarios, I., & Eisenberg, D. (2001). Protein interaction databases. Current Opinion in Biotechnology, 12(4), 334-339. DOI ScienceOn
30	Zhou, G., & Zhang, M. (2007). Extracting relation information from text documents by exploring various types of knowledge. Information Processing and Management, 43(4), 969-982. DOI ScienceOn