Browse > Article
http://dx.doi.org/10.16981/kliss.47.4.201612.289

A Study on the Semiautomatic Construction of Domain-Specific Relation Extraction Datasets from Biomedical Abstracts - Mainly Focusing on a Genic Interaction Dataset in Alzheimer's Disease Domain -  

Choi, Sung-Pil (경기대학교 문헌정보학과)
Yoo, Suk-Jong (한국과학기술정보연구원 생명의료융합기술연구실)
Cho, Hyun-Yang (경기대학교 문헌정보학과)
Publication Information
Journal of Korean Library and Information Science Society / v.47, no.4, 2016 , pp. 289-307 More about this Journal
Abstract
This paper introduces a software system and process model for constructing domain-specific relation extraction datasets semi-automatically. The system uses a set of terms such as genes, proteins diseases and so forth as inputs and then by exploiting massive biological interaction database, generates a set of term pairs which are utilized as queries for retrieving sentences containing the pairs from scientific databases. To assess the usefulness of the proposed system, this paper applies it into constructing a genic interaction dataset related to Alzheimer's disease domain, which extracts 3,510 interaction-related sentences by using 140 gene names in the area. In conclusion, the resulting outputs of the case study performed in this paper indicate the fact that the system and process could highly boost the efficiency of the dataset construction in various subfields of biomedical research.
Keywords
Relation extraction; Dataset construction; Genic interactions; Machine learning; Text mining;
Citations & Related Records
Times Cited By KSCI : 4  (Citation Analysis)
연도 인용수 순위
1 Nedellec, C. 2005. Learning language in logic-genic interaction extraction challenge. In Proceedings of the 4th Learning Language in Logic Workshop (LLL05) (Vol. 7). Citeseer.
2 Pyysalo, S., Ginter, F., Heimonen, J., Bjorne, J., Boberg, J., Jarvinen, J., & Salakoski, T. 2007. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics, 8(1): 50.   DOI
3 Alnazzawi, N., Thompson, P., & Ananiadou, S. 2014. Building a semantically annotated corpus for congestive heart and renal failure from clinical records and the literature. In Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)@ EACL (pp. 69-74).
4 Alex, B., Grover, C., Haddow, B., Kabadjor, M., Klein, E., Matthews, M., Wang, X. 2008. Assisted Curation: Does Text Mining Really Help?. In Pacific Symposium on Biocomputing (Vol. 13, pp. 556-567).
5 Bader, G. D., Betel, D., & Hogue, C. W. V. 2003. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Research, 31(1): 248-250.   DOI
6 Blaschke, C., Hirschman, L., & Valencia, A. 2002. Information extraction in molecular biology. Briefings in Bioinformatics, 3(2): 154-165.   DOI
7 Bunescu, R., Ge, R., Kate, R. J., Marcotte, E. M., Mooney, R. J., Ramani, A. K., & Wong, Y. W. 2005. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine, 33(2): 139-155.   DOI
8 Chatr-aryamontri, A., Ceol, A., Palazzi, L. M., Nardelli, G., Schneider, M. V., Castagnoli, L., & Cesareni, G. 2007. MINT: the Molecular INTeraction database. Nucleic Acids Research, 35(Database issue), D572-D 574. https://doi.org/10.1093/nar/gkl950   DOI
9 Choi, S.-P., & Myaeng, S.-H. 2010. Simplicity is Better: Revisiting Single Kernel PPI Extraction. In Proceedings of the 23rd International Conference on Computational Linguistics (pp. 206-214). Stroudsburg, PA, USA: Association for Computational Linguistics.
10 Ravikumar, K., Liu, H., Cohn, J. D., Wall, M. E., & Verspoor, K. 2012. Literature mining of protein-residue associations with graph rules learned through distant supervision. Journal of Biomedical Semantics, 3 Suppl 3, S2.
11 Rubin, D. L., Shah, N. H., & Noy, N. F. 2008. Biomedical ontologies: a functional perspective. Briefings in Bioinformatics, 9(1): 75-90.   DOI
12 Saffer, J. D., & Burnett, V. L. 2014. Introduction to Biomedical Literature Text Mining: Context and Objectives. In Biomedical Literature Mining (pp. 1-7). Springer.
13 Segura Bedmar, I., Martinez, P., & Sanchez Cisneros, D. 2011. The 1st DDIExtraction-2011 Challenge Task: Extraction of Drug-Drug Interactions from Biomedical Texts.
14 Stark, C., Breitkreutz, B.-J., Reguly, T., Boucher, L., Breitkreutz, A., & Tyers, M. 2006. BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 34(Database issue), D535-539.   DOI
15 Thompson, P., Iqbal, S. A., McNaught, J., & Ananiadou, S. 2009. Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics, 10(1): 349.   DOI
16 Uzuner, o., South, B. R., Shen, S., & DuVall, S. L. 2011. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association: JAMIA, 18(5): 552-556.   DOI
17 Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K., Marcotte, E. M., & Eisenberg, D. 2000. DIP: the Database of Interacting Proteins. Nucleic Acids Research, 28(1), 289-291.   DOI
18 최성필. 2016. 기계 학습을 이용한 바이오 분야 학술 문헌에서의 관계 추출에 대한 실험적 연구. 한국문헌정보학회지, 50(2).(Choi, Sung-Pil. 2016. An Experimental Study on the Relation Extraction from Biomedical Abstracts using Machine Learning. Journal of the Korean Society for Library and Information Science. 50(2).)
19 박경미, 황규백. 2011. 자연어처리 기반 바이오 텍스트 마이닝 시스템. 정보과학회논문지 : 컴퓨팅의 실제 및 레터, 17(4).(Park, Kyung-Mi, Kyu-Baek Hwang. 2011, A Bio-Text Mining System Based on Natural Language Processing. KIISE Transactions on Computing Practices, 17(4).)
20 정창후, 최성필, 이민호, 최윤수. 2010. 기술용어 간 관계추출의 성능평가를 위한 반자동 테스트 컬렉션 구축 프레임워크 개발. 한국콘텐츠학회논문지, 10(2).(Jeong, Chang-Hoo, Sung-Pil Choi, Min-Ho Lee, Yun-Soo Choi. 2010. The Journal of the Korea Contents Association. 10(2).)
21 허고은, 송민. 2014. 텍스트 마이닝 기반의 그래프 모델을 이용한 미발견 공공 지식 추론. 정보관리학회지, 31(1).(Heo, Go Eun, Min Song. 2014. Inferring Undiscovered Public Knowledge by Using Text Mining-driven Graph Model. Journal of the Korean Society for Information Management. 31(1).)
22 Hastie, T., Tibshirani, R., & Friedman, J. 2009. The Elements of Statistical Learning. New York, NY: Springer New York.
23 Ding, J., Berleant, D., Nettleton, D., & Wurtele, E. 2002. Mining MEDLINE: abstracts, sentences, or phrases? Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 326-337.
24 Fundel, K., Kuffner, R., & Zimmer, R. 2007. RelEx-Relation extraction using dependency parse trees. Bioinformatics, 23(3): 365-371. https://doi.org/10.1093/bioinformatics/btl616   DOI
25 Haddow, B., & Alex, B. 2008. Exploiting Multiply Annotated Corpora in Biomedical Information Extraction Tasks. In D. T. Nicoletta Calzolari (Conference Chair) Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis (Ed.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08). Marrakech, Morocco: European Language Resources Association (ELRA).
26 Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Apweiler, R. 2004. IntAct: an open source molecular interaction database. Nucleic Acids Research, 32(Database issue), D452-D455. https://doi.org/10.1093/nar/gkh052   DOI
27 Hirschman, L., Yeh, A., Blaschke, C., & Valencia, A. 2005. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl 1), S1.
28 Huang, C.-C., & Lu, Z. 2016. Community challenges in biomedical text mining over 10 years: success, failure and the future. Briefings in Bioinformatics, 17(1): 132-144.   DOI
29 Ivanovic, M., & Budimac, Z. 2014. An overview of ontologies and data resources in medical domains. Expert Systems with Applications, 41(11), 5158-5166.   DOI
30 Kim, J.-D., Pyysalo, S., Ohta, T., Bossy, R., Nguyen, N., & Tsujii, J. ichi. 2011. Overview of BioNLP Shared Task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop (pp. 1-6). Stroudsburg, PA, USA: Association for Computational Linguistics.
31 Krallinger, M., Leitner, F., Rodriguez-Penagos, C., & Valencia, A. 2008. Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology, 9(Suppl 2), S4. https://doi.org/10.1186/gb-2008-9-s2-s4   DOI
32 Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics (ACL) System Demonstrations (pp. 55-60).
33 Lee, J ., Kim, S., Lee, S., Lee, K., & Kang, J. 2012 . High Precision Rule Based PPI Extraction and Per-pair Basis Performance Evaluation. In Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics (pp. 69-76). New York, NY, USA: ACM.
34 Li, L., Guo, R., Jiang, Z., & Huang, D. 2014. Improving Kernel-based protein-protein interaction extraction by unsupervised word representation. In Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on (pp. 379-384). IEEE.
35 Malhotra, A., Younesi, E., Gündel, M., Muller, B., Heneka, M. T., & Hofmann-Apitius, M. 2014. ADO: a disease ontology representing the domain knowledge specific to Alzheimer's disease. Alzheimer's & Dementia: The Journal of the Alzheimer's Association, 10(2), 238-246.   DOI
36 Mintz, M., Bills, S., Snow, R., & Jurafsky, D. 2009. Distant Supervision for Relation Extraction Without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2 (pp. 1003-1011). Stroudsburg, PA, USA: Association for Computational Linguistics.