[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5391/IJFIS.2015.15.2.111

Protein Named Entity Identification Based on Probabilistic Features Derived from GENIA Corpus and Medical Text on the Web

Sumathipala, Sagara (Graduate School of Engineering, Nagaoka University of Technologyy)
Yamada, Koichi (Graduate School of Engineering, Nagaoka University of Technologyy)
Unehara, Muneyuki (Graduate School of Engineering, Nagaoka University of Technologyy)
Suzuki, Izumi (Graduate School of Engineering, Nagaoka University of Technologyy)

Publication Information

International Journal of Fuzzy Logic and Intelligent Systems / v.15, no.2, 2015 , pp. 111-120 More about this Journal

Abstract

Protein named entity identification is one of the most essential and fundamental predecessor for extracting information about protein-protein interactions from biomedical literature. In this paper, we explore the use of abstracts of biomedical literature in MEDLINE for protein name identification and present the results of the conducted experiments. We present a robust and effective approach to classify biomedical named entities into protein and non-protein classes, based on a rich set of features: orthographic, keyword, morphological and newly introduced Protein-Score features. Our procedure shows significant performance in the experiments on GENIA corpus using Random Forest, achieving the highest values of precision 92.7%, recall 91.7%, and F-measure 92.2% for protein identification, while reducing the training and testing time significantly.

Keywords

biomedical text mining; named entity recognition; protein named entity; random forest;

Citations & Related Records

Reference

1	Boulesteix, A. L., Janitza, S., Kruppa, J., and Knig, I. R. (2012). “Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6), 493-507. DOI
2	Okun, O., and Priisalu, H. (2007). “Random forest for gene expression based cancer classification: overlooked issues”. In Pattern Recognition and Image Analysis (pp. 483-490). Springer Berlin Heidelberg.
3	Yang, P., Hwa Yang, Y., B Zhou, B., and Y Zomaya, A. (2010). “A review of ensemble methods in bioinformatics”. Current Bioinformatics, 5(4), 296-308. DOI
4	Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984). "Classification and regression trees". CRC press.
5	Breiman, L. (1996). “Bagging predictors”. Machine learning, 24(2), 123-140. DOI
6	Zhu, F., and Shen, B. “Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing.” PloS one 7, no. 6 (2012): e39230. DOI
7	Kazama, J. I., Makino, T., Ohta, Y., and Tsujii, J. I. “Tuning support vector machines for biomedical named entity recognition.” In Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain- Volume 3, pp. 1-8. Association for Computational Linguistics, 2002.
8	Lee, K. J., Hwang, Y. S., Kim, S., and Rim, H. C. “Biomedical named entity recognition using two-phase model based on SVMs.” Journal of Biomedical Informatics 37, no. 6 (2004): 436-447. DOI
9	PubMed, http://www.ncbi.nlm.nih.gov/pubmed/
10	Kuo, H. C., and Lin, K. I. “Extracting Protein Names from Biological Literature.” Advances in Computer Science: an International Journal 3, no. 2 (2014): 58-68.
11	Tatar, S., and Cicekli, I. “Two learning approaches for protein name extraction.” Journal of biomedical informatics 42, no. 6 (2009): 1046-1055. DOI
12	Patrick, J., and Wang, Y. “Biomedical named entity recognition system.” In Proceedings of the Tenth Australasian Document Computing Symposium (ADCS 2005). 2005.
13	Zhou, G., Zhang, J., Su, J., Shen, D., and Tan, C.“Recognizing names in biomedical texts: a machine learning approach.” Bioinformatics 20, no. 7 (2004): 1178-1190. DOI
14	Liu, X., Zhang, S., Wei, F., and Zhou, M. “Recognizing named entities in tweets.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1, pp. 359-367. Association for Computational Linguistics, 2011.
15	Chen, X., and Ishwaran, H. (2012). “Random forests for genomic data analysis”. Genomics, 99(6), 323-329. DOI
16	Chieu, H. L., and Ng, H. T.“Named entity recognition: a maximum entropy approach using global information.” In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pp. 1-7. Association for Computational Linguistics, 2002.
17	Witten, I. H., and Frank, E."Data Mining: Practical machine learning tools and techniques." Morgan Kaufmann, 2005.
18	PubMed Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2005-. PubMed Help. [Updated 2014 Mar 25]," http://www.ncbi.nlm.nih.gov/books/NBK3827/"
19	Finkel, J., Dingare, S., Manning, C. D., Nissim, M., Alex, B., and Grover, C. “Exploring the boundaries: gene and protein identification in biomedical text.” BMC bioinformatics 6, no. Suppl 1 (2005): S5.
20	Mitsumori, T., Fation, S., Murata, M., Doi, K., and Doi, H. “Gene/protein name recognition based on support vector machine using dictionary as features.” BMC bioinformatics 6, no. Suppl 1 (2005).
21	Ju, Z., Wang, J., and Zhu, F. (2011, May). “Named entity recognition from biomedical text using SVM”. In Bioinformatics and Biomedical Engineering,(iCBBE) 2011 5th International Conference on (pp. 1-4). IEEE.
22	Yang, Li, and Yanhong Zhou. “Exploring feature sets for two-phase biomedical named entity recognition using semiCRFs. ”Knowledge and Information Systems (2013): 1-15.
23	Li, L., Zhou, R., and Huang, D. “Two-phase biomedical named entity recognition using CRFs.” Computational biology and chemistry 33, no. 4 (2009): 334-338. DOI
24	Sumathipala, S., Yamada, K., and Unehara, M. “Protein Named Entity Classification with Probabilistic Features Derived from GENIA Corpus and MEDLINE”, Joint 7th International Conference on Soft Computing and Intelligent Systems and 15th International Symposium on Advanced Intelligent Systems (2014): 1257-1261, Japan
25	Lin, Y. F., Tsai, T. H., Chou, W. C., Wu, K. P., Sung, T. Y., and Hsu, W. L. “A maximum entropy approach to biomedical named entity recognition. ” In BIOKDD, pp. 56-61. 2004.
26	Zhang, S., and Elhadad, N. “Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts.” Journal of biomedical informatics 46, no. 6 (2013): 1088-1098. DOI
27	Breiman, L. “Random forests.” Machine learning,(2001), 45:5-32. DOI
28	Bui, Q. C., Katrenko, S., and Sloot, P. M. “A hybrid approach to extract protein-protein interactions.” Bioinformatics 27, no. 2 (2011): 259-265. DOI
29	Blaschke, C., Andrade, M. A., Ouzounis, C. A., and Valencia, A. “Automatic extraction of biological information from scientific text: protein-protein interactions.” In Ismb, vol. 7, pp. 60-67. 1999.
30	UniProtKB, "http://www.uniprot.org/help/uniprotkb"
31	Ratinov, L., and Roth, D. “Design challenges and misconceptions in named entity recognition.” In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 147-155. Association for Computational Linguistics, 2009.
32	Sundheim, B. M. “Overview of results of the MUC-6 evaluation.” In Proceedings of a workshop on held at Vienna, Virginia: May 6-8, 1996, pp. 423-442. Association for Computational Linguistics, 1996.
33	Zhou, G., Shen, D., Zhang, J., Su, J., and Tan, S. “Recognition of protein/gene names from text using an ensemble of classifiers.” BMC bioinformatics 6, no. Suppl 1 (2005): S7.
34	Tanabe, L., Xie, N., Thom, L. H., Matten, W., and Wilbur, W. J. “GENETAG: a tagged corpus for gene/protein named entity recognition.” BMC bioinformatics 6, no. Suppl 1(2005): S3.
35	Krauthammer, M., Rzhetsky, A., Morozov, P., and Friedman, C. “Using BLAST for identifying gene and protein names in journal articles.” Gene 259, no. 1 (2000): 245-252. DOI
36	Seki, K., and Mostafa, J. (2005). “A hybrid approach to protein name identification in biomedical texts”. Information processing and management, 41(4), 723-743. DOI
37	MEDLINEⓇ/ PubMedⓇ/ Resources Guide, "http://www.nlm.nih.gov/bsd/pmresources.html"