DOI QR코드

DOI QR Code

An Active Co-Training Algorithm for Biomedical Named-Entity Recognition

  • Munkhdalai, Tsendsuren (Database/Bioinformatics Laboratory, Chungbuk National University) ;
  • Li, Meijing (Database/Bioinformatics Laboratory, Chungbuk National University) ;
  • Yun, Unil (Dept. of Computer Science, Chungbuk National University) ;
  • Namsrai, Oyun-Erdene (Dept. of Information Technology, Mongolian National University) ;
  • Ryu, Keun Ho (Database/Bioinformatics Laboratory, Chungbuk National University)
  • Received : 2012.02.13
  • Accepted : 2012.09.20
  • Published : 2012.12.31

Abstract

Exploiting unlabeled text data with a relatively small labeled corpus has been an active and challenging research topic in text mining, due to the recent growth of the amount of biomedical literature. Biomedical named-entity recognition is an essential prerequisite task before effective text mining of biomedical literature can begin. This paper proposes an Active Co-Training (ACT) algorithm for biomedical named-entity recognition. ACT is a semi-supervised learning method in which two classifiers based on two different feature sets iteratively learn from informative examples that have been queried from the unlabeled data. We design a new classification problem to measure the informativeness of an example in unlabeled data. In this classification problem, the examples are classified based on a joint view of a feature set to be informative/non-informative to both classifiers. To form the training data for the classification problem, we adopt a query-by-committee method. Therefore, in the ACT, both classifiers are considered to be one committee, which is used on the labeled data to give the informativeness label to each example. The ACT method outperforms the traditional co-training algorithm in terms of f-measure as well as the number of training iterations performed to build a good classification model. The proposed method tends to efficiently exploit a large amount of unlabeled data by selecting a small number of examples having not only useful information but also a comprehensive pattern.

Keywords

References

  1. H. Dai, Y. Chang, R. T. Tsai, and W. Hsu, "New Challenges for Biological Text-Mining in the Next Decade," Journal of computer science and technology, 2010, 25(1): 169. https://doi.org/10.1007/s11390-010-9313-5
  2. A. Blum, and T. Mitchell, "Combining Labeled Data with Co-Training," 11th Annual Conference Computational Learning Theory, 1998.
  3. T. Munkhdalai, M. Li, T. Kim, O. Namsrai, S. Jeong, J. Shin, and K.H. Ryu, "Bio Named Entity Recognition based on Co-training Algorithm," AINA 2012, 2012.
  4. B. Settles, Active learning literature survey, 2010. Univ. of Wisconsin-Madison, Madison, WI, Computer Sciences Tech., Rep.1648.
  5. H.S. Seung, M. Opper, and H. Sompolinsky, "Query by committee," The proc. of the ACM Workshop on Computational Learning Theory, 1992, pp.287-294.
  6. L.J. Gong, and X. Sun, "ATRMiner: A system for Automatic Biomedical Named Entities Recognition," ICNC 2010, 2010, pp.3842-3845.
  7. S. Zhao, "Named Entity Recognition in Biomedical Texts using an HMM Model," The Proc. of the COLING 2004 International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), 2004.
  8. Z. GuoDong, S. Jian, N. Collier, P. Ruch, and A. Nazarenko, "Exploring Deep Knowledge Resources in Biomedical Name Recognition," COLING 2004 International Joint workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), 2004, pp.99-102.
  9. K. M. Park, S. H. Kim, D. G. Lee and H. C. Rim, "Boosting Lexical Knowledge for Biomedical Named Entity Recognition," The Proc. of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 2004, pp.7599.
  10. T. Mitsumori, S. Fation, M. Murata, K. Doi, and H. Doi, "Gene/protein name recognition based on support vector machine using dictionary as features," BMC Bioinformatics, 2005.
  11. N. Collier, and K. Takeuchi, "Comparison of character-level and part of speech features for name recognition in biomedical texts," Journal of Biomedical Informatics, 2004, pp.423-435.
  12. Z. Ju, J. Wang, and F. Zhu, "Named Entity Recognition From Biomedical Text Using SVM," Bioinformatics and Biomedical Engineering (iCBBE 2011), 2011.
  13. M. Li, T. Munkhdalai, T. Kim, P. Li, and K. H. Ryu, "A Bio-Textmining System for Protein-Protein Interaction Extraction," The proc. of 8th International Conference on Ubiquitous Healthcare, 2011.
  14. J. Finkel, S. Dingare, H. Nguyen, M. Nissim, C. Manning, and G. Sinclair, "Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web," Joint Workshop on Natural Language Processing in Biomedicine and Its Applications at Coling 2004, 2004.
  15. B. Settles, "Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets," The proc. of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), 2004.
  16. S. Chan, and W. Lam, "Efficient Methods for Biomedical Named Entity Recognition," Bioinformatics and Bioengineering, 2007.
  17. C. Hsu, Y. Chang, C. Kuo, Y. Lin, H. Huang, and I. Chung, "Integrating high dimensional bidirectional parsing models for gene mention tagging," Bioinformatics, 2008.
  18. Y. Li, H Lin, and Z. Yang, "Integrating rich background knowledge for gene named entity classification and recognition," BMC Bioinformatics, 2009.
  19. L. Yang, and Y. Zhou, "Two-phase Biomedical Named Entity Recognition based on Semi-CRFs," Bio-inspired Computing: Theories and Applications (BIC-TA), 2010.
  20. T. Munkhdalai, M. Li, E. Namsrai, O. Namsrai, and K. H. Ruy, "BFSM: Finite State Machine Learned as Name Boundary Definer for Bio Named Entity Recognition," ICAST 2011, 2011.
  21. L. Tanable, and J. Wilbur, "Tagging Gene and Protein names in Full Text articles," Workshop on Natural language processing in the Biomedical Domain, 2002.
  22. J.D. Kim, T. Ohta, Y. Tateishi, and J. Tsujii, "GENIA corpus-a semantically annotated corpus for bio-text mining," Bioinformatics 2003, 2003, 19(Suppl. 1):18-2.

Cited by

  1. Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations vol.7, pp.Suppl 1, 2015, https://doi.org/10.1186/1758-2946-7-S1-S9
  2. Self-training in significance space of support vectors for imbalanced biomedical event data vol.16, pp.Suppl 7, 2015, https://doi.org/10.1186/1471-2105-16-S7-S6
  3. Identifying an OpenID anti-phishing scheme for cyberspace vol.9, pp.6, 2016, https://doi.org/10.1002/sec.1027