SVM-based Protein Name Recognition using Edit-Distance Features Boosted by Virtual Examples

가상 예제와 Edit-distance 자질을 이용한 SVM 기반의 단백질명 인식

  • Yi, Eun-Ji (Department of Computer Science and Engineering, POSTECH) ;
  • Lee, Gary-Geunbae (Department of Computer Science and Engineering, POSTECH) ;
  • Park, Soo-Jun (Bioinformatics Research Team, Computer and Software Research Lab, ETRI)
  • Published : 2003.10.31

Abstract

In this paper, we propose solutions to resolve the problem of many spelling variants and the problem of lack of annotated corpus for training, which are two among the main difficulties in named entity recognition in biomedical domain. To resolve the problem of spotting valiants, we propose a use of edit-distance as a feature for SVM. And we propose a use of virtual examples to automatically expand the annotated corpus to resolve the lack-of-corpus problem. Using virtual examples, the annotated corpus can be extended in a fast, efficient and easy way. The experimental results show that the introduction of edit-distance produces some improvements in protein name recognition performance. And the model, which is trained with the corpus expanded by virtual examples, outperforms the model trained with the original corpus. According to the proposed methods, we finally achieve the performance 75.80 in F-measure(71.89% in precision,80.15% in recall) in the experiment of protein name recognition on GENIA corpus (ver.3.0).

Keywords