DOI QR코드

DOI QR Code

시맨틱 구문 트리 커널을 이용한 생명공학 분야 전문용어간 관계 식별 및 분류 연구

A Study on the Identification and Classification of Relation Between Biotechnology Terms Using Semantic Parse Tree Kernel

  • 최성필 (한국과학기술정보연구원 정보기술연구실) ;
  • 정창후 (한국과학기술정보연구원 정보기술연구실) ;
  • 전홍우 (한국과학기술정보연구원 정보기술연구실) ;
  • 조현양 (경기대학교 문헌정보학과)
  • 투고 : 2011.04.13
  • 심사 : 2011.05.13
  • 발행 : 2011.05.30

초록

본 논문에서는 단백질 간 상호작용 자동 추출을 위해서 기존에 연구되어 높은 성능을 나타낸 구문 트리 커널을 확장한 시맨틱 구문 트리 커널을 제안한다. 기존 구문 트리 커널의 문제점은 구문 트리의 단말 노드를 구성하는 개별 어휘에 대한 단순 외형적 비교로 인해, 실제 의미적으로는 유사한 두 구문 트리의 커널 값이 상대적으로 낮아지는 현상이며 결국 상호작용 자동 추출의 전체 성능에 악영향을 줄 수 있다는 점이다. 본 논문에서는 두 구문 트리의 구문적 유사도(syntactic similarity)와 어휘 의미적 유사도(lexical semantic similarity)를 동시에 효과적으로 계산하여 이를 결합하는 새로운 커널을 고안하였다. 어휘 의미적 유사도 계산을 위해서 문맥 및 워드넷 기반의 어휘 중의성 해소 시스템과 이 시스템의 출력으로 도출되는 어휘 개념(WordNet synset)의 추상화를 통한 기존 커널의 확장을 시도하였다. 실험에서는 단백질 간 상호작용 추출(PPII, PPIC) 성능의 심층적 최적화를 위해서 기존의 SVM에서 지원되던 정규화 매개변수 외에 구문 트리 커널의 소멸인자와 시맨틱 구문 트리 커널의 어휘 추상화 인자를 새롭게 도입하였다. 이를 통해 구문 트리 커널을 적용함에 있어서 소멸인자 역할의 중요성을 확인할 수 있었고, 시맨틱 구문 트리 커널이 기존 시스템의 성능향상에 도움을 줄 수 있음을 실험적으로 보여주었다. 특히 단백질 간 상호작용식별 문제보다도 비교적 난이도가 높은 상호작용 분류에 더욱 효과적임을 알 수 있었다.

In this paper, we propose a novel kernel called a semantic parse tree kernel that extends the parse tree kernel previously studied to extract protein-protein interactions(PPIs) and shown prominent results. Among the drawbacks of the existing parse tree kernel is that it could degenerate the overall performance of PPI extraction because the kernel function may produce lower kernel values of two sentences than the actual analogy between them due to the simple comparison mechanisms handling only the superficial aspects of the constituting words. The new kernel can compute the lexical semantic similarity as well as the syntactic analogy between two parse trees of target sentences. In order to calculate the lexical semantic similarity, it incorporates context-based word sense disambiguation producing synsets in WordNet as its outputs, which, in turn, can be transformed into more general ones. In experiments, we introduced two new parameters: tree kernel decay factors, and degrees of abstracting lexical concepts which can accelerate the optimization of PPI extraction performance in addition to the conventional SVM's regularization factor. Through these multi-strategic experiments, we confirmed the pivotal role of the newly applied parameters. Additionally, the experimental results showed that semantic parse tree kernel is superior to the conventional kernels especially in the PPI classification tasks.

키워드

참고문헌

  1. Airola, A., Pyysalo, S., Bjorne, J., Pahikkala, T., Ginter, F., & Salakoski, T. 2008. "All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning." BMC Bioinformatics, 9(S2).
  2. Andrade, Miguel A. & Valencia, A. 1998. "Automatic extraction of keywords from scientific text: Application to the knowledge domain of protein families." Bioinformatics, 14(7): 600-607. https://doi.org/10.1093/bioinformatics/14.7.600
  3. Banerjee, S., & Pedersen, T. 2002. "An adapted Lesk algorithm for word sense disambiguation using WordNet." Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics(CICLing-2002), 136-45.
  4. Blaschke, C., Andrade, M., Ouzounis, C., & Valencia, A. 1999. "Automatic extraction of biological information from scientific text: Protein-protein interactions." Proceedings of the International Conference on Intelligent Systems for Molecular Biology, 7: 60-67.
  5. Bunescu, R., Ge, R., Kate, R., Marcotte, E., Mooney, R., Ramani, A., & Wong, Y. 2005. "Comparative experiments on learning information extractors for proteins and their interactions." Artiicial Inteligence in Medicine, Summarization and Information Extraction from Medical Documents, 33: 139-155.
  6. Collins, M., & Duffy, N. 2001. "Convolution kernels for natural language." NIPS-2001.
  7. Craven, M., & Kumlien, J. 1999. "Constructing biological knowledge bases by extracting information from text sources." Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, 77-86.
  8. Ding, J., Berleant, D., Nettleton, D., & Wurtele, E. 2002. "Mining MEDLINE: abstracts, sentences, or phrases?" Proceedings of PSB'02, 326-337.
  9. Fundel, K., Küffner, R., & Zimmer, R. 2007. "RelEx - Relation extraction using dependency parse trees." Bioinformatics, 23: 365-371. https://doi.org/10.1093/bioinformatics/btl616
  10. Gondy, L., Hsinchun, C., & Martinez, Jesse D. 2003. "A shallow parser based on closed-class words to capture relations in biomedical text." Journal of Biomedical Informatics, 36(3): 145-158. https://doi.org/10.1016/S1532-0464(03)00039-X
  11. Lesk, M. 1986. "Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone." Proceedings of the 5th annual international conference on Systems documentation, 24-26.
  12. Marcotte, Edward M., Xenarios, I., & Eisenberg, D. 2001. "Mining literature for protein-protein interactions." Bioinformatics, 17(4): 359-363. https://doi.org/10.1093/bioinformatics/17.4.359
  13. Miwa, M., Saetre, R., Miyao, Y., & Tsujii, J. 2009. "Protein-protein interaction extraction by leveraging multiple kernels and parsers." International Journal of Medical Informatics.
  14. Moschitti, A. 2006. "Making tree kernels practical for natural language learning." Proceedings of EACL.
  15. Nedellec, C. 2005. "Learning language in logic - genic interaction extraction challenge." Proceedings of LLL'05, 31-37.
  16. Nikolai, D., Anton, Y., Sergei, E., Svetalana, N., Alexander, N., & llya, M. 2004. "Extracting human protein interactions from MEDLINE using a full-sentence parser." Bioinformatics, 20(5): 604-611. https://doi.org/10.1093/bioinformatics/btg452
  17. Ono, T., Hishigaki, H., Tanigam, A., & Takagi, T. 2001. "Automated extraction of information on protein-protein interactions from the biological literature." Bioinformatics, 17(2): 155-161. https://doi.org/10.1093/bioinformatics/17.2.155
  18. Pyysalo, S., Airola, A., Heimonen, J., Bjorne, J., Ginter, F., & Salakoski, T. 2008. "Comparative analysis of five protein-protein interaction corpora." BMC Bioinformatics, 9(S6).
  19. Pyysalo, S., Ginter, F., Heimonen, J., Bjorne, J., Boberg, J., Jarvinen, J., & Salakoski, T. 2007. "BioInfer: A corpus for information extraction in the biomedical domain." BMC Bioinformatics, 8(50).
  20. Sekimizu, T., Park, H. S., & Tsujii, J. 1998. "Identifying the interaction between genes and gene products based on frequently seen verbs in MEDLINE abstracts." Workshop on genome informatics, 9: 62-71.
  21. Temkin, Joshua M., & Gilder, Mark R. 2003. "Extraction of protein interaction information from unstructured text using a context-free grammar." Bioinformatics, 19(16): 2046-2053. https://doi.org/10.1093/bioinformatics/btg279
  22. Vishwanathan, S. V. N., & Smola, A. J. 2003. "Fast kernels for string and tree matching." Advances in Neural Information Processing Systems, MIT Press, 15: 569-576.
  23. Wikipedia. [online]. [cited 2010.11.1]. .
  24. Zhang, M., Zhang, J., Su, J., & Zhou, G. 2006. "A composite kernel to extract relations between entities with both flat and structured features." 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, 825-832.
  25. Zhou, D., & He, Y. 2008. "Extracting interactions between proteins from the literature." Journal of Biomedical Informatics, 41: 393-407. https://doi.org/10.1016/j.jbi.2007.11.008