DOI QR코드

DOI QR Code

Using the PubAnnotation ecosystem to perform agile text mining on Genomics & Informatics: a tutorial review

  • Nam, Hee-Jo (Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University) ;
  • Yamada, Ryota (Fuku Corporation) ;
  • Park, Hyun-Seok (Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University)
  • 투고 : 2020.03.25
  • 심사 : 2020.06.01
  • 발행 : 2020.05.28

초록

The prototype version of the full-text corpus of Genomics & Informatics has recently been archived in a GitHub repository. The full-text publications of volumes 10 through 17 are also directly downloadable from PubMed Central (PMC) as XML files. During the Biomedical Linked Annotation Hackathon 6 (BLAH6), we experimented with converting, annotating, and updating 301 PMC full-text articles of Genomics & Informatics using PubAnnotation, a system that provides a convenient way to add PMC publications based on PMCID. Thus, this review aims to provide a tutorial overview of practicing the iterative task of named entity recognition with the PubAnnotation/PubDictionaries/TextAE ecosystem. We also describe developing a conversion tool between the Genia tagger output and the JSON format of PubAnnotation during the hackathon.

키워드

참고문헌

  1. Genomics and Informatics archives. Seoul: Korea Genome Organization, 2018. Accessed 2020 Jun 17. Available from: https://genominfo.org/articles/archive.php.
  2. Oh SY, Kim JH, Kim SJ, Nam HJ, Park HS. GNI Corpus Version 1.0: annotated full-text corpus of Genomics & Informatics to support biomedical information extraction. Genomics Inform 2018;16:75-77. https://doi.org/10.5808/GI.2018.16.3.75
  3. Kim JD, Wang Y. PubAnnotation: a persistent and sharable corpus and annotation repository. In: BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing (Cohen KB, Demner-Fushman D, Ananiadou S, Webber B, Tsukii J, Pestian J, eds.), 2012 Jun 8, Montreal, Canada. Stroudsburg: Association for Computational Linguistics, 2012. pp. 202-205.
  4. Kim JD, Cohen KB, Kim JJ. PubAnnotation-query: a search tool for corpora with multi-layers of annotation. BMC Proc 2015;9:A3.
  5. Kim JD, Wang Y, Fujiwara T, Okuda S, Callahan T, Cohen KB. Open Agile text mining for bioinformatics: the PubAnnotation ecosystem. Bioinformatics 2019;35:4372-4380. https://doi.org/10.1093/bioinformatics/btz227
  6. Chinchor N, Robinson P. MUC-7 named entity task definition. In: Proceedings of the 7th Conference on Message Understanding, 1997 Sep 17, Fairfax, VA, USA. pp. 1-21.
  7. Song HJ, Jo BC, Park CY, Kim JD, Kim YS. Comparison of named entity recognition methodologies in biomedical documents. Biomed Eng Online 2018;17:158. https://doi.org/10.1186/s12938-018-0573-6
  8. Beck K, Grenning J, Martin RC, Beedle M, Highsmith J, Mellor S, et al. Manifesto for agile software development. The Author, 2001.Accessed 2020 Jun 17. Available from: http://agilemanifesto.org.
  9. Tsuruoka Y, Tateishi Y, Kim JD, Ohta T, McNaught J, Ananiadou S, et al. Developing a robust part-of-speech tagger for biomedical text. In: Advances in Informatics. PCI 2005. Lecture Notes in Computer Science, Vol. 3746 (Bozanis P, Houstis EN, eds.). Berlin: Springer, 2005. pp. 382-392.
  10. Tsuruoka Y. GENIA tagger. Tokyo: The Author, 2010. Accessed 2020 Jun 17. Available from: http://www.nactem.ac.uk/GENIA/tagger.
  11. Loper E, Bird S. NLTK: the natural language toolkit. Preprint at https://arxiv.org/abs/cs/0205028 (2002).
  12. Kim JD, Wang Y, Nakajima S. TextAE. The Author, 2015. Accessed 2020 Jun 17. Available from: http://textae.pubannotation.org/. 10