Browse > Article
http://dx.doi.org/10.5808/GI.2020.18.2.e13

Using the PubAnnotation ecosystem to perform agile text mining on Genomics & Informatics: a tutorial review  

Nam, Hee-Jo (Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University)
Yamada, Ryota (Fuku Corporation)
Park, Hyun-Seok (Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University)
Abstract
The prototype version of the full-text corpus of Genomics & Informatics has recently been archived in a GitHub repository. The full-text publications of volumes 10 through 17 are also directly downloadable from PubMed Central (PMC) as XML files. During the Biomedical Linked Annotation Hackathon 6 (BLAH6), we experimented with converting, annotating, and updating 301 PMC full-text articles of Genomics & Informatics using PubAnnotation, a system that provides a convenient way to add PMC publications based on PMCID. Thus, this review aims to provide a tutorial overview of practicing the iterative task of named entity recognition with the PubAnnotation/PubDictionaries/TextAE ecosystem. We also describe developing a conversion tool between the Genia tagger output and the JSON format of PubAnnotation during the hackathon.
Keywords
named entity recognition; natural language processing; text mining;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Kim JD, Wang Y, Fujiwara T, Okuda S, Callahan T, Cohen KB. Open Agile text mining for bioinformatics: the PubAnnotation ecosystem. Bioinformatics 2019;35:4372-4380.   DOI
2 Kim JD, Wang Y. PubAnnotation: a persistent and sharable corpus and annotation repository. In: BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing (Cohen KB, Demner-Fushman D, Ananiadou S, Webber B, Tsukii J, Pestian J, eds.), 2012 Jun 8, Montreal, Canada. Stroudsburg: Association for Computational Linguistics, 2012. pp. 202-205.
3 Kim JD, Cohen KB, Kim JJ. PubAnnotation-query: a search tool for corpora with multi-layers of annotation. BMC Proc 2015;9:A3.
4 Chinchor N, Robinson P. MUC-7 named entity task definition. In: Proceedings of the 7th Conference on Message Understanding, 1997 Sep 17, Fairfax, VA, USA. pp. 1-21.
5 Song HJ, Jo BC, Park CY, Kim JD, Kim YS. Comparison of named entity recognition methodologies in biomedical documents. Biomed Eng Online 2018;17:158.   DOI
6 Beck K, Grenning J, Martin RC, Beedle M, Highsmith J, Mellor S, et al. Manifesto for agile software development. The Author, 2001.Accessed 2020 Jun 17. Available from: http://agilemanifesto.org.
7 Tsuruoka Y, Tateishi Y, Kim JD, Ohta T, McNaught J, Ananiadou S, et al. Developing a robust part-of-speech tagger for biomedical text. In: Advances in Informatics. PCI 2005. Lecture Notes in Computer Science, Vol. 3746 (Bozanis P, Houstis EN, eds.). Berlin: Springer, 2005. pp. 382-392.
8 Tsuruoka Y. GENIA tagger. Tokyo: The Author, 2010. Accessed 2020 Jun 17. Available from: http://www.nactem.ac.uk/GENIA/tagger.
9 Loper E, Bird S. NLTK: the natural language toolkit. Preprint at https://arxiv.org/abs/cs/0205028 (2002).
10 Kim JD, Wang Y, Nakajima S. TextAE. The Author, 2015. Accessed 2020 Jun 17. Available from: http://textae.pubannotation.org/. 10
11 Genomics and Informatics archives. Seoul: Korea Genome Organization, 2018. Accessed 2020 Jun 17. Available from: https://genominfo.org/articles/archive.php.
12 Oh SY, Kim JH, Kim SJ, Nam HJ, Park HS. GNI Corpus Version 1.0: annotated full-text corpus of Genomics & Informatics to support biomedical information extraction. Genomics Inform 2018;16:75-77.   DOI