DOI QR코드

DOI QR Code

Opinion: Strategy of Semi-Automatically Annotating a Full-Text Corpus of Genomics & Informatics

  • Park, Hyun-Seok (Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University)
  • Received : 2018.12.13
  • Accepted : 2018.12.20
  • Published : 2018.12.31

Abstract

There is a communal need for an annotated corpus consisting of the full texts of biomedical journal articles. In response to community needs, a prototype version of the full-text corpus of Genomics & Informatics, called GNI version 1.0, has recently been published, with 499 annotated full-text articles available as a corpus resource. However, GNI needs to be updated, as the texts were shallow-parsed and annotated with several existing parsers. I list issues associated with upgrading annotations and give an opinion on the methodology for developing the next version of the GNI corpus, based on a semi-automatic strategy for more linguistically rich corpus annotation.

Keywords

References

  1. Genomics and Informatics archives. Seoul: Korea Genome Organization, 2018. Accessed 2018 Jul 29. Available from: https://genominfo.org/articles/archive.php.
  2. Oh SY, Kim JH, Kim SJ, Nam HJ, Park HS. GNI Corpus version 1.0: annotated full-text corpus of Genomics & Informatics to support biomedical information extraction. Genomics Inform 2018;16:75-77. https://doi.org/10.5808/GI.2018.16.3.75
  3. Westergaard D, Stærfeldt HH, Tonsberg C, Jensen LJ, Brunak S. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput Biol 2018;14:e1005962. https://doi.org/10.1371/journal.pcbi.1005962
  4. Ian C, Wilfrid H. Mathematical logic. Vol. 3. Oxford: Oxford University Press, 2007.
  5. POS Tagging (State of the art). Stroudsburg: Wiki of the Association for Computational Linguistics, 2016. Accessed 2018 Jul 29. Available from: https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art).
  6. Foster J, Wagner J, van Genabith J. Adapting a WSJ-trained parser to grammatically noisy text. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, 2008 Jun 16-17, Columbus, OH, USA. Stroudsburg: Association for Computational Linguistics, 2008.
  7. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26 (Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, eds.). Red Hook: Curran Associates Inc., 2013. pp. 3113-3119.
  8. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. Ithaca: arXiv, Cornell University, 2016. Accessed 2018 Jul 29. Available from: https://arxiv.org/abs/1603.01360.
  9. Wang P, Qian Y, Soong FK, He L, Zhao H. Part-of-speech tagging with bidirectional long short-term memory recurrent neural network. Ithaca: arXiv, Cornell University, 2015. Accessed 2018 Jul 29. https://arxiv.org/abs/1510.06168.
  10. Sharma A, Chaudhary DR. Character recognition using neural network. Int J Eng Trends and Technol 2013;4:662-667.
  11. Garaas T, Xiao M, Pomplun M. Personalized spell checking using neural networks. Boston: University of Massachusetts Boston, 2007. Accessed 2018 Jul 29. Available from: https://www.cs.umb.edu/-marc/pubs/garaas_xiao_pomplun_HCII2007.pdf.