OryzaGP: rice gene and protein dataset for named-entity recognition

Larmande, Pierre;Do, Huy;Wang, Yue;

doi:10.5808/GI.2019.17.2.e17

Genomics & Informatics

제17권2호
/
Pages.17.1-17.3
/
2019
/
1598-866X(pISSN)
/
2234-0742(eISSN)

한국유전체학회 (Korea Genome Organization)

DOI QR Code

OryzaGP: rice gene and protein dataset for named-entity recognition

Larmande, Pierre (UMR DIADE, Institute of Research for Sustainable Development (IRD)) ;
Do, Huy (ICT Lab, University of Science and Technology of Hanoi (USTH)) ;
Wang, Yue (Database Center for Life Science (DBCLS))

투고 : 2018.12.14
심사 : 2019.05.30
발행 : 2019.06.30

https://doi.org/10.5808/GI.2019.17.2.e17 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Text mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant molecular biology data, have been performed, especially for rice, resulting in a lack of datasets available to solve named-entity recognition tasks for this species. Since there are rare benchmarks available for rice, we faced various difficulties in exploiting advanced machine learning methods for accurate analysis of the rice literature. To evaluate several approaches to automatically extract information from gene/protein entities, we built a new dataset for rice as a benchmark. This dataset is composed of a set of titles and abstracts, extracted from scientific papers focusing on the rice species, and is downloaded from PubMed. During the 5th Biomedical Linked Annotation Hackathon, a portion of the dataset was uploaded to PubAnnotation for sharing. Our ultimate goal is to offer a shared task of rice gene/protein name recognition through the BioNLP Open Shared Tasks framework using the dataset, to facilitate an open comparison and evaluation of different approaches to the task.

키워드

참고문헌

Deans AR, Lewis SE, Huala E, Anzaldo SS, Ashburner M, Balhoff JP, et al. Finding our way through phenotypes. PLoS Biol 2015;13:e1002033. https://doi.org/10.1371/journal.pbio.1002033
Venkatesan A, Tagny Ngompe G, Hassouni NE, Chentli I, Guignon V, Jonquet C, et al. Agronomic Linked Data (AgroLD): a knowledge-based system to enable integrative biology in agronomy. PLoS One 2018;13:e0198270. https://doi.org/10.1371/journal.pone.0198270
Gupta P, Naithani S, Tello-Ruiz MK, Chougule K, D'Eustachio P, Fabregat A, et al. Gramene database: navigating plant comparative genomics resources. Curr Plant Biol 2016;7-8:10-15. https://doi.org/10.1016/j.cpb.2016.12.005
Yamazaki Y, Sakaniwa S, Tsuchiya R, Nonomura KI, Kurata N. Oryzabase: an integrated information resource for rice science. Breed Sci 2010;60:544-548. https://doi.org/10.1270/jsbbs.60.544
Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H, et al. The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res 2008;36:D1009-D1014. https://doi.org/10.1093/nar/gkm965
Do H, Than K, Larmande P. Evaluating named-entity recognition approaches in plant molecular biology. In: 12th Multi-disciplinary International Conference on Artificial Intelligence (MIWAI 2018), 2018 Nov 18-20, Hanoi, Vietnam. Cham: Springer, 2018. pp. 219-225.
Kim JD, Wang Y. PubAnnotation: a persistent and shareable corpus and annotation repository. In: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing (BioNLP 2012) (Cohen KB, Demner-Fushman D, Ananiadou S, Webber B, Tsujii J, Pestian J, eds.), 2012 Jun 8, Montreal, Canada. Stroudsburg: Association for Computational Linguistics, 2012. pp. 202-205.

Genomics & Informatics

OryzaGP: rice gene and protein dataset for named-entity recognition

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)