Improving classification of low-resource COVID-19 literature by using Named Entity Recognition

Lithgow-Serrano, Oscar;Cornelius, Joseph;Kanjirangat, Vani;Mendez-Cruz, Carlos-Francisco;Rinaldi, Fabio;

doi:10.5808/gi.21018

Genomics & Informatics

Volume 19 Issue 3
/
Pages.22.1-22.5
/
2021
/
1598-866X(pISSN)
/
2234-0742(eISSN)

Korea Genome Organization (한국유전체학회)

DOI QR Code

Improving classification of low-resource COVID-19 literature by using Named Entity Recognition

Lithgow-Serrano, Oscar (Dalle Molle Institute for Artificial Intelligence Research, IDSIA USI-SUPSI) ;
Cornelius, Joseph (Dalle Molle Institute for Artificial Intelligence Research, IDSIA USI-SUPSI) ;
Kanjirangat, Vani (Dalle Molle Institute for Artificial Intelligence Research, IDSIA USI-SUPSI) ;
Mendez-Cruz, Carlos-Francisco (Centro de Ciencias Genomicas, Universidad Nacional Autonoma de Mexico) ;
Rinaldi, Fabio (Dalle Molle Institute for Artificial Intelligence Research, IDSIA USI-SUPSI)

Received : 2021.03.17
Accepted : 2021.08.12
Published : 2021.09.30

https://doi.org/10.5808/gi.21018 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Automatic document classification for highly interrelated classes is a demanding task that becomes more challenging when there is little labeled data for training. Such is the case of the coronavirus disease 2019 (COVID-19) clinical repository-a repository of classified and translated academic articles related to COVID-19 and relevant to the clinical practice-where a 3-way classification scheme is being applied to COVID-19 literature. During the 7th Biomedical Linked Annotation Hackathon (BLAH7) hackathon, we performed experiments to explore the use of named-entity-recognition (NER) to improve the classification. We processed the literature with OntoGene's Biomedical Entity Recogniser (OGER) and used the resulting identified Named Entities (NE) and their links to major biological databases as extra input features for the classifier. We compared the results with a baseline model without the OGER extracted features. In these proof-of-concept experiments, we observed a clear gain on COVID-19 literature classification. In particular, NE's origin was useful to classify document types and NE's type for clinical specialties. Due to the limitations of the small dataset, we can only conclude that our results suggests that NER would benefit this classification task. In order to accurately estimate this benefit, further experiments with a larger dataset would be needed.

Keywords

Acknowledgement

We are grateful to the organizer of the Biomedical Linked Annotation Hackathon 2021 for the opportunity to work collaboratively on this project and share it with the other participants.

References

Harder T, Sin MA, Bosch-Capblanch X, Coignard B, de Carvalho Gomes H, Duclos P, et al. Towards a framework for evaluating and grading evidence in public health. Health Policy 2015;119:732-736. https://doi.org/10.1016/j.healthpol.2015.02.010
Prati RC, Batista GE, Monard MC. Class imbalances versus class overlapping: an analysis of a learning system behavior. In: MICAI 2004: Advances in Artificial Intelligence. MICAI 2004. Lecture Notes in Computer Science, Vol. 2972 (Monroy R, Arroyo-Figueroa G, Sucar LE, Sossa H, eds.). Berlin: Springer, 2004. pp. 312-321.
Denil M, Trappenberg T. Overlap versus imbalance. In: Advances in Artificial Intelligence. Canadian AI 2010. Lecture Notes in Computer Science, Vol. 6085 (Farzindar A, Keselj V, eds.). Berlin: Springer, 2010. pp. 220-231.
Basaldella M, Furrer L, Tasso C, Rinaldi F. Entity recognition in the biomedical domain using a hybrid approach. J Biomed Semantics 2017;8:51. https://doi.org/10.1186/s13326-017-0157-6
Armour Q. The role of named entities in text classification. M.A.Sc. Thesis. Ottawa: University of Ottawa, 2005.
Andelic S, Kondic M, Peric I, Jocic M, Kovacevic A. Text classification based on named entities. In: 7th International Conference on Information Society and Technology, 2017 Mar 12-15, Kopaonik, Serbia. pp. 23-28.
Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019, 2019 Jun 2-7, Minneapolis, MN, USA. Stroudsburg: Association for Computational Linguistics, 2019. pp. 4171-4186.
Deerwester S, Dumais ST, Furnas GW, Ladauer TK, Harshman R. Indexing by latent semantic analysis. J Am Soc Inf Sci 1990;41:391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
Landauer TK, Dumais ST. A solution to Platos problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev 1997;104:211-240. https://doi.org/10.1037//0033-295X.104.2.211
Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995;20:273-297. https://doi.org/10.1007/BF00994018

Genomics & Informatics

Improving classification of low-resource COVID-19 literature by using Named Entity Recognition

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)