Browse > Article
http://dx.doi.org/10.5808/gi.21011

A biomedically oriented automatically annotated Twitter COVID-19 dataset  

Hernandez, Luis Alberto Robles (Department of Computer Science, Georgia State University)
Callahan, Tiffany J. (Computational Bioscience Program, University of Colorado Anschutz Medical Campus)
Banda, Juan M. (Department of Computer Science, Georgia State University)
Abstract
The use of social media data, like Twitter, for biomedical research has been gradually increasing over the years. With the coronavirus disease 2019 (COVID-19) pandemic, researchers have turned to more non-traditional sources of clinical data to characterize the disease in near-real time, study the societal implications of interventions, as well as the sequelae that recovered COVID-19 cases present. However, manually curated social media datasets are difficult to come by due to the expensive costs of manual annotation and the efforts needed to identify the correct texts. When datasets are available, they are usually very small and their annotations don't generalize well over time or to larger sets of documents. As part of the 2021 Biomedical Linked Annotation Hackathon, we release our dataset of over 120 million automatically annotated tweets for biomedical research purposes. Incorporating best-practices, we identify tweets with potentially high clinical relevance. We evaluated our work by comparing several SpaCy-based annotation frameworks against a manually annotated gold-standard dataset. Selecting the best method to use for automatic annotation, we then annotated 120 million tweets and released them publicly for future downstream usage within the biomedical domain.
Keywords
biomedical annotations; COVID-19; datasets; social media data;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Edo-Osagie O, De La Iglesia B, Lake I, Edeghere O. A scoping review of the use of Twitter for public health research. Comput Biol Med 2020;122:103770.   DOI
2 Vos SC, Buckner MM. Social media messages in an emerging health crisis: Tweeting bird flu. J Health Commun 2016;21:301-308.   DOI
3 Rufai SR, Bunce C. World leaders' usage of Twitter in response to the COVID-19 pandemic: a content analysis. J Public Health (Oxf) 2020;42:510-516.   DOI
4 Abd-Alrazaq A, Alhuwail D, Househ M, Hamdi M, Shah Z. Top concerns of Tweeters during the COVID-19 pandemic: infoveillance study. J Med Internet Res 2020;22:e19016.   DOI
5 Biomedical Linked Annotation Hackathon 7. Kashiwa: Database Center for Life Science, 2021. Accessed 2021 Mar 9. Available from: https://blah7.linkedannotation.org/.
6 Hino A, Fahey RA. Representing the Twittersphere: archiving a representative sample of Twitter data under resource constraints. Int J Inf Manage 2019;48:175-184.   DOI
7 Kim Y, Nordgren R, Emery S. The story of goldilocks and three Twitter's APIs: a pilot study on Twitter data sources and disclosure. Int J Environ Res Public Health 2020;17:864.   DOI
8 Chen E, Lerman K, Ferrara E. Tracking social media discourse about the COVID-19 pandemic: development of a public coronavirus Twitter data set. JMIR Public Health Surveill 2020;6:e19273.   DOI
9 Gupta RK, Vishwanath A, Yang Y. Global reactions to COVID-19 on Twitter: a labelled dataset with latent topic, sentiment and emotion attributes. Preprint at: http://arxiv.org/abs/2007.06954 (2021).
10 Newberry C. 36 Twitter statistics all marketers should know in 2021. Vancouver: Hootsuite Inc., 2021. Accessed 2021 Mar 9. Available from: https://blog.hootsuite.com/twitter-statistics/.
11 Mackey T, Purushothaman V, Li J, Shah N, Nali M, Bardier C, et al. Machine learning to detect self-reporting of symptoms, testing access, and recovery associated with COVID-19 on Twitter: retrospective big data infoveillance study. JMIR Public Health Surveill 2020;6:e19509.   DOI
12 Tang L, Bie B, Park SE, Zhi D. Social media and outbreaks of emerging infectious diseases: a systematic review of literature. Am J Infect Control 2018;46:962-972.   DOI
13 Banda JM, Tekumalla R, Wang G, Yu J, Liu T, Ding Y, et al. A large-scale COVID-19 Twitter chatter dataset for open scientific research: an international collaboration. Epidemiologia 2021;2: 315-324.   DOI
14 Banda JM, Singh SR, Alser OH, Prieto-Alhambra D. Long-term patient-reported symptoms of COVID-19: an analysis of social media data. Preprint at: https://doi.org/10.1101/2020.07.29.20164418 (2020).   DOI
15 Sinnenberg L, Buttenheim AM, Padrez K, Mancheno C, Ungar L, Merchant RM. Twitter as a tool for health research: a systematic review. Am J Public Health 2017;107:e1-e8.
16 Masri S, Jia J, Li C, Zhou G, Lee MC, Yan G, et al. Use of Twitter data to improve Zika virus surveillance in the United States during the 2016 epidemic. BMC Public Health 2019;19:761.   DOI
17 Chew C, Eysenbach G. Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak. PLoS One 2010;5:e14118.   DOI
18 Tekumalla R, Banda JM. Social Media Mining Toolkit (SMMT). Genomics Inform 2020;18:e16.   DOI
19 Callahan TJ, Tripodi IJ, Hunter LE, Baumgartner WA Jr. KGCOVID-19: a framework to produce customized knowledge graphs for COVID-19 response. Preprint at: https://doi.org/10.1101/2020.04.30.071407 (2020).   DOI
20 medspacy. San Francisco: GitHub, 2021. Accessed 2021 Mar 9. Available from: https://github.com/medspacy/medspacy.
21 Explosion AI. spaCy-Industrial-strength Natural Language Processing in Python. Explosion AI, 2017. Accessed 2021 Mar 9. Available from: https://spacy.io/.
22 medspacy. San Francisco: Github, 2021. Accessed 2021 Mar 9. Available from: https://github.com/medspacy/medspacy.
23 Donnelly K. SNOMED-CT: the advanced terminology and coding system for eHealth. Stud Health Technol Inform 2006;121: 279-290.
24 Kabir MY, Madria S. CoronaVis: a real-time COVID-19 Tweets data analyzer and data repository. Preprint at: https://arxiv.org/abs/2004.13932 (2020).
25 Coronavirus: staying safe and informed on Twitter. San Francisco: Twitter Inc., 2021. Accessed 2021 Mar 9. Available from: https://blog.twitter.com/en_us/topics/company/2020/covid-19.html.
26 Guo JW, Radloff CL, Wawrzynski SE, Cloyes KG. Mining twitter to explore the emergence of COVID-19 symptoms. Public Health Nurs 2020;37:934-940.   DOI
27 Webb H, Jirotka M, Stahl BC, Housley W, Edwards A, Williams M, et al. The ethical challenges of publishing Twitter data for research dissemination. In: Proceedings of the 2017 ACM on Web Science Conference, 2017 Jun 25-28, Troy, NY, USA. New York: Association for Computing Machinery, 2017. pp. 339-348.
28 Alqurashi S, Alhindi A, Alanazi E. Large arabic Twiter dataset on COVID-19. Preprint at: https://arxiv.org/abs/2004.04315 (2020).
29 RxNorm. Bethesda: National Library of Medicine, 2004. Accessed 2021 Mar 10. Available from: https://www.nlm.nih.gov/research/umls/rxnorm/index.html.
30 Medical subject headings. Bethesda: National Library of Medicine, 2020. Accessed 2021 Mar 10. Available from: https://www.nlm.nih.gov/mesh/meshhome.html.
31 Mulyar A, Mahendran D, Maffey L, Olex A, Matteo G, Dill N, et al. TAC SRIE 2018: extracting systematic review information with MedaCy. Gaithersburg: National Institute of Standards and Technology, 2018. Accessed 2021 Mar 9. Available: https://www.researchgate.net/profile/Darshini_Mahendran/publication/340870892_TAC_SRIE_2018_Extracting_Systematic_Review_Information_with_MedaCy/links/5ea1add5a6fdcc88fc381e4c/TAC-SRIE-2018-Extracting-Systematic-Review-Information-with-MedaCy.pdf.
32 Reese JT, Unni D, Callahan TJ, Cappelletti L, Ravanmehr V, Carbon S, et al. KG-COVID-19: a framework to produce customized knowledge graphs for COVID-19 response. Patterns (N Y) 2021;2:100155.   DOI
33 Neumann M, King D, Beltagy I, Ammar W. ScispaCy: fast and robust models for biomedical natural language processing. New York: Association for Computational Linguistics, 2019. Accessed 2021 Mar 9. https://doi.org/10.18653/v1/W19-5034.   DOI
34 Annotated_twitter_covid19_dataset. San Francisco: Github, 2021. Accessed 2021 Mar 9. Available from: https://github.com/thepanacealab/annotated_twitter_covid19_dataset.
35 International Statistical Classification of Diseases and Related Health Problems (ICD). Geneva: World Health Organization, 2020. Accessed 2021 Mar 10. Available from: https://www.who.int/standards/classifications/classification-of-diseases.
36 Tekumalla R, Banda JM. Characterizing drug mentions in COVID-19 Twitter Chatter. New York: Association for Computational Linguistics, 2020. Accessed 2021 Mar 9. Available from: https://www.aclweb.org/anthology/2020.nlpcovid19-2.25/.