Browse > Article
http://dx.doi.org/10.9718/JBER.2022.43.2.109

Comparative Analysis of Vectorization Techniques in Electronic Medical Records Classification  

Yoo, Sung Lim (Department of Medical Device Management and Research, SAIHST, Sungkyunkwan University)
Publication Information
Journal of Biomedical Engineering Research / v.43, no.2, 2022 , pp. 109-115 More about this Journal
Abstract
Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization techniques. Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vectorization techniques. Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification.
Keywords
Natural language processing; Medical records classification; Vectorization techniques; Machine learning; Latent semantic analysis;
Citations & Related Records
Times Cited By KSCI : 4  (Citation Analysis)
연도 인용수 순위
1 Yuli V. Natural language processing with Python and spaCy: a practical introduction. San Francisco: No Starch Press; 2020.
2 Kherwa P, Bansal P. Latent semantic analysis: an approach to understanding sematic of text. International conference on current trends in computer, electrical, electronics and communication. 2017;870-4.
3 Almas J, Qamar U. Affect of data filter on performance of latent semantic analysis based research paper recommender system. 5th International conference on computational intelligence and application. 2020;50-54.
4 Jamaluddin M, Wibawa AD. Patient diagnosis classification based on electronic medical record using text mining and support vector machine. International seminar on application for technology of information and communication. 2021;243-8.
5 Wang Y, Sohn SH, Liu S, Shen F. A clinical text classification paradigm using weak supervision and deep representation. BMC med inform Decis Mak. 2019;19(1).
6 Cho DB, Lee HY, Kang SS. An empirical study of Korean sentence representation with various tokenization. Electronics. 2021;10(7):845-57.   DOI
7 Guillermo JB, Ricardo O, Jose AL. Using latent semantic analysis and the predication algorithm to improve extraction of meanings from a diagnostic corpus. Span J Psychol. 2009;12(2):424-40.   DOI
8 Jonnalagadda SR, Adupa AK, Garg RP. Text mining of the electronic health record: An information extraction approach for automated identification and subphenotyping of HFpEF patients for clinical trials. J Cardiovasc Transl Res. 2017;10(3):313-21.   DOI
9 Blakey JD, Price DB, Pizzichini E. Identifying risk of future asthma attacks using UK medical record data: A respiratory effectiveness group initiative. J Allergy Clin Immunol. 2017;5(4):1015-24.   DOI
10 Tomasallo CD, Hanrahan LP, Tandias A. Estimating Wisconsin asthma prevalence using clinical electronic health records and public health data. Am J Public Health. 2014;104(1):65-73.
11 Rahaman T. Discovering new trend and connections: Current application of biomedical text mining. Med Ref Services Quarterly. 2021;40(3):329-36.   DOI
12 Shai SS, Shai BD. Understanding machine learning: from theory to algorithms. New York: Cambridge University Press; 2014.
13 Mehryar M, Afshin R, Ameet T. Foundations of machine learning. Cambridge: MIT press; 2012.
14 Peter F. Machine learning: the art and science of algorithms that make sense of data. Cambridge: Cambridge University Press; 2012.
15 Pak DH, Hwang MG, Hwang JU. Application of text classification based machine learning in prediction psychiatric diagnosis. Korean J Biol Psychiatry. 2020;27(1):18-26.
16 Hobson L, Cole H, Arwen G. Natural language processing in action: understanding, analyzing and generating text with Python. Shelter Island, NY: Manning Publications Co.; 2019.
17 Chicco D, Lovejoy CA, Oneto L. A machine learning analysis of health records of patients with chronic kidney disease at risk of cardiovascular disease. IEEE Access. 2021;9(3):165132-44.   DOI
18 Spasic I, Livsey J, Keane JA. Text mining of cancer-related information: Review of current status and future directions. Int J Med Informatics. 2014;83(9):605-23.   DOI
19 Andrea C, Leif J, Hercules D. Natural language processing and machine learning to enable automatic extraction and classification of patients' smoking status from electronic medical records. Upsala J Med Sci. 2020;125(4):316-24.   DOI
20 Chen MC, Ball RL, Yang L. Deep learning to classify radiology free-text reports. Radiology. 2018;286(3):845-2.   DOI
21 Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR. 2013;1301-13.
22 Zhou Y. An introduction to text classification with applications to medical records. 2nd international conference on informational technology and computer application. 2020;471-75.
23 Weng WH, Wagholikar KB, McCray A., Szolovits P. Medical subdomain classification of clinical notes using a machine learning based natural language processing approach. BMC med inform Decis Mak. 2017;17(1):1-13.   DOI
24 Park KB, Lee JH, Jang SB, Jung DW. An empirical study of tokenization strategies for various Korean NLP tasks. Computer Science. 2020.
25 Le Glaz A, Haralambous Y, Kim D. Machine learning and natural language processing in mental health: Systemic review. J Med Internet Res. 2021;23(5):15708.
26 Gastaldi JL. Why can computers understand natural language? The structuralist image of language behind word embeddings. Phil Tech. 2021;34(1):149-214.   DOI