DOI QR코드

DOI QR Code

의무 기록 문서 분류를 위한 자연어 처리에서 최적의 벡터화 방법에 대한 비교 분석

Comparative Analysis of Vectorization Techniques in Electronic Medical Records Classification

  • 유성림 (성균관대학교 SAIHST 의료기기산업학과)
  • Yoo, Sung Lim (Department of Medical Device Management and Research, SAIHST, Sungkyunkwan University)
  • 투고 : 2022.02.22
  • 심사 : 2022.04.15
  • 발행 : 2022.04.30

초록

Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization techniques. Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vectorization techniques. Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification.

키워드

참고문헌

  1. Chicco D, Lovejoy CA, Oneto L. A machine learning analysis of health records of patients with chronic kidney disease at risk of cardiovascular disease. IEEE Access. 2021;9(3):165132-44. https://doi.org/10.1109/ACCESS.2021.3133700
  2. Blakey JD, Price DB, Pizzichini E. Identifying risk of future asthma attacks using UK medical record data: A respiratory effectiveness group initiative. J Allergy Clin Immunol. 2017;5(4):1015-24. https://doi.org/10.1016/j.jaip.2016.11.007
  3. Tomasallo CD, Hanrahan LP, Tandias A. Estimating Wisconsin asthma prevalence using clinical electronic health records and public health data. Am J Public Health. 2014;104(1):65-73.
  4. Spasic I, Livsey J, Keane JA. Text mining of cancer-related information: Review of current status and future directions. Int J Med Informatics. 2014;83(9):605-23. https://doi.org/10.1016/j.ijmedinf.2014.06.009
  5. Jonnalagadda SR, Adupa AK, Garg RP. Text mining of the electronic health record: An information extraction approach for automated identification and subphenotyping of HFpEF patients for clinical trials. J Cardiovasc Transl Res. 2017;10(3):313-21. https://doi.org/10.1007/s12265-017-9752-2
  6. Rahaman T. Discovering new trend and connections: Current application of biomedical text mining. Med Ref Services Quarterly. 2021;40(3):329-36. https://doi.org/10.1080/02763869.2021.1945869
  7. Le Glaz A, Haralambous Y, Kim D. Machine learning and natural language processing in mental health: Systemic review. J Med Internet Res. 2021;23(5):15708.
  8. Shai SS, Shai BD. Understanding machine learning: from theory to algorithms. New York: Cambridge University Press; 2014.
  9. Peter F. Machine learning: the art and science of algorithms that make sense of data. Cambridge: Cambridge University Press; 2012.
  10. Mehryar M, Afshin R, Ameet T. Foundations of machine learning. Cambridge: MIT press; 2012.
  11. Chen MC, Ball RL, Yang L. Deep learning to classify radiology free-text reports. Radiology. 2018;286(3):845-2. https://doi.org/10.1148/radiol.2017171115
  12. Pak DH, Hwang MG, Hwang JU. Application of text classification based machine learning in prediction psychiatric diagnosis. Korean J Biol Psychiatry. 2020;27(1):18-26.
  13. Andrea C, Leif J, Hercules D. Natural language processing and machine learning to enable automatic extraction and classification of patients' smoking status from electronic medical records. Upsala J Med Sci. 2020;125(4):316-24. https://doi.org/10.1080/03009734.2020.1792010
  14. Yuli V. Natural language processing with Python and spaCy: a practical introduction. San Francisco: No Starch Press; 2020.
  15. Hobson L, Cole H, Arwen G. Natural language processing in action: understanding, analyzing and generating text with Python. Shelter Island, NY: Manning Publications Co.; 2019.
  16. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR. 2013;1301-13.
  17. Gastaldi JL. Why can computers understand natural language? The structuralist image of language behind word embeddings. Phil Tech. 2021;34(1):149-214. https://doi.org/10.1007/s13347-020-00393-9
  18. Guillermo JB, Ricardo O, Jose AL. Using latent semantic analysis and the predication algorithm to improve extraction of meanings from a diagnostic corpus. Span J Psychol. 2009;12(2):424-40. https://doi.org/10.1017/s1138741600001815
  19. Zhou Y. An introduction to text classification with applications to medical records. 2nd international conference on informational technology and computer application. 2020;471-75.
  20. Kherwa P, Bansal P. Latent semantic analysis: an approach to understanding sematic of text. International conference on current trends in computer, electrical, electronics and communication. 2017;870-4.
  21. Almas J, Qamar U. Affect of data filter on performance of latent semantic analysis based research paper recommender system. 5th International conference on computational intelligence and application. 2020;50-54.
  22. Weng WH, Wagholikar KB, McCray A., Szolovits P. Medical subdomain classification of clinical notes using a machine learning based natural language processing approach. BMC med inform Decis Mak. 2017;17(1):1-13. https://doi.org/10.1186/s12911-016-0389-x
  23. Jamaluddin M, Wibawa AD. Patient diagnosis classification based on electronic medical record using text mining and support vector machine. International seminar on application for technology of information and communication. 2021;243-8.
  24. Wang Y, Sohn SH, Liu S, Shen F. A clinical text classification paradigm using weak supervision and deep representation. BMC med inform Decis Mak. 2019;19(1).
  25. Park KB, Lee JH, Jang SB, Jung DW. An empirical study of tokenization strategies for various Korean NLP tasks. Computer Science. 2020.
  26. Cho DB, Lee HY, Kang SS. An empirical study of Korean sentence representation with various tokenization. Electronics. 2021;10(7):845-57. https://doi.org/10.3390/electronics10070845