[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.4332/KJHPA.2020.30.1.15

A Study on the Application of Natural Language Processing in Health Care Big Data: Focusing on Word Embedding Methods

Kim, Hansang (Review and Assessment Research Department, Health Insurance Review & Assessment Service)
Chung, Yeojin (Department of Data Science, Kookmin University)

Publication Information

Health Policy and Management / v.30, no.1, 2020 , pp. 15-25 More about this Journal

Abstract

While healthcare data sets include extensive information about patients, many researchers have limitations in analyzing them due to their intrinsic characteristics such as heterogeneity, longitudinal irregularity, and noise. In particular, since the majority of medical history information is recorded in text codes, the use of such information has been limited due to the high dimensionality of explanatory variables. To address this problem, recent studies applied word embedding techniques, originally developed for natural language processing, and derived positive results in terms of dimensional reduction and accuracy of the prediction model. This paper reviews the deep learning-based natural language processing techniques (word embedding) and summarizes research cases that have used those techniques in the health care field. Then we finally propose a research framework for applying deep learning-based natural language process in the analysis of domestic health insurance data.

Keywords

Health care big data; High dimensionality; Deep learning; Natural language processing; Word embedding; Word2vec;

Citations & Related Records

Reference

1	Amarasingham R, Moore BJ, Tabak YP, Drazner MH, Clark CA, Zhang S, et al. An automated model to identify heart failure patients at risk for 30-day readmission or death using electronic medical record data. Med Care 2010;48(11):981-988. DOI: https://doi.org/10.1097/MLR.0b013e3181ef60d9. DOI
2	Zhang E, Robinson R, Pfahringer B. Deep holistic representation learning from EHR. Proceedings of the 2018 12th International Symposium on Medical Information and Communication Technology (ISMICT); 2018 Mar 25-28; Sydney, Australia. Piscataway (NJ): IEEE; 2018.
3	Bellman, R. Adaptive control processes: a guided tour. Princeton (NJ): Princeton University Press; 1972.
4	Rodriguez P, Bautista MA, Gonzalez J, Escalera S. Beyond one-hot encoding: lower dimensional target embedding. Image Vis Comput 2018;75:21-31. DOI: https://doi.org/10.1016/j.imavis.2018.04.004. DOI
5	Choi E, Schuetz A, Stewart W, Sun J. Medical concept representation learning from electronic health records and its application on heart failure prediction [Internet]. Ithaca (NY): arXiv; 2016 [cited 2019 Sep 15]. Available from: https://arxiv.org/abs/1602.03686v1.
6	Nagata M, Takai K, Yasuda K, Heracleous P, Yoneyama A. Prediction models for risk of type-2 diabetes using health claims. Proceedings of the BioNLP 2018 Workshop; 2018 Jul 18-23; Melbourne, Australia. Stroudsburg (PA): Association for Computational Linguistics; 2018.
7	Jin B, Che C, Liu Z, Zhang S, Yin X, Wei X. Predicting the risk of heart failure with EHR sequential data modeling. IEEE Access 2018;6:9256-9261. DOI: https://doi.org/10.1109/access.2017.2789324. DOI
8	Bai T, Chanda A, Egleston B, Vucetic S. Joint learning of representations of medical concepts and words from EHR data. Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2017 Nov 13-16; Kansas City, USA. Piscataway (NJ): IEEE; 2017.
9	Che Z, Cheng Y, Sun Z, Liu Y. Exploiting convolutional neural network for risk prediction with medical feature embedding [Internet]. Ithaca (NY): arXiv; 2017 [cited 2019 Sep 15]. Available from: https://arxiv.org/abs/1701.07474v1.
10	Choi Y, Chiu CY, Sontag D. Learning low-dimensional representations of medical concepts. AMIA Jt Summits Transl Sci Proc 2016;2016:41-50.
11	Zhang J, Kowsari K, Harrison JH, Lobo JM, Barnes LE. Patient2vec: a personalized interpretable deep representation of the longitudinal electronic health record. IEEE Access 2018;6:65333-65346. DOI: https://doi.org/10.1109/access.2018.2875677. DOI
12	Young T, Hazarika D, Poria S, Cambria E. Recent trends in deep learning based natural language processing. IEEE Comput Intel Mag 2018;13(3):55-75. DOI: https://doi.org/10.1109/mci.2018.2840738. DOI
13	Yang Liu, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun. Topical word embeddings. Proceedings of the 29th AAAI Conference on Artificial Intelligence; 2015 Jan 25-29; Austin, USA. Menlo Park (CA): Association for the Advancement of Artificial Intelligence; 2015.
14	Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space [Internet]. Ithaca (NY): arXiv; 2013 [cited 2019 Sep 15]. Available from: https://arxiv.org/abs/1301.3781.
15	Kiela D, Hill F, Clark S. Specializing word embeddings for similarity or relatedness. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; 2015 Sep 17-21; Lisbon, Portugal. Stroudsburg (PA): Association for Computational Linguistics; 2015.
16	Harris ZS. Distributional structure. Word 1954;10(2-3):146-162. DOI: https://doi.org/10.1080/00437956.1954.11659520. DOI
17	Trask A, Gilmore D, Russell M. Modeling order in neural word embeddings at scale [Internet]. Ithaca (NY): arXiv; 2015 [cited 2019 Sep 15]. Available from: https://arxiv.org/abs/1506.02338.
18	Mikolov T, Kombrink S, Burget L, Cernocky J, Khudanpur S. Extensions of recurrent neural network language model. Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2011 May 22-27; Prague, Czech Republic. Piscataway (NJ): IEEE; 2011.
19	Mikolov T, Yih WT, Zweig G. Linguistic regularities in continuous space word representations. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2013 Jun 9-14; Atlanta, USA. Stroudsburg (PA): Association for Computational Linguistics; 2013. pp. 746-751.
20	Bengio Y, Ducharme R, Vincent P, Jauvin C. A neural probabilistic language model. J Mach Learn Res 2003;3:1137-1155.
21	Rong X. Word2vec parameter learning explained [Internet]. Ithaca (NY): arXiv; 2014 [cited 2020 Mar 24]. Available from: https://arxiv.org/abs/1411.2738.
22	Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality [Internet]. Ithaca (NY): arXiv; 2013 [cited 2019 Sep 15]. Available from: https://arxiv.org/abs/1310.4546.
23	Pennington J, Socher R, Manning C. Glove: global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2014 Oct 25-29; Doha, Qatar. Stroudsburg (PA): Association for Computational Linguistics; 2014.
24	Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist 2017;5:135-146. DOI: https://doi.org/10.1162/tacl_a_00051. DOI
25	Sahlgren M, Lenci A. The effects of data size and frequency range on distributional semantic models. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing; 2016 Nov 1-4; Austin, USA. Stroudsburg (PA): Association for Computational Linguistics; 2016.
26	Saltzman JR, Tabak YP, Hyett BH, Sun X, Travis AC, Johannes RS. A simple risk score accurately predicts in-hospital mortality, length of stay, and cost in acute upper GI bleeding. Gastrointest Endosc 2011;74(6):1215-1224. DOI: https://doi.org/10.1016/j.gie.2011.06.024. DOI
27	Mani S, Chen Y, Elasy T, Clayton W, Denny J. Type 2 diabetes risk forecasting from EMR data using machine learning. AMIA Annu Symp Proc 2012;2012:606-615.
28	Kennedy EH, Wiitala WL, Hayward RA, Sussman JB. Improved cardiovascular risk prediction using nonparametric regression and electronic health record data. Med Care 2013;51(3):251-258. DOI: https://doi.org/10.1097/MLR.0b013e31827da594. DOI
29	Tabak YP, Sun X, Nunez CM, Johannes RS. Using electronic health record data to develop inpatient mortality predictive model: Acute Laboratory Risk of Mortality Score (ALaRMS). J Am Med Inform Assoc 2014;21(3):455-463. DOI: https://doi.org/10.1136/amiajnl-2013-001790. DOI
30	De Vine L, Zuccon G, Koopman B, Sitbon L, Bruza P. Medical semantic similarity with a neural language model. Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM 2014); 2014 Nov 3-7; Shanghai, China. New York (NY): Association for Computing Machinery; 2014.
31	Le QV, Mikolov T. Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning; 2014 Jun 21-26; Beijing, China. Stroudsburg (PA): International Machine Learning Society; 2014.
32	Giatsoglou M, Vozalis MG, Diamantaras K, Vakali A, Sarigiannidis G, Chatzisavvas KC. Sentiment analysis leveraging emotions and word embeddings. Expert Syst Appl 2017;69:214-224. DOI: https://doi.org/10.1016/j.eswa.2016.10.043. DOI
33	Minarro-Gimenez JA, Marin-Alonso O, Samwald M. Exploring the application of deep learning techniques on medical text corpora. Stud Health Technol Inform 2014;205:584-588.
34	Huang K, Altosaar J, Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission [Internet]. Ithaca (NY): arXiv; 2019 [cited 2019 Dec 4]. Available from: https://arxiv.org/abs/1904.05342.
35	Choi E, Bahadori M, Searles E, Coffey C, Sun J. Multi-layer representation learning for medical concepts [Internet]. Ithaca (NY): arXiv; 2016 [cited 2019 Dec 4]. Available from: https://arxiv.org/abs/1602.05568v1.
36	Cai X, Gao J, Ngiam K, Ooi B, Zhang Y, Yuan X. Medical concept embedding with time-aware attention [Internet]. Ithaca (NY): arXiv; 2018 [cited 2019 Dec 4]. Available from: https://arxiv.org/abs/1806.02873.
37	Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform 2018;19(6):1236-1246. DOI: https://doi.org/10.1093/bib/bbx044. DOI
38	Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med 2018;1:18. DOI: https://doi.org/10.1038/s41746-018-0029-1. DOI
39	Choi E, Bahadori MT, Schuetz A, Stewart WF, Sun J. Doctor AI: predicting clinical events via recurrent neural networks. JMLR Workshop Conf Proc 2016;56:301-318.
40	Chang E, Kim D, Lee J, Yang B, Hwang J, Kwak S. A study on the advancement of utilization of medical big data. Wonju: Health Insurance Review & Assessment service; 2016.
41	Kang H. National-level use of health care big data and its policy implications. Sejong: Korea Institute for Health and Social Affairs; 2016.
42	Miotto R, Li L, Kidd BA, Dudley JT. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci Rep 2016;6:26094. DOI: https://doi.org/10.1038/srep26094. DOI
43	Himes BE, Dai Y, Kohane IS, Weiss ST, Ramoni MF. Prediction of chronic obstructive pulmonary disease (COPD) in asthma patients using electronic medical records. J Am Med Inform Assoc 2009;16(3):371-379. DOI: https://doi.org/10.1197/jamia.M2846. DOI
44	Pham T, Tran T, Phung D, Venkatesh S. DeepCare: a deep dynamic memory model for predictive medicine. In: Kim J, Shim K, Cao L, Lee JG, Lin X, Moon YS, editors. Advances in knowledge discovery and data mining: 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23-26, 2017, Proceedings. Cham: Springer; 2017. pp. 30-41.
45	Goldstein BA, Navar AM, Pencina MJ, Ioannidis JP. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc 2017;24(1):198-208. DOI: https://doi.org/10.1093/jamia/ocw042. DOI

KSCI

A Study on the Application of Natural Language Processing in Health Care Big Data: Focusing on Word Embedding Methods 보건의료 빅데이터에서의 자연어처리기법 적용방안 연구: 단어임베딩 방법을 중심으로

A Study on the Application of Natural Language Processing in Health Care Big Data: Focusing on Word Embedding Methods