DOI QR코드

DOI QR Code

A Study on Applying Novel Reverse N-Gram for Construction of Natural Language Processing Dictionary for Healthcare Big Data Analysis

헬스케어 분야 빅데이터 분석을 위한 개체명 사전구축에 새로운 역 N-Gram 적용 연구

  • KyungHyun Lee ;
  • RackJune Baek ;
  • WooSu Kim (Graduate school of Convergence Technology and Energy, Tech Univ of Korea)
  • 이경현 (한국공학대학교 IT반도체융합공학과) ;
  • 백락준 (가톨릭관동대학교) ;
  • 김우수 (한국공학대학교 융합기술에너지 대학원)
  • Received : 2024.03.12
  • Accepted : 2024.04.30
  • Published : 2024.05.31

Abstract

This study proposes a novel reverse N-Gram approach to overcome the limitations of traditional N-Gram methods and enhance performance in building an entity dictionary specialized for the healthcare sector. The proposed reverse N-Gram technique allows for more precise analysis and processing of the complex linguistic features of healthcare-related big data. To verify the efficiency of the proposed method, big data on healthcare and digital health announced during the Consumer Electronics Show (CES) held each January was collected. Using the Python programming language, 2,185 news titles and summaries mentioned from January 1 to 31 in 2010 and from January 1 to 31 in 2024 were preprocessed with the new reverse N-Gram method. This resulted in the stable construction of a dictionary for natural language processing in the healthcare field.

본 연구에서는 헬스케어 분야에 특화된 개체명 사전을 구축하기 위해 기존 N-Gram 방식의 한계를 극복하고 성능을 향상하게 시키기 위해 새로운 역 N-Gram 방식을 제안하였다. 제안된 역 N-Gram 방식은 헬스케어 관련 빅데이터의 복잡한 언어적 특성을 더 정밀하게 분석하고 처리할 수 있다. 제안된 방식의 효율성 검증을 위해 매년 1월에 개최되는 소비자 가전 전시회(Consumer Electronics Show: CES) 기간 동안 발표된 헬스케어 및 디지털 헬스케어 관련 빅데이터를 수집하기 위하여 뉴스를 대상으로 2010년 1월 1일부터 31일, 그리고 2024년 1월 1일부터 31일까지 언급된 2,185건의 뉴스 제목 및 요약문을 파이썬 프로그래밍언어로 새로운 역 N-Gram 방식을 구현하여 전처리한 결과, 헬스케어 분야에서의 자연어 처리를 위한 사전이 안정적으로 구축되었음을 확인할 수 있었다.

Keywords

Acknowledgement

이 논문은 2024년 산업통상자원부 산업혁신기반구축사업의 지원에 의하여 연구되었음(P0025775)

References

  1. 신수용, "비정형 헬스케어 데이터 표준화," The Journal of The Korean Institute of Communi cation Sciences, vol. 35, no. 2, pp. 58-64, 2018.
  2. R&D BRIEF "Natural Language Processing in Healthcare" NRF한국연구재단 42호, 2022
  3. Sungjick Lee, Han-joon Kim, "Keyword Extraction from News Corpus using Modified TF-IDF" 한국전자거래학회지 vol.14, no.4, pp. 59-73, 2009
  4. Bongjun Cho, HanJoo Lee, Wooseok Yong, and Won Suk LEE, "A Generation and Matching Method of Normal-Transient Dictionary for Realtime Topic Detection," The Journal of Korean Institute of Next Generation Computing, vol. 13, no. 5, pp. 7-18, 2017.
  5. Kyuri Kim, Jihyun Moon, Uran Oh, "Analysis and Recognition of Depressive Emotion through NLP and Machine Learning ," The Journal of the Convergence on Culture Technology (JCCT), vol.6, no.2, pp.449-454, 2020.
  6. Himchan Hong, "Building a Natural Language Processing Dictionary for Analysing Military Areas' Bigdata," Korean Journal of Military Art and Science, vol. 77, no. 2, pp. 400-415, 2021.
  7. Su-yeon Kang and Gun-woo Kim. "Morpheme -Based Few-Shot Learning with Large Language Models for Korean Healthcare Named Entity Recognition." 한국정보처리학회 학술대회논문집, vol. 30, no. 2, pp. 428-429, 2023.
  8. Hyeon-kon Son, Gi-hwan Ryu, "Automatic Electronic Medical Record Generation System using Speech Recognition and Natural Language Processing Deep Learning" The Journal of the Convergence on Culture Technology(JCCT), vol.9, no.3, pp.731-736, 2023
  9. Dokyoung Kim and Yu-Seop Kim, "Development of Chinese Media Keyword Analysis System using TF-IDF and N-gram," in 한국정보과학회학술발표논문집, pp.1432-1434. 2020
  10. Geonwoo ParkO, Seongsik Park, Yoengjin Jang, Kihyoen Choi, Harksoo Kim. "KACTEIL-NER: Named Entity Recognizer Using Deep Learning and Ensemble Technique" Kangwon National University Computer and Communication Engineering pp 324-326, 2017
  11. Jae-Kyun Kim, Chang-Hyun Kim, Min-Ah Cheon, Ho-Min Park, Ho Yoon, Young Nam-Goong, Min-Seok Choi, Jae-Hoon Kim. "Generating Korean NER Corpus using Hidden Markov Model"Korea Maritime and Ocean University, Electronics and Telecommunications Research Institute. pp357-361, 2019
  12. Yoon-Shik Tae, Seong-Bae Park, Sang-Jo Lee, and Se-Young Park, "Self-Organizing n-gram Model for Automatic Word Spacing," in 한국정보과학회 언어공학연구회 학술발표 논문집, pp. 125-132. 2006
  13. Dongyoung Lee, "Natural Language Processing Research," in 한국정보과학회 학술발표논문집, pp. 1771-1773. 2018
  14. Sungwook Ko and Hyeryung Jang, "A Study on Improving Practicality for Natural Language Processing Applications Based on a Pre-trained Language Model," in Proceedings of Symposium of the Korean Institute of communications and Information Sciences, pp. 909-910. 2023