DOI QR코드

DOI QR Code

Optimized Feature Selection using Feature Subset IG-MLP Evaluation based Machine Learning Model for Disease Prediction

특징집합 IG-MLP 평가 기반의 최적화된 특징선택 방법을 이용한 질환 예측 머신러닝 모델

  • Received : 2019.09.06
  • Accepted : 2019.12.20
  • Published : 2020.03.31

Abstract

Cardio-cerebrovascular diseases (CCD) account for 24% of the causes of death to Koreans and its proportion is the highest except cancer. Currently, the risk of the cardiovascular disease for domestic patients is based on the Framingham risk score (FRS), but accuracy tends to decrease because it is a foreign guideline. Also, it can't score the risk of cerebrovascular disease. CCD is hard to predict, because it is difficult to analyze the features of early symptoms for prevention. Therefore, proper prediction method for Koreans is needed. The purpose of this paper is validating IG-MLP (Information Gain - Multilayer Perceptron) evaluation based feature selection method using CCD data with simulation. The proposed method uses the raw data of the 4th ~ 7th of The Korea National Health and Nutrition Examination Survey (KNHANES). To select the important feature of CCD, analysis on the attributes using IG-MLP are processed, finally CCD prediction ANN model using optimize feature set is provided. Proposed method can find important features of CCD prediction of Koreans, and ANN model could predict more accurate CCD for Koreans.

암을 제외한 한국인의 가장 높은 사망원인은 심뇌혈관질환으로 사망원인의 24%를 차지한다. 현재 국내 환자의 심혈관질환의 위험도 산출은 프레밍험 위험지수를 기반으로 하지만, 국외의 가이드라인에 의존하고 있어 정확도가 떨어지는 편이며, 뇌혈관질환의 예측에 대한 위험도는 산출할 수 없다. 심뇌혈관질환은 예방을 위한 조기증상들의 특징 분석이 어려워 질환예측이 힘들며, 한국인에 적합한 예측 방법이 필요하다. 본 연구의 목적은 심뇌혈관질환 데이터를 이용하여, 특징집합 IG-MLP 평가 기반의 특징선택 방법론을 시뮬레이션 하여 검증하는 것이다. 제안하는 방법은 제4~7기 국민건강영양조사 원시자료를 이용한다. 심뇌혈관질환의 예측에 중요한 특징들을 선별하기 위해, 속성들의 심뇌혈관질환에 대한 정보이득-다층신경망을 이용한 분석을 실시하며, 최종적으로 선별된 특징을 이용한 심뇌혈관질환 예측 모델을 제공한다. 제안하는 방법으로 한국인의 심뇌혈관질환에 관련된 중요한 특징들을 찾을 수 있으며, 최적화된 특징들로 구성된 예측 모델은 한국인에 대해 더욱 정확한 심뇌혈관 예측을 할 수 있다.

Keywords

References

  1. 건강보험심사평가원(보건의료빅데이터개방시스템, 국민관심질병통계), 2019. 03
  2. 김상현. (2016). 이상지질혈증 진료지침의 최신지견. Journal of the Korean Medical Association, 59(5), 349-351. https://doi.org/10.5124/jkma.2016.59.5.349
  3. 김영은, 김일화, 문아지, 김남권, 이성근, 이기상. (2010). 뇌혈관질환 위험요인과의 분석을 통한 EAV(MERIDIAN) 활용에 관한 연구. 대한한의학회지, 31(5), 136-145.
  4. 김태년. (2015). 심혈관대사질환 예측인자로서 허리둘레/신장비의 유용성. 대한비만학회지, 24(2), 92-94.
  5. 오경재. (2016). 권역심뇌혈관질환센터 예방관리사업의 지역사회 성과. 2017년 대한예방의학회 가을학술대회 심포지엄 발표자료
  6. 정일영, 김석관, 이다은, 이유현. (2016). 데이터 기반 헬스케어 혁신의 부상과 대응전략. 정책연구, 1-204.
  7. Ahn, K. A., Yun, J. E., Cho, E. R., Nam, C. M., Jang, Y., & Jee, S. H. (2006). Framingham Equation Model Overestimates Risk of Ischemic Heart Disease in Korean Men and Women. Korean Journal of Epidemiology, 28(2), 162-170.
  8. Cho, Y. G. (2018). Cardiovascular Risk Prediction in Korean Adults. Korean journal of family medicine, 39(3), 135-136. https://doi.org/10.4082/kjfm.2018.39.3.135
  9. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
  10. Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2013). Multivariate data analysis: Pearson new international edition. Pearson Higher Ed.
  11. Hall, M. A. (2000). Correlation-based feature selection of discrete and numeric class machine learning.
  12. Hu, Z., Bao, Y., Xiong, T., & Chiong, R. (2015). Hybrid filter–wrapper feature selection for short-term load forecasting. Engineering Applications of Artificial Intelligence, 40, 17-27. https://doi.org/10.1016/j.engappai.2014.12.014
  13. Huda, S., Abawajy, J., Alazab, M., Abdollalihian, M., Islam, R., & Yearwood, J. (2016). Hybrids of support vector machine wrapper and filter based framework for malware detection. Future Generation Computer Systems, 55, 376-390. https://doi.org/10.1016/j.future.2014.06.001
  14. Ian G,, Yoshua B., and Aaron C. (2016). Deep Learning. MIT Press.
  15. Karegowda, A. G., Manjunath, A. S., & Jayaram, M. A. (2010). Comparative study of attribute selection using gain ratio and correlation based feature selection. International Journal of Information Technology and Knowledge Management, 2(2), 271-277.
  16. Kent, J. T. (1983). Information gain and a general measure of correlation. Biometrika, 70(1), 163-173. https://doi.org/10.1093/biomet/70.1.163
  17. Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2), 273-324. https://doi.org/10.1016/S0004-3702(97)00043-X
  18. KOSIS(통계청, 2017년 사망원인통계), 2018. 09. 19.
  19. Myers, T. A. (2011). Goodbye, listwise deletion: Presenting hot deck imputation as an easy and effective tool for handling missing data. Communication Methods and Measures, 5(4), 297-310. https://doi.org/10.1080/19312458.2011.624490
  20. Pathria, R. K.; Beale, Paul (2011). Statistical Mechanics (Third Edition). Academic Press. p. 51. ISBN978-0123821881.
  21. The Fifth Korea National Health and Nutrition Examination Survey (KNHANES V), 2010-2012, Korea Centers for Disease Control and Prevention.
  22. The Fourth Korea National Health and Nutrition Examination Survey (KNHANES IV),2007- 2009, Korea Centers for Disease Control and Prevention.
  23. The Seventh Korea National Health and Nutrition Examination Survey (KNHANES VII 1-2), 2016-2017, Korea Centers for Disease Control and Prevention.
  24. The Sixth Korea National Health and Nutrition Examination Survey (KNHANES VI), 2013-2015, Korea Centers for Disease Control and Prevention.
  25. Wilson P, D’Agostino R, Levy D, Belanger A, Silbershatz H, Kannel W (1998) Prediction of coronary heart disease using risk factor categories. Circulation 97(18), 1837-1874. https://doi.org/10.1161/01.CIR.97.18.1837
  26. Yeonhee Bae, Kowoon Lee. (2016). Risk Factors for Cardiovascular Disease in Adults Aged 30 Years and Older. Journal of Korean Society of Integrative Medicine, 4(2), 97-107. https://doi.org/10.15268/ksim.2016.4.2.097
  27. Young Mi Kang, Hyun Jin Kim, Tae-yong Lee, Bon Jeong Ku. (2017). The Relationship between Death and Clinical Risk Factors in Korean: Community Cohort Study. The Journal of the Korean Public Health Association, 43(3), 81-90.