DOI QR코드

DOI QR Code

A Prediction Model for the Development of Cataract Using Random Forests

Random Forests 기법을 이용한 백내장 예측모형 - 일개 대학병원 건강검진 수검자료에서 -

  • Han, Eun-Jeong (Research Center, National Health Insurance Corporation) ;
  • Song, Ki-Jun (Department of Biostatistics, Yonsei University) ;
  • Kim, Dong-Geon (Department of Statistics and Information Science, Dongduk Women's University)
  • 한은정 (국민건강보험공단 건강보험정책연구원) ;
  • 송기준 (연세대학교 의학전산통계학과) ;
  • 김동건 (동덕여자대학교 정보동계학과)
  • Published : 2009.08.31

Abstract

Cataract is the main cause of blindness and visual impairment, especially, age-related cataract accounts for about half of the 32 million cases of blindness worldwide. As the life expectancy and the expansion of the elderly population are increasing, the cases of cataract increase as well, which causes a serious economic and social problem throughout the country. However, the incidence of cataract can be reduced dramatically through early diagnosis and prevention. In this study, we developed a prediction model of cataracts for early diagnosis using hospital data of 3,237 subjects who received the screening test first and then later visited medical center for cataract check-ups cataract between 1994 and 2005. To develop the prediction model, we used random forests and compared the predictive performance of this model with other common discriminant models such as logistic regression, discriminant model, decision tree, naive Bayes, and two popular ensemble model, bagging and arcing. The accuracy of random forests was 67.16%, sensitivity was 72.28%, and main factors included in this model were age, diabetes, WBC, platelet, triglyceride, BMI and so on. The results showed that it could predict about 70% of cataract existence by screening test without any information from direct eye examination by ophthalmologist. We expect that our model may contribute to diagnose cataract and help preventing cataract in early stages.

백내장 질환은 노령인구가 증가하고 있는 시점에서 사회, 경제적으로 심각한 문제로 부각되고 있는 질병으로 조기 진단이 이루어진다면 발병률을 크게 줄일 수 있는 질병이다. 본 연구에서는 백내장을 조기 진단하기 위한 예측 모형을 구축하고자 1994년부터 2001년까지 연세대학병원에서 2회 이상 건강검진을 받고 의사진단을 통해 백내장 여부를 확인할 수 있는 30세 이상 남 녀 3,237명에 대한 건강검진 수검 자료를 활용하여 백내장 발생 위험 예측모형을 개발하였다. 모형개발에는 데이터마이닝 기법인 Random Forests를 사용하였고, 기존의 로지스틱 회귀분석, 판별분석, 의사결정나무 모형(Decision tree), 나이브베이즈(Naive Bayes), 앙상블 모형인 배깅(Bagging)과 아킹(Arcing)을 이용하여 그 성능을 비교 분석하였다. Random Forests를 통해 개발한 백내장 발생 예측모형은 정확도가 67.16%, 민감도가 72.28%였고, 주요 영향요인은 연령, 혈당, 백혈구수치(WBC), 혈소판수치(platelet), 중성지질(triglyceride), BMI였다. 이 결과는 의사의 안과검진 정보 없이 건강검진 수검 자료만으로 백내장 질환 유 무에 관한 정보를 70% 정도 예측할 수 있음을 보여주는 것으로, 백내장의 조기 진단에 많은 기여를 할 것으로 판단된다.

Keywords

References

  1. 국민건강보험공단.건강보험심사평가원 (2007), 2006 건강보험통계연보
  2. 신경환, 김재찬, 김원식, 안병헌, 이진학, 노세현, 송준경, 이용환 (1992a). 한국 백내장 역학 조사회에 의한 노인성 백내장의 제반 위험 인자에 관한 연구 조사, <대한안과학회지>, 33, 127-134
  3. 신경환, 홍내선, 안상기, 김재찬, 이진학, 안병헌, rlaakst, 노세현, 송준경 (1992b). 노인성 백내장의 위험인자 및 환경요소에 대한 역학적 연구: 인구를 기초로 한 역학 조사, <대한안과학회지>, 33, 834-843
  4. 통계청 (2008). <2008 고령자 통계>, 통계청, 서울
  5. Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants, Machine Learning, 36, 105-139 https://doi.org/10.1023/A:1007515423169
  6. Breiman, L. (2001). Random forest, Machine Learning, 45, 5-32 https://doi.org/10.1023/A:1010933404324
  7. Bureau, A., Dupuis, J., Falls, K, Lunetta, K. L., Hayward, B., Keith, T. P. and Van Eerdewegh, P. (2005). Identifying SNPs predictive of phenotype using random forests, Genetic Epidemiology, 28, 171-182 https://doi.org/10.1002/gepi.20041
  8. Delcourt, C., Cristol, J. P., Tessier, F., Leger, C. L., Michel. F. and Papoz, L. (2000). Risk factors for cortical, nuclear, and posterior subcapsular cataracts: The POLA study, American Journal of Epidemiology, 151, 497-504 https://doi.org/10.1093/oxfordjournals.aje.a010235
  9. Elkan, C. (2001). The foundations of cost-sensitive learning, In Proceedings of the Seventeenth International Joint Conference on Artijiciallntelligence(IJCAI'01), 973-978
  10. Heidema, A. G., Boer, J. M. A., Nagelkerke, N., Mariman, E. C. M., van der A, D. L. and Feskens, E. J. M. (2006). The challenge for genetic epidemiologists: How to analyze large numbers of SNPs in relation to complex disease, BMC Genetics, 1, 23 https://doi.org/10.1186/1471-2156-7-23
  11. Hennis, A., Wu, S. Y., Nemesure, B. and Leske, M. C. (2004). Risk factors for incident cortical and posterior subcapsular lens opacities in the Barbados Eye Studies, Arch Ophthalmol, 122, 525-530 https://doi.org/10.1001/archopht.122.4.525
  12. Kuang, T. M., Tsai, S. Y., Hsu, W. M., Cheng, C. Y., Liu, J. H. and Chou, P. (2005). Body mass index and age-related cataract: The Shihpai Eye Study, Archives of Ophthalmol, 123, 1109-1114 https://doi.org/10.1001/archopht.123.8.1109
  13. Lunetta, K. L., Hayward, L. B., Segal, J. and Van Eerdewegh, P. (2004). Screening Large-scale association study data: Exploiting interactions using random forests, BMC Genentics, 5, 32 https://doi.org/10.1186/1471-2156-5-32
  14. Panchapakesan, J., Mitchell, P., Tumuluri, K., Rochtchina, E., Foran, S. and Cumming, R, G. (2003). Five year incidence of cataract surgery: The blue mountains eye study, British Journal of Ophthalmology, 87, 168-172 https://doi.org/10.1136/bjo.87.2.168
  15. Prasad, A. M., Iverson, L. R. and Liaw, A. (2006). Newer classification and regression tree techniques: Bagging and random forests for ecological prediction, Ecosystems, 9, 181-199 https://doi.org/10.1007/s10021-005-0054-1
  16. Robnik-Sikonja, M. (2004). Improving Random Forests, Lecture Notes in Computer Science, Springer, 359-370
  17. Strobl, C, Boulesteix, A. L., Zeileis, A. and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution, BMC Bioinformatics, 8, 25 https://doi.org/10.1186/1471-2105-8-25
  18. Tibshirani, R. (1996). Bias, Variance and Prediction Error for Classification Rules, Technical Report, Statistics Department, University of Toronto
  19. Weintraub, J. M., Willett, W. C, Rosner, B., Colditz, G. A., Seddon, J. M. and Hankinson, S, E. (2002). A prospective study of the relationship between body mass index and cataract extraction among US women and men, International Journal of Obesity, 26, 1588-1595 https://doi.org/10.1038/sj.ijo.0802158
  20. Wolpert, D. H. and Macready, W. G. (1999). An efficient method to estimate Bagging's generalization error, Machine Learning, 35, 41-55 https://doi.org/10.1023/A:1007519102914