DOI QR코드

DOI QR Code

Bayesian logit models with auxiliary mixture sampling for analyzing diabetes diagnosis data

보조 혼합 샘플링을 이용한 베이지안 로지스틱 회귀모형 : 당뇨병 자료에 적용 및 분류에서의 성능 비교

  • Rhee, Eun Hee (Department of Applied Statistics, Chung-Ang University) ;
  • Hwang, Beom Seuk (Department of Applied Statistics, Chung-Ang University)
  • 이은희 (중앙대학교 응용통계학과) ;
  • 황범석 (중앙대학교 응용통계학과)
  • Received : 2021.11.01
  • Accepted : 2021.12.08
  • Published : 2022.02.28

Abstract

Logit models are commonly used to predicting and classifying categorical response variables. Most Bayesian approaches to logit models are implemented based on the Metropolis-Hastings algorithm. However, the algorithm has disadvantages of slow convergence and difficulty in ensuring adequacy for the proposal distribution. Therefore, we use auxiliary mixture sampler proposed by Frühwirth-Schnatter and Frühwirth (2007) to estimate logit models. This method introduces two sequences of auxiliary latent variables to make logit models satisfy normality and linearity. As a result, the method leads that logit model can be easily implemented by Gibbs sampling. We applied the proposed method to diabetes data from the Community Health Survey (2020) of the Korea Disease Control and Prevention Agency and compared performance with Metropolis-Hastings algorithm. In addition, we showed that the logit model using auxiliary mixture sampling has a great classification performance comparable to that of the machine learning models.

로지스틱 회귀 모형은 다양한 분야에서 범주형 종속 변수를 예측하거나 분류하기 위한 모형으로 많이 사용되고 있다. 로지스틱 회귀 모형에 대한 전통적인 베이지안 추론 기법으로 메트로폴리스-헤이스팅스 알고리즘이 많이 사용되었지만, 수렴의 속도가 느리고 제안 분포에 대한 적절성을 보장하기 어렵다. 따라서, 본 논문에서는 모형에 대한 베이지안 추론 방법으로 Frühwirth-Schnatter와 Frühwirth (2007)에서 제안된 보조 혼합 샘플링(auxiliary mixture sampling) 기법을 사용하였다. 이 방법은 모형의 선형성과 정규성을 만족시키기 위해 두 단계에 거쳐 잠재변수를 도입하며, 결과적으로 깁스 샘플링을 통한 추론을 가능하게 한다. 제안한 모형의 효과를 검증하기 위해 2020년 지역사회 건강조사 당뇨병 자료에 적용하여 메트로폴리스-헤이스팅스를 사용한 모형과 추론 결과를 비교 분석하였다. 또한, 다양한 분류 모형들과 본 논문에서 제안한 모형의 분류 성능을 비교한 결과 제안된 모형이 분류 분석에서도 좋은 성능을 보이는 것을 확인할 수 있었다.

Keywords

Acknowledgement

이 논문은 2020년도 중앙대학교 CAU GRS 지원에 의하여 작성되었고, 2019년도 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 기초연구사업임(NRF-2019R1C1C1011710).

References

  1. Albert JH and Chib S (1993). Bayes inference via Gibbs sampling of autoregressive time series subject to Markov mean and variance shifts, Journal of Business and Economic Statistics, 11, 1-15. https://doi.org/10.2307/1391303
  2. Chen MH, Dey DK, and Shao QM (1999). A new skewed link model for dichotomous quantal response data, Journal of the American Statistical Association, 94, 1172-1186. https://doi.org/10.1080/01621459.1999.10473872
  3. Chib S and Greenberg E (1995). Understanding the metropolis-hastings algorithm, The American Statistician, 49, 327-335. https://doi.org/10.2307/2684568
  4. Chib S, Greenberg E, and Winkelmann R (1998). Posterior simulation and Bayes factors in panel count data models, Journal of Econometrics, 86, 33-54. https://doi.org/10.1016/S0304-4076(97)00108-5
  5. Chib S, Nardari F, and Shephard N (2002). Markov chain Monte Carlo methods for stochastic volatility models, Journal of Econometrics, 108, 281-316. https://doi.org/10.1016/S0304-4076(01)00137-3
  6. Fruhwirth-Schnatter S and Fruhwirth R (2007). Auxiliary mixture sampling with applications to logistic models,Computational Statistics and Data Analysis, 51.7, 3509-3528. https://doi.org/10.1016/j.csda.2006.10.006
  7. Fruhwirth-Schnatter S, Fruhwirth R, Held L, and Rue H (2009). Improved auxiliary mixture sampling for hierarchical models of non-Gaussian data, Statistics and Computing, 19, 479-492. https://doi.org/10.1007/s11222-008-9109-4
  8. Gamerman D (1997). Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing, 7(1), 57-68. https://doi.org/10.1023/A:1018509429360
  9. Gelman A, Gilks WR, and Roberts GO (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms, The Annals of Applied Probability, 7, 110-120. https://doi.org/10.1214/aoap/1034625254
  10. Geweke J and Keane M (1999). Mixture of normals probit models, In honour of: Hsiao C, Pesaran MH, Lahiri KL, Lee LF (Eds.), Analysis of Panels and Limited Dependent Variable Models(pp. 49-78), Cambridge University Press, Cambridge.
  11. Held L and Holmes CC (2006). Bayesian auxiliary variable models for binary and multinomial regression, Bayesian Analysis, 1, 145-168. https://doi.org/10.1214/06-BA105
  12. International Diabetes Federation (2019). IDF Diabetes Atlas(9th ed.), retrieved from: https://www.diabetesatlas.org
  13. Kim SB and Hwang BS (2019). A Bayesian skewed logit model for high-risk drinking data, The Korean Data and Information Science Society, 30, 335-348. https://doi.org/10.7465/jkdi.2019.30.2.335
  14. Kim S, Shephard N, and Chib S (1998). Stochastic volatility: likelihood inference and comparison with ARCH models, The review of economic studies, 65.3, 361-393. https://doi.org/10.1111/1467-937X.00050
  15. Kim YH and Hwang BS (2020). Joint analysis of binary and continuous data using skewed logit model in developmental toxicity studies, The Korean Journal of Applied Statistics, 33, 123-136. https://doi.org/10.5351/KJAS.2020.33.2.123
  16. Kim YM, Cho DG, and Kang SH (2014). An empirical analysis on geographic variations in the prevalence of diabetes, Health and Social Welfare Review, 34, 82-105. https://doi.org/10.15709/hswr.2014.34.3.82
  17. King G and Zeng L (2001). Logistic regression in rare events data, Political analysis, 9, 137-163. https://doi.org/10.1093/oxfordjournals.pan.a004868
  18. Lenk PJ and DeSarbo WS (2000). Bayesian inference for finite mixtures of generalized linear models with random effects, Psychometrika, 65, 93-119. https://doi.org/10.1007/BF02294188
  19. McFadden D (1973). Conditional logit analysis of qualitative choice behavior, Frontiers in Econometrics, Academic Press, New York, 105-142.
  20. Nanayakkara N, Andrea JC, Stephane H, et al. (2020). Impact of age at type 2 diabetes mellitus diagnosis on mortality and vascular complications: systematic review and meta-analyses, Diabetologia, 64.2, 275-287. https://doi.org/10.1007/s00125-020-05319-w
  21. Omori Y, Chib S, Shephard N, and Nakajima J (2007). Stochastic volatility with leverage: Fast and efficient likelihood inference, Journal of Econometrics, 140, 425-449. https://doi.org/10.1016/j.jeconom.2006.07.008
  22. R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
  23. Scott SL (2011). Data augmentation, frequentist estimation, and the Bayesian analysis of multinomial logit models, Statistical Papers, 52, 87-109. https://doi.org/10.1007/s00362-009-0205-0
  24. Shephard N (1994). Partial non-Gaussian state space, Biometrika, 81, 115-131 https://doi.org/10.1093/biomet/81.1.115
  25. Song KE, Kim DJ, Park JW, Cho HK, Lee KW, and Huh KB (2007). Clinical characteristics of Korean type 2 diabetic patients according to insulin secretion and insulin resistance, Diabetes and Metabolism Journal, 31, 123-129.
  26. Titterington DM, Afm S, Smith AF, and Makov UE (1985). Statistical Analysis of Finite Mixture Distributions (Vol. 198), John Wiley and Sons Incorporated.
  27. Theodoridis S (2015). Machine learning: A Bayesian and Optimization Perspective, Academic press.
  28. World Health Organization Regional Office for the Western Pacific (2000). The Asia-Pacific perspective : redefining obesity and its treatment, Sydney : Health Communications Australia, retrieved from: https://apps.who.int/iris/handle/10665/206936
  29. Zellner A and Rossi PE (1984). Bayesian analysis of dichotomous quantal response models, Journal of Econometrics, 25, 365-393. https://doi.org/10.1016/0304-4076(84)90007-1