DOI QR코드

DOI QR Code

Effect of zero imputation methods for log-transformation of independent variables in logistic regression

  • Seo Young Park (Department of Statistics and Data Science, Korea National Open University)
  • 투고 : 2023.12.12
  • 심사 : 2024.02.06
  • 발행 : 2024.07.31

초록

Logistic regression models are commonly used to explain binary health outcome variable using independent variables such as patient characteristics in medical science and public health research. Although there is no distributional assumption required for independent variables in logistic regression, variables with severely right-skewed distribution such as lab values are often log-transformed to achieve symmetry or approximate normality. However, lab values often have zeros due to limit of detection which makes it impossible to apply log-transformation. Therefore, preprocessing to handle zeros in the observation before log-transformation is necessary. In this study, five methods that remove zeros (shift by 1, shift by half of the smallest nonzero, shift by square root of the smallest nonzero, replace zeros with half of the smallest nonzero, replace zeros with the square root of the smallest nonzero) are investigated in logistic regression setting. To evaluate performances of these methods, we performed a simulation study based on randomly generated data from log-normal distribution and logistic regression model. Shift by 1 method has the worst performance, and overall shift by half of the smallest nonzero method, replace zeros with half of the smallest nonzero method, and replace zeros with the square root of the smallest nonzero method showed comparable and stable performances.

키워드

과제정보

This research was supported by the Korea National Open University Research Fund.

참고문헌

  1. Bellego C, Benatia D, and Pape L (2022). Dealing with logs and zeros in regressinon models, Available from: arXiv eprint 2203.11820
  2. Box GEP and Cox DR (1964). An analysis of transformations, Journal of the Royal Statistical Society: Series B (Methodological), 26, 211-243. https://doi.org/10.1111/j.2517-6161.1964.tb00553.x
  3. Durbin BP and Rocke DM (2004). Variance-stabilizing transformations for two-color microarrays, Bioinformatics, 20, 660-667. https://doi.org/10.1093/bioinformatics/btg464
  4. Ekwaru JP and Veugelers PJ (2018). The overlooked importance of constants added in log transformation of independent variables with zero values: A proposed approach for determining an optimal constant, Statistics in Biopharmaceutical Research, 10, 26-29. https://doi.org/10.1080/19466315.2017.1369900
  5. Feng C, Hongyue W, Lu N, Chen T, He H, Lu Y, and Tu X (2014). Log-transformation and its implications for data analysis, Shanghai Archives of Psychiatry, 26, 105-109.
  6. Park SY (2023). Zero imputation methods for log-transformation of independent variables, Journal of the Korean Data Analysis Society, 25, 79-90. https://doi.org/10.37727/jkdas.2022.25.1.79
  7. Rocke DM and Durbin-Johnson B (2001). A model for measurement error for gene expression arrays, Journal of Computational Biology, 8, 557-569. https://doi.org/10.1089/106652701753307485
  8. Rocke DM and Durbin-Johnson B (2003). Approximate variance-stabilizing transformations for gene-expression microarray data, Bioinformatics, 19, 966-972. https://doi.org/10.1093/bioinformatics/btg107
  9. Steyerberg EW (2019). Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating (2nd ed), Springer, Berlin.