• 제목/요약/키워드: Missing variables

검색결과 194건 처리시간 0.029초

Two-stage imputation method to handle missing data for categorical response variable

  • Jong-Min Kim;Kee-Jae Lee;Seung-Joo Lee
    • Communications for Statistical Applications and Methods
    • /
    • 제30권6호
    • /
    • pp.577-587
    • /
    • 2023
  • Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.

무응답이 있는 설문조사연구의 접근법 : 한국노인약물역학코호트 자료의 평가 (An Approach to Survey Data with Nonresponse: Evaluation of KEPEC Data with BMI)

  • 백지은;강위창;이영조;박병주
    • Journal of Preventive Medicine and Public Health
    • /
    • 제35권2호
    • /
    • pp.136-140
    • /
    • 2002
  • Objectives : A common problem with analyzing survey data involves incomplete data with either a nonresponse or missing data. The mail questionnaire survey conducted for collecting lifestyle variables on the members of the Korean Elderly Phamacoepidemiologic Cohort(KEPEC) in 1996 contains some nonresponse or missing data. The proper statistical method was applied to evaluate the missing pattern of a specific KEPEC data, which had no missing data in the independent variable and missing data in the response variable, BMI. Methods : The number of study subjects was 8,689 elderly people. Initially, the BMI and significant variables that influenced the BMI were categorized. After fitting the log-linear model, the probabilities of the people on each category were estimated. The EM algorithm was implemented using a log-linear model to determine the missing mechanism causing the nonresponse. Results : Age, smoking status, and a preference of spicy hot food were chosen as variables that influenced the BMI. As a result of fitting the nonignorable and ignorable nonresponse log-linear model considering these variables, the difference in the deviance in these two models was 0.0034(df=1). Conclusion : There is a lot of risk if an inference regarding the variables and large samples is made without considering the pattern of missing data. On the basis of these results, the missing data occurring in the BMI is the ignorable nonresponse. Therefore, when analyzing the BMI in KEPEC data, the inference can be made about the data without considering the missing data.

Reject Inference of Incomplete Data Using a Normal Mixture Model

  • Song, Ju-Won
    • 응용통계연구
    • /
    • 제24권2호
    • /
    • pp.425-433
    • /
    • 2011
  • Reject inference in credit scoring is a statistical approach to adjust for nonrandom sample bias due to rejected applicants. Function estimation approaches are based on the assumption that rejected applicants are not necessary to be included in the estimation, when the missing data mechanism is missing at random. On the other hand, the density estimation approach by using mixture models indicates that reject inference should include rejected applicants in the model. When mixture models are chosen for reject inference, it is often assumed that data follow a normal distribution. If data include missing values, an application of the normal mixture model to fully observed cases may cause another sample bias due to missing values. We extend reject inference by a multivariate normal mixture model to handle incomplete characteristic variables. A simulation study shows that inclusion of incomplete characteristic variables outperforms the function estimation approaches.

Imputation Procedures in Weibull Regression Analysis in the presence of missing values

  • 김순귀;정동빈
    • 한국통계학회:학술대회논문집
    • /
    • 한국통계학회 2001년도 추계학술발표회 논문집
    • /
    • pp.143-148
    • /
    • 2001
  • A dataset having missing observations is often completed by using imputed values. In this paper the performances and accuracy of complete case methods and four imputation procedures are evaluated when missing values exist only on the response variables in the Weibull regression model. Our simulation results show that compared to other imputation procedures, in particular, hotdeck and Weibull regression imputation procedure can be well used to compensate for missing data. In addition an illustrative real data is given.

  • PDF

A case study of competing risk analysis in the presence of missing data

  • Limei Zhou;Peter C. Austin;Husam Abdel-Qadir
    • Communications for Statistical Applications and Methods
    • /
    • 제30권1호
    • /
    • pp.1-19
    • /
    • 2023
  • Observational data with missing or incomplete data are common in biomedical research. Multiple imputation is an effective approach to handle missing data with the ability to decrease bias while increasing statistical power and efficiency. In recent years propensity score (PS) matching has been increasingly used in observational studies to estimate treatment effect as it can reduce confounding due to measured baseline covariates. In this paper, we describe in detail approaches to competing risk analysis in the setting of incomplete observational data when using PS matching. First, we used multiple imputation to impute several missing variables simultaneously, then conducted propensity-score matching to match statin-exposed patients with those unexposed. Afterwards, we assessed the effect of statin exposure on the risk of heart failure-related hospitalizations or emergency visits by estimating both relative and absolute effects. Collectively, we provided a general methodological framework to assess treatment effect in incomplete observational data. In addition, we presented a practical approach to produce overall cumulative incidence function (CIF) based on estimates from multiple imputed and PS-matched samples.

Imputation Method Using Local Linear Regression Based on Bidirectional k-nearest-components

  • Yonggeol, Lee
    • Journal of information and communication convergence engineering
    • /
    • 제21권1호
    • /
    • pp.62-67
    • /
    • 2023
  • This paper proposes an imputation method using a bidirectional k-nearest components search based local linear regression method. The bidirectional k-nearest-components search method selects components in the dynamic range from the missing points. Unlike the existing methods, which use a fixed-size window, the proposed method can flexibly select adjacent components in an imputation problem. The weight values assigned to the components around the missing points are calculated using local linear regression. The local linear regression method is free from the rank problem in a matrix of dependent variables. In addition, it can calculate the weight values that reflect the data flow in a specific environment, such as a blackout. The original missing values were estimated from a linear combination of the components and their weights. Finally, the estimated value imputes the missing values. In the experimental results, the proposed method outperformed the existing methods when the error between the original data and imputation data was measured using MAE and RMSE.

A comparison of imputation methods using machine learning models

  • Heajung Suh;Jongwoo Song
    • Communications for Statistical Applications and Methods
    • /
    • 제30권3호
    • /
    • pp.331-341
    • /
    • 2023
  • Handling missing values in data analysis is essential in constructing a good prediction model. The easiest way to handle missing values is to use complete case data, but this can lead to information loss within the data and invalid conclusions in data analysis. Imputation is a technique that replaces missing data with alternative values obtained from information in a dataset. Conventional imputation methods include K-nearest-neighbor imputation and multiple imputations. Recent methods include missForest, missRanger, and mixgb ,all which use machine learning algorithms. This paper compares the imputation techniques for datasets with mixed datatypes in various situations, such as data size, missing ratios, and missing mechanisms. To evaluate the performance of each method in mixed datasets, we propose a new imputation performance measure (IPM) that is a unified measurement applicable to numerical and categorical variables. We believe this metric can help find the best imputation method. Finally, we summarize the comparison results with imputation performances and computational times.

PhysioCover: Recovering the Missing Values in Physiological Data of Intensive Care Units

  • Kim, Sun-Hee;Yang, Hyung-Jeong;Kim, Soo-Hyung;Lee, Guee-Sang
    • International Journal of Contents
    • /
    • 제10권2호
    • /
    • pp.47-58
    • /
    • 2014
  • Physiological signals provide important clues in the diagnosis and prediction of disease. Analyzing these signals is important in health and medicine. In particular, data preprocessing for physiological signal analysis is a vital issue because missing values, noise, and outliers may degrade the analysis performance. In this paper, we propose PhysioCover, a system that can recover missing values of physiological signals that were monitored in real time. PhysioCover integrates a gradual method and EM-based Principle Component Analysis (PCA). This approach can (1) more readily recover long- and short-term missing data than existing methods, such as traditional EM-based PCA, linear interpolation, 5-average and Missing Value Singular Value Decomposition (MSVD), (2) more effectively detect hidden variables than PCA and Independent component analysis (ICA), and (3) offer fast computation time through real-time processing. Experimental results with the physiological data of an intensive care unit show that the proposed method assigns more accurate missing values than previous methods.

Household, personal, and financial determinants of surrender in Korean health insurance

  • Shim, Hyunoo;Min, Jung Yeun;Choi, Yang Ho
    • Communications for Statistical Applications and Methods
    • /
    • 제28권5호
    • /
    • pp.447-462
    • /
    • 2021
  • In insurance, the surrender rate is an important variable that threatens the sustainability of insurers and determines the profitability of the contract. Unlike other actuarial assumptions that determine the cash flow of an insurance contract, however, it is characterized by endogenous variables such as people's economic, social, and subjective decisions. Therefore, a microscopic approach is required to identify and analyze the factors that determine the lapse rate. Specifically, micro-level characteristics including the individual, demographic, microeconomic, and household characteristics of policyholders are necessary for the analysis. In this study, we select panel survey data of Korean Retirement Income Study (KReIS) with many diverse dimensions to determine which variables have a decisive effect on the lapse and apply the lasso regularized regression model to analyze it empirically. As the data contain many missing values, they are imputed using the random forest method. Among the household variables, we find that the non-existence of old dependents, the existence of young dependents, and employed family members increase the surrender rate. Among the individual variables, divorce, non-urban residential areas, apartment type of housing, non-ownership of homes, and bad relationship with siblings increase the lapse rate. Finally, among the financial variables, low income, low expenditure, the existence of children that incur child care expenditure, not expecting to bequest from spouse, not holding public health insurance, and expecting to benefit from a retirement pension increase the lapse rate. Some of these findings are consistent with those in the literature.

공간-시계열 모형을 이용한 결측대체 방법에 대한 연구 (Imputation Method using the Space-Time Model in Sample Survey)

  • 이진희;신기일
    • 응용통계연구
    • /
    • 제20권3호
    • /
    • pp.499-514
    • /
    • 2007
  • 표본조사에서 항목무응답 발생 시 결측대체에 사용되는 일반적인 방법은 결측변수와 관계 있는 보조변수를 이용하는 것이다. 최근 이진희 등 (2006)은 2002년 강원지역의 농가경제 자료를 이용하여 표본조사에서 공간통계를 이용한 결측대체 (missing imputation) 방법을 비교하였으며, 자료들 사이에 지역적 상관이 존재할 때 이를 이용한 결측대체가 효율적임을 보였다. 본 논문에서는 이를 확장한 개념으로, 강원지역의 2000-2002까지의 월별 자료가 공간상관과 시계열상관이 존재함을 확인하고 이 관계를 결측대체에 이용하였다. 또한 공간상관과 시계열상관이 모두 존재할 경우 공간시계열 모형을 이용한 결측 대체 방법이 공간모형을 이용하였을 때에 비해 더 효율적임을 모의실험을 통해 확인하였다.