• Title/Summary/Keyword: Categorical Missing Data

Search Result 16, Processing Time 0.018 seconds

Two-stage imputation method to handle missing data for categorical response variable

  • Jong-Min Kim;Kee-Jae Lee;Seung-Joo Lee
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.6
    • /
    • pp.577-587
    • /
    • 2023
  • Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.

Comparing Accuracy of Imputation Methods for Incomplete Categorical Data

  • Shin, Hyung-Won;Sohn, So-Young
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2003.05a
    • /
    • pp.237-242
    • /
    • 2003
  • Various kinds of estimation methods have been developed for imputation of categorical missing data. They include modal category method, logistic regression, and association rule. In this study, we propose two imputation methods (neural network fusion and voting fusion) that combine the results of individual imputation methods. A Monte-Carlo simulation is used to compare the performance of these methods. Five factors used to simulate the missing data are (1) true model for the data, (2) data size, (3) noise size (4) percentage of missing data, and (5) missing pattern. Overall, neural network fusion performed the best while voting fusion is better than the individual imputation methods, although it was inferior to the neural network fusion. Result of an additional real data analysis confirms the simulation result.

  • PDF

Comparing Accuracy of Imputation Methods for Categorical Incomplete Data (범주형 자료의 결측치 추정방법 성능 비교)

  • 신형원;손소영
    • The Korean Journal of Applied Statistics
    • /
    • v.15 no.1
    • /
    • pp.33-43
    • /
    • 2002
  • Various kinds of estimation methods have been developed for imputation of categorical missing data. They include category method, logistic regression, and association rule. In this study, we propose two fusions algorithms based on both neural network and voting scheme that combine the results of individual imputation methods. A Mont-Carlo simulation is used to compare the performance of these methods. Five factors used to simulate the missing data pattern are (1) input-output function, (2) data size, (3) noise of input-output function (4) proportion of missing data, and (5) pattern of missing data. Experimental study results indicate the following: when the data size is small and missing data proportion is large, modal category method, association rule, and neural network based fusion have better performances than the other methods. However, when the data size is small and correlation between input and missing output is strong, logistic regression and neural network barred fusion algorithm appear better than the others. When data size is large with low missing data proportion, a large noise, and strong correlation between input and missing output, neural networks based fusion algorithm turns out to be the best choice.

Bayesian Analysis for Categorical Data with Missing Traits Under a Multivariate Threshold Animal Model (다형질 Threshold 개체모형에서 Missing 기록을 포함한 이산형 자료에 대한 Bayesian 분석)

  • Lee, Deuk-Hwan
    • Journal of Animal Science and Technology
    • /
    • v.44 no.2
    • /
    • pp.151-164
    • /
    • 2002
  • Genetic variance and covariance components of the linear traits and the ordered categorical traits, that are usually observed as dichotomous or polychotomous outcomes, were simultaneously estimated in a multivariate threshold animal model with concepts of arbitrary underlying liability scales with Bayesian inference via Gibbs sampling algorithms. A multivariate threshold animal model in this study can be allowed in any combination of missing traits with assuming correlation among the traits considered. Gibbs sampling algorithms as a hierarchical Bayesian inference were used to get reliable point estimates to which marginal posterior means of parameters were assumed. Main point of this study is that the underlying values for the observations on the categorical traits sampled at previous round of iteration and the observations on the continuous traits can be considered to sample the underlying values for categorical data and continuous data with missing at current cycle (see appendix). This study also showed that the underlying variables for missing categorical data should be generated with taking into account for the correlated traits to satisfy the fully conditional posterior distributions of parameters although some of papers (Wang et al., 1997; VanTassell et al., 1998) presented that only the residual effects of missing traits were generated in same situation. In present study, Gibbs samplers for making the fully Bayesian inferences for unknown parameters of interests are played rolls with methodologies to enable the any combinations of the linear and categorical traits with missing observations. Moreover, two kinds of constraints to guarantee identifiability for the arbitrary underlying variables are shown with keeping the fully conditional posterior distributions of those parameters. Numerical example for a threshold animal model included the maternal and permanent environmental effects on a multiple ordered categorical trait as calving ease, a binary trait as non-return rate, and the other normally distributed trait, birth weight, is provided with simulation study.

A Bayesian model for two-way contingency tables with nonignorable nonresponse from small areas

  • Woo, Namkyo;Kim, Dal Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.1
    • /
    • pp.245-254
    • /
    • 2016
  • Many surveys provide categorical data and there may be one or more missing categories. We describe a nonignorable nonresponse model for the analysis of two-way contingency tables from small areas. There are both item and unit nonresponse. One approach to analyze these data is to construct several tables corresponding to missing categories. We describe a hierarchical Bayesian model to analyze two-way categorical data from different areas. This allows a "borrowing of strength" of the data from larger areas to improve the reliability in the estimates of the model parameters corresponding to the small areas. Also we use a nonignorable nonresponse model with Bayesian uncertainty analysis by placing priors in nonidentifiable parameters instead of a sensitivity analysis for nonidentifiable parameters. We use the griddy Gibbs sampler to fit our models and compute DIC and BPP for model diagnostics. We illustrate our method using data from NHANES III data on thirteen states to obtain the finite population proportions.

Simultaneous Approach to Fuzzy Clustering and Quantification of Categorical Data with Missing Values

  • Honda, Katsuhiro;Nakamura, Yoshihito;Ichihashi, Hidetomo
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2003.09a
    • /
    • pp.36-39
    • /
    • 2003
  • This paper proposes a simultaneous application of homogeneity analysis and fuzzy clustering with in complete data. Taking the similarity between the loss of homogeneity in homogeneity analysis and the least squares criterion in principal component analysis into account, the new objective function is defined in a similar formulation to the linear fuzzy clustering with missing values. Numerical experiment shows the characteristic properties of the proposed method.

  • PDF

A comparison of imputation methods using machine learning models

  • Heajung Suh;Jongwoo Song
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.3
    • /
    • pp.331-341
    • /
    • 2023
  • Handling missing values in data analysis is essential in constructing a good prediction model. The easiest way to handle missing values is to use complete case data, but this can lead to information loss within the data and invalid conclusions in data analysis. Imputation is a technique that replaces missing data with alternative values obtained from information in a dataset. Conventional imputation methods include K-nearest-neighbor imputation and multiple imputations. Recent methods include missForest, missRanger, and mixgb ,all which use machine learning algorithms. This paper compares the imputation techniques for datasets with mixed datatypes in various situations, such as data size, missing ratios, and missing mechanisms. To evaluate the performance of each method in mixed datasets, we propose a new imputation performance measure (IPM) that is a unified measurement applicable to numerical and categorical variables. We believe this metric can help find the best imputation method. Finally, we summarize the comparison results with imputation performances and computational times.

Analysis of categorical data with nonresponses (무응답을 포함하는 범주형 자료의 분석)

  • 박태성;이승연
    • The Korean Journal of Applied Statistics
    • /
    • v.11 no.1
    • /
    • pp.83-95
    • /
    • 1998
  • Statistical models are proposed for analyzing categorical data in the presence of missing observations or nonresponses which might occur in the sampling surveys and polls. As an illustration, we analyzed real polling data of the pre-presidential election in the USA, 1948, It had been predicted that Dewey would win the election. However, Truman won in the actual election.

  • PDF

A Bayesian uncertainty analysis for nonignorable nonresponse in two-way contingency table

  • Woo, Namkyo;Kim, Dal Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.6
    • /
    • pp.1547-1555
    • /
    • 2015
  • We study the problem of nonignorable nonresponse in a two-way contingency table and there may be one or two missing categories. We describe a nonignorable nonresponse model for the analysis of two-way categorical table. One approach to analyze these data is to construct several tables (one complete and the others incomplete). There are nonidentifiable parameters in incomplete tables. We describe a hierarchical Bayesian model to analyze two-way categorical data. We use a nonignorable nonresponse model with Bayesian uncertainty analysis by placing priors in nonidentifiable parameters instead of a sensitivity analysis for nonidentifiable parameters. To reduce the effects of nonidentifiable parameters, we project the parameters to a lower dimensional space and we allow the reduced set of parameters to share a common distribution. We use the griddy Gibbs sampler to fit our models and compute DIC and BPP for model diagnostics. We illustrate our method using data from NHANES III data to obtain the finite population proportions.

Empirical Bayesian Misclassification Analysis on Categorical Data (범주형 자료에서 경험적 베이지안 오분류 분석)

  • 임한승;홍종선;서문섭
    • The Korean Journal of Applied Statistics
    • /
    • v.14 no.1
    • /
    • pp.39-57
    • /
    • 2001
  • Categorical data has sometimes misclassification errors. If this data will be analyzed, then estimated cell probabilities could be biased and the standard Pearson X2 tests may have inflated true type I error rates. On the other hand, if we regard wellclassified data with misclassified one, then we might spend lots of cost and time on adjustment of misclassification. It is a necessary and important step to ask whether categorical data is misclassified before analyzing data. In this paper, when data is misclassified at one of two variables for two-dimensional contingency table and marginal sums of a well-classified variable are fixed. We explore to partition marginal sums into each cells via the concepts of Bound and Collapse of Sebastiani and Ramoni (1997). The double sampling scheme (Tenenbein 1970) is used to obtain informations of misclassification. We propose test statistics in order to solve misclassification problems and examine behaviors of the statistics by simulation studies.

  • PDF