• Title/Summary/Keyword: Missing data

Search Result 1,286, Processing Time 0.028 seconds

Comparision of Missing Imputaion Methods In fine dust data (미세먼지 자료에서의 결측치 대체 방법 비교)

  • Kim, YeonJin;Park, HeonJin
    • The Journal of Bigdata
    • /
    • v.4 no.2
    • /
    • pp.105-114
    • /
    • 2019
  • Missing value replacement is one of the big issues in data analysis. If you ignore the occurrence of the missing value and proceed with the analysis, a bias can occur and give incorrect results for the estimate. In this paper, we need to find and apply an appropriate alternative to missing data from weather data. Through this, we attempted to clarify and compare the simulations for various situations using existing methods such as MICE and MissForest based on R and time series-based models. When comparing these results with each variable, it was determined that the kalman filter of the auto arima model using the ImputeTS package and the MissForest model gave good results in the weather data.

  • PDF

Large tests of independence in incomplete two-way contingency tables using fractional imputation

  • Kang, Shin-Soo;Larsen, Michael D.
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.4
    • /
    • pp.971-984
    • /
    • 2015
  • Imputation procedures fill-in missing values, thereby enabling complete data analyses. Fully efficient fractional imputation (FEFI) and multiple imputation (MI) create multiple versions of the missing observations, thereby reflecting uncertainty about their true values. Methods have been described for hypothesis testing with multiple imputation. Fractional imputation assigns weights to the observed data to compensate for missing values. The focus of this article is the development of tests of independence using FEFI for partially classified two-way contingency tables. Wald and deviance tests of independence under FEFI are proposed. Simulations are used to compare type I error rates and Power. The partially observed marginal information is useful for estimating the joint distribution of cell probabilities, but it is not useful for testing association. FEFI compares favorably to other methods in simulations.

Reject Inference of Incomplete Data Using a Normal Mixture Model

  • Song, Ju-Won
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.2
    • /
    • pp.425-433
    • /
    • 2011
  • Reject inference in credit scoring is a statistical approach to adjust for nonrandom sample bias due to rejected applicants. Function estimation approaches are based on the assumption that rejected applicants are not necessary to be included in the estimation, when the missing data mechanism is missing at random. On the other hand, the density estimation approach by using mixture models indicates that reject inference should include rejected applicants in the model. When mixture models are chosen for reject inference, it is often assumed that data follow a normal distribution. If data include missing values, an application of the normal mixture model to fully observed cases may cause another sample bias due to missing values. We extend reject inference by a multivariate normal mixture model to handle incomplete characteristic variables. A simulation study shows that inclusion of incomplete characteristic variables outperforms the function estimation approaches.

Recovering missing data transmitted from a wireless sensor node for vibration-based bridge health monitoring

  • Kim, C.W.;Kawatani, M.;Ozaki, R.;Makihata, N.
    • Structural Engineering and Mechanics
    • /
    • v.38 no.4
    • /
    • pp.417-428
    • /
    • 2011
  • This paper presents recovering of missing vibration data of a bridge transmitted from wireless sensors. Kalman filter algorithm is adopted to reconstruct the missing data analytically. Validity of the analytical approach is examined through a field experiment of a bridge. Observations demonstrate that, even a part of recovered acceleration responses is underestimated in comparison with those responses taken from cabled sensors, dominant frequencies taken from the reconstructed data are comparable with those from cabled sensors.

Estimate method of missing data using Similarity in AMI system (AMI시스템에서 유사도를 활용한 누락데이터 보정 방법)

  • Kwon, Hyuk-Rok;Hong, Taek-Eun;Kim, Pan-Koo
    • Smart Media Journal
    • /
    • v.8 no.4
    • /
    • pp.80-84
    • /
    • 2019
  • As a result of AMI rapidly expanding and distributing its products, variety of services that utilize data on the use of electricity are increasing. In order to make these services more effective, missing metric data needs to be corrected, compensating for which Euclidean similarity is used to find customers with similar usage patterns. Throughout such a process, we propose a method for correcting missing data and provide comparison with the preceding methods.

Handling Incomplete Data Problem in Collaborative Filtering System

  • Noh, Hyun-ju;Kwak, Min-jung;Han, In-goo
    • Proceedings of the KAIS Fall Conference
    • /
    • 2003.11a
    • /
    • pp.105-110
    • /
    • 2003
  • Collaborative filtering is one of the methodologies that are most widely used for recommendation system. It is based on a data matrix of each customer's preferences of products. There could be a lot of missing values in such preference. data matrix. This incomplete data is one of the reasons to deteriorate the accuracy of recommendation system. Multiple imputation method imputes m values for each missing value. It overcomes flaws of single imputation approaches through considering the uncertainty of missing values.. The objective of this paper is to suggest multiple imputation-based collaborative filtering approach for recommendation system to improve the accuracy in prediction performance. The experimental works show that the proposed approach provides better performance than the traditional Collaborative filtering approach, especially in case that there are a lot of missing values in dataset used for recommendation system.

  • PDF

Effect of missing values in detecting differentially expressed genes in a cDNA microarray experiment

  • Kim, Byung-Soo;Rha, Sun-Young
    • Bioinformatics and Biosystems
    • /
    • v.1 no.1
    • /
    • pp.67-72
    • /
    • 2006
  • The aim of this paper is to discuss the effect of missing values in detecting differentially expressed genes in a cDNA microarray experiment in the context of a one sample problem. We conducted a cDNA micro array experiment to detect differentially expressed genes for the metastasis of colorectal cancer based on twenty patients who underwent liver resection due to liver metastasis from colorectal cancer. Total RNAs from metastatic liver tumor and adjacent normal liver tissue from a single patient were labeled with cy5 and cy3, respectively, and competitively hybridized to a cDNA microarray with 7775 human genes. We used $M=log_2(R/G)$ for the signal evaluation, where Rand G denoted the fluorescent intensities of Cy5 and Cy3 dyes, respectively. The statistical problem comprises a one sample test of testing E(M)=0 for each gene and involves multiple tests. The twenty cDNA microarray data would comprise a matrix of dimension 7775 by 20, if there were no missing values. However, missing values occur for various reasons. For each gene, the no missing proportion (NMP) was defined to be the proportion of non-missing values out of twenty. In detecting differentially expressed (DE) genes, we used the genes whose NMP is greater than or equal to 0.4 and then sequentially increased NMP by 0.1 for investigating its effect on the detection of DE genes. For each fixed NMP, we imputed the missing values with K-nearest neighbor method (K=10) and applied the nonparametric t-test of Dudoit et al. (2002), SAM by Tusher et al. (2001) and empirical Bayes procedure by $L\ddot{o}nnstedt$ and Speed (2002) to find out the effect of missing values in the final outcome. These three procedures yielded substantially agreeable result in detecting DE genes. Of these three procedures we used SAM for exploring the acceptable NMP level. The result showed that the optimum no missing proportion (NMP) found in this data set turned out to be 80%. It is more desirable to find the optimum level of NMP for each data set by applying the method described in this note, when the plot of (NMP, Number of overlapping genes) shows a turning point.

  • PDF

Comparison of GEE Estimators Using Imputation Methods (대체방법별 GEE추정량 비교)

  • 김동욱;노영화
    • The Korean Journal of Applied Statistics
    • /
    • v.16 no.2
    • /
    • pp.407-426
    • /
    • 2003
  • We consider the missing covariates problem in generalized estimating equations(GEE) model. If the covariate is partially missing, GEE can not be calculated. In this paper, we study the performance of 7 imputation methods to handle missing covariates in GEE models, and the properties of GEE estimators are investigated after missing covariates are imputed for ordinal data of repeated measurements. The 7 imputation methods include i) Naive Deletion ii) Sample Average Imputation iii) Row Average Imputation iv) Cross-wave Regression Imputation v) Carry-over Imputation vi) Bayesian Bootstrap vii) Approximate Bayesian Bootstrap. A Monte-Carlo simulation is used to compare the performance of these methods. For the missing mechanism generating the missing data, we assume ignorable nonresponse. Furthermore, we generate missing covariates with or without considering wave nonresp onse patterns.

Missing Value Imputation Technique for Water Quality Dataset

  • Jin-Young Jun;Youn-A Min
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.4
    • /
    • pp.39-46
    • /
    • 2024
  • Many researchers make efforts to evaluate water quality using various models. Such models require a dataset without missing values, but in real world, most datasets include missing values for various reasons. Simple deletion of samples having missing value(s) could distort distribution of the underlying data and pose a significant risk of biasing the model's inference when the missing mechanism is not MCAR. In this study, to explore the most appropriate technique for handing missing values in water quality data, several imputation techniques were experimented based on existing KNN and MICE imputation with/without the generative neural network model, Autoencoder(AE) and Denoising Autoencoder(DAE). The results shows that KNN and MICE combined imputation without generative networks provides the closest estimated values to the true values. When evaluating binary classification models based on support vector machine and ensemble algorithms after applying the combined imputation technique to the observed water quality dataset with missing values, it shows better performance in terms of Accuracy, F1 score, RoC-AuC score and MCC compared to those evaluated after deleting samples having missing values.

Imputation method for missing data based on clustering and measure of property (군집화 및 특성도를 이용한 결측치 대체 방법)

  • Kim, Sunghyun;Kim, Dongjae
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.1
    • /
    • pp.29-40
    • /
    • 2018
  • There are various reasons for missing values when collecting data. Missing values have some influence on the analysis and results; consequently, various methods of processing missing values have been studied to solve the problem. It is thought that the later point of view may be affected by the initial time point value in the repeated measurement data. However, in the existing method, there was no method for the imputation of missing values using this concept. Therefore, we proposed a new missing value imputation method in this study using clustering in initial time point of the repeated measurement data and the measure of property proposed by Kim and Kim (The Korean Communications in Statistics, 30, 463-473, 2017). We also applied the Monte Carlo simulations to compare the performance of the established method and suggested methods in repeated measurement data.