• Title/Summary/Keyword: Imputation accuracy

Search Result 47, Processing Time 0.027 seconds

Imputation Using Factor Score Regression

  • Lee, Sang-Eun;Hwang, Hee-Jin;Shin, Key-Il
    • Communications for Statistical Applications and Methods
    • /
    • v.16 no.2
    • /
    • pp.317-323
    • /
    • 2009
  • Recently not even government polices but small town decisions are based on the survey data/information, so the most of government agencies/organizations demand various sample surveys in each fields for more detail information. However in conducting the sample survey, nonresponse problem rises very often and it becomes a major issue on judging the accuracy of survey. For that matters, one solution ran be using the administration data. However unfortunately most of administration data are restricted to the common users. The other solution can be the imputation. Therefore several method, of imputation are studied in various fields. In this study, in stead of the simple regression imputation method which is commonly used, factor score regression method is applied specially to the incomplete data which have the unit and item misting values in survey data. Here for simulation study, Consumer Expenditure Surveys in Korea are used.

Treatment of Missing Data by Decomposition and Voting with Ordinal Data

  • Chun, Young-M.;Son, Hong-K.;Chung, Sung-S.
    • Journal of the Korean Data and Information Science Society
    • /
    • v.18 no.3
    • /
    • pp.585-598
    • /
    • 2007
  • It is so difficult to get complete data when we conduct a questionaire in actuality. And we get inefficient results if we analyze statistical tests with ignoring missing values. Therefore, we use imputation methods which evaluate quality of data. This study proposes a imputation method by decomposition and voting with ordinal data. First, data are sorted by each variable. After that, imputation methods are used by each decomposition level. And the last step is selection of values with voting. The proposed method is evaluated by accuracy and RMSE. In conclusion, missing values are related to each variable, median imputation method using decomposition and voting is powerful.

  • PDF

A comparison of imputation methods using nonlinear models (비선형 모델을 이용한 결측 대체 방법 비교)

  • Kim, Hyein;Song, Juwon
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.4
    • /
    • pp.543-559
    • /
    • 2019
  • Data often include missing values due to various reasons. If the missing data mechanism is not MCAR, analysis based on fully observed cases may an estimation cause bias and decrease the precision of the estimate since partially observed cases are excluded. Especially when data include many variables, missing values cause more serious problems. Many imputation techniques are suggested to overcome this difficulty. However, imputation methods using parametric models may not fit well with real data which do not satisfy model assumptions. In this study, we review imputation methods using nonlinear models such as kernel, resampling, and spline methods which are robust on model assumptions. In addition, we suggest utilizing imputation classes to improve imputation accuracy or adding random errors to correctly estimate the variance of the estimates in nonlinear imputation models. Performances of imputation methods using nonlinear models are compared under various simulated data settings. Simulation results indicate that the performances of imputation methods are different as data settings change. However, imputation based on the kernel regression or the penalized spline performs better in most situations. Utilizing imputation classes or adding random errors improves the performance of imputation methods using nonlinear models.

Survival Analysis of Gastric Cancer Patients with Incomplete Data

  • Moghimbeigi, Abbas;Tapak, Lily;Roshanaei, Ghodaratolla;Mahjub, Hossein
    • Journal of Gastric Cancer
    • /
    • v.14 no.4
    • /
    • pp.259-265
    • /
    • 2014
  • Purpose: Survival analysis of gastric cancer patients requires knowledge about factors that affect survival time. This paper attempted to analyze the survival of patients with incomplete registered data by using imputation methods. Materials and Methods: Three missing data imputation methods, including regression, expectation maximization algorithm, and multiple imputation (MI) using Monte Carlo Markov Chain methods, were applied to the data of cancer patients referred to the cancer institute at Imam Khomeini Hospital in Tehran in 2003 to 2008. The data included demographic variables, survival times, and censored variable of 471 patients with gastric cancer. After using imputation methods to account for missing covariate data, the data were analyzed using a Cox regression model and the results were compared. Results: The mean patient survival time after diagnosis was $49.1{\pm}4.4$ months. In the complete case analysis, which used information from 100 of the 471 patients, very wide and uninformative confidence intervals were obtained for the chemotherapy and surgery hazard ratios (HRs). However, after imputation, the maximum confidence interval widths for the chemotherapy and surgery HRs were 8.470 and 0.806, respectively. The minimum width corresponded with MI. Furthermore, the minimum Bayesian and Akaike information criteria values correlated with MI (-821.236 and -827.866, respectively). Conclusions: Missing value imputation increased the estimate precision and accuracy. In addition, MI yielded better results when compared with the expectation maximization algorithm and regression simple imputation methods.

Imputation Model for Link Travel Speed Measurement Using UTIS (UTIS 구간통행속도 결측치 보정모델)

  • Ki, Yong-Kul;Ahn, Gye-Hyeong;Kim, Eun-Jeong;Bae, Kwang-Soo
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.10 no.6
    • /
    • pp.63-73
    • /
    • 2011
  • Travel speed is an important parameter for measuring road traffic. UTIS(Urban Traffic Information System) was developed as a mobile detector for measuring link travel speeds in South Korea. After investigation, we founded that UTIS includes some missing data caused by the lack of probe vehicles on road segments, system failures and etc. Imputation is the practice of filling in missing data with estimated values. In this paper, we suggests a new model for imputing missing data to provide accurate link travel speeds to the public. In the field test, new model showed the travel speed measuring accuracy of 93.6%. Therefore, it can be concluded that the proposed model significantly improves travel speed measuring accuracy.

A Novel on Auto Imputation and Analysis Prediction Model of Data Missing Scope based on Machine Learning (머신러닝기반의 데이터 결측 구간의 자동 보정 및 분석 예측 모델에 대한 연구)

  • Jung, Se-Hoon;Lee, Han-Sung;Kim, Jun-Yeong;Sim, Chun-Bo
    • Journal of Korea Multimedia Society
    • /
    • v.25 no.2
    • /
    • pp.257-268
    • /
    • 2022
  • When there is a missing value in the raw data, if ignore the missing values and proceed with the analysis, the accuracy decrease due to the decrease in the number of sample. The method of imputation and analyzing patterns and significant values can compensate for the problem of lower analysis quality and analysis accuracy as a result of bias rather than simply removing missing values. In this study, we proposed to study irregular data patterns and missing processing methods of data using machine learning techniques for the study of correction of missing values. we would like to propose a plan to replace the missing with data from a similar past point in time by finding the situation at the time when the missing data occurred. Unlike previous studies, data correction techniques present new algorithms using DNN and KNN-MLE techniques. As a result of the performance evaluation, the ANAE measurement value compared to the existing missing section correction algorithm confirmed a performance improvement of about 0.041 to 0.321.

Veri cation of Improving a Clustering Algorith for Microarray Data with Missing Values

  • Kim, Su-Young
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.2
    • /
    • pp.315-321
    • /
    • 2011
  • Gene expression microarray data often include multiple missing values. Most gene expression analysis (including gene clustering analysis); however, require a complete data matric as an input. In ordinary clustering methods, just a single missing value makes one abandon the whole data of a gene even if the rest of data for that gene was intact. The quality of analysis may decrease seriously as the missing rate is increased. In the opposite aspect, the imputation of missing value may result in an artifact that reduces the reliability of the analysis. To clarify this contradiction in microarray clustering analysis, this paper compared the accuracy of clustering with and without imputation over several microarray data having different missing rates. This paper also tested the clustering efficiency of several imputation methods including our propose algorithm. The results showed it is worthwhile to check the clustering result in this alternative way without any imputed data for the imperfect microarray data.

A Generation and Accuracy Evaluation of Common Metadata Prediction Model Using Public Bicycle Data and Imputation Method

  • Kim, Jong-Chan;Jung, Se-Hoon
    • Journal of Korea Multimedia Society
    • /
    • v.25 no.2
    • /
    • pp.287-296
    • /
    • 2022
  • Today, air pollution is becoming a severe issue worldwide and various policies are being implemented to solve environmental pollution. In major cities, public bicycles are installed and operated to reduce pollution and solve transportation problems, and operational information is collected in real time. However, research using public bicycle operation information data has not been processed. This study uses the daily weather data of Korea Meteorological Agency and real-time air pollution data of Korea Environment Corporation to predict the amount of daily rental bicycles. Cross- validation, principal component analysis and multiple regression analysis were used to determine the independent variables of the predictive model. Then, the study selected the elements that satisfy the significance level, constructed a model, predicted the amount of daily rental bicycles, and measured the accuracy.

Considering of the Rainfall Effect in Missing Traffic Volume Data Imputation Method (누락교통량자료 보정방법에서 강우의 영향 고려)

  • Kim, Min-Heon;Oh, Ju-Sam
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.14 no.2
    • /
    • pp.1-13
    • /
    • 2015
  • Traffic volume data is basic information that is used in a wide variety of fields. Existing missing traffic volume data imputation method did not take the effect on the rainfall. This research analyzed considering of the rainfall effect in missing traffic volume data imputation method. In order to consider the effect of rainfall, established the following assumption. When missing of traffic volume data generated in rainy days it would be more accurate to use only the traffic volume data of the past rainy days. To confirm this assumption, compared for accuracy of imputed results at three kinds of imputation method(Unconditional Mean, Auto Regression, Expectation-Maximization Algorithm). The analysis results, the case on consideration of the rainfall effect was more low error occurred.

Imputation of Missing SST Observation Data Using Multivariate Bidirectional RNN (다변수 Bidirectional RNN을 이용한 표층수온 결측 데이터 보간)

  • Shin, YongTak;Kim, Dong-Hoon;Kim, Hyeon-Jae;Lim, Chaewook;Woo, Seung-Buhm
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.34 no.4
    • /
    • pp.109-118
    • /
    • 2022
  • The data of the missing section among the vertex surface sea temperature observation data was imputed using the Bidirectional Recurrent Neural Network(BiRNN). Among artificial intelligence techniques, Recurrent Neural Networks (RNNs), which are commonly used for time series data, only estimate in the direction of time flow or in the reverse direction to the missing estimation position, so the estimation performance is poor in the long-term missing section. On the other hand, in this study, estimation performance can be improved even for long-term missing data by estimating in both directions before and after the missing section. Also, by using all available data around the observation point (sea surface temperature, temperature, wind field, atmospheric pressure, humidity), the imputation performance was further improved by estimating the imputation data from these correlations together. For performance verification, a statistical model, Multivariate Imputation by Chained Equations (MICE), a machine learning-based Random Forest model, and an RNN model using Long Short-Term Memory (LSTM) were compared. For imputation of long-term missing for 7 days, the average accuracy of the BiRNN/statistical models is 70.8%/61.2%, respectively, and the average error is 0.28 degrees/0.44 degrees, respectively, so the BiRNN model performs better than other models. By applying a temporal decay factor representing the missing pattern, it is judged that the BiRNN technique has better imputation performance than the existing method as the missing section becomes longer.