• Title/Summary/Keyword: Imputation method

Search Result 134, Processing Time 0.024 seconds

Filling in Hydrological Missing Data Using Imputation Methods (Imputation Method를 활용한 수문 결측자료의 보정)

  • Kang, Tae-Ho;Hong, Il-Pyo;Km, Young-Oh
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2009.05a
    • /
    • pp.1254-1259
    • /
    • 2009
  • 과거 관측된 수문자료는 분석을 통해 다양한 수문모형의 평가 및 예측과 수자원 정책결정에서 활용된다. 하지만 관측장비의 오작동 및 관측범위의 한계에 의해 수집된 자료에는 결측이 존재한다. 단순히 결측이 존재하는 벡터를 제외하거나, 결측이 존재하는 자료 구간에 선형성이 존재한다는 가정 하에 평균을 활용하기도 했으나, 이로 인하여 자료의 통계특성에 왜곡이 야기될 수 있다. 본 연구는 결측의 보정으로 자료가 보유하는 정보의 손실 및 왜곡을 최소화 할 수 있는 방안을 연구하고자 한다. 자료의 결측은 크게 완벽한 무작위 결측(missing completely at random, MCAR), 무작위 결측(missing at random, MAR), 무작위성이 없는 결측(nonrandom missingness)으로 분류되며, 수문자료는 결측을 포함한 기간이 그 외 기간의 자료와 통계적으로 동일하지는 않지만 결측자료의 추정이 가능한 MAR에 속하는 것이 일반적이므로 이를 가정으로 결측을 보정하였다. Local Lest Squares Imputation(LLSimput)을 결측의 추정을 위해 사용하였으며, 기존에 쉽게 사용되던 선형보간법과 비교하였다. 적용성 평가를 위해 소양강댐 일 유입량 자료에 1 - 5 %의 결측자료를 임의로 생성하였다. 동일한 양의 결측자료에 대해 100개의 셋을 사용하여 보정의 불확실성 범위를 적용된 방법에 대해 비교..평가하였으며, 결측 증가에 따른 보정효과의 변화를 검토하였다. Normalized Root Mean Squared Error(NRMSE)를 사용하여 적용된 두 방법을 평가한 결과, (1) 결측자료의 비가 낮을수록 간단한 선형보간법을 사용한 보정이 효과적이었다. (2) 하지만 결측의 비가 증가할수록 선형보간법의 보정효과는 점차 큰 불확실성과 낮은 보정효과를 보인 반면, (3) LLSimpute는 결측의 증가에 관계없이 일정한 보정효과 및 불확실성 범위를 나타내는 것으로 드러났다.

  • PDF

A two-sample test with interval censored competing risk data using multiple imputation (다중대체방법을 이용한 구간 중도 경쟁 위험 모형에서의 이표본 검정)

  • Kim, Yuwon;Kim, Yang-Jin
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.2
    • /
    • pp.233-241
    • /
    • 2017
  • Interval censored data frequently occur in observation studies where the subject is followed periodically. In this paper, our interest is to suggest a test statistic to compare the CIF of two groups with interval censored failure time data in the presence of competing risks. Gray (1988) suggested a test statistic for right censored data that motivated a well-known Fine and Gray's subdistribution hazard model. A multiple imputation technique is adopted to adopt Gray's test statistic to interval censored data. The powers and sizes of the suggested method are investigated through diverse simulation schemes. The main merit of the suggested method is its simplicity to implement with existing software for right censored data. The method is illustrated by analyzing Bangkok's HIV cohort dataset.

Imputation of Missing SST Observation Data Using Multivariate Bidirectional RNN (다변수 Bidirectional RNN을 이용한 표층수온 결측 데이터 보간)

  • Shin, YongTak;Kim, Dong-Hoon;Kim, Hyeon-Jae;Lim, Chaewook;Woo, Seung-Buhm
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.34 no.4
    • /
    • pp.109-118
    • /
    • 2022
  • The data of the missing section among the vertex surface sea temperature observation data was imputed using the Bidirectional Recurrent Neural Network(BiRNN). Among artificial intelligence techniques, Recurrent Neural Networks (RNNs), which are commonly used for time series data, only estimate in the direction of time flow or in the reverse direction to the missing estimation position, so the estimation performance is poor in the long-term missing section. On the other hand, in this study, estimation performance can be improved even for long-term missing data by estimating in both directions before and after the missing section. Also, by using all available data around the observation point (sea surface temperature, temperature, wind field, atmospheric pressure, humidity), the imputation performance was further improved by estimating the imputation data from these correlations together. For performance verification, a statistical model, Multivariate Imputation by Chained Equations (MICE), a machine learning-based Random Forest model, and an RNN model using Long Short-Term Memory (LSTM) were compared. For imputation of long-term missing for 7 days, the average accuracy of the BiRNN/statistical models is 70.8%/61.2%, respectively, and the average error is 0.28 degrees/0.44 degrees, respectively, so the BiRNN model performs better than other models. By applying a temporal decay factor representing the missing pattern, it is judged that the BiRNN technique has better imputation performance than the existing method as the missing section becomes longer.

Extended KNN Imputation Based LOF Prediction Algorithm for Real-time Business Process Monitoring Method (실시간 비즈니스 프로세스 모니터링 방법론을 위한 확장 KNN 대체 기반 LOF 예측 알고리즘)

  • Kang, Bok-Young;Kim, Dong-Soo;Kang, Suk-Ho
    • The Journal of Society for e-Business Studies
    • /
    • v.15 no.4
    • /
    • pp.303-317
    • /
    • 2010
  • In this paper, we propose a novel approach to fault prediction for real-time business process monitoring method using extended KNN imputation based LOF prediction. Existing rule-based approaches to process monitoring has some limitations like late alarm for fault occurrence or no indicators about real-time progress, since there exist unobserved attributes according to the monitoring phase during process executions. To improve these limitations, we propose an algorithm for LOF prediction by adopting the imputation method to assume unobserved attributes. LOF of ongoing instance is calculated by assuming next probable progresses after the monitoring phase, which is conducted during entire monitoring phases so that we can predict the abnormal termination of the ongoing instance. By visualizing the real-time progress in terms of the probability on abnormal termination, we can provide more proactive operations to opportunities or risks during the real-time monitoring.

Sparse Data Cleaning using Multiple Imputations

  • Jun, Sung-Hae;Lee, Seung-Joo;Oh, Kyung-Whan
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.4 no.1
    • /
    • pp.119-124
    • /
    • 2004
  • Real data as web log file tend to be incomplete. But we have to find useful knowledge from these for optimal decision. In web log data, many useful things which are hyperlink information and web usages of connected users may be found. The size of web data is too huge to use for effective knowledge discovery. To make matters worse, they are very sparse. We overcome this sparse problem using Markov Chain Monte Carlo method as multiple imputations. This missing value imputation changes spare web data to complete. Our study may be a useful tool for discovering knowledge from data set with sparseness. The more sparseness of data in increased, the better performance of MCMC imputation is good. We verified our work by experiments using UCI machine learning repository data.

Estimating a Binomial Proportion with Bayes Estimated Imputed Conditional Means

  • Shin, Min-Woong;Lee, Sang-Eun
    • Communications for Statistical Applications and Methods
    • /
    • v.9 no.1
    • /
    • pp.63-73
    • /
    • 2002
  • The one of analytic imputation technique involving conditional means was mentioned by Schafer and Schenker(2000). And their derivations are based on asymptotic expansions of point estimator and their associated variance estimator, and the result of imputation can be thought of as first-order approximations to the estimators. Specially in this paper, we are presenting the method of estimating a Binomial proportion with Bayesian approach of imputed conditional means. That is, instead of using maximum likelihood(ML) estimator to estimate a Binomial proportion, in general, we use the Bayesian estimators and will show the result of estimated Imputed conditional means.

Multiple imputation inference for stratified random sample with nonignorable nonresponse

  • Shin Minwoong;Lee Sangeun;Lee Sungchul;Lee Juyoung
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2001.11a
    • /
    • pp.191-194
    • /
    • 2001
  • In general, the imputation problems which are caused from survey nonresponse have been studied for being based on ignorable cases. However the model based approach can be applied to survey with nonresponse suspected of being nonignorable. Here in this study, we will make the nonresponse for nonignorable into ignorable cell using adjustment cell approach, then we can applied the ignorable nonresponse method. For data sets of each nonresponse cells are simulated from normal distribution.

  • PDF

Logistic Regression Method in Interval-Censored Data

  • Yun, Eun-Young;Kim, Jin-Mi;Ki, Choong-Rak
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.5
    • /
    • pp.871-881
    • /
    • 2011
  • In this paper we propose a logistic regression method to estimate the survival function and the median survival time in interval-censored data. The proposed method is motivated by the data augmentation technique with no sacrifice in augmenting data. In addition, we develop a cross validation criterion to determine the size of data augmentation. We compare the proposed estimator with other existing methods such as the parametric method, the single point imputation method, and the nonparametric maximum likelihood estimator through extensive numerical studies to show that the proposed estimator performs better than others in the sense of the mean squared error. An illustrative example based on a real data set is given.

A Study on Imputation using Adjusted Cohen Method

  • Chung, Sung-Suk;Chun, Young-Min;Lee, Sun-Kyung
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.3
    • /
    • pp.871-888
    • /
    • 2006
  • Many studies have been done to develop procedures to deal with missing values. Most common method is to reassign the other values to the missing data. The purpose of our study is to suggest adjusted Cohen methods and to compare the efficiency of them with other methods through a simulation study. The adjusted Cohen methods use an auxiliary variable to arrange ranking of the variable with missing values. It leads to a reduced mean square error(MSE) compared with the Cohen method.

  • PDF

KARE Genomewide Association Study of Blood Pressure Using Imputed SNPs

  • Hong, Kyung-Won;Lim, Ji-Eun;Kim, Young-Jin;Cho, Nam-H.;Shin, Chol;Oh, Berm-Seok
    • Genomics & Informatics
    • /
    • v.8 no.3
    • /
    • pp.103-107
    • /
    • 2010
  • The imputation of untyped SNPs enables researchers to validate association findings across SNP arrays and also enables them to test a large number of SNPs to reveal the fine structure of the association peak, facilitating interpretation of the results and the location of causal polymorphisms. In this study, we applied the imputation method to a genomewide association study and recapitulated the previously associated gene loci of blood pressure traits in Korean cohorts. A total of 1,827,004 SNPs were imputed by the IMPUTE program, and we conducted a genomewide association study for systolic and diastolic blood pressure. While no SNPs passed the Bonferroni correction p-value (p=$2.74{\times}10^{-8}$ for 1,827,004 SNPs), 12 novel loci for systolic blood pressure and 16 novel loci for diastolic blood pressure were detected by imputed SNPs, with $10^{-5}$ < p-value < $10^{-4}$. Moreover, 7 regions (ATP2B1, 10p15.1, ARHGEF12, ALX4, LIPC, 7q31.1, and TCF7L2) out of 14 genetic loci that were previously reported revealed that the imputed SNPs had lower p-values than those of genotyped SNPs. Moreover, a nonsynonymous SNP in the CSMD1 gene, one of the 14 genes, was found to be associated with systolic blood pressure (p<0.05). These results suggest that the imputation method can facilitate the discovery of novel SNPs as well as enhance the fine structure of the association peak in the loci.