• 제목/요약/키워드: Imputation method

검색결과 132건 처리시간 0.026초

다중대체방법을 이용한 구간 중도 경쟁 위험 모형에서의 이표본 검정 (A two-sample test with interval censored competing risk data using multiple imputation)

  • 김유원;김양진
    • 응용통계연구
    • /
    • 제30권2호
    • /
    • pp.233-241
    • /
    • 2017
  • 구간 중도 절단 자료는 관측 연구에서 종종 발생되는 생존 자료의 한 유형으로 관심 있는 사건 발생 시간을 정확하게 관측할 수 없는 대신에 이를 포함한 두 관측 시점으로 구성된다. 본 연구의 목적은 경쟁 위험이 구간 중도 절단 자료에서 발생될 경우, 두 그룹의 누적 발생 함수를 비교하기 위한 검정 통계량을 제시하는 것이다. 특히 본 연구에서는 다중 대체 방법을 통해 생성된 자료를 이용하여 검정력과 유의 수준을 구하고자 한다. 모의실험을 통해 제안한 방법이 다양한 경우에서 적절한 결과를 보이는지 검토하였으며 실제 자료 분석의 예로 남녀 그룹의 HIV 발생 함수의 차이를 비교하기 위해 제안한 방법을 적용하였다.

다변수 Bidirectional RNN을 이용한 표층수온 결측 데이터 보간 (Imputation of Missing SST Observation Data Using Multivariate Bidirectional RNN)

  • 신용탁;김동훈;김현재;임채욱;우승범
    • 한국해안·해양공학회논문집
    • /
    • 제34권4호
    • /
    • pp.109-118
    • /
    • 2022
  • 정점 표층 수온 관측 데이터 중 결측 구간의 데이터를 양방향 순환신경망(Bidirectional Recurrent Neural Network, BiRNN) 기법을 이용하여 보간하였다. 인공지능 기법 중 시계열 데이터에 일반적으로 활용되는 Recurrent Neural Networks(RNNs)은 결측 추정 위치까지의 시간 흐름 방향 또는 역방향으로만 추정하기 때문에 장기 결측 구간에는 추정 성능이 떨어진다. 반면, 본 연구에서는 결측 구간 전후의 양방향으로 추정을 하여 장기 결측 데이터에 대해서도 추정 성능을 높일 수 있다. 또한 관측점 주위의 가용한 모든 데이터(수온, 기온, 바람장, 기압, 습도)를 사용함으로써, 이들 상관관계로부터 보간 데이터를 함께 추정하도록 하여 보간 성능을 더욱 높이고자 하였다. 성능 검증을 위하여 통계 기반 모델인 Multivariate Imputation by Chained Equations(MICE)와 기계학습 기반의 Random Forest 모델, 그리고 Long Short-Term Memory(LSTM)을 이용한 RNN 모델과 비교하였다. 7일간의 장기 결측에 대한 보간에 대해서 BiRNN/통계 모델들의 평균 정확도가 각각 70.8%/61.2%이며 평균 오차가 각각 0.28도/0.44도로 BiRNN 모델이 다른 모델보다 좋은 성능을 보인다. 결측 패턴을 나타내는 temporal decay factor를 적용함으로써 BiRNN 기법이 결측 구간이 길어질수록 보간 성능이 기존 방법보다 우수한 것으로 판단된다.

실시간 비즈니스 프로세스 모니터링 방법론을 위한 확장 KNN 대체 기반 LOF 예측 알고리즘 (Extended KNN Imputation Based LOF Prediction Algorithm for Real-time Business Process Monitoring Method)

  • 강복영;김동수;강석호
    • 한국전자거래학회지
    • /
    • 제15권4호
    • /
    • pp.303-317
    • /
    • 2010
  • 본 논문에서는 KNN 대체와 LOF 알고리즘의 결합 모델을 확장하여 실시간 비즈니스 프로세스 모니터링을 위한 비정상 종료 예측 방법론을 제안하였다. 기존의 룰 기반 모니터링 방법론은 실시간 프로세스 진행 정도에 따른 비관측 정보에 기인하여 조기 경보 및 실시간 대응이 힘들다는 한계점을 안고 있다. 이를 해결하기 위하여 비관측 정보에 대한 가정 및 진행 중인 프로세스의 향후 경로 예측을 통해 종료 시점에서 예상되는 LOF를 추정하기 위한 알고리즘을 제안하였다. 이 알고리즘을 적용하여 실시간 비즈니스 프로세스 모니터링 과정에서 각 관측 시점마다 종료 시점에서의 결과를 예측함으로써, 전 시점에 걸친 추세를 살펴종료 패턴을 예측할 수 있다. 이를 통해 비즈니스 프로세스의 실시간 진척에 대한 정보를 가시화함으로써 기회 및 위협에 사전에 대응할 수 있게 하여 프로세스 관리 수준의 향상을 기대할 수 있을 것으로 예상된다.

Sparse Data Cleaning using Multiple Imputations

  • Jun, Sung-Hae;Lee, Seung-Joo;Oh, Kyung-Whan
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • 제4권1호
    • /
    • pp.119-124
    • /
    • 2004
  • Real data as web log file tend to be incomplete. But we have to find useful knowledge from these for optimal decision. In web log data, many useful things which are hyperlink information and web usages of connected users may be found. The size of web data is too huge to use for effective knowledge discovery. To make matters worse, they are very sparse. We overcome this sparse problem using Markov Chain Monte Carlo method as multiple imputations. This missing value imputation changes spare web data to complete. Our study may be a useful tool for discovering knowledge from data set with sparseness. The more sparseness of data in increased, the better performance of MCMC imputation is good. We verified our work by experiments using UCI machine learning repository data.

Estimating a Binomial Proportion with Bayes Estimated Imputed Conditional Means

  • Shin, Min-Woong;Lee, Sang-Eun
    • Communications for Statistical Applications and Methods
    • /
    • 제9권1호
    • /
    • pp.63-73
    • /
    • 2002
  • The one of analytic imputation technique involving conditional means was mentioned by Schafer and Schenker(2000). And their derivations are based on asymptotic expansions of point estimator and their associated variance estimator, and the result of imputation can be thought of as first-order approximations to the estimators. Specially in this paper, we are presenting the method of estimating a Binomial proportion with Bayesian approach of imputed conditional means. That is, instead of using maximum likelihood(ML) estimator to estimate a Binomial proportion, in general, we use the Bayesian estimators and will show the result of estimated Imputed conditional means.

Multiple imputation inference for stratified random sample with nonignorable nonresponse

  • 신민웅;이상은;이성철;이주영
    • 한국통계학회:학술대회논문집
    • /
    • 한국통계학회 2001년도 추계학술발표회 논문집
    • /
    • pp.191-194
    • /
    • 2001
  • In general, the imputation problems which are caused from survey nonresponse have been studied for being based on ignorable cases. However the model based approach can be applied to survey with nonresponse suspected of being nonignorable. Here in this study, we will make the nonresponse for nonignorable into ignorable cell using adjustment cell approach, then we can applied the ignorable nonresponse method. For data sets of each nonresponse cells are simulated from normal distribution.

  • PDF

Logistic Regression Method in Interval-Censored Data

  • Yun, Eun-Young;Kim, Jin-Mi;Ki, Choong-Rak
    • 응용통계연구
    • /
    • 제24권5호
    • /
    • pp.871-881
    • /
    • 2011
  • In this paper we propose a logistic regression method to estimate the survival function and the median survival time in interval-censored data. The proposed method is motivated by the data augmentation technique with no sacrifice in augmenting data. In addition, we develop a cross validation criterion to determine the size of data augmentation. We compare the proposed estimator with other existing methods such as the parametric method, the single point imputation method, and the nonparametric maximum likelihood estimator through extensive numerical studies to show that the proposed estimator performs better than others in the sense of the mean squared error. An illustrative example based on a real data set is given.

A Study on Imputation using Adjusted Cohen Method

  • Chung, Sung-Suk;Chun, Young-Min;Lee, Sun-Kyung
    • Journal of the Korean Data and Information Science Society
    • /
    • 제17권3호
    • /
    • pp.871-888
    • /
    • 2006
  • Many studies have been done to develop procedures to deal with missing values. Most common method is to reassign the other values to the missing data. The purpose of our study is to suggest adjusted Cohen methods and to compare the efficiency of them with other methods through a simulation study. The adjusted Cohen methods use an auxiliary variable to arrange ranking of the variable with missing values. It leads to a reduced mean square error(MSE) compared with the Cohen method.

  • PDF

KARE Genomewide Association Study of Blood Pressure Using Imputed SNPs

  • Hong, Kyung-Won;Lim, Ji-Eun;Kim, Young-Jin;Cho, Nam-H.;Shin, Chol;Oh, Berm-Seok
    • Genomics & Informatics
    • /
    • 제8권3호
    • /
    • pp.103-107
    • /
    • 2010
  • The imputation of untyped SNPs enables researchers to validate association findings across SNP arrays and also enables them to test a large number of SNPs to reveal the fine structure of the association peak, facilitating interpretation of the results and the location of causal polymorphisms. In this study, we applied the imputation method to a genomewide association study and recapitulated the previously associated gene loci of blood pressure traits in Korean cohorts. A total of 1,827,004 SNPs were imputed by the IMPUTE program, and we conducted a genomewide association study for systolic and diastolic blood pressure. While no SNPs passed the Bonferroni correction p-value (p=$2.74{\times}10^{-8}$ for 1,827,004 SNPs), 12 novel loci for systolic blood pressure and 16 novel loci for diastolic blood pressure were detected by imputed SNPs, with $10^{-5}$ < p-value < $10^{-4}$. Moreover, 7 regions (ATP2B1, 10p15.1, ARHGEF12, ALX4, LIPC, 7q31.1, and TCF7L2) out of 14 genetic loci that were previously reported revealed that the imputed SNPs had lower p-values than those of genotyped SNPs. Moreover, a nonsynonymous SNP in the CSMD1 gene, one of the 14 genes, was found to be associated with systolic blood pressure (p<0.05). These results suggest that the imputation method can facilitate the discovery of novel SNPs as well as enhance the fine structure of the association peak in the loci.

Breast Cancer and Modifiable Lifestyle Factors in Argentinean Women: Addressing Missing Data in a Case-Control Study

  • Coquet, Julia Becaria;Tumas, Natalia;Osella, Alberto Ruben;Tanzi, Matteo;Franco, Isabella;Diaz, Maria Del Pilar
    • Asian Pacific Journal of Cancer Prevention
    • /
    • 제17권10호
    • /
    • pp.4567-4575
    • /
    • 2016
  • A number of studies have evidenced the effect of modifiable lifestyle factors such as diet, breastfeeding and nutritional status on breast cancer risk. However, none have addressed the missing data problem in nutritional epidemiologic research in South America. Missing data is a frequent problem in breast cancer studies and epidemiological settings in general. Estimates of effect obtained from these studies may be biased, if no appropriate method for handling missing data is applied. We performed Multiple Imputation for missing values on covariates in a breast cancer case-control study of $C{\acute{o}}rdoba$ (Argentina) to optimize risk estimates. Data was obtained from a breast cancer case control study from 2008 to 2015 (318 cases, 526 controls). Complete case analysis and multiple imputation using chained equations were the methods applied to estimate the effects of a Traditional dietary pattern and other recognized factors associated with breast cancer. Physical activity and socioeconomic status were imputed. Logistic regression models were performed. When complete case analysis was performed only 31% of women were considered. Although a positive association of Traditional dietary pattern and breast cancer was observed from both approaches (complete case analysis OR=1.3, 95%CI=1.0-1.7; multiple imputation OR=1.4, 95%CI=1.2-1.7), effects of other covariates, like BMI and breastfeeding, were only identified when multiple imputation was considered. A Traditional dietary pattern, BMI and breastfeeding are associated with the occurrence of breast cancer in this Argentinean population when multiple imputation is appropriately performed. Multiple Imputation is suggested in Latin America's epidemiologic studies to optimize effect estimates in the future.