• Title/Summary/Keyword: Imputation method

Search Result 132, Processing Time 0.029 seconds

Performance Evaluation of an Imputation Method based on Generative Adversarial Networks for Electric Medical Record (전자의무기록 데이터에서의 적대적 생성 알고리즘 기반 결측값 대치 알고리즘 성능분석)

  • Jo, Yong-Yeon;Jeong, Min-Yeong;Hwangbo, Yul
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2019.10a
    • /
    • pp.879-881
    • /
    • 2019
  • 전자의무기록 (EMR)과 같은 의료 현장에서 수집되는 대용량의 데이터는 임상 해석적으로 잠재가치가 크고 활용도가 다양하나 결측값이 많아 희소성이 크다는 한계점이 있어 분석이 어렵다. 특히 EMR의 정보수집과정에서 발생하는 결측값은 무작위적이고 임의적이어서 분석 정확도를 낮추고 예측 모델의 성능을 저하시키는 주된 요인으로 작용하기 때문에, 결측치 대체는 필수불가결하다. 최근 통상적으로 활용되어지던 통계기반 알고리즘기반의 결측치 대체 알고리즘보다는 딥러닝 기술을 활용한 알고리즘들이 새로이 등장하고 있다. 본 논문에서는 Generative Adversarial Network를 기반한 최신 결측값 대치 알고리즘인 Generative Adversarial Imputation Nets을 적용하여 EMR에서의 성능을 분석해보고자 하였다.

Association measure of doubly interval censored data using a Kendall's 𝜏 estimator

  • Kang, Seo-Hyun;Kim, Yang-Jin
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.2
    • /
    • pp.151-159
    • /
    • 2021
  • In this article, our interest is to estimate the association between consecutive gap times which are subject to interval censoring. Such data are referred as doubly interval censored data (Sun, 2006). In a context of serial event, an induced dependent censoring frequently occurs, resulting in biased estimates. In this study, our goal is to propose a Kendall's 𝜏 based association measure for doubly interval censored data. For adjusting the impact of induced dependent censoring, the inverse probability censoring weighting (IPCW) technique is implemented. Furthermore, a multiple imputation technique is applied to recover unknown failure times owing to interval censoring. Simulation studies demonstrate that the suggested association estimator performs well with moderate sample sizes. The proposed method is applied to a dataset of children's dental records.

A study on multiple imputation modeling for Korean EAPS (경제활동인구조사 자료를 위한 다중대체 방식 연구)

  • Park, Min-Jeong;Bae, Yoonjong;Kim, Joungyoun
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.5
    • /
    • pp.685-696
    • /
    • 2021
  • The Korean Economically Active Population Survey (KEAPS) is a national survey that produces employment-related statistics. The main purpose of the survey is to find out the economic activity status (employed/ unemployed/ non-employed) of the people. KEAPS has a unique characteristics caused by the survey method. In this study, through understanding of structural non-response and utilization of past data, we would like to present an improved imputation model. The performance of the proposed model is compared with the existing model through simulation. The performance of the imputation models is evaluated based on the degree of mathing/nonmatching rates. For this, we employ the KEAPS data in November 2019. For the randomly selected ones among the total 59,996 respondents, the six explanatory variables, which are critical in determining the economic activity states, are treated as non-response. The proposed model includes industry variable and job status variable in addition to the explanatory variables used in the precedent research. This is based on the linkage and utilization of past data. The simulation results confirm that the proposed model with additional variables outperforms the existing model in the precedent research. In addition, we consider various scenarios for the number of non-responders by the economic activity status.

Enhancement of durability of tall buildings by using deep-learning-based predictions of wind-induced pressure

  • K.R. Sri Preethaa;N. Yuvaraj;Gitanjali Wadhwa;Sujeen Song;Se-Woon Choi;Bubryur Kim
    • Wind and Structures
    • /
    • v.36 no.4
    • /
    • pp.237-247
    • /
    • 2023
  • The emergence of high-rise buildings has necessitated frequent structural health monitoring and maintenance for safety reasons. Wind causes damage and structural changes on tall structures; thus, safe structures should be designed. The pressure developed on tall buildings has been utilized in previous research studies to assess the impacts of wind on structures. The wind tunnel test is a primary research method commonly used to quantify the aerodynamic characteristics of high-rise buildings. Wind pressure is measured by placing pressure sensor taps at different locations on tall buildings, and the collected data are used for analysis. However, sensors may malfunction and produce erroneous data; these data losses make it difficult to analyze aerodynamic properties. Therefore, it is essential to generate missing data relative to the original data obtained from neighboring pressure sensor taps at various intervals. This study proposes a deep learning-based, deep convolutional generative adversarial network (DCGAN) to restore missing data associated with faulty pressure sensors installed on high-rise buildings. The performance of the proposed DCGAN is validated by using a standard imputation model known as the generative adversarial imputation network (GAIN). The average mean-square error (AMSE) and average R-squared (ARSE) are used as performance metrics. The calculated ARSE values by DCGAN on the building model's front, backside, left, and right sides are 0.970, 0.972, 0.984 and 0.978, respectively. The AMSE produced by DCGAN on four sides of the building model is 0.008, 0.010, 0.015 and 0.014. The average standard deviation of the actual measures of the pressure sensors on four sides of the model were 0.1738, 0.1758, 0.2234 and 0.2278. The average standard deviation of the pressure values generated by the proposed DCGAN imputation model was closer to that of the measured actual with values of 0.1736,0.1746,0.2191, and 0.2239 on four sides, respectively. In comparison, the standard deviation of the values predicted by GAIN are 0.1726,0.1735,0.2161, and 0.2209, which is far from actual values. The results demonstrate that DCGAN model fits better for data imputation than the GAIN model with improved accuracy and fewer error rates. Additionally, the DCGAN is utilized to estimate the wind pressure in regions of buildings where no pressure sensor taps are available; the model yielded greater prediction accuracy than GAIN.

A Research for Imputation Method of Photovoltaic Power Missing Data to Apply Time Series Models (태양광 발전량 데이터의 시계열 모델 적용을 위한 결측치 보간 방법 연구)

  • Jeong, Ha-Young;Hong, Seok-Hoon;Jeon, Jae-Sung;Lim, Su-Chang;Kim, Jong-Chan;Park, Chul-Young
    • Journal of Korea Multimedia Society
    • /
    • v.24 no.9
    • /
    • pp.1251-1260
    • /
    • 2021
  • This paper discusses missing data processing using simple moving average (SMA) and kalman filter. Also SMA and kalman predictive value are made a comparative study. Time series analysis is a generally method to deals with time series data in photovoltaic field. Photovoltaic system records data irregularly whenever the power value changes. Irregularly recorded data must be transferred into a consistent format to get accurate results. Missing data results from the process having same intervals. For the reason, it was imputed using SMA and kalman filter. The kalman filter has better performance to observed data than SMA. SMA graph is stepped line graph and kalman filter graph is a smoothing line graph. MAPE of SMA prediction is 0.00737%, MAPE of kalman prediction is 0.00078%. But time complexity of SMA is O(N) and time complexity of kalman filter is O(D2) about D-dimensional object. Accordingly we suggest that you pick the best way considering computational power.

Comparison of GEE Estimation Methods for Repeated Binary Data with Time-Varying Covariates on Different Missing Mechanisms (시간-종속적 공변량이 포함된 이분형 반복측정자료의 GEE를 이용한 분석에서 결측 체계에 따른 회귀계수 추정방법 비교)

  • Park, Boram;Jung, Inkyung
    • The Korean Journal of Applied Statistics
    • /
    • v.26 no.5
    • /
    • pp.697-712
    • /
    • 2013
  • When analyzing repeated binary data, the generalized estimating equations(GEE) approach produces consistent estimates for regression parameters even if an incorrect working correlation matrix is used. However, time-varying covariates experience larger changes in coefficients than time-invariant covariates across various working correlation structures for finite samples. In addition, the GEE approach may give biased estimates under missing at random(MAR). Weighted estimating equations and multiple imputation methods have been proposed to reduce biases in parameter estimates under MAR. This article studies if the two methods produce robust estimates across various working correlation structures for longitudinal binary data with time-varying covariates under different missing mechanisms. Through simulation, we observe that time-varying covariates have greater differences in parameter estimates across different working correlation structures than time-invariant covariates. The multiple imputation method produces more robust estimates under any working correlation structure and smaller biases compared to the other two methods.

An estimation method for non-response model using Monte-Carlo expectation-maximization algorithm (Monte-Carlo expectation-maximaization 방법을 이용한 무응답 모형 추정방법)

  • Choi, Boseung;You, Hyeon Sang;Yoon, Yong Hwa
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.3
    • /
    • pp.587-598
    • /
    • 2016
  • In predicting an outcome of election using a variety of methods ahead of the election, non-response is one of the major issues. Therefore, to address the non-response issue, a variety of methods of non-response imputation may be employed, but the result of forecasting tend to vary according to methods. In this study, in order to improve electoral forecasts, we studied a model based method of non-response imputation attempting to apply the Monte Carlo Expectation Maximization (MCEM) algorithm, introduced by Wei and Tanner (1990). The MCEM algorithm using maximum likelihood estimates (MLEs) is applied to solve the boundary solution problem under the non-ignorable non-response mechanism. We performed the simulation studies to compare estimation performance among MCEM, maximum likelihood estimation, and Bayesian estimation method. The results of simulation studies showed that MCEM method can be a reasonable candidate for non-response model estimation. We also applied MCEM method to the Korean presidential election exit poll data of 2012 and investigated prediction performance using modified within precinct error (MWPE) criterion (Bautista et al., 2007).

A Study on One Factorial Longitudinal Data Analysis with Informative Drop-out

  • Lee, Ki-Hoon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.4
    • /
    • pp.1053-1065
    • /
    • 2006
  • This paper proposes a method in one-way layouts for longitudinal data with informative drop-out. When dropouts are informative, that is, correlated with unobserved data and/or the previous observed data, the simple imputation methods such as 'last observation carried forward' (LOCF) methods would arise the bias of the testing models. The maximum likelihood procedure combined with a logit model for the drop-out process is proposed to test treatment effects for one factorial designs and compared with LOCF method in two examples.

  • PDF

Analysis of Missing Data Using an Empirical Bayesian Method (경험적 베이지안 방법을 이용한 결측자료 연구)

  • Yoon, Yong Hwa;Choi, Boseung
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.6
    • /
    • pp.1003-1016
    • /
    • 2014
  • Proper missing data imputation is an important procedure to obtain superior results for data analysis based on survey data. This paper deals with both a model based imputation method and model estimation method. We utilized a Bayesian method to solve a boundary solution problem in which we applied a maximum likelihood estimation method. We also deal with a missing mechanism model selection problem using forecasting results and a comparison between model accuracies. We utilized MWPE(modified within precinct error) (Bautista et al., 2007) to measure prediction correctness. We applied proposed ML and Bayesian methods to the Korean presidential election exit poll data of 2012. Based on the analysis, the results under the missing at random mechanism showed superior prediction results than under the missing not at random mechanism.

A comparison study for accuracy of exit poll based on nonresponse model (무응답모형에 기반한 출구조사의 예측 정확성 비교 연구)

  • Kwak, Jeongae;Choi, Boseung
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.1
    • /
    • pp.53-64
    • /
    • 2014
  • One of the major problems to forecast election, especially based on survey, is nonresponse. We may have different forecasting results depend on method of imputation. Handling nonresponse is more important in a survey about sensitive subject, such as presidential election. In this research, we consider a model based method of nonresponse imputation. A model based imputation method should be constructed based on assumption of nonresponse mechanism and may produce different results according to the nonresponse mechanism. An assumption of the nonresponse mechanism is very important precondition to forecast the accurate results. However, there is no exact way to verify assumption of the nonresponse mechanism. In this paper, we compared the accuracy of prediction and assumption of nonresponse mechanism based on the result of presidential election exit poll. We consider maximum likelihood estimation method based on EM algorithm to handle assumption of the model of nonresponse. We also consider modified within precinct error which Bautista (2007) proposed to compare the predict result.