• Title/Summary/Keyword: missing data imputation

Search Result 144, Processing Time 0.033 seconds

Study on Imputation Methods of Missing Real-Time Traffic Data (실시간 누락 교통자료의 대체기법에 관한 연구)

  • Jang Jin-hwan;Ryu Seung-ki;Moon Hak-yong;Byun Sang-cheal
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.3 no.1 s.4
    • /
    • pp.45-52
    • /
    • 2004
  • There are many cities installing ITS(Intelligent Transportation Systems) and running TMC(Trafnc Management Center) to improve mobility and safety of roadway transportation by providing roadway information to drivers. There are many devices in ITS which collect real-time traffic data. We can obtain many valuable traffic data from the devices. But it's impossible to avoid missing traffic data for many reasons such as roadway condition, adversary weather, communication shutdown and problems of the devices itself. We couldn't do any secondary process such as travel time forecasting and other transportation related research due to the missing data. If we use the traffic data to produce AADT and DHV, essential data in roadway planning and design, We might get skewed data that could make big loss. Therefore, He study have explored some imputation techniques such as heuristic methods, regression model, EM algorithm and time-series analysis for the missing traffic volume data using some evaluating indices such as MAPE, RMSE, and Inequality coefficient. We could get the best result from time-series model generating 5.0$\%$, 0.03 and 110 as MAPE, Inequality coefficient and RMSE, respectively. Other techniques produce a little different results, but the results were very encouraging.

  • PDF

Modified BLS Weight Adjustment (수정된 BLS 가중치보정법)

  • Park, Jung-Joon;Cho, Ki-Jong;Lee, Sang-Eun;Shin, Key-Il
    • Communications for Statistical Applications and Methods
    • /
    • v.18 no.3
    • /
    • pp.367-376
    • /
    • 2011
  • BLS weight adjustment is a widely used method for business surveys with non-responses and outliers. Recent surveys show that the non-response weight adjustment of the BLS method is the same as the ratio imputation method. In this paper, we suggested a modified BLS weight adjustment method by imputing missing values instead of using weight adjustment for non-response. Monthly labor survey data is used for a small Monte-Carlo simulation and we conclude that the suggested method is superior to the original BLS weight adjustment method.

Probability Estimation Method for Imputing Missing Values in Data Expansion Technique (데이터 확장 기법에서 손실값을 대치하는 확률 추정 방법)

  • Lee, Jong Chan
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.11
    • /
    • pp.91-97
    • /
    • 2021
  • This paper uses a data extension technique originally designed for the rule refinement problem to handling incomplete data. This technique is characterized in that each event can have a weight indicating importance, and each variable can be expressed as a probability value. Since the key problem in this paper is to find the probability that is closest to the missing value and replace the missing value with the probability, three different algorithms are used to find the probability for the missing value and then store it in this data structure format. And, after learning to classify each information area with the SVM classification algorithm for evaluation of each probability structure, it compares with the original information and measures how much they match each other. The three algorithms for the imputation probability of the missing value use the same data structure, but have different characteristics in the approach method, so it is expected that it can be used for various purposes depending on the application field.

A Study on Missing Data Imputation for Water Demand in 112 Block of Yoengjong Island, Korea (영종도 112블록 AMI 물 수요량 결측 자료 보정기법 연구)

  • Koo, Kang Min;Han, Kuk Heon;Yum, Kyung Taek;Jun, Kyung Soo
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2019.05a
    • /
    • pp.3-3
    • /
    • 2019
  • 최근 기후변화로 인한 집중호우, 가뭄 등 예측하기 어려운 사태가 발생하면서 깨끗하고 안정적인 용수공급 기술의 필요성이 대두되고 있다. 이에 IoT와 기존 물관리시스템을 결합한 스마트워터그리드 출범은 실시간으로 수요와 공급량의 정보를 취득하여 물 관리 효율성을 제고 할 수 있게 되었다. 실시간 수요량 자료를 이용하여 물 수요량 예측을 통한 최적의 물 공급량을 결정할 수 있다. 이 때 스마트워터그리드의 핵심 기술은 실시간으로 취득한 자료의 품질관리라 할 수 있다. 본 연구 대상지역인 영종도 112 블록에는 528개 AMI 스마트 미터를 이용하여 1시간 단위의 물 수요량 자료를 원격 검침하고 있다. 각 수용가에 설치된 AMI 센서를 통해 수집된 자료에는 오류를 포함할 수 있는데 통신 장애, 미터기 고장 및 교체 등으로 발생된다. 결측된 수요량 자료는 상수관망 수리해석에 사용되는 기본자료로서 비표본오차를 증가시켜 검정력과 정확성을 결여시키는 문제가 있다. 이에 본 연구에서는 수집된 자료를 가용할 수 있는 자료로 정제하고 대체하기 위해 완전히 관찰된 자료(complete data)만을 이용하여 각 시간에 따른 관경별, 용도별 그리고 요일별 수요패턴을 추정한다. 결측된 자료는 기존에 사용되는 평균대체법과 핫덱 대체(hot deck imputation) 등과 비교 검증한다.

  • PDF

Data-centric XAI-driven Data Imputation of Molecular Structure and QSAR Model for Toxicity Prediction of 3D Printing Chemicals (3D 프린팅 소재 화학물질의 독성 예측을 위한 Data-centric XAI 기반 분자 구조 Data Imputation과 QSAR 모델 개발)

  • ChanHyeok Jeong;SangYoun Kim;SungKu Heo;Shahzeb Tariq;MinHyeok Shin;ChangKyoo Yoo
    • Korean Chemical Engineering Research
    • /
    • v.61 no.4
    • /
    • pp.523-541
    • /
    • 2023
  • As accessibility to 3D printers increases, there is a growing frequency of exposure to chemicals associated with 3D printing. However, research on the toxicity and harmfulness of chemicals generated by 3D printing is insufficient, and the performance of toxicity prediction using in silico techniques is limited due to missing molecular structure data. In this study, quantitative structure-activity relationship (QSAR) model based on data-centric AI approach was developed to predict the toxicity of new 3D printing materials by imputing missing values in molecular descriptors. First, MissForest algorithm was utilized to impute missing values in molecular descriptors of hazardous 3D printing materials. Then, based on four different machine learning models (decision tree, random forest, XGBoost, SVM), a machine learning (ML)-based QSAR model was developed to predict the bioconcentration factor (Log BCF), octanol-air partition coefficient (Log Koa), and partition coefficient (Log P). Furthermore, the reliability of the data-centric QSAR model was validated through the Tree-SHAP (SHapley Additive exPlanations) method, which is one of explainable artificial intelligence (XAI) techniques. The proposed imputation method based on the MissForest enlarged approximately 2.5 times more molecular structure data compared to the existing data. Based on the imputed dataset of molecular descriptor, the developed data-centric QSAR model achieved approximately 73%, 76% and 92% of prediction performance for Log BCF, Log Koa, and Log P, respectively. Lastly, Tree-SHAP analysis demonstrated that the data-centric-based QSAR model achieved high prediction performance for toxicity information by identifying key molecular descriptors highly correlated with toxicity indices. Therefore, the proposed QSAR model based on the data-centric XAI approach can be extended to predict the toxicity of potential pollutants in emerging printing chemicals, chemical process, semiconductor or display process.

Variational Mode Decomposition with Missing Data (결측치가 있는 자료에서의 변동모드분해법)

  • Choi, Guebin;Oh, Hee-Seok;Lee, Youngjo;Kim, Donghoh;Yu, Kyungsang
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.2
    • /
    • pp.159-174
    • /
    • 2015
  • Dragomiretskiy and Zosso (2014) developed a new decomposition method, termed variational mode decomposition (VMD), which is efficient for handling the tone detection and separation of signals. However, VMD may be inefficient in the presence of missing data since it is based on a fast Fourier transform (FFT) algorithm. To overcome this problem, we propose a new approach based on a novel combination of VMD and hierarchical (or h)-likelihood method. The h-likelihood provides an effective imputation methodology for missing data when VMD decomposes the signal into several meaningful modes. A simulation study and real data analysis demonstrates that the proposed method can produce substantially effective results.

Imputation Accuracy from 770K SNP Chips to Next Generation Sequencing Data in a Hanwoo (Korean Native Cattle) Population using Minimac3 and Beagle (Minimac3와 Beagle 프로그램을 이용한 한우 770K chip 데이터에서 차세대 염기서열분석 데이터로의 결측치 대치의 정확도 분석)

  • An, Na-Rae;Son, Ju-Hwan;Park, Jong-Eun;Chai, Han-Ha;Jang, Gul-Won;Lim, Dajeong
    • Journal of Life Science
    • /
    • v.28 no.11
    • /
    • pp.1255-1261
    • /
    • 2018
  • Whole genome analysis have been made possible with the development of DNA sequencing technologies and discovery of many single nucleotide polymorphisms (SNPs). Large number of SNP can be analyzed with SNP chips, since SNPs of human as well as livestock genomes are available. Among the various missing nucleotide imputation programs, Minimac3 software is suggested to be highly accurate, with a simplified workflow and relatively fast. In the present study, we used Minimac3 program to perform genomic missing value substitution 1,226 animals 770K SNP chip and imputing missing SNPs with next generation sequencing data from 311 animals. The accuracy on each chromosome was about 94~96%, and individual sample accuracy was about 92~98%. After imputation of the genotypes, SNPs with R Square ($R^2$) values for three conditions were 0.4, 0.6, and 0.8 and the percentage of SNPs were 91%, 84%, and 70% respectively. The differences in the Minor Allele Frequency gave $R^2$ values corresponding to seven intervals (0, 0.025), (0.025, 0.05), (0.05, 0.1), (0.1, 0.2), (0.2, 0.3). (0.3, 0.4) and (0.4, 0.5) of 64~88%. The total analysis time was about 12 hr. In future SNP chip studies, as the size and complexity of the genomic datasets increase, we expect that genomic imputation using Minimac3 can improve the reliability of chip data for Hanwoo discrimination.

Household, personal, and financial determinants of surrender in Korean health insurance

  • Shim, Hyunoo;Min, Jung Yeun;Choi, Yang Ho
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.5
    • /
    • pp.447-462
    • /
    • 2021
  • In insurance, the surrender rate is an important variable that threatens the sustainability of insurers and determines the profitability of the contract. Unlike other actuarial assumptions that determine the cash flow of an insurance contract, however, it is characterized by endogenous variables such as people's economic, social, and subjective decisions. Therefore, a microscopic approach is required to identify and analyze the factors that determine the lapse rate. Specifically, micro-level characteristics including the individual, demographic, microeconomic, and household characteristics of policyholders are necessary for the analysis. In this study, we select panel survey data of Korean Retirement Income Study (KReIS) with many diverse dimensions to determine which variables have a decisive effect on the lapse and apply the lasso regularized regression model to analyze it empirically. As the data contain many missing values, they are imputed using the random forest method. Among the household variables, we find that the non-existence of old dependents, the existence of young dependents, and employed family members increase the surrender rate. Among the individual variables, divorce, non-urban residential areas, apartment type of housing, non-ownership of homes, and bad relationship with siblings increase the lapse rate. Finally, among the financial variables, low income, low expenditure, the existence of children that incur child care expenditure, not expecting to bequest from spouse, not holding public health insurance, and expecting to benefit from a retirement pension increase the lapse rate. Some of these findings are consistent with those in the literature.

Variance Estimation for Imputed Survey Data using Balanced Repeated Replication Method

  • Lee, Jun-Suk;Hong, Tae-Kyong;Namkung, Pyong
    • Communications for Statistical Applications and Methods
    • /
    • v.12 no.2
    • /
    • pp.365-379
    • /
    • 2005
  • Balanced Repeated Replication(BRR) is widely used to estimate the variance of linear or nonlinear estimators from complex sampling surveys. Most of survey data sets include imputed missing values and treat the imputed values as observed data. But applying the standard BRR variance estimation formula for imputed data does not produce valid variance estimators. Shao, Chen and Chen(1998) proposed an adjusted BRR method by adjusting the imputed data to produce more accurate variance estimators. In this paper, another adjusted BRR method is proposed with examples of real data.

Prediction of Dissolved Oxygen in Jindong Bay Using Time Series Analysis (시계열 분석을 이용한 진동만의 용존산소량 예측)

  • Han, Myeong-Soo;Park, Sung-Eun;Choi, Youngjin;Kim, Youngmin;Hwang, Jae-Dong
    • Journal of the Korean Society of Marine Environment & Safety
    • /
    • v.26 no.4
    • /
    • pp.382-391
    • /
    • 2020
  • In this study, we used artificial intelligence algorithms for the prediction of dissolved oxygen in Jindong Bay. To determine missing values in the observational data, we used the Bidirectional Recurrent Imputation for Time Series (BRITS) deep learning algorithm, Auto-Regressive Integrated Moving Average (ARIMA), a widely used time series analysis method, and the Long Short-Term Memory (LSTM) deep learning method were used to predict the dissolved oxygen. We also compared accuracy of ARIMA and LSTM. The missing values were determined with high accuracy by BRITS in the surface layer; however, the accuracy was low in the lower layers. The accuracy of BRITS was unstable due to the experimental conditions in the middle layer. In the middle and bottom layers, the LSTM model showed higher accuracy than the ARIMA model, whereas the ARIMA model showed superior performance in the surface layer.