• Title/Summary/Keyword: Missing data

Search Result 1,303, Processing Time 0.026 seconds

A Study on the cleansing of water data using LSTM algorithm (LSTM 알고리즘을 이용한 수도데이터 정제기법)

  • Yoo, Gi Hyun;Kim, Jong Rib;Shin, Gang Wook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.10a
    • /
    • pp.501-503
    • /
    • 2017
  • In the water sector, various data such as flow rate, pressure, water quality and water level are collected during the whole process of water purification plant and piping system. The collected data is stored in each water treatment plant's DB, and the collected data are combined in the regional DB and finally stored in the database server of the head office of the Korea Water Resources Corporation. Various abnormal data can be generated when a measuring instrument measures data or data is communicated over various processes, and it can be classified into missing data and wrong data. The cause of each abnormal data is different. Therefore, there is a difference in the method of detecting the wrong side and the missing side data, but the method of cleansing the data is the same. In this study, a program that can automatically refine missing or wrong data by applying deep learning LSTM (Long Short Term Memory) algorithm will be studied.

  • PDF

A Study on the Index Estimation of Missing Real Estate Transaction Cases Using Machine Learning (머신러닝을 활용한 결측 부동산 매매 지수의 추정에 대한 연구)

  • Kim, Kyung-Min;Kim, Kyuseok;Nam, Daisik
    • Journal of the Economic Geographical Society of Korea
    • /
    • v.25 no.1
    • /
    • pp.171-181
    • /
    • 2022
  • The real estate price index plays key roles as quantitative data in real estate market analysis. International organizations including OECD publish the real estate price indexes by country, and the Korea Real Estate Board announces metropolitan-level and municipal-level indexes. However, when the index is set on the smaller spatial unit level than metropolitan and municipal-level, problems occur: missing values. As the spatial scope is narrowed down, there are cases where there are few or no transactions depending on the unit period, which lead index calculation difficult or even impossible. This study suggests a supervised learning-based machine learning model to compensate for missing values that may occur due to no transaction in a specific range and period. The models proposed in our research verify the accuracy of predicting the existing values and missing values.

Compressive sensing-based two-dimensional scattering-center extraction for incomplete RCS data

  • Bae, Ji-Hoon;Kim, Kyung-Tae
    • ETRI Journal
    • /
    • v.42 no.6
    • /
    • pp.815-826
    • /
    • 2020
  • We propose a two-dimensional (2D) scattering-center-extraction (SCE) method using sparse recovery based on the compressive-sensing theory, even with data missing from the received radar cross-section (RCS) dataset. First, using the proposed method, we generate a 2D grid via adaptive discretization that has a considerably smaller size than a fully sampled fine grid. Subsequently, the coarse estimation of 2D scattering centers is performed using both the method of iteratively reweighted least square and a general peak-finding algorithm. Finally, the fine estimation of 2D scattering centers is performed using the orthogonal matching pursuit (OMP) procedure from an adaptively sampled Fourier dictionary. The measured RCS data, as well as simulation data using the point-scatterer model, are used to evaluate the 2D SCE accuracy of the proposed method. The results indicate that the proposed method can achieve higher SCE accuracy for an incomplete RCS dataset with missing data than that achieved by the conventional OMP, basis pursuit, smoothed L0, and existing discrete spectral estimation techniques.

The Comparison of Estimation Methods for the Missing Rainfall Data with spatio-temporal Variability (시공간적 변동성을 고려한 강우의 결측치 추정 방법의 비교)

  • Kim, Byung-Sik;Noh, Hui-Seong;Kim, Hung-Soo
    • Journal of Wetlands Research
    • /
    • v.13 no.2
    • /
    • pp.189-197
    • /
    • 2011
  • This paper reviewed application of data-driven method, distance-weighted method(IDWM, IEWM, CCWM, ANN), and radar data method estimated of missing raifall data. To evaluate these methods, statistics was compared using radar and station rainfall data from Imjin-river basin. The range of RMSE values calculated for CCWM, ANN was 1.4 to 1.79mm, and the range of RMSE values estimated data used for radar rainfall data was 0.05 to 2.26mm. Spatial characteristics is considered to Radar rainfall data rather than station rainfall data. Result suggest that estimated data used for radar data can impove estimation of missing raifall data.

Fuzzy Logic Modeling and Its Application to A Walking-Beam Reheating Furnace

  • Zhang, Bin;Wang, Jing-Cheng
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.7 no.3
    • /
    • pp.182-187
    • /
    • 2007
  • A fuzzy modeling method is proposed to build the dynamic model of a walking-beam reheating furnace from the recorded data. In the proposed method, the number of membership function on each variable is increased individually and the modeling accuracy is evaluated iteratively. When the modeling accuracy is satisfied, the membership functions on each variable are fixed and the structure of fuzzy model is determined. Because the training data is limited, in this process, as the number of membership function increase, it is highly possible that some rules are missing, i.e., no data in the training set corresponds to the consequent part of a missing rule. To complete the rulebase, the output of the model constructed at the previous step is used to generate the consequent part of the missing rules. Finally, in the real time application, a rolling update scheme to rulebase is introduced to compensate the change of system dynamics and fine tune the rulebase. The proposed method is verified by the application to the modeling of a reheating furnace.

A longitudinal study for child aggression with Korea Welfare Panel Study data (한국복지패널 자료를 이용한 아동기 공격성에 대한 경시적 자료 분석)

  • Choi, Nayeon;Huh, Jib
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.6
    • /
    • pp.1439-1447
    • /
    • 2014
  • Most of literatures on Korean child aggression are based on using the cross-sectional data sets. Although there is a related study with a longitudinal data set, it is assumed that the data sets measured repeatedly in the longitudinal data are mutually independent. A longitudinal data analysis for Korean child aggression is then necessary. This study is to analyze the effect of child development outcomes including academic achievement, self-esteem, depression anxiety, delinquency, victimization by peers, abuse by parents and internet using time on child aggression with Korea Welfare Panel Study data observed three times between 2006 and 2012. Since Korea Welfare Panel Study data have missing values, the missing at random is assumed. The linear mixed effect model and the restricted maximum likelihood estimation are considered.

Assessment of Reference Evapotranspiration Equations for Missing and Estimated Weather Data (기상자료의 결측과 산정에 따른 기준작물 증발산량 공식의 비교 평가)

  • Yoon, Pu Reun;Choi, Jin-Yong
    • Journal of The Korean Society of Agricultural Engineers
    • /
    • v.60 no.3
    • /
    • pp.15-25
    • /
    • 2018
  • Estimating the reference evapotranspiration is an important factor to consider in irrigation system design and agricultural water use. However, there is a limitation in using the FAO Penman-Monteith (FAO P-M) equation, which requires various meteorological data. The purpose of this study is to compare three reference evapotranspiration (ETo) equations in the case of meteorological data missing for 11 study weather stations. Firstly, the FAO P-M equation is used for reference potential evapotranspiration estimation with the actual solar radiation data $R_n$ and the actual vapor pressure $e_a$. Then, in the case of $R_n$, and $e_a$ are missed, the reference evapotranspirations applying FAO P-M, Priestley-Taylor (P-T), Hargreaves (HG) equation were calculated using other meteorological factors. Secondly, MAE, RMSE, $R^2$ were calculated to compare ETo relationship from the ETo equations. From the results, ETo with Hargreaves equation in coastal areas and the Priestley-Taylor equation in the inland areas showed relatively high correlation with FAO P-M when $e_a$ data is missed. In the case of $R_n$ data is missed or two weather data, $e_a$, and $R_n$ data are all missed, $R^2$ value in Priestley-Taylor equation was highest in coastal areas, and $R^2$ values in Hargreaves equation were the high values for 7 inland areas. The results of sensitivity analysis showed that net radiation was the most sensitive for P-T and HG equation, and for FAO P-M, the most sensitive factor was net radiation and relative humidity, air temperature and wind speed were follows. Therefore, in considering of the accessibility to the coast, the types of the missing wether data, and the correlation and the magnitude of error, the reference evapotranspiration equations would be selected in sense of different conditions.

Predictive Optimization Adjusted With Pseudo Data From A Missing Data Imputation Technique (결측 데이터 보정법에 의한 의사 데이터로 조정된 예측 최적화 방법)

  • Kim, Jeong-Woo
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.20 no.2
    • /
    • pp.200-209
    • /
    • 2019
  • When forecasting future values, a model estimated after minimizing training errors can yield test errors higher than the training errors. This result is the over-fitting problem caused by an increase in model complexity when the model is focused only on a given dataset. Some regularization and resampling methods have been introduced to reduce test errors by alleviating this problem but have been designed for use with only a given dataset. In this paper, we propose a new optimization approach to reduce test errors by transforming a test error minimization problem into a training error minimization problem. To carry out this transformation, we needed additional data for the given dataset, termed pseudo data. To make proper use of pseudo data, we used three types of missing data imputation techniques. As an optimization tool, we chose the least squares method and combined it with an extra pseudo data instance. Furthermore, we present the numerical results supporting our proposed approach, which resulted in less test errors than the ordinary least squares method.

Long-gap Filling Method for the Coastal Monitoring Data (해양모니터링 자료의 장기결측 보충 기법)

  • Cho, Hong-Yeon;Lee, Gi-Seop;Lee, Uk-Jae
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.33 no.6
    • /
    • pp.333-344
    • /
    • 2021
  • Technique for the long-gap filling that occur frequently in ocean monitoring data is developed. The method estimates the unknown values of the long-gap by the summation of the estimated trend and selected residual components of the given missing intervals. The method was used to impute the data of the long-term missing interval of about 1 month, such as temperature and water temperature of the Ulleungdo ocean buoy data. The imputed data showed differences depending on the monitoring parameters, but it was found that the variation pattern was appropriately reproduced. Although this method causes bias and variance errors due to trend and residual components estimation, it was found that the bias error of statistical measure estimation due to long-term missing is greatly reduced. The mean, and the 90% confidence intervals of the gap-filling model's RMS errors are 0.93 and 0.35~1.95, respectively.

Genetic Algorithm Based Attribute Value Taxonomy Generation for Learning Classifiers with Missing Data (유전자 알고리즘 기반의 불완전 데이터 학습을 위한 속성값계층구조의 생성)

  • Joo Jin-U;Yang Ji-Hoon
    • The KIPS Transactions:PartB
    • /
    • v.13B no.2 s.105
    • /
    • pp.133-138
    • /
    • 2006
  • Learning with Attribute Value Taxonomies (AVT) has shown that it is possible to construct accurate, compact and robust classifiers from a partially missing dataset (dataset that contains attribute values specified with different level of precision). Yet, in many cases AVTs are generated from experts or people with specialized knowledge in their domain. Unfortunately these user-provided AVTs can be time-consuming to construct and misguided during the AVT building process. Moreover experts are occasionally unavailable to provide an AVT for a particular domain. Against these backgrounds, this paper introduces an AVT generating method called GA-AVT-Learner, which finds a near optimal AVT with a given training dataset using a genetic algorithm. This paper conducted experiments generating AVTs through GA-AVT-Learner with a variety of real world datasets. We compared these AVTs with other types of AVTs such as HAC-AVTs and user-provided AVTs. Through the experiments we have proved that GA-AVT-Learner provides AVTs that yield more accurate and compact classifiers and improve performance in learning missing data.