• Title/Summary/Keyword: Missing data

Search Result 1,296, Processing Time 0.029 seconds

Analysis of Missing Data Using an Empirical Bayesian Method (경험적 베이지안 방법을 이용한 결측자료 연구)

  • Yoon, Yong Hwa;Choi, Boseung
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.6
    • /
    • pp.1003-1016
    • /
    • 2014
  • Proper missing data imputation is an important procedure to obtain superior results for data analysis based on survey data. This paper deals with both a model based imputation method and model estimation method. We utilized a Bayesian method to solve a boundary solution problem in which we applied a maximum likelihood estimation method. We also deal with a missing mechanism model selection problem using forecasting results and a comparison between model accuracies. We utilized MWPE(modified within precinct error) (Bautista et al., 2007) to measure prediction correctness. We applied proposed ML and Bayesian methods to the Korean presidential election exit poll data of 2012. Based on the analysis, the results under the missing at random mechanism showed superior prediction results than under the missing not at random mechanism.

Incomplete data handling technique using decision trees (결정트리를 이용하는 불완전한 데이터 처리기법)

  • Lee, Jong Chan
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.8
    • /
    • pp.39-45
    • /
    • 2021
  • This paper discusses how to handle incomplete data including missing values. Optimally processing the missing value means obtaining an estimate that is the closest to the original value from the information contained in the training data, and replacing the missing value with this value. The way to achieve this is to use a decision tree that is completed in the process of classifying information by the classifier. In other words, this decision tree is obtained in the process of learning by inputting only complete information that does not include loss values among all training data into the C4.5 classifier. The nodes of this decision tree have classification variable information, and the higher node closer to the root contains more information, and the leaf node forms a classification region through a path from the root. In addition, the average of classified data events is recorded in each region. Events including the missing value are input to this decision tree, and the region closest to the event is searched through a traversal process according to the information of each node. The average value recorded in this area is regarded as an estimate of the missing value, and the compensation process is completed.

Development of a Machine Learning Model for Imputing Time Series Data with Massive Missing Values (결측치 비율이 높은 시계열 데이터 분석 및 예측을 위한 머신러닝 모델 구축)

  • Bangwon Ko;Yong Hee Han
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.17 no.3
    • /
    • pp.176-182
    • /
    • 2024
  • In this study, we compared and analyzed various methods of missing data handling to build a machine learning model that can effectively analyze and predict time series data with a high percentage of missing values. For this purpose, Predictive State Model Filtering (PSMF), MissForest, and Imputation By Feature Importance (IBFI) methods were applied, and their prediction performance was evaluated using LightGBM, XGBoost, and Explainable Boosting Machines (EBM) machine learning models. The results of the study showed that MissForest and IBFI performed the best among the methods for handling missing values, reflecting the nonlinear data patterns, and that XGBoost and EBM models performed better than LightGBM. This study emphasizes the importance of combining nonlinear imputation methods and machine learning models in the analysis and prediction of time series data with a high percentage of missing values, and provides a practical methodology.

Missing Data Imputation Using Permanent Traffic Counts on National Highways (일반국토 상시 교통량자료를 이용한 교통량 결측자료 추정)

  • Ha, Jeong-A;Park, Jae-Hwa;Kim, Seong-Hyeon
    • Journal of Korean Society of Transportation
    • /
    • v.25 no.1 s.94
    • /
    • pp.121-132
    • /
    • 2007
  • Up to now Permanent traffic volumes have been counted by Automatic Vehicle Classification (AVC) on National Highways. When counted data have missing items or errors, the data must be revised to stay statistically reliable This study was carried out to estimate correct data based on outoregression and seasonal AutoRegressive Integrated Moving Average (ARIMA). As a result of verification through seasonal ARIMA, the longer the missed period is, the greater the error. Autoregression results in better verification results than seasonal ARIMA. Traffic data is affected by the present state mote than past patterns. However. autoregression can be applied only to the cases where data include similar neighborhood patterns and even in this case. the data cannot be corrected when data are missing due to low qualify or errors Therefore, these data shoo)d be corrected using past patterns and seasonal ARIMA when the missing data occurs in short periods.

Filling in Water Temperature Data of Aquatic Environments using a Pre-constructed Relationship

  • Lee, Khil-Ha
    • Journal of Environmental Science International
    • /
    • v.26 no.10
    • /
    • pp.1125-1133
    • /
    • 2017
  • In this study a method for filling in missing data of river water temperature using a pre-constructed mathematical relationship between air and water temperatures is presented. A regression between water temperatures at individual stations and ambient air temperatures at nearby weather stations can provide a practical method for representing missing water temperature data for an entire region. Air and water temperature data that were collected from two test sites (one coastal and, one inland) were individually fitted to a nonlinear regression model. To consider seasonal hysteresis effects, separate functions were fitted to the data in the rising and falling limbs. A single-criterion, multi-parameter optimization technique was used to determine the optimal parameter sets. This method minimizes the differences between the time series of the measured and estimated data. The constructed air-water temperature relationship was subsequently applied to represent missing water temperature data. It was found that the RMSEs(MBEs) were in the range of $1.843-1.976^{\circ}C(-0.329-0.201^{\circ}C)$ and the coefficient of determination were in the range of 0.92-0.96. The results demonstrate that the predicted water temperatures using the regression equations were reasonably accurate.

A Research for Imputation Method of Photovoltaic Power Missing Data to Apply Time Series Models (태양광 발전량 데이터의 시계열 모델 적용을 위한 결측치 보간 방법 연구)

  • Jeong, Ha-Young;Hong, Seok-Hoon;Jeon, Jae-Sung;Lim, Su-Chang;Kim, Jong-Chan;Park, Chul-Young
    • Journal of Korea Multimedia Society
    • /
    • v.24 no.9
    • /
    • pp.1251-1260
    • /
    • 2021
  • This paper discusses missing data processing using simple moving average (SMA) and kalman filter. Also SMA and kalman predictive value are made a comparative study. Time series analysis is a generally method to deals with time series data in photovoltaic field. Photovoltaic system records data irregularly whenever the power value changes. Irregularly recorded data must be transferred into a consistent format to get accurate results. Missing data results from the process having same intervals. For the reason, it was imputed using SMA and kalman filter. The kalman filter has better performance to observed data than SMA. SMA graph is stepped line graph and kalman filter graph is a smoothing line graph. MAPE of SMA prediction is 0.00737%, MAPE of kalman prediction is 0.00078%. But time complexity of SMA is O(N) and time complexity of kalman filter is O(D2) about D-dimensional object. Accordingly we suggest that you pick the best way considering computational power.

Bridge Health Monitoring with Consideration of Environmental Effects

  • Kim, Yuhee;Kim, Hyunsoo;Shin, Soobong;Park, Jong-Chil
    • Journal of the Korean Society for Nondestructive Testing
    • /
    • v.32 no.6
    • /
    • pp.648-660
    • /
    • 2012
  • Reliable response measurements are extremely important for proper bridge health monitoring but incomplete and unreliable data may be acquired due to sensor problems and environmental effects. In the case of a sensor malfunction, parts of the measured data can be missing so that the structural health condition cannot be monitored reliably. This means that the dynamic characteristics of natural frequencies can change as if the structure is damaged due to environmental effects, such as temperature variations. To overcome these problems, this paper proposes a systematic procedure of data analysis to recover missing data and eliminate the environmental effects from the measured data. It also proposes a health index calculated statistically using revised data to evaluate the health condition of a bridge. The proposed method was examined using numerically simulated data with a truss structure and then applied to a set of field data measured from a cable-stayed bridge.

Developing a Method to Define Mountain Search Priority Areas Based on Behavioral Characteristics of Missing Persons

  • Yoo, Ho Jin;Lee, Jiyeong
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.37 no.5
    • /
    • pp.293-302
    • /
    • 2019
  • In mountain accident events, it is important for the search team commander to determine the search area in order to secure the Golden Time. Within this period, assistance and treatment to the concerned individual will most likely prevent further injuries and harm. This paper proposes a method to determine the search priority area based on missing persons behavior and missing persons incidents statistics. GIS (Geographic Information System) and MCDM (Multi Criteria Decision Making) are integrated by applying WLC (Weighted Linear Combination) techniques. Missing persons were classified into five types, and their behavioral characteristics were analyzed to extract seven geographic analysis factors. Next, index values were set up for each missing person and element according to the behavioral characteristics, and the raster data generated by multiplying the weight of each element are superimposed to define models to select search priority areas, where each weight is calculated from the AHP (Analytical Hierarchy Process) through a pairwise comparison method obtained from search operation experts. Finally, the model generated in this study was applied to a missing person case through a virtual missing scenario, the priority area was selected, and the behavioral characteristics and topographical characteristics of the missing persons were compared with the selected area. The resulting analysis results were verified by mountain rescue experts as 'appropriate' in terms of the behavior analysis, analysis factor extraction, experimental process, and results for the missing persons.

The development of statistical methods for retrieving MODIS missing data: Mean bias, regressions analysis and local variation method (MODIS 손실 자료 복원을 위한 통계적 방법 개발: 평균 편차 방법, 회귀 분석 방법과 지역 변동 방법)

  • Kim, Min Wook;Yi, Jonghyuk;Park, Yeon Gu;Song, Junghyun
    • Journal of Satellite, Information and Communications
    • /
    • v.11 no.4
    • /
    • pp.94-101
    • /
    • 2016
  • Satellite data for remote sensing technology has limitations, especially with visible range sensor, cloud and/or other environmental factors cause missing data. In this study, using land surface temperature data from the MODerate resolution Imaging Spectro-radiometer(MODIS), we developed retrieving methods for satellite missing data and developed three methods; mean bias, regression analysis and local variation method. These methods used the previous day data as reference data. In order to validate these methods, we selected a specific measurement ratio using artificial missing data from 2014 to 2015. The local variation method showed low accuracy with root mean square error(RMSE) more than 2 K in some cases, and the regression analysis method showed reliable results in most cases with small RMSE values, 1.13 K, approximately. RMSE with the mean bias method was similar to RMSE with the regression analysis method, 1.32 K, approximately.

Proposal to Supplement the Missing Values of Air Pollution Levels in Meteorological Dataset (기상 데이터에서 대기 오염도 요소의 결측치 보완 기법 제안)

  • Jo, Dong-Chol;Hahn, Hee-Il
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.21 no.1
    • /
    • pp.181-187
    • /
    • 2021
  • Recently, various air pollution factors have been measured and analyzed to reduce damages caused by it. In this process, many missing values occur due to various causes. To compensate for this, basically a vast amount of training data is required. This paper proposes a statistical techniques that effectively compensates for missing values generated in the process of measuring ozone, carbon dioxide, and ultra-fine dust using a small amount of learning data. The proposed algorithm first extracts a group of meteorological data that is expected to have positive effects on the correction of missing values through statistical information analysis such as the correlation between meteorological data and air pollution level factors, p-value, etc. It is a technique that efficiently and effectively compensates for missing values by analyzing them. In order to confirm the performance of the proposed algorithm, we analyze its characteristics through various experiments and compare the performance of the well-known representative algorithms with ours.