• Title/Summary/Keyword: 데이터 결측

Search Result 134, Processing Time 0.024 seconds

A Sparse Data Preprocessing Using Support Vector Regression (Support Vector Regression을 이용한 희소 데이터의 전처리)

  • Jun, Sung-Hae;Park, Jung-Eun;Oh, Kyung-Whan
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.6
    • /
    • pp.789-792
    • /
    • 2004
  • In various fields as web mining, bioinformatics, statistical data analysis, and so forth, very diversely missing values are found. These values make training data to be sparse. Largely, the missing values are replaced by predicted values using mean and mode. We can used the advanced missing value imputation methods as conditional mean, tree method, and Markov Chain Monte Carlo algorithm. But general imputation models have the property that their predictive accuracy is decreased according to increase the ratio of missing in training data. Moreover the number of available imputations is limited by increasing missing ratio. To settle this problem, we proposed statistical learning theory to preprocess for missing values. Our statistical learning theory is the support vector regression by Vapnik. The proposed method can be applied to sparsely training data. We verified the performance of our model using the data sets from UCI machine learning repository.

The Study for Estimating Traffic Volumes on Urban Roads Using Spatial Statistic and Navigation Data (공간통계기법과 내비게이션 자료를 활용한 도시부 도로 교통량 추정연구)

  • HONG, Dahee;KIM, Jinho;JANG, Doogik;LEE, Taewoo
    • Journal of Korean Society of Transportation
    • /
    • v.35 no.3
    • /
    • pp.220-233
    • /
    • 2017
  • Traffic volumes are fundamental data widely used in various traffic analysis, such as origin-and-destination establishment, total traveled kilometer distance calculation, congestion evaluation, and so on. The low number of links collecting the traffic-volume data in a large urban highway network has weakened the quality of the analyses in practice. This study proposes a method to estimate the traffic volume data on a highway link where no collection device is available by introducing a spatial statistic technique with (1) the traffic-volume data from TOPIS, and National Transport Information Center in the Ministry of Land, Infrastructure, and (2) the navigation data from private navigation. Two different component models were prepared for the interrupted and the uninterrupted flows respectively, due to their different traffic-flow characteristics: the piecewise constant function and the regression kriging. The comparison of the traffic volumes estimated by the proposed method against the ones counted in the field showed that the level of error includes 6.26% in MAPE and 5,410 in RMSE, and thus the prediction error is 20.3% in MAPE.

Assessment of Missing Data Estimation with Rain Radar (강우레이더를 활용한 강수량 결측 보정에 관한 연구)

  • Kim, Tae Hyung;Lee, Jong-Hyeon;Lee, Yeong-Gon;Jang, Seung-Yeong;Choe, Gyu-Hyeon
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2018.05a
    • /
    • pp.310-310
    • /
    • 2018
  • Generally, precipitation measurement were conducted with various authrities. Among these, the MOLIT conduct the hydrological survey for the water resource management such as flood and low-flow forecasting, drought countermeasure, streamflow management. There is totally 424 observatory were existed and each precipitation measurement were obtained and quality assuranced with 10-min interval. It could be arranged or estimated with nearby observatory and radar reflectivity when the total amount of precipitation are existed. The objective of the study is therefore to suggest the method to estimate missing data with rain radar reflectivity. To validate suggested method, 50 observartory were obtained, and the efficiency were analyzed with estimated and observed precipitation. As the result of the study, the suggested method has reliability, and can be used as a method for quality assurance.

  • PDF

A Study of the Method for Estimating the Missing Data from Weather Measurement Instruments (인공신경망을 이용한 기상관측장비 결측 보완 기술에 관한 연구)

  • Min, Jae-Sik;Lee, Moo-Hun;Jee, Joon-Bum;Jang, Min
    • Journal of Digital Convergence
    • /
    • v.14 no.8
    • /
    • pp.245-252
    • /
    • 2016
  • The purpose of this study is to make up for missing of weather informations from ASOS and AWS using artificial neural networks. We collected temperature, relative humidity and wind velocity for August during 5-yr (2011-2015) and sample designed artificial neural networks, assuming the Seoul weather station was missing. The result of sensitivity study on number of epoch shows that early stopping appeared at 2,000 epochs. Correlation between observation and prediction was higher than 0.6, especially temperature and humidity was higher than 0.9, 0.8 respectively. RMSE decreased gradually and training time increased exponentially with respect to increase of number of epochs. The predictability at 40 epoch was more than 80% effect on of improved results by the time the early stopping. It is expected to make it possible to use more detailed weather information via the rapid missing complemented by quick learning time within 2 seconds.

A comparison of imputation methods using nonlinear models (비선형 모델을 이용한 결측 대체 방법 비교)

  • Kim, Hyein;Song, Juwon
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.4
    • /
    • pp.543-559
    • /
    • 2019
  • Data often include missing values due to various reasons. If the missing data mechanism is not MCAR, analysis based on fully observed cases may an estimation cause bias and decrease the precision of the estimate since partially observed cases are excluded. Especially when data include many variables, missing values cause more serious problems. Many imputation techniques are suggested to overcome this difficulty. However, imputation methods using parametric models may not fit well with real data which do not satisfy model assumptions. In this study, we review imputation methods using nonlinear models such as kernel, resampling, and spline methods which are robust on model assumptions. In addition, we suggest utilizing imputation classes to improve imputation accuracy or adding random errors to correctly estimate the variance of the estimates in nonlinear imputation models. Performances of imputation methods using nonlinear models are compared under various simulated data settings. Simulation results indicate that the performances of imputation methods are different as data settings change. However, imputation based on the kernel regression or the penalized spline performs better in most situations. Utilizing imputation classes or adding random errors improves the performance of imputation methods using nonlinear models.

Determination of the Optimal Aggregation Interval Size of Individual Vehicle Travel Times Collected by DSRC in Interrupted Traffic Flow Section of National Highway (국도 단속류 구간에서 DSRC를 활용하여 수집한 개별차량 통행시간의 최적 수집 간격 결정 연구)

  • PARK, Hyunsuk;KIM, Youngchan
    • Journal of Korean Society of Transportation
    • /
    • v.35 no.1
    • /
    • pp.63-78
    • /
    • 2017
  • The purpose of this study is to determine the optimal aggregation interval to increase the reliability when estimating representative value of individual vehicle travel time collected by DSRC equipment in interrupted traffic flow section in National Highway. For this, we use the bimodal asymmetric distribution data, which is the distribution of the most representative individual vehicle travel time collected in the interrupted traffic flow section, and estimate the MSE(Mean Square Error) according to the variation of the aggregation interval of individual vehicle travel time, and determine the optimal aggregation interval. The estimation equation for the MSE estimation utilizes the maximum estimation error equation of t-distribution that can be used in asymmetric distribution. For the analysis of optimal aggregation interval size, the aggregation interval size of individual vehicle travel time was only 3 minutes or more apart from the aggregation interval size of 1-2 minutes in which the collection of data was normally lost due to the signal stop in the interrupted traffic flow section. The aggregation interval that causes the missing part in the data collection causes another error in the missing data correction process and is excluded. As a result, the optimal aggregation interval for the minimum MSE was 3~5 minutes. Considering both the efficiency of the system operation and the improvement of the reliability of calculation of the travel time, it is effective to operate the basic aggregation interval as 5 minutes as usual and to reduce the aggregation interval to 3 minutes in case of congestion.

Dataset Augmentation Technique for Crack Detection of Wood Building (목조건물 크랙 감지를 위한 데이터셋 증강 기법)

  • Kim, Beom-Jun;Kim, Inki;Lim, Hyunseok;Gwak, Jeonghwan
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2021.07a
    • /
    • pp.645-647
    • /
    • 2021
  • 본 논문에서는 목조건물의 Crack만을 움직여 Data set을 증강하는 기법을 제안한다. 이 기법은 이미지 내 Crack Detection의 학습 데이터를 만들기 위해 이미지의 전체적인 값으로 Flip, Rotation, Shift, Rescale 등의 변환을 통해 Data Augmentation을 진행하는 대신 Crack이라는 하나의 Object만을 가지고 새로운 데이터를 생성한다. 이때 Object는 관심 영역 내에서만 연산되어 기존의 방법보다 더욱 많은 데이터를 얻을 수 있으며, Crack이 관심 영역 밖으로 이동하지 않기 때문에 이상치 혹은 결측치가 존재하지 않는 데이터를 얻을 수 있다. 또한 Crack이 존재하지 않는 이미지에도 임의적으로 Crack을 생성하여 새로운 데이터를 만들 수 있다. 결론적으로 본 논문에서는 Crack Detection의 학습을 위하여 기존 방법보다 우수한 성능의 Data Augmentation을 제안하였다.

  • PDF

Missing Value Estimation and Sensor Fault Identification using Multivariate Statistical Analysis (다변량 통계 분석을 이용한 결측 데이터의 예측과 센서이상 확인)

  • Lee, Changkyu;Lee, In-Beum
    • Korean Chemical Engineering Research
    • /
    • v.45 no.1
    • /
    • pp.87-92
    • /
    • 2007
  • Recently, developments of process monitoring system in order to detect and diagnose process abnormalities has got the spotlight in process systems engineering. Normal data obtained from processes provide available information of process characteristics to be used for modeling, monitoring, and control. Since modern chemical and environmental processes have high dimensionality, strong correlation, severe dynamics and nonlinearity, it is not easy to analyze a process through model-based approach. To overcome limitations of model-based approach, lots of system engineers and academic researchers have focused on statistical approach combined with multivariable analysis such as principal component analysis (PCA), partial least squares (PLS), and so on. Several multivariate analysis methods have been modified to apply it to a chemical process with specific characteristics such as dynamics, nonlinearity, and so on.This paper discusses about missing value estimation and sensor fault identification based on process variable reconstruction using dynamic PCA and canonical variate analysis.

A Development of Personalized Recommendation System using Spark GraphX (Spark GraphX를 활용한 개인 추천 시스템 개발)

  • Kim, Sungsook;Park, Kiejin;Lu, Sun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2018.05a
    • /
    • pp.41-43
    • /
    • 2018
  • 소설 데이터는 인터넷 상의 수 많은 개인과 개인의 상호 작용에 의하여 연결되어 있으며, 이러한 데이터를 분석하여, 분석 대상에 내재하고 있는 구조와 특성을 파악하는 일은 중요하다. 특히, 개인 추천을 위해서는 개별 데이터들의 관계 그래프를 활용하여 빠르고 정확하게 추천 값을 도출하는 것이 효율적이다. 하지만, 기존 추천 기법으로는 신규 사용자와 아이템이 끊임없이 등장하는 상황을 즉각적으로 반영하기가 어렵고, 또한 많은 결측값을 포함하는 sparse 한 데이터일 경우에는 추천 시스템의 연산 공간과 시간에 많은 제약이 있다. 이에 본 논문에서는 Spark GraphX 를 활용한 개인 추천 시스템을 설계 및 개발하였으며, 이를 통하여 사용자와 아이템간에 내재하는 복합 요인이 반영된 그래프 기반 추천을 실행하여, 개인 추천 결과의 우수성을 확인하였다.

Imputation method for missing data based on clustering and measure of property (군집화 및 특성도를 이용한 결측치 대체 방법)

  • Kim, Sunghyun;Kim, Dongjae
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.1
    • /
    • pp.29-40
    • /
    • 2018
  • There are various reasons for missing values when collecting data. Missing values have some influence on the analysis and results; consequently, various methods of processing missing values have been studied to solve the problem. It is thought that the later point of view may be affected by the initial time point value in the repeated measurement data. However, in the existing method, there was no method for the imputation of missing values using this concept. Therefore, we proposed a new missing value imputation method in this study using clustering in initial time point of the repeated measurement data and the measure of property proposed by Kim and Kim (The Korean Communications in Statistics, 30, 463-473, 2017). We also applied the Monte Carlo simulations to compare the performance of the established method and suggested methods in repeated measurement data.