• Title/Summary/Keyword: 결측치 대체

Search Result 20, Processing Time 0.033 seconds

Using Missing Values in the Model Tree to Change Performance for Predict Cholesterol Levels (모델트리의 결측치 처리 방법에 따른 콜레스테롤수치 예측의 성능 변화)

  • Jung, Yong Gyu;Won, Jae Kang;Sihn, Sung Chul
    • Journal of Service Research and Studies
    • /
    • v.2 no.2
    • /
    • pp.35-43
    • /
    • 2012
  • Data mining is an interest area in all field around us not in any specific areas, which could be used applications in a number of areas heavily. In other words, it is used in the decision-making process, data and correlation analysis in hidden relations, for finding the actionable information and prediction. But some of the data sets contains many missing values in the variables and do not exist a large number of records in the data set. In this paper, missing values are handled in accordance with the model tree algorithm. Cholesterol value is applied for predicting. For the performance analysis, experiments are approached for each treatment. Through this, efficient alternative is presented to apply the missing data.

  • PDF

Performance Evaluation of an Imputation Method based on Generative Adversarial Networks for Electric Medical Record (전자의무기록 데이터에서의 적대적 생성 알고리즘 기반 결측값 대치 알고리즘 성능분석)

  • Jo, Yong-Yeon;Jeong, Min-Yeong;Hwangbo, Yul
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2019.10a
    • /
    • pp.879-881
    • /
    • 2019
  • 전자의무기록 (EMR)과 같은 의료 현장에서 수집되는 대용량의 데이터는 임상 해석적으로 잠재가치가 크고 활용도가 다양하나 결측값이 많아 희소성이 크다는 한계점이 있어 분석이 어렵다. 특히 EMR의 정보수집과정에서 발생하는 결측값은 무작위적이고 임의적이어서 분석 정확도를 낮추고 예측 모델의 성능을 저하시키는 주된 요인으로 작용하기 때문에, 결측치 대체는 필수불가결하다. 최근 통상적으로 활용되어지던 통계기반 알고리즘기반의 결측치 대체 알고리즘보다는 딥러닝 기술을 활용한 알고리즘들이 새로이 등장하고 있다. 본 논문에서는 Generative Adversarial Network를 기반한 최신 결측값 대치 알고리즘인 Generative Adversarial Imputation Nets을 적용하여 EMR에서의 성능을 분석해보고자 하였다.

Imputation of Multiple Missing Values by Normal Mixture Model under Markov Random Field: Application to Imputation of Pixel Values of Color Image (마코프 랜덤 필드 하에서 정규혼합모형에 의한 다중 결측값 대체기법: 색조영상 결측 화소값 대체에 응용)

  • Kim, Seung-Gu
    • Communications for Statistical Applications and Methods
    • /
    • v.16 no.6
    • /
    • pp.925-936
    • /
    • 2009
  • There very many approaches to impute missing values in the iid. case. However, it is hardly found the imputation techniques in the Markov random field(MRF) case. In this paper, we show that the imputation under MRF is just to impute by fitting the normal mixture model(NMM) under several practical assumptions. Our multivariate normal mixture model based approaches under MRF is applied to impute the missing pixel values of 3-variate (R, G, B) color image, providing a technique to smooth the imputed values.

An EM Algorithm-Based Approach for Imputation of Pixel Values in Color Image (색조영상에서 랜덤결측화소값 대체를 위한 EM 알고리즘 기반 기법)

  • Kim, Seung-Gu
    • The Korean Journal of Applied Statistics
    • /
    • v.23 no.2
    • /
    • pp.305-315
    • /
    • 2010
  • In this paper, a frequentistic approach to impute the values of R, G, B-components in random missing pixels of color image is provided. Under assumption that the given image is a realization of Gaussian Markov random field, its model is designed such that each neighbor pixel values for a given pixel follows (independently) the normal distribution with covariance matrix scaled by an evaluates of the similarity between two pixel values, so that the imputation is not to be affected by the neighbors with different color. An approximate EM-based algorithm maximizing the underlying likelihood is implemented to estimate the parameters and to impute the missing pixel values. Some experiments are presented to show its effectiveness through performance comparison with a popular interpolation method.

Pairwise fusion approach to cluster analysis with applications to movie data (영화 데이터를 위한 쌍별 규합 접근방식의 군집화 기법)

  • Kim, Hui Jin;Park, Seyoung
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.2
    • /
    • pp.265-283
    • /
    • 2022
  • MovieLens data consists of recorded movie evaluations that was often used to measure the evaluation score in the recommendation system research field. In this paper, we provide additional information obtained by clustering user-specific genre preference information through movie evaluation data and movie genre data. Because the number of movie ratings per user is very low compared to the total number of movies, the missing rate in this data is very high. For this reason, there are limitations in applying the existing clustering methods. In this paper, we propose a convex clustering-based method using the pairwise fused penalty motivated by the analysis of MovieLens data. In particular, the proposed clustering method execute missing imputation, and at the same time uses movie evaluation and genre weights for each movie to cluster genre preference information possessed by each individual. We compute the proposed optimization using alternating direction method of multipliers algorithm. It is shown that the proposed clustering method is less sensitive to noise and outliers than the existing method through simulation and MovieLens data application.

Imputation of Missing Data Based on Hot Deck Method Using K-nn (K-nn을 이용한 Hot Deck 기반의 결측치 대체)

  • Kwon, Soonchang
    • Journal of Information Technology Services
    • /
    • v.13 no.4
    • /
    • pp.359-375
    • /
    • 2014
  • Researchers cannot avoid missing data in collecting data, because some respondents arbitrarily or non-arbitrarily do not answer questions in studies and experiments. Missing data not only increase and distort standard deviations, but also impair the convenience of estimating parameters and the reliability of research results. Despite widespread use of hot deck, researchers have not been interested in it, since it handles missing data in ambiguous ways. Hot deck can be complemented using K-nn, a method of machine learning, which can organize donor groups closest to properties of missing data. Interested in the role of k-nn, this study was conducted to impute missing data based on the hot deck method using k-nn. After setting up imputation of missing data based on hot deck using k-nn as a study objective, deletion of listwise, mean, mode, linear regression, and svm imputation were compared and verified regarding nominal and ratio data types and then, data closest to original values were obtained reasonably. Simulations using different neighboring numbers and the distance measuring method were carried out and better performance of k-nn was accomplished. In this study, imputation of hot deck was re-discovered which has failed to attract the attention of researchers. As a result, this study shall be able to help select non-parametric methods which are less likely to be affected by the structure of missing data and its causes.

The Comparison of Imputation Methods in Time Series Data with Missing Values (시계열자료에서 결측치 추정방법의 비교)

  • Lee, Sung-Duck;Choi, Jae-Hyuk;Kim, Duck-Ki
    • Communications for Statistical Applications and Methods
    • /
    • v.16 no.4
    • /
    • pp.723-730
    • /
    • 2009
  • Missing values in time series can be treated as unknown parameters and estimated by maximum likelihood or as random variables and predicted by the expectation of the unknown values given the data. The purpose of this study is to impute missing values which are regarded as the maximum likelihood estimator and random variable in incomplete data and to compare with two methods using ARMA model. For illustration, the Mumps data reported from the national capital region monthly over the years 2001 ${\sim}$ 2006 are used, and results from two methods are compared with using SSF(Sum of square for forecasting error).

Comparison of Single Imputation Methods in 2×2 Cross-Over Design with Missing Observations (2×2 교차계획법에서 결측치가 있을 때의 결측치 처리 방법 비교에 관한 연구)

  • Jo, Bobae;Kim, Dongjae
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.3
    • /
    • pp.529-540
    • /
    • 2015
  • A cross-over design is frequently used in clinical trials (especially in bioequivalence tests with a parametric method) for the comparison of two treatments. Missing values frequently take place in cross-over designs in the second period. Usually, subjects that have missing values are removed and analyzed. However, it can be unsuitable in clinical trials with a small sample size. In this paper, we compare single imputation methods in a $2{\times}2$ cross-over design when missing values exist in the second period. Additionally, parametric and nonparametric methods are compared after applying single imputation methods. A Monte-Carlo simulation study compares type I error and the power of methods.

Store Sales Prediction Using Gradient Boosting Model (그래디언트 부스팅 모델을 활용한 상점 매출 예측)

  • Choi, Jaeyoung;Yang, Heeyoon;Oh, Hayoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.2
    • /
    • pp.171-177
    • /
    • 2021
  • Through the rapid developments in machine learning, there have been diverse utilization approaches not only in industrial fields but also in daily life. Implementations of machine learning on financial data, also have been of interest. Herein, we employ machine learning algorithms to store sales data and present future applications for fintech enterprises. We utilize diverse missing data processing methods to handle missing data and apply gradient boosting machine learning algorithms; XGBoost, LightGBM, CatBoost to predict the future revenue of individual stores. As a result, we found that using median imputation onto missing data with the appliance of the xgboost algorithm has the best accuracy. By employing the proposed method, fintech enterprises and customers can attain benefits. Stores can benefit by receiving financial assistance beforehand from fintech companies, while these corporations can benefit by offering financial support to these stores with low risk.

Comparison of Machine Learning Techniques in Urban Weather Prediction using Air Quality Sensor Data (실외공기측정기 자료를 이용한 도심 기상 예측 기계학습 모형 비교)

  • Jong-Chan Park;Heon Jin Park
    • The Journal of Bigdata
    • /
    • v.6 no.2
    • /
    • pp.39-49
    • /
    • 2021
  • Recently, large and diverse weather data are being collected by sensors from various sources. Efforts to predict the concentration of fine dust through machine learning are being made everywhere, and this study intends to compare PM10 and PM2.5 prediction models using data from 840 outdoor air meters installed throughout the city. Information can be provided in real time by predicting the concentration of fine dust after 5 minutes, and can be the basis for model development after 10 minutes, 30 minutes, and 1 hour. Data preprocessing was performed, such as noise removal and missing value replacement, and a derived variable that considers temporal and spatial variables was created. The parameters of the model were selected through the response surface method. XGBoost, Random Forest, and Deep Learning (Multilayer Perceptron) are used as predictive models to check the difference between fine dust concentration and predicted values, and to compare the performance between models.