• Title/Summary/Keyword: Imputation method

Search Result 132, Processing Time 0.021 seconds

Cluster Analysis of Incomplete Microarray Data with Fuzzy Clustering

  • Kim, Dae-Won
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.17 no.3
    • /
    • pp.397-402
    • /
    • 2007
  • In this paper, we present a method for clustering incomplete Microarray data using alternating optimization in which a prior imputation method is not required. To reduce the influence of imputation in preprocessing, we take an alternative optimization approach to find better estimates during iterative clustering process. This method improves the estimates of missing values by exploiting the cluster Information such as cluster centroids and all available non-missing values in each iteration. The clustering results of the proposed method are more significantly relevant to the biological gene annotations than those of other methods, indicating its effectiveness and potential for clustering incomplete gene expression data.

Pairwise fusion approach to cluster analysis with applications to movie data (영화 데이터를 위한 쌍별 규합 접근방식의 군집화 기법)

  • Kim, Hui Jin;Park, Seyoung
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.2
    • /
    • pp.265-283
    • /
    • 2022
  • MovieLens data consists of recorded movie evaluations that was often used to measure the evaluation score in the recommendation system research field. In this paper, we provide additional information obtained by clustering user-specific genre preference information through movie evaluation data and movie genre data. Because the number of movie ratings per user is very low compared to the total number of movies, the missing rate in this data is very high. For this reason, there are limitations in applying the existing clustering methods. In this paper, we propose a convex clustering-based method using the pairwise fused penalty motivated by the analysis of MovieLens data. In particular, the proposed clustering method execute missing imputation, and at the same time uses movie evaluation and genre weights for each movie to cluster genre preference information possessed by each individual. We compute the proposed optimization using alternating direction method of multipliers algorithm. It is shown that the proposed clustering method is less sensitive to noise and outliers than the existing method through simulation and MovieLens data application.

Estimation of Survival Function and Median Survival Time in Interval-Censored Data (구간중도절단자료에서 생존함수와 중간생존시간에 대한 추정)

  • Yun, Eun-Young;Kim, Choong-Rak
    • The Korean Journal of Applied Statistics
    • /
    • v.23 no.3
    • /
    • pp.521-531
    • /
    • 2010
  • Interval-censored observations are common in medical and epidemiologic studies; however, limited studies exist due to the complexity and special structure of interval-censoring. This paper introduces the imputation method and the self consistency method in the interval-censored data. We propose a new method of generating random numbers under an interval-censoring set-up. Through simulation studies we compare two methods under various simulation schemes in the sense of the mean squared error for estimating the median survival time and the mean integrated squared error for estimating the survival function. Under a moderate censoring percentage, the mean imputation method showed a better performance than the self-consistency method in estimating the median survival time and the survival function.

Missing values imputation for time course gene expression data using the pattern consistency index adaptive nearest neighbors (시간경로 유전자 발현자료에서 패턴일치지수와 적응 최근접 이웃을 활용한 결측값 대치법)

  • Shin, Heyseo;Kim, Dongjae
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.3
    • /
    • pp.269-280
    • /
    • 2020
  • Time course gene expression data is a large amount of data observed over time in microarray experiments. This data can also simultaneously identify the level of gene expression. However, the experiment process is complex, resulting in frequent missing values due to various causes. In this paper, we propose a pattern consistency index adaptive nearest neighbors as a method of missing value imputation. This method combines the adaptive nearest neighbors (ANN) method that reflects local characteristics and the pattern consistency index that considers consistent degree for gene expression between observations over time points. We conducted a Monte Carlo simulation study to evaluate the usefulness of proposed the pattern consistency index adaptive nearest neighbors (PANN) method for two yeast time course data.

A Concordance Study of the Preprocessing Orders in Microarray Data (마이크로어레이 자료의 사전 처리 순서에 따른 검색의 일치도 분석)

  • Kim, Sang-Cheol;Lee, Jae-Hwi;Kim, Byung-Soo
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.3
    • /
    • pp.585-594
    • /
    • 2009
  • Researchers of microarray experiment transpose processed images of raw data to possible data of statistical analysis: it is preprocessing. Preprocessing of microarray has image filtering, imputation and normalization. There have been studied about several different methods of normalization and imputation, but there was not further study on the order of the procedures. We have no further study about which things put first on our procedure between normalization and imputation. This study is about the identification of differentially expressed genes(DEG) on the order of the preprocessing steps using two-dye cDNA microarray in colon cancer and gastric cancer. That is, we check for compare which combination of imputation and normalization steps can detect the DEG. We used imputation methods(K-nearly neighbor, Baysian principle comparison analysis) and normalization methods(global, within-print tip group, variance stabilization). Therefore, preprocessing steps have 12 methods. We identified concordance measure of DEG using the datasets to which the 12 different preprocessing orders were applied. When we applied preprocessing using variance stabilization of normalization method, there was a little variance in a sensitive way for detecting DEG.

A modified estimating equation for a binary time varying covariate with an interval censored changing time

  • Kim, Yang-Jin
    • Communications for Statistical Applications and Methods
    • /
    • v.23 no.4
    • /
    • pp.335-341
    • /
    • 2016
  • Interval censored failure time data often occurs in an observational study where a subject is followed periodically. Instead of observing an exact failure time, two inspection times that include it are made available. Several methods have been suggested to analyze interval censored failure time data (Sun, 2006). In this article, we are concerned with a binary time-varying covariate whose changing time is interval censored. A modified estimating equation is proposed by extending the approach suggested in the presence of a missing covariate. Based on simulation results, the proposed method shows a better performance than other simple imputation methods. ACTG 181 dataset were analyzed as a real example.

A Generation and Accuracy Evaluation of Common Metadata Prediction Model Using Public Bicycle Data and Imputation Method

  • Kim, Jong-Chan;Jung, Se-Hoon
    • Journal of Korea Multimedia Society
    • /
    • v.25 no.2
    • /
    • pp.287-296
    • /
    • 2022
  • Today, air pollution is becoming a severe issue worldwide and various policies are being implemented to solve environmental pollution. In major cities, public bicycles are installed and operated to reduce pollution and solve transportation problems, and operational information is collected in real time. However, research using public bicycle operation information data has not been processed. This study uses the daily weather data of Korea Meteorological Agency and real-time air pollution data of Korea Environment Corporation to predict the amount of daily rental bicycles. Cross- validation, principal component analysis and multiple regression analysis were used to determine the independent variables of the predictive model. Then, the study selected the elements that satisfy the significance level, constructed a model, predicted the amount of daily rental bicycles, and measured the accuracy.

An EM Algorithm-Based Approach for Imputation of Pixel Values in Color Image (색조영상에서 랜덤결측화소값 대체를 위한 EM 알고리즘 기반 기법)

  • Kim, Seung-Gu
    • The Korean Journal of Applied Statistics
    • /
    • v.23 no.2
    • /
    • pp.305-315
    • /
    • 2010
  • In this paper, a frequentistic approach to impute the values of R, G, B-components in random missing pixels of color image is provided. Under assumption that the given image is a realization of Gaussian Markov random field, its model is designed such that each neighbor pixel values for a given pixel follows (independently) the normal distribution with covariance matrix scaled by an evaluates of the similarity between two pixel values, so that the imputation is not to be affected by the neighbors with different color. An approximate EM-based algorithm maximizing the underlying likelihood is implemented to estimate the parameters and to impute the missing pixel values. Some experiments are presented to show its effectiveness through performance comparison with a popular interpolation method.

Sparse Web Data Analysis Using MCMC Missing Value Imputation and PCA Plot-based SOM (MCMC 결측치 대체와 주성분 산점도 기반의 SOM을 이용한 희소한 웹 데이터 분석)

  • Jun, Sung-Hae;Oh, Kyung-Whan
    • The KIPS Transactions:PartD
    • /
    • v.10D no.2
    • /
    • pp.277-282
    • /
    • 2003
  • The knowledge discovery from web has been studied in many researches. There are some difficulties using web log for training data on efficient information predictive models. In this paper, we studied on the method to eliminate sparseness from web log data and to perform web user clustering. Using missing value imputation by Bayesian inference of MCMC, the sparseness of web data is removed. And web user clustering is performed using self organizing maps based on 3-D plot by principal component. Finally, using KDD Cup data, our experimental results were shown the problem solving process and the performance evaluation.

A Novel on Auto Imputation and Analysis Prediction Model of Data Missing Scope based on Machine Learning (머신러닝기반의 데이터 결측 구간의 자동 보정 및 분석 예측 모델에 대한 연구)

  • Jung, Se-Hoon;Lee, Han-Sung;Kim, Jun-Yeong;Sim, Chun-Bo
    • Journal of Korea Multimedia Society
    • /
    • v.25 no.2
    • /
    • pp.257-268
    • /
    • 2022
  • When there is a missing value in the raw data, if ignore the missing values and proceed with the analysis, the accuracy decrease due to the decrease in the number of sample. The method of imputation and analyzing patterns and significant values can compensate for the problem of lower analysis quality and analysis accuracy as a result of bias rather than simply removing missing values. In this study, we proposed to study irregular data patterns and missing processing methods of data using machine learning techniques for the study of correction of missing values. we would like to propose a plan to replace the missing with data from a similar past point in time by finding the situation at the time when the missing data occurred. Unlike previous studies, data correction techniques present new algorithms using DNN and KNN-MLE techniques. As a result of the performance evaluation, the ANAE measurement value compared to the existing missing section correction algorithm confirmed a performance improvement of about 0.041 to 0.321.