• Title/Summary/Keyword: missing data

Search Result 1,260, Processing Time 0.036 seconds

Missing Data Modeling based on Matrix Factorization of Implicit Feedback Dataset (암시적 피드백 데이터의 행렬 분해 기반 누락 데이터 모델링)

  • Ji, JiaQi;Chung, Yeongjee
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.23 no.5
    • /
    • pp.495-507
    • /
    • 2019
  • Data sparsity is one of the main challenges for the recommender system. The recommender system contains massive data in which only a small part is the observed data and the others are missing data. Most studies assume that missing data is randomly missing from the dataset. Therefore, they only use observed data to train recommendation model, then recommend items to users. In actual case, however, missing data do not lost randomly. In our research, treat these missing data as negative examples of users' interest. Three sample methods are seamlessly integrated into SVD++ algorithm and then propose SVD++_W, SVD++_R and SVD++_KNN algorithm. Experimental results show that proposed sample methods effectively improve the precision in Top-N recommendation over the baseline algorithms. Among the three improved algorithms, SVD++_KNN has the best performance, which shows that the KNN sample method is a more effective way to extract the negative examples of the users' interest.

Development of Missing Item Detection and Management System under Cell Type Packaging Processes (Cell 방식 포장공정에서의 Missing Item 검사 및 관리 시스템 개발)

  • Kim, Hyeon-Woo;Choi, Hyun-Eui;An, Ho-Gyun;Yoon, Tae-Sung
    • Proceedings of the IEEK Conference
    • /
    • 2009.05a
    • /
    • pp.344-346
    • /
    • 2009
  • Cell type packaging line is more suitable for the products with various models and small quantities like mobile phone or mp3 player than conveyor type packaging line. Cell type packaging line is applicable to package various product models, but it can cause wrong product compositions and missing of items. So, automatic missing item detection system is needed. We designed an missing item detection system with a bar code reader, infrared sensors, and s digital camera. and also developed the programs for sensor data acquisition, image data processing, GUI, and data management.

  • PDF

PhysioCover: Recovering the Missing Values in Physiological Data of Intensive Care Units

  • Kim, Sun-Hee;Yang, Hyung-Jeong;Kim, Soo-Hyung;Lee, Guee-Sang
    • International Journal of Contents
    • /
    • v.10 no.2
    • /
    • pp.47-58
    • /
    • 2014
  • Physiological signals provide important clues in the diagnosis and prediction of disease. Analyzing these signals is important in health and medicine. In particular, data preprocessing for physiological signal analysis is a vital issue because missing values, noise, and outliers may degrade the analysis performance. In this paper, we propose PhysioCover, a system that can recover missing values of physiological signals that were monitored in real time. PhysioCover integrates a gradual method and EM-based Principle Component Analysis (PCA). This approach can (1) more readily recover long- and short-term missing data than existing methods, such as traditional EM-based PCA, linear interpolation, 5-average and Missing Value Singular Value Decomposition (MSVD), (2) more effectively detect hidden variables than PCA and Independent component analysis (ICA), and (3) offer fast computation time through real-time processing. Experimental results with the physiological data of an intensive care unit show that the proposed method assigns more accurate missing values than previous methods.

Algorithms for Handling Incomplete Data in SVM and Deep Learning (SVM과 딥러닝에서 불완전한 데이터를 처리하기 위한 알고리즘)

  • Lee, Jong-Chan
    • Journal of the Korea Convergence Society
    • /
    • v.11 no.3
    • /
    • pp.1-7
    • /
    • 2020
  • This paper introduces two different techniques for dealing with incomplete data and algorithms for learning this data. The first method is to process the incomplete data by assigning the missing value with equal probability that the missing variable can have, and learn this data with the SVM. This technique ensures that the higher the frequency of missing for any variable, the higher the entropy so that it is not selected in the decision tree. This method is characterized by ignoring all remaining information in the missing variable and assigning a new value. On the other hand, the new method is to calculate the entropy probability from the remaining information except the missing value and use it as an estimate of the missing variable. In other words, using a lot of information that is not lost from incomplete learning data to recover some missing information and learn using deep learning. These two methods measure performance by selecting one variable in turn from the training data and iteratively comparing the results of different measurements with varying proportions of data lost in the variable.

A Study on Imputation using Adjusted Cohen Method

  • Chung, Sung-Suk;Chun, Young-Min;Lee, Sun-Kyung
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.3
    • /
    • pp.871-888
    • /
    • 2006
  • Many studies have been done to develop procedures to deal with missing values. Most common method is to reassign the other values to the missing data. The purpose of our study is to suggest adjusted Cohen methods and to compare the efficiency of them with other methods through a simulation study. The adjusted Cohen methods use an auxiliary variable to arrange ranking of the variable with missing values. It leads to a reduced mean square error(MSE) compared with the Cohen method.

  • PDF

Iterative integrated imputation for missing data and pathway models with applications to breast cancer subtypes

  • Linder, Henry;Zhang, Yuping
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.4
    • /
    • pp.411-430
    • /
    • 2019
  • Tumor development is driven by complex combinations of biological elements. Recent advances suggest that molecularly distinct subtypes of breast cancers may respond differently to pathway-targeted therapies. Thus, it is important to dissect pathway disturbances by integrating multiple molecular profiles, such as genetic, genomic and epigenomic data. However, missing data are often present in the -omic profiles of interest. Motivated by genomic data integration and imputation, we present a new statistical framework for pathway significance analysis. Specifically, we develop a new strategy for imputation of missing data in large-scale genomic studies, which adapts low-rank, structured matrix completion. Our iterative strategy enables us to impute missing data in complex configurations across multiple data platforms. In turn, we perform large-scale pathway analysis integrating gene expression, copy number, and methylation data. The advantages of the proposed statistical framework are demonstrated through simulations and real applications to breast cancer subtypes. We demonstrate superior power to identify pathway disturbances, compared with other imputation strategies. We also identify differential pathway activity across different breast tumor subtypes.

Pattern-Mixture Model of the Cox Proportional Hazards Model with Missing Binary Covariates (결측이 있는 이산형 공변량에 대한 Cox비례위험모형의 패턴-혼합 모델)

  • Youk, Tae-Mi;Song, Ju-Won
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.2
    • /
    • pp.279-291
    • /
    • 2012
  • When fitting a Cox proportional hazards model with missing covariates, it is inefficient to exclude observations with missing values in the analysis. Furthermore, if the missing-data mechanism is not Missing Completely At Random(MCAR), it may lead to biased parameter estimation. Many approaches have been suggested to handle the Cox proportional hazards model when covariates are sometimes missing, but they are based on the selection model. This paper suggest an approach to handle Cox proportional hazards model with missing covariates by using the pattern-mixture model (Little, 1993). The pattern-mixture model is expressed by the joint distribution of survival time and the missing-data mechanism. In the pattern-mixture model, many models can be considered by setting up various restrictions, and different results under various restrictions indicate the sensitivity of the model due to missing covariates. A simulation study was conducted to show the sensitivity of parameter estimation under different restrictions in a pattern-mixture model. The proposed approach was also applied to mouse leukemia data.

Statistical Methods for Multivariate Missing Data in Health Survey Research (보건조사연구에서 다변량결측치가 내포된 자료를 효율적으로 분석하기 위한 통계학적 방법)

  • Kim, Dong-Kee;Park, Eun-Cheol;Sohn, Myong-Sei;Kim, Han-Joong;Park, Hyung-Uk;Ahn, Chae-Hyung;Lim, Jong-Gun;Song, Ki-Jun
    • Journal of Preventive Medicine and Public Health
    • /
    • v.31 no.4 s.63
    • /
    • pp.875-884
    • /
    • 1998
  • Missing observations are common in medical research and health survey research. Several statistical methods to handle the missing data problem have been proposed. The EM algorithm (Expectation-Maximization algorithm) is one of the ways of efficiently handling the missing data problem based on sufficient statistics. In this paper, we developed statistical models and methods for survey data with multivariate missing observations. Especially, we adopted the EM algorithm to handle the multivariate missing observations. We assume that the multivariate observations follow a multivariate normal distribution, where the mean vector and the covariance matrix are primarily of interest. We applied the proposed statistical method to analyze data from a health survey. The data set we used came from a physician survey on Resource-Based Relative Value Scale(RBRVS). In addition to the EM algorithm, we applied the complete case analysis, which uses only completely observed cases, and the available case analysis, which utilizes all available information. The residual and normal probability plots were evaluated to access the assumption of normality. We found that the residual sum of squares from the EM algorithm was smaller than those of the complete-case and the available-case analyses.

  • PDF

Rank Tests for Multivariate Linear Models in the Presence of Missing Data

  • Lee, Jae-Won;David M. Reboussin
    • Journal of the Korean Statistical Society
    • /
    • v.26 no.3
    • /
    • pp.319-332
    • /
    • 1997
  • The application of multivariate linear rank statistics to data with item nonresponse is considered. Only a modest extension of the complete data techniques is required when the missing data may be thought of as a random sample, and an appropriate modification of the covariances is derived. A proof of the asymptotic multivariate normality is given. A review of some related results in the literature is presented and applications including longitudinal and repeated measures designs are discussed.

  • PDF

The Interpolation Method for the missing AIS Data of Ship

  • Nguyen, Van-Suong;Im, Nam-kyun;Lee, Sang-min
    • Journal of Navigation and Port Research
    • /
    • v.39 no.5
    • /
    • pp.377-384
    • /
    • 2015
  • The interpolation of missing AIS data can be used for recovering the lost data of a ship's state which is then able to produce useful information for VTS stations or other ships. Previous research has introduced some interpolating methods however there are some problems with regard to missing AIS data. This paper proposes one new method which includes linear interpolation, cubic Hermit interpolation and an identification mechanism to overcome some of those limitations, first AIS data regarding ship position, COG, SOG and HDG is divided into separate time series, then the characteristic of the missing data is investigated into through using an identification mechanism, an appropriate interpolation is selected to fit all the time series which matches the characteristics. Numerical experiments are carried out using real AIS data to validate the algorithm of this approach and the results are compared with the previous method, after which the actual missing area is suggested to be interpolated by the proposed method. The interpolation results show this approach can be applied well in practice.