Sparse Data Cleaning using Multiple Imputations

  • Published : 2004.06.01


Real data as web log file tend to be incomplete. But we have to find useful knowledge from these for optimal decision. In web log data, many useful things which are hyperlink information and web usages of connected users may be found. The size of web data is too huge to use for effective knowledge discovery. To make matters worse, they are very sparse. We overcome this sparse problem using Markov Chain Monte Carlo method as multiple imputations. This missing value imputation changes spare web data to complete. Our study may be a useful tool for discovering knowledge from data set with sparseness. The more sparseness of data in increased, the better performance of MCMC imputation is good. We verified our work by experiments using UCI machine learning repository data.



  1. L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, 'Classification and regression trees', Wadsworth & Brooks, 1984
  2. C. Conversano, C. Cappelli, 'Missing data incremental imputation through tree based methods', 14th Conference on Computational Statistics, 24-28 August 2002, Berlin, Germany, 2002
  3. C. Cortes, V. Vapnik, 'Support Vector Networks', Machine Learning, 1995
  4. R. Fletcher, 'Practical Methods of Optimization', John Wiley & Sons, Inc. New York, 1989
  5. S. Haykin, 'Neural Networks', 2nd edition. Prentice Hall, 1999
  6. D. C. Hoaglin, F. Mosteller, J. W. Tukey, 'Understanding robust and exploratory data analysis', John Wiley & Sons, Inc. New York, 1983
  7. R. J. A. Lavori, R. Dawson, D. Shera, 'A Multiple Imputation Strategy for Clinical TriaIs with Truncation of Patent Datal, Statistics in Medicine, 1995
  8. R. J. A. Little, 'A Test of Missing Completely at Random for Multivariate Data with Missing Values', Journal of the American Statistical Association, 1988
  9. P. R. Rosenbaum, D. B. Rubin, 'The CentraI Role of the Propensity Score in Observational Studies for Causal Effects', Biometrica, 1983
  10. D. B. Rubin, 'Inference with missing data', Biometrika, 1976
  11. D. B. Rubin, 'Multiple Imputation for Nonresponse in Surveys', John Wiley & Sons Inc. New York, 1987
  12. D. B. Rubin, 'Multiple Imputation After 18+ Years', Journal of the American Statistical Association, 1996
  13. J. L. Schafer, 'Analysis of Incomplete MuItivariate Data', Chapman and Hall. New York, 1997
  14. V. Vapnik, 'The Nature of Statistical Learning Theory', Springer. New York, 1995