Browse > Article
http://dx.doi.org/10.5391/IJFIS.2004.4.1.119

Sparse Data Cleaning using Multiple Imputations  

Jun, Sung-Hae (Department of Statistics, Cheongju University)
Lee, Seung-Joo (Department of Statistics, Cheongju University)
Oh, Kyung-Whan (Department of Computer Science, Sogang University)
Publication Information
International Journal of Fuzzy Logic and Intelligent Systems / v.4, no.1, 2004 , pp. 119-124 More about this Journal
Abstract
Real data as web log file tend to be incomplete. But we have to find useful knowledge from these for optimal decision. In web log data, many useful things which are hyperlink information and web usages of connected users may be found. The size of web data is too huge to use for effective knowledge discovery. To make matters worse, they are very sparse. We overcome this sparse problem using Markov Chain Monte Carlo method as multiple imputations. This missing value imputation changes spare web data to complete. Our study may be a useful tool for discovering knowledge from data set with sparseness. The more sparseness of data in increased, the better performance of MCMC imputation is good. We verified our work by experiments using UCI machine learning repository data.
Keywords
Cleaning of Sparse data; Multiple Imputation; Markov Chain Monte Carlo;
Citations & Related Records
연도 인용수 순위
  • Reference
1 D. B. Rubin, 'Multiple Imputation for Nonresponse in Surveys', John Wiley & Sons Inc. New York, 1987
2 J. L. Schafer, 'Analysis of Incomplete MuItivariate Data', Chapman and Hall. New York, 1997
3 http://www.sas.com
4 D. B. Rubin, 'Multiple Imputation After 18+ Years', Journal of the American Statistical Association, 1996
5 V. Vapnik, 'The Nature of Statistical Learning Theory', Springer. New York, 1995
6 C. Cortes, V. Vapnik, 'Support Vector Networks', Machine Learning, 1995
7 R. J. A. Little, 'A Test of Missing Completely at Random for Multivariate Data with Missing Values', Journal of the American Statistical Association, 1988
8 D. B. Rubin, 'Inference with missing data', Biometrika, 1976
9 S. Haykin, 'Neural Networks', 2nd edition. Prentice Hall, 1999
10 L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, 'Classification and regression trees', Wadsworth & Brooks, 1984
11 D. C. Hoaglin, F. Mosteller, J. W. Tukey, 'Understanding robust and exploratory data analysis', John Wiley & Sons, Inc. New York, 1983
12 C. Conversano, C. Cappelli, 'Missing data incremental imputation through tree based methods', 14th Conference on Computational Statistics, 24-28 August 2002, Berlin, Germany, 2002
13 R. J. A. Lavori, R. Dawson, D. Shera, 'A Multiple Imputation Strategy for Clinical TriaIs with Truncation of Patent Datal, Statistics in Medicine, 1995
14 P. R. Rosenbaum, D. B. Rubin, 'The CentraI Role of the Propensity Score in Observational Studies for Causal Effects', Biometrica, 1983
15 http://www.ics.uci.edu/-mlearn/MLRepository.html
16 R. Fletcher, 'Practical Methods of Optimization', John Wiley & Sons, Inc. New York, 1989