Browse > Article
http://dx.doi.org/10.7232/JKIIE.2012.38.4.276

Missing Value Imputation based on Locally Linear Reconstruction for Improving Classification Performance  

Kang, Pilsung (Industrial and Information Systems Engineering, Seoul National University of Science and Technology (Seoultech))
Publication Information
Journal of Korean Institute of Industrial Engineers / v.38, no.4, 2012 , pp. 276-284 More about this Journal
Abstract
Classification algorithms generally assume that the data is complete. However, missing values are common in real data sets due to various reasons. In this paper, we propose to use locally linear reconstruction (LLR) for missing value imputation to improve the classification performance when missing values exist. We first investigate how much missing values degenerate the classification performance with regard to various missing ratios. Then, we compare the proposed missing value imputation (LLR) with three well-known single imputation methods over three different classifiers using eight data sets. The experimental results showed that (1) any imputation methods, although some of them are very simple, helped to improve the classification accuracy; (2) among the imputation methods, the proposed LLR imputation was the most effective over all missing ratios, and (3) when the missing ratio is relatively high, LLR was outstanding and its classification accuracy was as high as the classification accuracy derived from the compete data set.
Keywords
Locally Linear Reconstruction (LLR); Missing Value Imputation; Classification;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Acuna, E. and Rodriguez, C. (2004), The Treatment of Missing Values and Its Effect in the Classifier Accuracy, in Classification, Clustering and Data Mining Applications, 639-648.
2 Batista, G. E. A. P. A. and Monard, M. C. (2003), An Analysis of Four Missing Data Treatment Methods for Supervosed Learning, Applied Artificial Intelligence, 17(5-6), 519-533.   DOI   ScienceOn
3 Bernard, J. and Meng, X. L. (1999), Applications of Multiple Imputation in Medical Studies : From AIDS to NHANES, Statistical Methods in Medical Research, 8(1), 17-36.   DOI
4 Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Springer, Singapore.
5 Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984), Classification and Regression Trees, Boca Raton, FL : CRC Press.
6 Ennett, C. M., Frize, M., and Walker, R. (2008), Imputation of Missing Values by Integrating Neural Networks and Case-Based Reasoning, In: Proceedings of the 30th Annual International IEEE Engineering in Medicine and Biology Society (EBMS '08), Vancouver, BC, Canada, 4337-4341.
7 Farhangfar, A., Kurgan, L., and Dy, J. (2008), Impact of Imputation of Missing Values on Classification Error for Discrete Data, Pattern Recognition, 41(12), 3692-3705.   DOI
8 Farhangfar, A., Kurgan, L., and Pedrycz, W. (2007), A Novel Framework for Imputation of Missing Values in Database, IEEE Transactions on Systems, Man, and Cybernetics-Part A : Systems and Humans 37(5), 692-709.
9 Garcia-Laencina, P., Sancho-Gomez, J.-L., Rigueiras-Vidal, A. R., and Verleysen, M. (2009), K-nearest Neighbours with Mutual Information for Simultaneous Classification and Missing Data Imputation, Neurocomputing, 72(7-9), 1483-1493.   DOI
10 Ghahramani, Z. and Jordan, M. I. (1994), Supervised Learning from Incomplete Data Via an EM Approach, In : Advances in NIPS 6, Morgan Kaufmann, Los Altos, CA, USA, 120-127.
11 Hron, K., Templ, M., and Filzmoser, P. (2010), Imputation of Missing Values for Compositional Data using Classical and Robust Methods, Computational Statistics and Data Analytics, 54(12), 3095-3107.   DOI
12 Jerez, J. M., Molina, I., Garcia-Laencina, G., Alba, E., Ribelles, N., Martin, M., and Franco, L. (2010), Missing Data Imputation using Statistical and Machine Learning Methods in a Real Breast Cancer Problem, Artificial Intelligence in Medicine, 50(2), 105-115.   DOI
13 Jerzy, W.G-B. and Hu, M. (2000), A Comparison of Several Approaches to Missing Attribute Values in Data Mining, In: Proceedings of the 2nd International Conference on Rough Sets and Current Trends in Computing(RSCTC'00), Banff, Canada, 378-385.
14 McCullagh, P. and Nelder, J. A. (1990), Generalized Linear Models, New York : Chapman and Hall.
15 Kang, P. and Cho, S. (2008), Locally Linear Reconstruction for Instance- Based Learnining, Pattern Recognition, 41(11), 3507-3518.   DOI
16 Li, H., Zhou, X., and Yao, Y. (2009), Missing Values Imputation Hypothesis : An Experimental Evaluation, In Proceedings of the 8th IEEE International Conference on Cognitive Informatics(ICCI '09), Hong Kong, China, 275-280.
17 Little, R. J. and Rubin, D. B. (1987), Statistical Analysis with Missing Data, John Wiley and Sons, New York.
18 Kohavi, R., Becker, B., and Sommerfield, D. (1997), Improving Simple Bayes, In: Proceedings of the European Conference on Machine Learning (ECML'97), Prague, Czech Republic.
19 Su, X., Khoshgoftaar, T. M., and Greiner, R. (2008), Using Imputation Techniques to Help Learn Accurate Classifiers, In : Proceedings of the 20th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'08), Dayton, OH, USA, 437-444.
20 UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/.
21 van Buuren, S. and Groothuis-Oudshoorn, K. (2011), MICE : Multivariate Imputation by Chained Equation in R, Journal of Statistical Software, 45(3).
22 Witten, I. H. and Frank, E. (2005), Data Mining : Practical Machine Learning Tools and Techniques, 2nd edition, Morgan Faufmann.
23 Yu, T., Peng, H., and Sun, W. (2011), Incorporating Nonlinear Relationships in Microarray Missing Value Imputation, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(3), 723-731.   DOI
24 Zhang, P. (2003), Multiple Imputation : Theory and Method, International Statistical Review, 71(3), 581-592.
25 Zhang, Y. and Liu, Y. (2009), Data Imputation using Least Squares Support Vector Machines in Urban Arterial Street, IEEE Signal Processing Letters, 15(5), 414-417.