Browse > Article
http://dx.doi.org/10.9716/KITS.2014.13.4.359

Imputation of Missing Data Based on Hot Deck Method Using K-nn  

Kwon, Soonchang (인천대학교 무역학부)
Publication Information
Journal of Information Technology Services / v.13, no.4, 2014 , pp. 359-375 More about this Journal
Abstract
Researchers cannot avoid missing data in collecting data, because some respondents arbitrarily or non-arbitrarily do not answer questions in studies and experiments. Missing data not only increase and distort standard deviations, but also impair the convenience of estimating parameters and the reliability of research results. Despite widespread use of hot deck, researchers have not been interested in it, since it handles missing data in ambiguous ways. Hot deck can be complemented using K-nn, a method of machine learning, which can organize donor groups closest to properties of missing data. Interested in the role of k-nn, this study was conducted to impute missing data based on the hot deck method using k-nn. After setting up imputation of missing data based on hot deck using k-nn as a study objective, deletion of listwise, mean, mode, linear regression, and svm imputation were compared and verified regarding nominal and ratio data types and then, data closest to original values were obtained reasonably. Simulations using different neighboring numbers and the distance measuring method were carried out and better performance of k-nn was accomplished. In this study, imputation of hot deck was re-discovered which has failed to attract the attention of researchers. As a result, this study shall be able to help select non-parametric methods which are less likely to be affected by the structure of missing data and its causes.
Keywords
Missing Data; Imputation; K-nn; Hot Deck;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Acock, A.C., "Working with missing values", Journal of Marriage and Family, Vol.67, No.4, 2005, 1012-1028.   DOI
2 Allison, P.D., "Missing data : Quantitative applications in the social sciences", British Journal of Mathematical and Statistical Psychology, Vol.55, No.1, 2002, 193-196.   DOI
3 Anderson, A.B., R. Basilevsky, and D.P.J. Hum, "Missing data : a review of the literature", Handbook of survey research, Vol.4, 1983, 415-494.
4 Andridge, R.R. and R.J.A. Little, "A Review of Hot Deck Imputation for Survey Non-response", International Statistical Review, Vol.78, No.1, 2010, 40-64.   DOI   ScienceOn
5 Baraldi, A.N. and C.K. Enders, "An introduction to modern missing data analyses", Journal of School Psychology, Vol.48, No.1, 2010, 5-37.   DOI
6 Batista, G.E. and M.C. Monard, "A Study of K-Nearest Neighbour as an Imputation Method", HIS, Vol.87, 2002, 251-260.
7 Bennett, D.A., "How can I deal with missing data in my study?", Australian and New Zealand Journal of Public Health, Vol.25, No.5, 2001, 464-469.   DOI
8 Carpenter, J., "Statistical modelling with missing data using multiple imputation Session 2 : Multiple Imputation", 2010.
9 Cheng, X. and D. Cook, and H. Hofmann, "MissingDataGUI : A Graphical User Interface for Exploring Missing Values in Data", 2013.
10 Christobel, Y.A. and P. Sivaprakasam, "Improving the performance of K-nearest neighbor algorithm for the classification of diabetes dataset with missing values", International Journal of Computer Engineering and Technology, Vol.3, No.3, 2012, 16-23.
11 Devane, D.C., M. Begley, and M. Clarke, "How many do I need? Basic principles of sample size estimation", Journal of Advanced Nursing, Vol.47, No.3, 2004, 297-302.   DOI   ScienceOn
12 Finch, W.H., "Imputation Methods for Missing Categorical Questionnaire Data : A Comparison of Approaches", Journal of Data Science, Vol.8, 2010, 361-378.
13 Graham, J.W., P.E. Cumsille, and E. Elek, Fisk Methods for handling missing data, Handbook of psychology, 2003.
14 Gunn, S.R., "Support vector machines for classification and regression", ISIS technical report, Vol.14, 1998.
15 He, H., W. Graco, and X. Yao, "Application of genetic algorithm and k-nearest neighbour method in medical fraud detection", Simulated Evolution and Learning, Springer Berlin Heidelberg, 1999, 74-81.
16 Jonsson, P. and C. Wohlin, "An evaluation of k-nearest neighbour imputation using likert data", Software Metrics, 2004. Proceedings 10th International Symposium on IEEE, 2004.
17 Kim, K. and H. Ahn, "Optimization of Support Vector Machines for Financial Forecasting", Journal of Intelligence and Information System, Vol.17, No.4, 2011, 241-254.
18 Little, R.J.A. and D.B. Rubin, "Statistical Analysis with", 2002.
19 King, G. et al., "Analyzing incomplete political science data : An alternative algorithm for multiple imputation", American Political Science Association, Vol.95. No.1, 2001.
20 Little, R.J.A., "A test of missing completely at random for multivariate data with missing values", Journal of the American Statistical Association, Vol.83, No.404, 1988, 1198-1202.   DOI   ScienceOn
21 MacCallum, R.C. et al., "On the practice of dichotomization of quantitative variables", Psychological methods, Vol.7, No.1, 2002, 19.   DOI
22 Martin, A.T., M. Akshmi, and V.P. Venkatesan, "An Analysis on Qualitative Bankruptcy Prediction Rules using Ant-Miner", International Journal of Intelligent Systems and Applications, Vol.6, No.1, 2013.
23 Peng, C.J. et al., "Advances in missing data methods and implications for educational research", Real data analysis, 2006, 31-78.
24 Pettersson, N., "Real donor imputation pools", Proceedings of the Workshop of the Baltic-Nordic-Ukrainian network on survey statistics, 2012.
25 Roth, P.L., "Missing data : A conceptual review for applied psychologists", Personnel Psychology, Vol.47, No.3, 1994, 537-560.   DOI   ScienceOn
26 Rubin, D.B., "Inference and missing data", Biometrika, Vol.63, No.3, 1976, 581-592.   DOI   ScienceOn
27 Sarma, H.T. et al., "An improvement to k-nearest neighbor classifier", arXiv preprint arXiv : 1301.6324, 2013.
28 Saunders, J.A. et al., "Imputing missing data : A comparison of methods for social work researchers", Social work research, Vol.30, No.1, 2006, 19-31.   DOI
29 Schafer, J.L., Analysis of incomplete multivariate data, CRC press, 1997.
30 Somasundaram, R.S. and R. Nedunchezhian, "Evaluation of Three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values", International Journal of Computer Applications (0975-8887), Vol.21, No.10, 2011, 14-19.
31 Schafer, J.L. and J.W. Graham, "Missing data : our view of the state of the art", Psychological methods, Vol.7, No.2, 2002, 147.   DOI   ScienceOn
32 Schlomer, G.L., S. Bauman, and N.A. Card. "Best practices for missing data management in counseling psychology", Journal of Counseling Psychology, Vol.57, No.1, 2010.
33 Suykens, J.A., "Advances in learning theory : methods, models, and applications," Vol.190, IOS Press, 2003.
34 Van Buuren, Stef, Flexible imputation of missing data, CRC press, 2012.
35 Vapnik, V.N., Statistical Learning Theory, Wiley, New York, 1998.
36 Viswanath, P. and T.H. Sarma, "An improvement to k-nearest neighbor classifier", Recent Advances in Intelligent Computational Systems (RAICS), IEEE, 2011.
37 Yan, X., "Weighted K-Nearest Neighbor Classification Algorithm Based on Genetic Algorithm", TELKOMNIKA Indonesian Journal of Electrical Engineering, Vol.11, No.10, 2013.
38 Zhang, C., Q.Y. Zhu, X.J. Zhang, and S. Zhang, "Clustering-based missing value imputation for data preprocessing", In Industrial Informatics, IEEE International Conference on, IEEE, 2006, 1081-1086.