[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.9716/KITS.2014.13.4.359

Imputation of Missing Data Based on Hot Deck Method Using K-nn

Kwon, Soonchang (인천대학교 무역학부)

Publication Information

Journal of Information Technology Services / v.13, no.4, 2014 , pp. 359-375 More about this Journal

Abstract

Researchers cannot avoid missing data in collecting data, because some respondents arbitrarily or non-arbitrarily do not answer questions in studies and experiments. Missing data not only increase and distort standard deviations, but also impair the convenience of estimating parameters and the reliability of research results. Despite widespread use of hot deck, researchers have not been interested in it, since it handles missing data in ambiguous ways. Hot deck can be complemented using K-nn, a method of machine learning, which can organize donor groups closest to properties of missing data. Interested in the role of k-nn, this study was conducted to impute missing data based on the hot deck method using k-nn. After setting up imputation of missing data based on hot deck using k-nn as a study objective, deletion of listwise, mean, mode, linear regression, and svm imputation were compared and verified regarding nominal and ratio data types and then, data closest to original values were obtained reasonably. Simulations using different neighboring numbers and the distance measuring method were carried out and better performance of k-nn was accomplished. In this study, imputation of hot deck was re-discovered which has failed to attract the attention of researchers. As a result, this study shall be able to help select non-parametric methods which are less likely to be affected by the structure of missing data and its causes.

Keywords

Missing Data; Imputation; K-nn; Hot Deck;

Citations & Related Records

Reference

1	Acock, A.C., "Working with missing values", Journal of Marriage and Family, Vol.67, No.4, 2005, 1012-1028. DOI
2	Allison, P.D., "Missing data : Quantitative applications in the social sciences", British Journal of Mathematical and Statistical Psychology, Vol.55, No.1, 2002, 193-196. DOI
3	Anderson, A.B., R. Basilevsky, and D.P.J. Hum, "Missing data : a review of the literature", Handbook of survey research, Vol.4, 1983, 415-494.
4	Andridge, R.R. and R.J.A. Little, "A Review of Hot Deck Imputation for Survey Non-response", International Statistical Review, Vol.78, No.1, 2010, 40-64. DOI ScienceOn
5	Baraldi, A.N. and C.K. Enders, "An introduction to modern missing data analyses", Journal of School Psychology, Vol.48, No.1, 2010, 5-37. DOI
6	Batista, G.E. and M.C. Monard, "A Study of K-Nearest Neighbour as an Imputation Method", HIS, Vol.87, 2002, 251-260.
7	Bennett, D.A., "How can I deal with missing data in my study?", Australian and New Zealand Journal of Public Health, Vol.25, No.5, 2001, 464-469. DOI
8	Carpenter, J., "Statistical modelling with missing data using multiple imputation Session 2 : Multiple Imputation", 2010.
9	Cheng, X. and D. Cook, and H. Hofmann, "MissingDataGUI : A Graphical User Interface for Exploring Missing Values in Data", 2013.
10	Christobel, Y.A. and P. Sivaprakasam, "Improving the performance of K-nearest neighbor algorithm for the classification of diabetes dataset with missing values", International Journal of Computer Engineering and Technology, Vol.3, No.3, 2012, 16-23.
11	Devane, D.C., M. Begley, and M. Clarke, "How many do I need? Basic principles of sample size estimation", Journal of Advanced Nursing, Vol.47, No.3, 2004, 297-302. DOI ScienceOn
12	Finch, W.H., "Imputation Methods for Missing Categorical Questionnaire Data : A Comparison of Approaches", Journal of Data Science, Vol.8, 2010, 361-378.
13	Graham, J.W., P.E. Cumsille, and E. Elek, Fisk Methods for handling missing data, Handbook of psychology, 2003.
14	Gunn, S.R., "Support vector machines for classification and regression", ISIS technical report, Vol.14, 1998.
15	He, H., W. Graco, and X. Yao, "Application of genetic algorithm and k-nearest neighbour method in medical fraud detection", Simulated Evolution and Learning, Springer Berlin Heidelberg, 1999, 74-81.
16	Jonsson, P. and C. Wohlin, "An evaluation of k-nearest neighbour imputation using likert data", Software Metrics, 2004. Proceedings 10th International Symposium on IEEE, 2004.
17	Kim, K. and H. Ahn, "Optimization of Support Vector Machines for Financial Forecasting", Journal of Intelligence and Information System, Vol.17, No.4, 2011, 241-254.
18	Little, R.J.A. and D.B. Rubin, "Statistical Analysis with", 2002.
19	King, G. et al., "Analyzing incomplete political science data : An alternative algorithm for multiple imputation", American Political Science Association, Vol.95. No.1, 2001.
20	Little, R.J.A., "A test of missing completely at random for multivariate data with missing values", Journal of the American Statistical Association, Vol.83, No.404, 1988, 1198-1202. DOI ScienceOn
21	MacCallum, R.C. et al., "On the practice of dichotomization of quantitative variables", Psychological methods, Vol.7, No.1, 2002, 19. DOI
22	Martin, A.T., M. Akshmi, and V.P. Venkatesan, "An Analysis on Qualitative Bankruptcy Prediction Rules using Ant-Miner", International Journal of Intelligent Systems and Applications, Vol.6, No.1, 2013.
23	Peng, C.J. et al., "Advances in missing data methods and implications for educational research", Real data analysis, 2006, 31-78.
24	Pettersson, N., "Real donor imputation pools", Proceedings of the Workshop of the Baltic-Nordic-Ukrainian network on survey statistics, 2012.
25	Roth, P.L., "Missing data : A conceptual review for applied psychologists", Personnel Psychology, Vol.47, No.3, 1994, 537-560. DOI ScienceOn
26	Rubin, D.B., "Inference and missing data", Biometrika, Vol.63, No.3, 1976, 581-592. DOI ScienceOn
27	Sarma, H.T. et al., "An improvement to k-nearest neighbor classifier", arXiv preprint arXiv : 1301.6324, 2013.
28	Saunders, J.A. et al., "Imputing missing data : A comparison of methods for social work researchers", Social work research, Vol.30, No.1, 2006, 19-31. DOI
29	Schafer, J.L., Analysis of incomplete multivariate data, CRC press, 1997.
30	Somasundaram, R.S. and R. Nedunchezhian, "Evaluation of Three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values", International Journal of Computer Applications (0975-8887), Vol.21, No.10, 2011, 14-19.
31	Schafer, J.L. and J.W. Graham, "Missing data : our view of the state of the art", Psychological methods, Vol.7, No.2, 2002, 147. DOI ScienceOn
32	Schlomer, G.L., S. Bauman, and N.A. Card. "Best practices for missing data management in counseling psychology", Journal of Counseling Psychology, Vol.57, No.1, 2010.
33	Suykens, J.A., "Advances in learning theory : methods, models, and applications," Vol.190, IOS Press, 2003.
34	Van Buuren, Stef, Flexible imputation of missing data, CRC press, 2012.
35	Vapnik, V.N., Statistical Learning Theory, Wiley, New York, 1998.
36	Viswanath, P. and T.H. Sarma, "An improvement to k-nearest neighbor classifier", Recent Advances in Intelligent Computational Systems (RAICS), IEEE, 2011.
37	Yan, X., "Weighted K-Nearest Neighbor Classification Algorithm Based on Genetic Algorithm", TELKOMNIKA Indonesian Journal of Electrical Engineering, Vol.11, No.10, 2013.
38	Zhang, C., Q.Y. Zhu, X.J. Zhang, and S. Zhang, "Clustering-based missing value imputation for data preprocessing", In Industrial Informatics, IEEE International Conference on, IEEE, 2006, 1081-1086.

KSCI

Imputation of Missing Data Based on Hot Deck Method Using K-nn K-nn을 이용한 Hot Deck 기반의 결측치 대체

Imputation of Missing Data Based on Hot Deck Method Using K-nn