Browse > Article

A Missing Data Imputation by Combining K Nearest Neighbor with Maximum Likelihood Estimation for Numerical Software Project Data  

Lee, Dong-Ho (KAIST 전산학과)
Yoon, Kyung-A (KAIST 전산학과)
Bae, Doo-Hwan (KAIST 전산학과)
Abstract
Missing data is one of the common problems in building analysis or prediction models using software project data. Missing imputation methods are known to be more effective missing data handling method than deleting methods in small software project data. While K nearest neighbor imputation is a proper missing imputation method in the software project data, it cannot use non-missing information of incomplete project instances. In this paper, we propose an approach to missing data imputation for numerical software project data by combining K nearest neighbor and maximum likelihood estimation; we also extend the average absolute error measure by normalization for accurate evaluation. Our approach overcomes the limitation of K nearest neighbor imputation and outperforms on our real data sets.
Keywords
missing data imputation; K-NN; maximum likelihood estimation; software project data;
Citations & Related Records
연도 인용수 순위
  • Reference
1 M. H. Cartwright, M. J. Shepperd, and Q. Song, 'Dealing with Missing Software Project Data,' Proceeding of the Ninth International Software Metrics Symposium, pp. 154-165, 2003
2 Jason Van Hulse, Taghi M. Khoshgoftaar, 'A comprehensive empirical evaluation of missing value imputation in noisy software measurement data,' The Journal of Systems and Software, Vol. 81, No.5, pp. 691-708, 2008   DOI   ScienceOn
3 Taghi Khoshgoftaar, Andres Folleco, Jason Van Hulse, and Lofton Bullard, 'Multiple Imputation of Missing Values in Software Measurement Data,' International Journal of Software Measurement, Vol.1, No.1, pp. 1-12, 2007
4 Donald B. Rubin, Multiple imputation for nonresponse in surveys, John Wiley & Sons, 1987
5 Qinbao Song, Martin Shepperd, and Michelle Cartwright, 'A Short Note on Safest Default Missingness Mechanism Assumptions,' Empirical Software Engineering, Vol.10, No.2, pp. 235-243, 2005   DOI   ScienceOn
6 Frank Wilcoxon, 'Individual Comparisons by Ranking Methods,' Biometrics Bulletin, Vol.1, No.6, pp. 80-83, 1945   DOI   ScienceOn
7 Roderick J. A. Little, Donald B. Rubin, Statistical Analysis with Missing Data, John Wiley & Sons, 1987
8 Anthony J. Hayter, Probability and Statistics for Engineers and Scientists, 3rd Ed., Thomson Higher Education, 2007
9 Ingunn Myrtveit, Erik Stensrud, and Ulf H. Olsson, 'Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods,' IEEE Transactions on Software Engineering, Vol.27, No.11, pp.999-1013, 2001   DOI   ScienceOn
10 Bhekisipho Twala, Michelle Cartwright, and Martin Shepperd, 'Comparison of Various Methods for Handling Incomplete Data in Software Engineering Databases,' International Symposium on Empirical Software Engineering, pp. 105-114, 2005
11 Kevin Strike, Khaled El Emam, and Nazim Madhavji, 'Software Cost Estimation with Incomplete Data,' IEEE Transactions on Software Engineering, Vol.27, No.10, pp. 890-908, 2001   DOI   ScienceOn
12 Qinbao Song, Martin Shepperd, 'A new imputation method of small software project data sets,' The Journal of Systems and Software, Vol.80, No.1, pp. 51-62, 2007   DOI   ScienceOn