Browse > Article
http://dx.doi.org/10.3745/KTSDE.2018.7.6.205

A Comparative Study on Similarity Measure Techniques for Cross-Project Defect Prediction  

Ryu, Duksan (KAIST, School of Computing)
Baik, Jongmoon (KAIST, School of Computing)
Publication Information
KIPS Transactions on Software and Data Engineering / v.7, no.6, 2018 , pp. 205-220 More about this Journal
Abstract
Software defect prediction is helpful for allocating valuable project resources effectively for software quality assurance activities thanks to focusing on the identified fault-prone modules. If historical data collected within a company is sufficient, a Within-Project Defect Prediction (WPDP) can be utilized for accurate fault-prone module prediction. In case a company does not maintain historical data, it may be helpful to build a classifier towards predicting comprehensible fault prediction based on Cross-Project Defect Prediction (CPDP). Since CPDP employs different project data collected from other organization to build a classifier, the main obstacle to build an accurate classifier is that distributions between source and target projects are not similar. To address the problem, because it is crucial to identify effective similarity measure techniques to obtain high performance for CPDP, In this paper, we aim to identify them. We compare various similarity measure techniques. The effectiveness of similarity weights calculated by those similarity measure techniques are evaluated. The results are verified using the statistical significance test and the effect size test. The results show k-Nearest Neighbor (k-NN), LOcal Correlation Integral (LOCI), and Range methods are the top three performers. The experimental results show that predictive performances using the three methods are comparable to those of WPDP.
Keywords
Cross-Project Defect Prediction; Similarity Measure; Outlier Detection;
Citations & Related Records
연도 인용수 순위
  • Reference
1 S. Kim, E. Whitehead, and Y. Zhang, "Classifying software changes: Clean or buggy?," Softw. Eng. IEEE Trans., Vol. 34, No.2, pp.181-196, 2008.   DOI
2 K. O. Elish and M. O. Elish, "Predicting defect-prone software modules using support vector machines," J. Syst. Softw., Vol.81, No.5, pp.649-660, May 2008.   DOI
3 E. Arisholm, L. C. Briand, and E. B. Johannessen, "A systematic and comprehensive investigation of methods to build and evaluate fault prediction models," J. Syst. Softw., Vol.83, No.1, pp.2-17, Jan. 2010.   DOI
4 T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, "A Systematic Literature Review on Fault Prediction Performance in Software Engineering," IEEE Trans. Softw. Eng., Vol.38, No.6, pp.1276-1304, Nov. 2012.   DOI
5 T. Menzies, Z. Milton, B. Turhan, B. Cukic, Y. Jiang, and A. Bener, "Defect prediction from static code features: current results, limitations, new approaches," Autom. Softw. Eng., Vol.17, No.4, pp.375-407, May 2010.   DOI
6 M. D'Ambros, M. Lanza, and R. Robbes, "Evaluating defect prediction approaches: A benchmark and an extensive comparison," Empir. Softw. Eng., Vol.17, No.4-5, pp.531-577, Aug. 2012.   DOI
7 T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, "Cross-project defect prediction," in Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, pp.91-100, 2009.
8 Z. He, F. Shu, Y. Yang, M. Li, and Q. Wang, "An investigation on the feasibility of cross-project defect prediction," Autom. Softw. Eng., Vol.19, No.2, pp.167-199, Jul. 2011.   DOI
9 Y. Ma, G. Luo, X. Zeng, and A. Chen, "Transfer learning for cross-company software defect prediction," Inf. Softw. Technol., Vol.54, No.3, pp.248-256, Mar. 2012.   DOI
10 J. Nam, S. J. Pan, and S. Kim, "Transfer defect learning," in Proceedings of the 35th International Conference on Software Engineering, pp.382-391, 2013.
11 D. Ryu, J. Jang, and J. Baik, "A Hybrid Instance Selection using Nearest-Neighbor for Cross-Project Defect Prediction," J. Comput. Sci. Technol., Vol.30, No.5, pp.969-980, 2015.   DOI
12 G. Woodbury, "An Introduction to Statistics." Cengage Learning, 2001.
13 D. Ryu, O. Choi, and J. Baik, "Value-cognitive boosting with a support vector machine for cross-project defect prediction," Empir. Softw. Eng., Vol.21, No.1, pp.43-71, Feb. 2016.   DOI
14 B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano, "On the relative value of cross-company and within-company data for defect prediction," Empir. Softw. Eng., Vol.14, No.5, pp.540-578, Jan. 2009.   DOI
15 T. Pang-Ning, M. Steinbach, and V. Kumar, "Introduction to Data Mining." 2006.
16 T. Grbac, G. Mausa, and B. Basic, "Stability of Software Defect Prediction in Relation to Levels of Data Imbalance.," in Proceedings of the 2nd Workshop of SQAMIA, 2013.
17 N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE : Synthetic Minority Over-sampling Technique," J. Artif. Intell. Res., Vol.16, pp.321-357, 2002.
18 C. C. Aggarwal, "Outlier Analysis." New York, NY: Springer New York, 2013.
19 N. Altman, "An introduction to kernel and nearest-neighbor nonparametric regression," Am. Stat., Vol.46, No.3, pp.175-185, 1992.
20 H.-P. Kriegel, M. Schubert, and A. Zimek, "Angle-based outlier detection in high-dimensional data," Proceeding 14th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (KDD '08), pp.444-452, 2008.
21 R. Hamming, "Error Detecting and Error Correcting Codes," Bell Syst. Tech. J., Vol.XXIX, No.2, 1950.
22 B. Raman and T. R. Ioerger, "Enhancing Learning using Feature and Example selection," Texas A&M Univ. Coll. Station. TX, USA, 2003.
23 E. Parzen, "On estimation of a probability density function and mode," Ann. Math. Stat., Vol.33, No.3, pp.1065-1076, 1962.   DOI
24 M. Breunig, H. Kriegel, R. Ng, and J. Sander, "LOF: identifying density-based local outliers," ACM Sigmod Rec., pp.1-12, 2000.
25 S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos, "LOCI: Fast outlier detection using the local correlation integral," Proc. - Int. Conf. Data Eng., pp.315-326, 2003.
26 S. Lloyd, "Least squares quantization in PCM," IEEE Trans. Inf. Theory, Vol.28, No.2, pp.129-137, 1982.   DOI
27 I. T. Jolliffe, "Principal Component Analysis." Springer, 2002.
28 T. Kohonen, "Self-organized formation of topologically correct feature maps," Biol. Cybern., Vol.43, No.1, pp.59-69, 1982.   DOI
29 C. M. Bishop, "Pattern recognition and machine learning." New York, New York, USA: Springer, 2006.
30 B. Turhan, A. Tosun MIsirli, and A. Bener, "Empirical evaluation of the effects of mixed project data on learning defect predictors," Inf. Softw. Technol., Vol.55, No.6, pp.1101-1118, Jun. 2013.   DOI
31 P. C. Mahalanobis, "On the generalised distance in statistics," in Proceedings of the National Institute of Sciences of India, Vol.2, No.1, pp.49-55, 1936.
32 M. Jureczko and D. Spinellis, "Using Object-Oriented Design Metrics to Predict Software Defects," in Models and Methods of System Dependability. Oficyna Wydawnicza Politechniki Wroclawskiej, 2010, pp.69-81.
33 T. Menzies et al., "The PROMISE Repository of empirical software engineering data," 2012. [Online]. Available: http://openscience.us/repo/.
34 K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, When is "nearest neighbor" meaningful? Springer-Verlag, 1999.
35 S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, "Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings," IEEE Trans. Softw. Eng., Vol.34, No.4, pp.485-496, 2008.   DOI
36 M. Hall, E. Frank, and G. Holmes, "The WEKA data mining software: an update," ACM SIGKDD Explor. Newsl., Vol.11, No.1, pp.10-18, 2009.   DOI
37 F. Menzies T, Greenwald J, "Data mining static code attributes to learn defect predictors," IEEE Trans. Softw. Eng., Vol.33, No.1, pp.2-13, 2007.   DOI
38 S. Wang and X. Yao, "Using Class Imbalance Learning for Software Defect Prediction," IEEE Trans. Reliab., Vol.62, No.2, pp.434-443, Jun. 2013.   DOI
39 B. Turhan, A. Tosun, and A. Bener, "Empirical Evaluation of Mixed-Project Defect Prediction Models," in Proceedings of the 37th EUROMICRO Conference on Software Engineering and Advanced Applications, pp.396-403, 2011.
40 Y. Kamei, S. Matsumoto, A. Monden, K. I. Matsumoto, B. Adams, and A. E. Hassan, "Revisiting common bug prediction findings using effort-aware models," IEEE Int. Conf. Softw. Maintenance, ICSM, 2010.
41 M. Friedman, "The use of ranks to avoid the assumption of normality implicit in the analysis of variance," J. Am. Stat. Assoc., No.32, pp.675-701, 1937.
42 M. Friedman, "A comparison of alternative tests of significance for the problem of m rankings.," Ann. Math. Stat., No.11, pp.86-92, 1940.
43 J. Demsar, "Statistical comparisons of classifiers over multiple data sets," J. Mach. Learn. Res., Vol.7, pp.1-30, 2006.
44 J. Tukey, "Comparing individual means in the analysis of variance," Biometrics, No.5, pp.99-114, 1949.
45 P. Nemenyi, "Distribution-free multiple comparisons.," Princeton University, 1963.
46 O. J. Dunn, "Multiple comparisons among means," J. Am. Stat. Assoc., No.56, pp.52-64, 1961.
47 F. Wilcoxon, "Individual comparisons by ranking methods," Biometrics Bull., pp.80-83, 1945.
48 A. Arcuri and L. Briand, "A practical guide for using statistical tests to assess randomized algorithms in software engineering," in 2011 33rd International Conference on Software Engineering (ICSE), pp.1-10, 2011.
49 D. M. J. Tax, "DDtools, the Data Description Toolbox for Matlab." 2014.
50 A. Vargha and H. D. Delaney, "A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong," J. Educ. Behav. Stat., Vol.25, No.2, pp.101-132, 2000.   DOI