DOI QR코드

DOI QR Code

A Pragmatic Framework for Predicting Change Prone Files Using Machine Learning Techniques with Java-based Software

  • Loveleen Kaur (Department of Computer Science and Engineering, Thapar University) ;
  • Ashutosh Mishra (Department of Computer Science and Engineering, Thapar University)
  • Received : 2019.11.07
  • Accepted : 2020.04.02
  • Published : 2020.09.30

Abstract

This study aims to extensively analyze the performance of various Machine Learning (ML) techniques for predicting version to version change-proneness of source code Java files. 17 object-oriented metrics have been utilized in this work for predicting change-prone files using 31 ML techniques and the framework proposed has been implemented on various consecutive releases of two Java-based software projects available as plug-ins. 10-fold and inter-release validation methods have been employed to validate the models and statistical tests provide supplementary information regarding the reliability and significance of the results. The results of experiments conducted in this article indicate that the ML techniques perform differently under the different validation settings. The results also confirm the proficiency of the selected ML techniques in lieu of developing change-proneness prediction models which could aid the software engineers in the initial stages of software development for classifying change-prone Java files of a software, in turn aiding in the trend estimation of change-proneness over future versions.

Keywords

References

  1. Aggarwal, K. K., Singh, Y., Kaur, A., and Malhotra, R. (2009). Empirical analysis for investigating the effect of object oriented metrics on fault proneness: A replicated case study. Software Process: Improvement and Practice, 14(1), 39-62. https://doi.org/10.1002/spip.389
  2. Arcuri, A., and Briand, L. (2011). A practical guide for using statistical tests to assess randomized algorithms in software engineering. In 2011 33rd International Conference on Software Engineering, IEEE, 1-10.
  3. Bansal, A. (2017). Empirical analysis of search based algorithms to identify change prone classes of open source software. Computer Languages, Systems and Structures, 47, 211-231. https://doi.org/10.1016/j.cl.2016.10.001
  4. Bauer, E., and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2), 105-139. https://doi.org/10.1023/A:1007515423169
  5. Beller, M., Gousios, G., and Zaidman, A. (2017). Travistorrent: Synthesizing travis ci and github for full-stack research on continuous integration. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), IEEE, 447-450.
  6. Bethea, R. M. (2018). Statistical methods for engineers and scientists. Routledge.
  7. Catolino, G., and Ferrucci, F. (2018). Ensemble techniques for software change prediction: A preliminary investigation. In 2018 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), IEEE, 25-30.
  8. Catolino, G., Palomba, F., De Lucia, A., Ferrucci, F., and Zaidman, A. (2018). Enhancing change prediction models using developer-related factors. Journal of Systems and Software, 143, 14-28. https://doi.org/10.1016/j.jss.2018.05.003
  9. Chaumun, M. A., Kabaili, H., Keller, R. K., and Lustman, F. (2002). A change impact model for changeability assessment in object-oriented software systems. Science of Computer Programming, 45(2-3), 155-174. https://doi.org/10.1016/S0167-6423(02)00058-8
  10. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953
  11. Chidamber, S. R., and Kemerer, C. F. (1994). A metrics suite for object oriented design. IEEE Transactions on Software Engineering, 20(6), 476-493. https://doi.org/10.1109/32.295895
  12. Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1-30.
  13. Elish, K. O., and Elish, M. O. (2008). Predicting defect-prone software modules using support vector machines. Journal of Systems and Software, 81(5), 649-660. https://doi.org/10.1016/j.jss.2007.07.040
  14. Elish, M. O., and Al-Zouri, A. A. (2014). Effectiveness of coupling metrics in identifying change-prone object-oriented classes. In Proceedings of the International Conference on Software Engineering Research and Practice (SERP), The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).
  15. Espindola, R. P., and Ebecken, N. F. F. (2005). On extending f-measure and g-mean metrics to multi-class problems. WIT Transactions on Information and Communication Technologies, 35, 10.
  16. Gama, J., Medas, P., and Rodrigues, P. (2005). Learning decision trees from dynamic data streams. In Proceedings of the 2005 ACM symposium on Applied computing, ACM, 573-577.
  17. Giger, E., Pinzger, M., and Gall, H. C. (2012). Can we predict types of code changes? An empirical analysis. In 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), IEEE, 217-226.
  18. Gray, D., Bowes, D., Davey, N., Sun, Y., and Christianson, B. (2012). Reflections on the NASA MDP data sets. IET Software, 6(6), 549-558. https://doi.org/10.1049/iet-sen.2011.0132
  19. Halstead, M. H. (1979). Advances in software science. In Advances in Computers, Elsevier, pp. 119-172.
  20. Honglei, T., Wei, S., and Yanan, Z. (2009). The research on software metrics and software complexity metrics. In 2009 International Forum on Computer Science-Technology and Applications, IEEE, 131-136.
  21. Islam, Z., and Giggins, H. (2011). Knowledge discovery through SysFor: A systematically developed forest of multiple decision trees. In Proceedings of the Ninth Australasian Data Mining Conference, ACM, 195-204.
  22. Jelihovschi, E. G., Faria, J. C., and Allaman, I. B. (2014). ScottKnott: A package for performing the Scott-Knott clustering algorithm in R. Trends in Applied and Computational Mathematics, 15(1), 3-17. https://doi.org/10.5540/tema.2014.015.01.0003
  23. Jimenez, F., Martinez, C., Marzano, E., Palma, J. T., Sanchez, G., and Sciavicco, G. (2019). Multiobjective evolutionary feature selection for fuzzy classification. IEEE Transactions on Fuzzy Systems, 27(5), 1085-1099. https://doi.org/10.1109/TFUZZ.2019.2892363
  24. Kaburlasos, V. G., Athanasiadis, I. N., and Mitkas, P. A. (2007). Fuzzy Lattice Reasoning (FLR) classifier and its application for ambient ozone estimation. International Journal of Approximate Reasoning, 45(1), 152-188. https://doi.org/10.1016/j.ijar.2006.08.001
  25. Kaur, L., and Mishra, A. (2018a). A comparative analysis of evolutionary algorithms for the prediction of software change. In International Conference on Innovations in Information Technology (IIT), IEEE, 187-192.
  26. Kaur, L., and Mishra, A. (2018b). An empirical analysis for predicting source code file reusability using meta-classification algorithms. In Advanced computational and communication paradigms (pp. 493-504), Springer.
  27. Kaur, L., and Mishra, A. (2019). Cognitive complexity as a quantifier of version to version Java-based source code change: An empirical probe. Information and Software Technology, 106, 31-48. https://doi.org/10.1016/j.infsof.2018.09.002
  28. Khoshgoftaar, T. M., Gao, K., and Seliya, N. (2010). Attribute selection and imbalanced data: Problems in software defect prediction. In 22nd IEEE International Conference on Tools with Artificial Intelligence, IEEE, 137-144.
  29. Klamler, C. (2005). On the closeness aspect of three voting rules: Borda-Copeland-Maximin. Group Decision and Negotiation, 14(3), 233-240. https://doi.org/10.1007/s10726-005-0958-3
  30. Koch, S., and Mitlohner, J. (2009). Software project effort estimation with voting rules. Decision Support Systems, 46(4), 895-901. https://doi.org/10.1016/j.dss.2008.12.002
  31. Kumar, L., Rath, S. K., and Sureka, A. (2017). Empirical analysis on effectiveness of source code metrics for predicting change-proneness. In Proceedings of the 10th Innovations in Software Engineering Conference, ACM, 4-14.
  32. Kumari, D., and Rajnish, K. (2019). A systematic approach towards development of universal software fault prediction model using object-oriented design measurement. In Nanoelectronics, Circuits and Communication Systems, Springer, pp. 515-526.
  33. Kuo, J. Y., Huang, F. C., Hung, C., Hong, L., and Yang, Z. (2012). The study of plagiarism detection for object-oriented programming. In 2012 Sixth International Conference on Genetic and Evolutionary Computing, IEEE, 188-191.
  34. Lessmann, S., Baesens, B., Mues, C., and Pietsch, S. (2008). Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4), 485-496. https://doi.org/10.1109/TSE.2008.35
  35. Lu, H., Zhou, Y., Xu, B., Leung, H., and Chen, L. (2012). The ability of object-oriented metrics to predict change-proneness: A meta-analysis. Empirical Software Engineering, 17(3), 200-242. https://doi.org/10.1007/s10664-011-9170-z
  36. Malhotra, L., and Bansal, A. J. (2014). Prediction of change-prone classes using machine learning and statistical techniques. In Advanced Research and Trends in New Technologies, Software, Human-Computer Interaction, and Communicability, IGI Global, pp. 193-202.
  37. Malhotra, R., and Bansal, A. (2015). Prediction of change prone classes using threshold methodology. Advances in Computer Science and Information Technology, 2, 30-35.
  38. Malhotra, R., and Jangra, R. (2017). Prediction & assessment of change prone classes using statistical & machine learning techniques. Journal of Information Processing Systems, 13(4), 778-804. https://doi.org/10.3745/JIPS.04.0013
  39. Malhotra, R., and Khanna, M. (2013). Investigation of relationship between object-oriented metrics and change proneness. International Journal of Machine Learning and Cybernetics, 4(4), 273-286. https://doi.org/10.1007/s13042-012-0095-7
  40. Malhotra, R., and Khanna, M. (2014). Examining the effectiveness of machine learning algorithms for prediction of change prone classes. In 2014 International Conference on High Performance Computing & Simulation (HPCS), IEEE, 635-642.
  41. Malhotra, R., and Khanna, M. (2017). An exploratory study for software change prediction in object-oriented systems using hybridized techniques. Automated Software Engineering, 24(3), 673-717. https://doi.org/10.1007/s10515-016-0203-0
  42. Malhotra, R., and Khanna, M. (2018). Particle swarm optimization-based ensemble learning for software change prediction. Information and Software Technology, 102, 65-84. https://doi.org/10.1016/j.infsof.2018.05.007
  43. Malhotra, R., Shukla, S., and Sawhney, G. (2016). Assessment of defect prediction models using machine learning techniques for object-oriented systems. In 2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions), IEEE, 577-583.
  44. Martin, R. C. (2002). Agile software development: Principles, patterns, and practices. Prentice Hall.
  45. McCabe, T. J. (1976). A complexity measure. IEEE Transactions on Software Engineering, (4), 308-320.
  46. Menzies, T., Greenwald, J., and Frank, A. (2006). Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1), 2-13. https://doi.org/10.1109/TSE.2007.256941
  47. Myrtveit, I., Stensrud, E., and Shepperd, M. (2005). Reliability and validity in comparative studies of software prediction models. IEEE Transactions on Software Engineering, 31(5), 380-391. https://doi.org/10.1109/TSE.2005.58
  48. Oman, P., and Hagemeister, J. (1992). Metrics for assessing a software system's maintainability. In Proceedings Conference on Software Maintenance, IEEE, 337-344.
  49. Peng, Y., Kou, G., Wang, G., Wu, W., and Shi, Y. (2011). Ensemble of software defect predictors: An AHP-based evaluation method. International Journal of Information Technology & Decision Making, 10(1), 187-206. https://doi.org/10.1142/S0219622011004282
  50. Prati, R. C. (2015). Fuzzy rule classifiers for multi-label classification. In 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), IEEE, 1-8.
  51. Purushothaman, R., and Perry, D. E. (2005). Toward understanding the rhetoric of small source code changes. IEEE Transactions on Software Engineering, 31(6), 511-526. https://doi.org/10.1109/TSE.2005.74
  52. Romano, D., and Pinzger, M. (2011). Using source code metrics to predict change-prone java interfaces. In 2011 27th IEEE International Conference on Software Maintenance (ICSM), IEEE, 303-312.
  53. Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. (2011). Pegasos: Primal estimated sub-gradient solver for svm. Mathematical Programming, 127(1), 3-30. https://doi.org/10.1007/s10107-010-0420-4
  54. Shepperd, M., Bowes, D., and Hall, T. (2014). Researcher bias: The use of machine learning in software defect prediction. IEEE Transactions on Software Engineering, 40(6), 603-616. https://doi.org/10.1109/TSE.2014.2322358
  55. Tallon-Ballesteros, A. J., and Riquelme, J. C. (2014). Deleting or keeping outliers for classifier training? In 2014 Sixth World Congress on Nature and Biologically Inspired, IEEE, 281-286.
  56. Ting, K. M. and Witten, I. H. (1997). Stacking bagged and dagged models. Hamilton, New Zealand: University of Waikato, Department of Computer Science.
  57. Van Koten, C., and Gray, A. R. (2006). An application of Bayesian network for predicting object-oriented software maintainability. Information and Software Technology, 48(1), 59-67. https://doi.org/10.1016/j.infsof.2005.03.002
  58. Vassallo, C., Panichella, S., Palomba, F., Proksch, S., Zaidman, A., and Gall, H. C. (2018). Context is king: The developer perspective on the usage of static analysis tools. In IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 38-49.
  59. Wilkinson, L., Anand, A., and Tuan, D. N. (2011). CHIRP: A new classifier based on composite hypercubes on iterated random projections. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 6-14.
  60. Ying, A. T., Murphy, G. C., Ng, R., and Chu-Carroll, M. C. (2004). Predicting source code changes by mining change history. IEEE Transactions on Software Engineering, 30(9), 574-586. https://doi.org/10.1109/TSE.2004.52
  61. Zhou, Y., Leung, H., and Xu, B. (2009). Examining the potentially confounding effect of class size on the associations between object-oriented metrics and change-proneness. IEEE Transactions on Software Engineering, 35(5), 607-623. https://doi.org/10.1109/TSE.2009.32