1. Introduction
Software faults are the imperfections in software development process that would cause software to fail to meet the desired expectations. Faults can be introduced, by numerous problems occurring at requirement specification, design, code and testing phase. They can also be caused by external factors, such as environmental disturbances or human actions, either accidental or deliberate. If faults occur in early phases like requirement and are not identified, they can be very difficult to remove in later phases. Software design fault results from the design of algorithms, control, logic, data elements, module interface descriptions, and external software / hardware / user interface descriptions. Coding defects are derived from errors in implementing the code. Coding defects can arrive from a failure to understand programming language constructs, and miscommunication with the designers. Test plans, test cases, test harnesses, and test procedures can also contain defects. U.S. economy estimated $60 billion (approx) cost due to software failure each year, this is equivalent to 0.6 percent of gross domestic product, revealed by recent study commissioned by the Department of Commerce's National Institute of Standards and Technology (NIST). They pointed that improvements in testing can reduce the cost by $20.0 billion, approximately, but it won’t eliminate all software errors [1].
The reason that the cost of correcting a fault increases so rapidly is related to the rigorous process of software fault detection and correction. In early phases of the software processes exist only on the paper, and correcting a fault may require to make the change in the documents. Software bug removal takes around 40% of total development effort [2]. Identifying which module is most likely to be error prone is a critical task. By identifying the faulty module in early phases; software resource allocation for bug removal to those modules can be done in appropriate manner. Software fault prediction can be defined as let Xd = {x1, x2,…,xn) be a set of vector in Rp here xi represents a software module that assigned the software metric at design time and Y = (fp, nfp) in R2 denotes its binary class (fault-prone or not fault-prone). Let there is an unknown function S which transforms Xd to Y. Our goal is to estimate S as an efficient combination of preprocessing and learner for fault prediction.
In this paper we developed models for faulty module prediction at design phase by using design time software metrics. For this purpose we have extracted design metrics from 11 projects. These 11 software projects fault data are taken from Promise repository [3]. In this paper, after describing the project sets, we have conducted an extensive empirical study for investigation of the effect of preprocessing and dimensionality reduction on different machine learning techniques for fault identification using the design metrics only; We find that models built from design metrics are useful as they are built in early phases of the development life cycle and these models have efficient predictability using design metric; We therefore recommend that in future, researchers explore the effects of the attributes from design phases of the development life cycle for software fault prediction. It is also interesting to study that how fault identified at design time can improve the next level predicting capabilities. The remaining details are divided in following sections. Section 2 describes the related work in the area. Section 3 discusses about software project and their corresponding design metrics. Section 4 presents the 28 scheme which includes preprocessing and dimensionality reduction and machine learners. Section 5 provides the results and discusses their implications, and Section 6 concludes with a summary and future work.
2. Related Work
Software researchers have visualize from the study that cost incurred to correct faults increase exponentially with the time they remain uncorrected in the system. It is therefore advisable to eliminate as many faults as early as possible during software development [4]. Code reviews, are labor-intensive and one member can inspect only 8 to 20 LOC/minute. Same review process will be repeated for all members of the review team, which can be as large as four or six [7]. One effective approach for early identification of faulty modules is software fault prediction which investigates and learns the characteristics of individual code and design segments to predict the fault-prone modules [4]. Earlier number of studies have been conducted but they have used metrics after coding phase i.e. they used all metrics of coding and design phase [4, 5, 16]. In this study, our emphasis is on early fault prediction model at design phase for identification of faulty segments. For this purpose, a large number of software design characteristics have been extracted from standard software projects. These design characteristics includes Branch_Count, Call_Pairs, Condition_Count, Cyclomatic_Complexity design metrics etc. The detailed list of design metrics used in the study is shown in Table 2.
Table 2.Design metrics used in this study
Koru et. al. investigated change-prone classes with tree based models on K-Office and Mozilla by using class label metric. They found that around majority of change is rooted in a small proportion of (around 20%) classes and tree based models were very useful to identify these changeprone classes [19]. We have also conducted a study to identify the impact of feature reduction method on software fault prediction performance of learners [18].
In 2012, Singh et al. [16] have utilized two different machine learning algorithm: J48 (a decision tree) and Naïve Bayes algorithm on open source project developed in object oriented language They used the object oriented metric and reported the prediction accuracy 98.15% and 95.58% respectively. Shull et al. [6] found that around 60% of bugs can be identified by manual review. Lessmann et al. [5] suggested that to get confidence we need to develop reliable research and comparative study for software prediction models. Hall [20] mentioned that data preprocessing such as the removal of noninformative features or the discretization of continuous attributes may improve the performance of some classifiers. Thus, we stress that the aim of this paper is to evaluate different pre-processing and dimensionality reduction processes for finding effective fault prediction models at design phase.
3. Metrics Description
The software projects used in this study are from the NASA Metrics Data Program (MDP) and we have extracted design level metrics for these projects. Eleven software projects used are described in Table 1. Column 2 of Table 1 shows the total numbers of modules in the software projects and column 3 shows the number of modules that have some faults. Column 5 describes the language used in the development of the project. KC1 and KC3 projects of NASA mission are responsible for the storage management for ground data. MW1 project is a zero gravity experiment related to combustion. MC1 is software for combustion experiment on the space shuttle. MC2 is a video guidance system. JM1 project is a real-time ground system that uses simulations to generate certain predictions for mission’s. PC1, PC2, PC3, PC4 and PC5 are the flight software for earth orbiting satellite. More details of the project development can be found at [21].
Table 1.Dataset used in the study
We have extracted the design metrics shown in Table 2 from these projects. Project KC3, PC3, PC4, MW1, MC2 and PC2 have 15 total design metrics. MC1 and PC5 have 14 design metrics. KC1, JM1 and PC1 have 4 attributes branch count, cyclomatic complexity, design complexity. All 11 projects having module segment id and its associate fault status that shows that module is fault or not faulty. After removing other attributes, the project data is ready for pre-processing. We have used three different pre-processing schemes. After pre-processing we have applied seven different machine learning algorithm. The details description of each project is given in Table 1. The design metrics shown in Table 2 have been processed by McCabe IQ 7.1 [8].
The design metrics are processed from design phase artifacts. These are design diagrams, data flow diagrams control flow graphs and Unified Modelling Language diagrams. The best part of these metrics are that they are available before coding phase. The design metrics mined from design artifact and used in the study are listed in Table 2.
4. Experimental Design
Seven machine learning algorithms used from Weka for empirical data analysis [9]. We use 11 NASA MDP projects, with the design metrics extracted for each projects. We have used area under the receiver operating characteristic curve to measure the performance of fault prediction models. This experiment is intended to demonstrate the effectiveness of early fault prediction using different preprocessing and learning scheme using design metrics. We have used various well known and effective classifiers from different categories for our experiment. For this purpose, 28 different learning schemes have been designed according to the following data preprocessing, and learning algorithms.
• Four data pre-processorsNone: data unchanged;Discretization: entropy based discretizationBinning (equal-width binning.)Principal component analysis• Seven learning algorithms: Naive Bayes (NB), Voting feature intervals (VFI), Simple logistic, KNN, J48, OneR and Random forest.
Twenty eight learning schemes: The combination of four data pre-processors and seven learning algorithms yields a total of 28 different learning schemes: for e.g Naïve Bayes learner have four
NB + NoneNB + PCANB + binningNB + entropy based;
PCA is used abundantly in different analysis techniques for extracting relevant information from data set like pattern recognition, neurocomputing and computer graphics. PCA is a generalized dimensionality reduction technique so details are left.
We used the state-of-the-art supervised discretisation technique developed by Fayyad and Irani [10] for binary discretization, of continuous attribute into two intervals. This entropy based discretization process is a viable choice as it improves accuracy [17, 20].
Given a set of samples, S, Entropy for S can be calculated from Eq. (1)
Now suppose if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partition can be calculated as the equation given in Eq. (2)
T is taken from the midpoints of the feature values. Ent (S1) and Ent (S2) are the entropy of S1 and S2 respectively. The goal is to get the maximum information gain after the split and the gain is calculated as
Gain (S, T) = Entropy (S) − E(S, T).
Recursively evaluating all possible splits and then selecting best split(s) with maximum gain is chosen. A partition point is decided for discretization which minimizes the entropy function. The process is recursively applied to obtain partitions until it minimizes the difference to acceptable level e.g.
Ent(S) − E(S, T ) > δ
Minimal descriptive length principle is used by [10] to determine stopping criteria for recursive discretization process. Recursive partitioning for attribute A within a set of value S stops if the Gain satisfies the following criteria
Here
Δ(A,T;S) = log2(3k −2)−[k.Ent(S)−k1.Ent(S1)−k2.Ent(S2)]
And ki is the number of class labels represented in the set S.
Next scheme for pre-processing used is simple equalwidth binning algorithm which divides the data into k intervals of equal size. The width of intervals is: i = (maxmin)/k and the interval boundaries are: min+i, min+2i… min+ (k−1)i.
In this experiment, we have used 10-fold cross validation. For this, we took randomly 90 percent of the data as training data and the remainder 10% as test data and this process is repeated 10 times. The experiment calculates 3, 080 AUC curve (11 different data sets * one design metric groups * 7 machine learners *4 pre-processing for each learner * 10 runs). For each model we have performed the same cross validation to get AUC. Table 3 shows the result for the AUCs from each dataset with pre-processing metric group and machine learner experiment. We compare the performance of each models derived from pre-processing and machine learning on design metrics.
Table 3.AUC Comparison of 28 different learning schemes
4.1 Machine learners
Out of seven learning algorithms the first one is Naive Bayes classifier (NB). NB has been used extensively in fault-proneness identification, for example in [4]. Naïve Bayes implements the probabilistic classifier. It is easy to implement on different types of datasets, also provides good results. Naive Bayes makes the assumption that feature values are statistically independent given the target value of the instance, the probability of observing the conjunction is just the product of the probabilities for the individual attributes [15]. Therefore target value output by the Naïve Bayes is
where VNB denotes the target value output by the naïve Bayes classifier. The conditional probabilities P(ai|vj) need to be estimated from the training set. OneR classifier is a simple, accurate algorithm that generates one rule for each predictor in the data, and then selects the rule with the smallest total error as its “one rule”. Studies have shown that OneR produces rules only slightly less accurate than state-of-the-art classification algorithms. KNN - K nearest neighbors is a simple algorithm that stores all available cases and predict the numerical target based on a similarity measure which finds the training instance closest in Euclidean distance to the given test instance and predicts the same class as this training instance. Fist one is selected in case of several instances qualify as the same distance. Simple Logistic builds logistic regression models fitting them using LogitBoost with simple regression functions as base learners and determining how much iteration to perform using cross-validation for automatic attribute selection. Details of simple logistic can be found at [14]. VFI (Voting feature intervals) uses voting for classification by discretizing numeric attributes [9]. J48 is a JAVA implementation of Quinlan’s C4.5 algorithm [11]. J48 and Random forest are well known algorithm so details of these algorithms are skipped here. The procedure we have used to evaluate the model is 10 fold cross validation for each dataset with various pre-processing and machine learning combination.
Procedure Evaluation data Learning (data, scheme)
Input: data - the data on which the learner is built, KC1, KC3, PC1, PC3, PC4, MW1, MC2, JM1,MC1,PC2,PC5]
Learners- The learning scheme. [Naive Bayes (NB), VFI, Simple logistic, KNN, J48, OneR and Random forest]
Output Result [D_AUC]=D_PREDICATOR with D_TEST
AUC over 10 fold cross validation.
Preprocessing={None, PCA, equal-width binning, entropy based discretisation}
FOR i =1 to10
FOR EACH data
FOR EACH preprocessing
P_Data = Preprocess data // after preprocessing of data
Divide the Processed data in 90% D_TRAIN and 10%
D_TEST // construct predictor from D_TRAIN
D_PREDICTOR = Train LEARNER with D_TRAIN //Evaluate predictors on the test data
[D_AUC] =Test D_PREDICATOR with D_ TEST
END
END
END
We have used the area under receiver operating characteristic curve (AUC) to evaluate the effectiveness of the fault prediction as it is most informative indicator for prediction accuracy. If AUC is greater than 0.5 then the learner have capability to identifying fault prone modules [5]. The classifiers performance measures are derived from the confusion matrix. Confusion matrix gives the TP, FP and FN and TN. TP and TN are the correctly identified faulty and not faulty modules. A FP occurs when outcome is predicted yes when it is actually not faulty. A FN occurs when the when the outcome is incorrectly predicted as negative when it is actually positive.
Receiver Operating Characteristic (ROC) curves compares the classification performance by a plotting the TP rate on y axis and FP rate on X axis across all the possible experiments.
An ideal ROC curve has a shape which covers the maximum area. ROC curves output by two machine learner are shown in Fig. 1. The ideal point on the ROC curve would when no positive examples are classified incorrectly and negative examples are classified as negative. Continuous measures ROC curves have a problem when two curves intersect each other. As shown in Fig. 1 the curve C1 intersects C2, determination of which model is dominating requires to obtain the area under each curve. Every model gets different values for area under curve. So AUC is used for theevaluation of model performance. AUC is independent of the decision criterion selected and prior probabilities. The AUC comparison can establish a dominance relationship between classifiers [12]. The bigger the area under curve, the better the model is. As opposed to other measures, the area under the ROC curve (AUC) does not depend on the imbalance of the training set [13]. Thus, the comparison of the AUC of two classifiers is fairer and more informative than comparing their misclassification rates.
Fig. 1.ROC curve of two experiments
5. Results
AUC is measured across the 10 fold cross validation for each learning scheme on each data set. Fig. 2 shows the performance of each model for each data set. It can be seen from Table 3 that VFI and Naïve Bayes with discretization outperformed on the average of 11 data sets with AUC 0.736 and 0.735 respectively. This means we should choose Naive Bayes or VFI learning scheme after discretization for different data sets. If we see the result of individual data set then for KC1 Naïve Bayes with PCA has performed well with area under curve 0.716. PC1 using random forest performed highest with the AUC 0.621. KC3 with AUC 0.79 and PC3 with AUC 0.708 by using Naïve Bayes with discretization has performed better that the other models. On PC4 software design data Random forest with discretization has outperformed with AUC 0.791. MW1 using simple logistics with PCA achieves AUC 0.766. MC2 Simple logistics with PCA AUC is 0.688 and JM1 with Simple logistics after PCA 0.676. MC1 using Simple logistics with discretization AUC is 0.89 and PC2 achieved AUC 0.757 after applying Naive Bayes. On PC5 data set Simple logistics with discretization achieved highest AUC 0.96 whereas Naïve Bayes using discretization achieved AUC 0.95. This reveals that different attribute selectors can be suitable for different learning algorithms.
Fig. 2.Performance measure using AUC for 28 Schemes
It is evident from the results shown in Table 3 that machine learning algorithm applied after the discretization process improves the performance of classifiers Naïve Bayes, VFI, KNN, Simple Logistics and Random forest. Performance of each scheme is shown in Fig. 2 which clearly indicates that Naïve Bayes is consistently performing well for each dataset.
Out of seven five classifiers have reported improved prediction capability in AUC measurement on the average performance of 11 classifiers when used with discretized data. When PCA is used as dimensionality reduction for pre-processing step only Simple logistics and OneR has shown improvement. Maximum AUC for each data set is shown in bold font.
Binning pre-process step has slightly improved the KNN AUC from 0.583 to 0.586 only. Further Naïve Bayes and VFI with discretization on average of performance have outperformed in all 28 classification strategies. To summarize, since Naïve Bayes and VFI learning scheme dominates when applied after discretization, we should choose Naïve Bayes learning or VFI schemes for different data sets. By the results it is evident that Naïve bayes and VFI with discretization improve the performance of software fault prediction also entropy based discretization helps in improving the performance of other learning algorithms. By the results it can be seen that principal components degrades the performance of Naive Bayes. Also similar findings are reported by other researchers like Hall et al [20].
6. Conclusion
Development of fault free software in budgeted cost is the major concern of software organizations. Prediction of faulty modules prior to coding improves the knowledge for the faulty modules and can be corrected at less cost with less effort. The goal of prediction of fault prone modules at design time using machine learning and pre-processing of data is to minimize the testing cost and to develop reliable software. Using the information at design time the software managers can effectively make changes in module design and allocate project resources toward those modules that are fault prone. While developing the software, fault prediction enables developers to fix the faults at design phase itself. In this research, we evaluated and analyzed techniques to predict the fault at design phase. Extensive experiments were performed to explore the impacts of different elements of a learning scheme for fault prediction. From these results, we conclude that a data pre-processor/dimensionality reduction can play different roles with different learning algorithms for different data sets. These results strongly endorse building fault predictors using VFI and Naive Bayes with discretization. Also discretization process has been shown to improve the prediction capability of other learners. In this experiment, each data set was divided into 10 parts and performed 10-fold cross validation. For each pass, we took 90 percent of the data as training data and the remainder as test data. An extensive study conducted with the results of 3, 080 AUC curve. Results from experiments have established the fact that design metrics can be used accurately as software fault indicator in early stage of software development.
References
- http://www.cse.lehigh.edu/-gtan/bug/localCopies/nist Report.pdf
-
Barry Boehm, Software Engineering Economics,
${\copyright}$ 1981, p. 40. of Prentice Hall, Inc., Englewood Cliffs, NJ - www.promisedata.org
- T. Menzies, J. Greenwald, and A. Frank, "Data Mining Static Code Attributes to Learn Defect Predictors", IEEE Trans. Software Eng., vol. 33, no. 1, pp. 2-13, Jan. 2007 https://doi.org/10.1109/TSE.2007.256941
- S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, "Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings," IEEE Trans. Software Eng., vol. 34, no. 4, pp. 485-496, July/Aug. 2008 https://doi.org/10.1109/TSE.2008.35
- Shull F, Basili V, Boehm B, Brown A, Costa P, Lindvall M, et al. "What we have learned about fighting defect". In: Proceedings of 8th international software metrics symposium, Ottawa, Canada; 2002. p. 249-58.
- Menzies T, Raffo D, on Setamanit S, Hu Y, Tootoonian S. "Model-based tests of truisms". In: Proceedings of IEEE ASE 2002.
- Do-178b and mccabe iq. available in http://www.mccabe.com/iq_research_whitepapers.htm.
- www.cs.waikato.ac.nz/-ml/weka/.
- Fayyad, U.M., and Irani, K.B. (1993), "Multi Interval discretization of continuous-valued attributes for classification learning", in Proceeding of the 13th International Joint Conference on Artificial Intelli- gence, 1022-1027, Morgan Kauffmann
- Quinlan, R.J., "C4.5: Programs for Machine Learning", Morgan Kaufman, 1993
- Lee, S., "Noisy Replication in Skewed Binary Classification, Computational Statistics and Data Analysis," 34, 2000.
- Kolcz, A. Chowdhury, and J. Alspector, Data duplication: "An imbalance problem"In Workshop on Learning from Imbalanced Data Sets" (ICML), 2003.
- Niels Landwehr, Mark Hall, and Eibe Frank. "Logistic model trees". Machine Learning, 59(1-2):161-205, 2005. https://doi.org/10.1007/s10994-005-0466-3
- Shatovskaya, T., Repka, V., & Good, A. (2006). "Application of the Bayesian Networks in the informational modeling". International conference: Modern problems of radio engineering, telecommunications, and computer science, international conference (p. 108). Lviv-Slavsko, Ukraine.
- Singh, P.; Verma, S., "Empirical investigation of fault prediction capability of object oriented metrics of open source software," Computer Science and Software Engineering (JCSSE), 2012 International Joint Conference on, vol., no., pp. 323, 327, May 30 2012-June 1 2012
- P. Singh and S. Verma, "An Investigation of the Effect of Discretization on Defect Prediction Using Static Measures", IEEE International Conference on Advances in Computing, Control, and Telecommunication Technologies (2009), pp. 837-839
- P. Singh and S. Verma, "Effectiveness analysis of consistency based feature selection in Software fault Prediction", International Journal of Advancements in Computer Science & Information Technology, vol.02, no.1, pp. 01-09, 2012
- Koru, A. G., & Liu, H. (2007). "Identifying and characterizing change-prone classes in two largescale open-source products". Journal of Systems and Software, 80(1), 63-73. https://doi.org/10.1016/j.jss.2006.05.017
- Hall, Mark A Holmes, Geoffrey "Benchmarking Attribute Selection Techniques for Discrete Class Data Mining" IEEE Transactions on Software Engineering, 2003
- Boetticher, G., Menzies, T., & Ostrand, T. J. (2007). "The PROMISE repository of empirical software engineering data "West Virginia University, Lane Department of Computer Science and Electrical Engineering.
Cited by
- Fault Detection and Classification with Optimization Techniques for a Three-Phase Single-Inverter Circuit vol.16, pp.3, 2016, https://doi.org/10.6113/JPE.2016.16.3.1097