1. Introduction
Text classification approach has been deployed in certain domains to automate the systems inorder to decrease the computational time and cost. The goal of such systems is to categorize the texts in to proper classes based on the content. For example, filtering of spam emails [1], classification of web page [2], analysis of sentiments [3], identification of authors [4], and design pattern classification and selection [5, 6]. In these domains, the use of a large number offeatures affects the classification decisions and computation time of the classifiers used in anautomated system [7]. In order to reduce the adverse effect of inoperable features in text categorization, filter-based (rather than embedded and wrapper) feature selection techniques are applied. [8]. Though numerous filter-based methods such as DF(Document Frequency [9], IGI(Improved Gini Index) [10], Correlation, GR (Gain Ratio), IG (Information Gain) [11], and Distinguish Feature Selector (DFS) [12] have been introduced and applied in the text classification domains [7, 13]. However, due to variation in the discriminative power offilter-based feature selection techniques and their constructed feature set, we still required a systematic way to construct a more representative feature set. Researchers have successfully implemented deep learning algorithms in certain domains such as fault localization [14], organization of design patterns according to expert opinion [15], extraction of semantic features from the source code, digit recognition [16], code suggestion via modeling the programming languages [17], use of structural information of program, and use of source code of programs to extract features for defects prediction [18, 19]. Due to the successfulimplementation of deep learning algorithms in certain areas [20] and construction of a moredemonstrative feature set, we leverage a powerful deep learning technique. Like Hussain et al.[15], we leverage an influential and representative algorithm, namely DBN [21] to cramfeatures from the constructed VSM (Vector Space Model). The DBN algorithms aims to aid in reconstruction of highly likely training data. Deep Belief Network (DBN) is a multi-levelneural network, which can aid to obtain the high level abstraction of input data by constructing a deep architecture. A Deep Belief Network (DBN) is consists of one visible and numeroushidden layers. Each layer is the combination of numerous stochastic neurons and is fully connected with adjacent layers. We accomplish few pre-processing tasks such as removal of stopwords and word stemming to realm the data in feature vector form with their frequency. The input of DBN algorithm is constructed through these feature vectors. In this paper, we accomplished several experiments. In each experiment, three widely used classifiers are used to assess the efficacy of the proposed method in terms of improving their classification decisions. The major contributions of the proposed work are.
- We used the DBN algorithm to retrieve the influential features. The DBN learn thesefeatures from the document vocabulary in order to increase the classifier’s performance.
- We appraise the capability of the proposed method by comparing with the state of the art filter-based feature selection techniques in terms of construction of more representativefeature set. .
- We employed the proposed method with three well-known classifiers and three benchmarkdatasets in the context of text classification and measure the efficacy of the proposed method.
Several weighting schemes are used to identify and assign weights to features and rank the maccordingly. Such as, Binary, TF (Term Frequency), Entropy, TFIDF (Term Frequency Inverse Document Frequency), and LTC (Length Term Collection) are the well-known weighting schemes [22]. These weighting schemes are used to assign weights to each featureand ordered them accordingly. The remaining article is categorized in eight units. In section 2, we concise the allied work in the context of feature selection techniques and implication of deep learning in the domain of software engineering. In section 3, we describe the working process of the proposed method. In section 4, we present a toy case study. In section 5, we present the experimental procedure. In section 6, we discussed the experimental results and discussed consequences. Moreover, insection 7, we concise the assessment results. In section 8 and section 9, we present the few threats of proposed method and the conclusion of proposed work respectively.
2. Related Work
We summarized the related work to highlight the use of existing filter-based techniques andimplication of deep learning approach (for feature set construction) in the context of increase in the performance of text classification based automated system.
2.1. Feature Selection Methods
The feature selection techniques are used to decrease the adverse effect of feature selectiontechniques. The existing feature selection techniques can be categorized in three schemes suchas filter-based, wrapper, and embedded. The feature selection methods of last two schemes (i.e.wrapper and embedded) need a learning model [7]. In their study, Ogura et al. [23] describethe characteristics of filter-based methods and group them as one-sided or two-sided. In case of one-sided methods, each feature is assigned several scores depends on the number of classlabels. However, these scores can be group with respect to the membership class score (i.e. Score ≥ 0) and non-membership classes (i.e. Score < 0). The Odd Ratio (OR) is the commonexample of one-sided method. The aim of OR is to compute the membership and non-membership scores of each feature by normalizing its nominator and denominator values. Moreover, in case of two-sided methods, a single score (based on the amalgamation of non-membership and membership) is associated to each feature. The IGI, DF, and IG are the common examples of two-sided feature selection methods [6, 13]. Similarly, in their study, Tasci and Gungor analysed certain filter-based techniques in terms of assigning single or multiclass based score to each feature and group them as local and global [24]. The classifier & rsquo;sperformance in the context of text classification can be improved in terms of handling the classimbalances issue. In this regard, Gunal empirically investigate the progress in the classification performance of genetic algorithm and filter based methods [25]. Finally, the authors informed that combine filter methods outperform than a single filter method.
Recently, Uysal present an IGFSS (Improved Global Feature Selection Scheme) scheme to construct more demonstrative feature set and reported the improvement in the classificationdecision of classifiers. The aim of IGFSS is construct a feature set by evenly class-wisedistribution of features [7]. Though, we observed several methods from the literature, however, in the context of text classification, feature selection is still an enduring issue for research community. This thing motivates us to present a new method for the construction of a more demonstrative feature setto increase the classifier’s performance.
Table 1. Summary of existing related work
2.2. Deep learning in Software Engineering (SE) domain
In the Software Engineering (SE) domain, research community has introduced and adopted several deep learning algorithms for certain tasks. In this regard, the related work has beensummarized and shown in Table 1, which presents its importance.
In a recent study Hussain et al. [15] have leveraged the DBN (Deep Belief Network), a deeplearning algorithm for construction of a more demonstrative feature set. In their study, Hussainet al. [15] accomplished three case studies to describe the effectiveness of the proposed method for the organization of software design patterns in the perspective of different domains. The organization of software design patterns is evaluated with respect to classificationschemes of domain experts. Liu [27] leverage the DBN algorithm and introduce a new text classification method to improve the classification decision of Support Vector Machine (SVM) for the Chinese text. The aim of Liu [27] proposed study is to improve the F1 measure by incorporating DBN to develop high-level abstraction of text for classification decision viaSVM. The readers who are more interested to know about the applications of deep learning approach can study the Bengio et al. work[20].
3. Proposed method
In order to address the biasness of existing filter-based methods, we look out the capability ofleveraged algorithm in the domain of construction of a more demonstrative feature set to increase the classifier’s performance. The study of proposed method is structured into three phases namely Pre-processing (Phase 1), Construction of Feature Set (Phase 2), and Text classification (Phase 3).
Fig. 1. Layout of the proposed method
3.1. Preprocessing
In the pre-processing phase, three tasks are used to transform documents into strings. Further, these tasks aid to formulate strings into feature vector form. The aim of the first task is used toremove the more frequent stop words. Such words carry no information such as Pronouns, Conjunction, and Preposition. The aim of second task is to stem the group of related words which have same conceptual meaning. Through, through this task numerous words areremoved. Porter’s stemmer1 is a widely used word stemming algorithm whose aim istransform English words into their stem using a set of rules. Finally, the aim of third task is describe the each feature vector into an integer vector form as a prerequisite for DBNalgorithm. Though, research community has reported certain weighting methods such as Entropy, TF, TFIDF, Binary, LTC, and TFC. However, in this paper, we employ the TF for construction of integer vectors. The size of integer vectors is same with respect to prerequisiteinputs of DBN algorithm.
3.2. Feature Set Construction
In this phase, a feature set is constructed for appropriate classification of documents using the DBN algorithm. The goal of DBN algorithm is to learn from given input (feature vectors) through a multi-level neural network. Further, it reconstructs the highly likely input data. The DBN has a three layer architecture, which are 1) one input, 2) N hidden, and 3) one output (i.e. Top) layer. Each layer is include a set of nodes. The data complexity can be tune by selecting the number of hidden layers and nodes in each layer. The joint distribution of input and hiddenlayers can be described through equation 1.
\(P\left(v, h^{1}, \ldots, h^{l}\right)=P\left(v | h^{1}\right)\left(\prod_{k=1}^{l} P\left(h^{k} | h^{k+1}\right)\right)\) (1)
Where the terms \(v\) , \(i\), and \(h^{k}\) referred as feature vector of input layer, total number of hiddenlayers, feature vector of \(k^{th}\) layer respectively. The 1 \(P\left(h^{k} | h^{k+1}\right)\) is used to described the conditional distribution of two adjacent layers (for example k and k+1). This distribution iscalculated from Restricted Boltzmann Machines (RBM). In case of node, DBN learnsprobabilities from any current node to the upper levels nodes. The backward validation isperformed by DBN to reconstruct the input feature set by tuning the weights between nodes across the layers. The input visible node of \((j+1)^{th}\) RBM are retrieved from hidden node the jth RBM. The Energy \(E\) for a visible layer v and hidden layer \(h\) is computed through equation 2.
\(E(v, h)=-\sum_{i=1}^{n^{*}} b_{i}^{r} v_{i}-\sum_{j=1}^{n^{*}} b_{j}^{h} h_{j}-\sum_{i=1}^{n^{*}} \sum_{j=1}^{n^{*}} v_{i} h_{j} w_{j}\) (2)
Where \(V_i\) and \(h_i\) are the ith and jth node of v and h . The biv and bih are the biases associated with nodes iv and jth respectively. wij is the weight between vi and hj . The relationship between Energy E and probability of generated data is proportional represented via equation 3 and equation 4.
\(\partial E(v, h) / \partial w_{i j}=-v_{i} h_{j}\) (3)
\(\begin{equation} P(v)=\sum_{h} e^{-E(v, h)} / \sum_{w, s} e^{-E(u, s)} \end{equation}\) (4)
The output hj of a visible unit v for one RBM is represented via equation 5.
\(\begin{equation} P\left(h_{j}=1 | v\right)=\sigma\left(b_{j}^{h}+\sum_{i=1}^{n^{*}} w_{i j} v_{i}\right )\end{equation}\) (5)
Where \(\sigma(x)=1 /\left(1+e^{-x}\right)\) . The current state of the hidden node is used to derive a new visible node using equation 6
\(\begin{equation} P\left(v_{i}=1 | h\right)=\sigma\left(b_{j}^{h}+\sum_{j=1}^{n^{n}} w_{i j} v_{j}\right) \end{equation}\) (6)
3.3. Text Classification
Though, in text classification based automated system, several supervised learning techniques are used. However, no single outperformed classifier is reported for all problems [28]. So, it isendorsed to find the outperform classifier for any new problem. Moreover, in the text categorization domain, research community have reported the better performance of Na ï veBayes, Support Vector Machine (SVM), ad C4.5 Decision Tree [7]. In this paper, we considered these classifiers to assess the efficacy of the proposed method.
4. Toy Case Study
The aim of toy case study is to present the analysis and working way of proposed method. The corpus of Toy case study consists of eight documents, 11 unique features and 3 classes. The description of documents is shown in Table 2.
Table 2. Document’s description for Toy case study
Though, selection of top-N feature remains an issue to build a more demonstrative feature set. However, the ranking of features with respect to discriminative power of feature selection can help to construct a feature set.
Table 3. Document’s description for Toy case study
The pre-processing activities (i.e. Section 3.1) on corpus (i.e. Table 2) are applied. The list ofunique features with their frequencies and ranking with respect to discriminative power offeature selection method is shown in Table 3.
The result of ranking of unique features of Toy example can help to construct a feature set based on the top-N values (i.e. the value of N can be selected in range 1 to 11). Since Document Frequency (DF) feature selection method depends on the frequency of a uniquefeature in whole corpus. Consequently, the frequency of features can be used to rank the features. Such as, the features “Alchemilla” and “Sunflower” are ranked first based on their highest frequencies. Similarly, the features are ranked according to discriminative power of IG, GR, GI and proposed method (DBN). The feature set with N=5 for each selected method is shown in Table 4.
Table 4. Feature sets with top N=5
There are two main observations from the result of Table 4 are; 1) the feature set varies withrespect to included features and 2) the number of orders of features in a constructed feature set.
5. Experiment Setup
In this section, we performed several experiments to examine the efficacy of the proposed method against the performance of four global filter-based methods namely DF [9], IGI [10], GR and IG [7, 11].
5.1. Dataset
We used three widely-used dataset with varying characteristics named Reuters-Mod Apte Split2, WebKB3, and Classic34. The number of classes of documents and the samplesize of each dataset is shown in Table 5.
Table 5. Summary of Datasets
5.2. Evaluation Criteria
In case of text classification, Micro-F1 and Macro-F1 are two well-known measures used to assess the performance of classifiers without or with class discrimination [7]. As, in this paper, our aim to formulate a more illustrative feature set without class discrimination, consequently, we used micro-F1 (shown in equation 9, which incorporate two widely used performancemeasure precision (equation 7) and recall (equation 8) for the classification decisions.
\(P=\frac{\sum_{c=1}^{C} T P_{c}}{\sum_{c=1}^{C}\left(T P_{c}+F P_{c}\right)}\), Micro-Averaging (7)
\(\mathbf{R}=\frac{\sum_{c=1}^{C} \mathrm{TP}_{c}}{\sum_{c=1}^{C}\left(\mathrm{TP}_{c}+\mathrm{FN}_{c}\right)}\), Micro-Averaging (8)
\(\text {Micro}-F 1=\frac{2 * P * R}{P+R}\) (9)
5.3. Experiment Procedure for Construction of Feature Set
In the proposed method, we used DBN algorithm to build a demonstrative feature set to improve the classifier’s performance. We used Darch-CRAN package of R (for statistical computing) to perform experiments. We used DBN algorithm with default parameters (i.e. Nodes in each layer, Number of hidden layers, and iteration) for each dataset. But, the parameter & rsquo;s values can be tune with respect to data complexity in order to obtain moreeffective outcomes. Certain optimization techniques are required to find settings that clue to high predictive accuracy and the selection of best tuning technique depends on the hyperparameter configurations, and their complex interactions [29, 30]. We used 10-fold cross validation [31] in each experiment and compare the classifier’s performance on the constructed feature set of the proposed method and existing filter-based techniques. We keepthe number of filtered features is same as we obtain from the proposed method to preserveconsistency.
6. Results and Discussion
In order to assess the efficacy of the proposed method, we performed three experiments. Ineach experiment, we equate the efficacy of the proposed method with existing global filter-based methods in term of improvement in classifier’s performance. We coined the term classifier instance5 to present an association between classifier and constructed feature via afeature selection method. For example, five instances (e.g. NB+IG describe the association between NB and feature set constructed through IG) of Naïve Bayes are created with five feature sets constructed in the experiment. We consider top 100 ranked features to construct afeature set in each experiment. We used the evaluation criterion describe in the section 5.2 and summarized the results in Table 6 and Fig. 2.
5One instance of a classifier is referred to its use with one feature set. For example, in case of SVM, five instances of SVM are created with five feature sets constructed through applied feature selection methods under study
Fig. 2. Classifier’s Performance on a) Reuters b) Classic3 and c) Web KB datasets with feature sets
The main consequences of the experimental results are:
- The performance of classifiers (in term of micro F-measure) on the constructed feature set via the proposed method indicates its applicability and effectiveness.
- In all experiments, on the constructed feature set via the proposed method, the performance of SVM remains better as compared to Naïve Bayes and Decision Tree with a minordifference. For example, in Fig. 2-a, the SVM (88%) outperform than NB (77%) and DT(87%).
- From the comparative assessment, we concluded some outcomes. Firstly, we observed improvement in classifier’s performance with a minor difference on feature sets constructed via the proposed method and existing filter-based methods. For example, in Fig. 2-a, the performance of SVM on the feature set of Gini Index and proposed method is reported with minor differences. Secondly, we cannot observe difference in classifier & rsquo;sperformance on the feature set of Gini Index and the proposed method. For example, in the Fig. 2-a, the performance of Naïve Bayes is same on both feature sets.
- In some cases, we cannot observe the better performance of classifiers on the feature set of the proposed method. For example, in the Fig. 2-a, the performance of Naïve Bayes on the constructed feature set via the proposed method is not significant (i.e. F-measure =77).
Table 6. Performance evaluation of classifiers in terms of Recall(R), Precision (P), and F-measure
The implications of experimental work indicate that there is not a significant difference in the mean F-measure value (or F1 score) of a classifier (with each constructed feature set) across the datasets. For example, there is no significant difference in performance of SVM (Fig.2.a-c). Consequently, we applied non-parametric tests 1) Friedman’s Test, and 2) post-hoc Nemenyi tests on the F-measure values to determine the significant difference and rank the outperform classifier with the feature set constructed through the DF, IG, GI, and GR. Firstly, Fried man & rsquo;s Test is applied on instances of a cleassifiers5 with three datasets to achieve the chi-square and p-value shown in Table 7.
The Friedman’s Test chi-square value for five instances of each classifier NB, DT, and SVM is greater than the critical value 8.53 with the degree of freedom (df) as 3. Consequently, there is a significant difference between F-measure values. We apply post-hoc Nemenyi and ANOM(Analysis of Means) tests to describe the significantly difference between classifier’s instances with with 0.05 α= to reject the null hypothesis (H0= All instances of a classifier evenlyperform). The results of the Fig. 3 described the ranking of an outperform classifier instance of NB, DT and SVM respectively. From the results of the Table 4 and Fig. 3, we can concludethe significant improvement in the classification performance when proposed method is used to construct the feature set.
Fig. 3. ANOM Tests Results for Ranking of Instances of a) NB b) DT and c) SVM Classifiers
Table 7. Friedman’s Test Results (with F-value) For All Instances of DT (Decision Tree), NB (Naïve Bayes), and SVM(Support Vector Machine) Classifiers
Subsequently, experiments are performed to look for running time of classifier’s training and testing. Since 10-fold cross validation is applied so time is recorded from the start to end of its execution. For each dataset, we evaluate the running time of each classifier to present its performance. Since only three datasets are used a benchmark of this study, consequently, we present the running time via bar charts shown in Fig. 4.
Fig. 4. Runtime analysis of classifiers on a) Reuters b) Classic3 and c) WebKB datasets with feature sets
7. Evaluation Summary
The main aim of the proposed method is to use a widely used deep learning algorithm namely DBN to construct a more demonstrative feature set and assess its influence on the classifier & rsquo;sperformance used in the proposed study. In each experiment, three supervised learning techniques are applied to assess the efficacy of the proposed method. The proposed method isemployed within the context of three widely used datasets. We observe significant performance of SVM as compared to other classifiers. The main implications of the proposed study are:
- On Reuters Dataset, classification decisions of outperformed classifier SVM with proposed method (i.e. F-measure=80) are 10% improve in terms of F-measure as compared to the classification decision of SVM with Document Frequency feature selection method (F-measure=76). However, we observe 1.2% improvement when SVM is used with Information Gain feature selection method.
- On Classic3 Dataset, classification decisions of outperformed classifier SVM with proposed method (i.e. F-measure=83) are 7.79% improve in terms of F-measure as compared to the classification decision of SVM with Document Frequency feature selection method (F-measure=77). However, we observe very little improvement when SVM is used with GiniIndex feature selection method.
- On Web KB Dataset, classification decisions of outperformed classifier SVM with proposed method (i.e. F-measure=86) are 10.25% improve in terms of F-measure as compared to the classification decision of SVM with Document Frequency feature selection method (F-measure=78). However, we observe 3.61% improvement when SVM is used with Information Gain feature selection method.
- We observed the variations in the effect of feature selection techniques on the performance of learners which might be depend on the dataset characteristics. For example, the increase in the performance of outperforms classifier SVM is 10%, 7.79%, and 13.42% for Reuters, Classic3, and Web KB datasets.
8. Threats to Validity
There are few threats in our study. The first threat is to make the generalization of results reported from the experiments. Since, we used a a limited number of datasets and classifiers. Consequently, we cannot generalize the experimental results. The proposed study can bereplicated with a large number of classifiers and datasets to generalize the results. The second threat is related to the adjustment of effectiveness of DBN algorithm through tuning its defaultset parameters. Selection of better optimization technique may help to present the besthyperparameter configurations to improve the predictive accuracy. The third threat is related to the usage of performance measure to evaluate the efficacy of the proposed method. The macro-F1 (rather than micro-F1) measure can be used to assess the efficacy of the proposed method for class discrimination. The forth threat is related to selection of top-N (i.e. N=100) features. The results of all experiments are reported with top-N features. Consequently, thereported results may be altered for any other value of N.
9. Conclusion
The proposed method yields promising results which indicate its efficacy in terms of construction of a demonstrative feature set using DBN (Deep Belief Network) algorithm. The construction of feature sets on the base of semantic of words across the documents can aid to increase the classifier’s performance in an automated system. We used DBN algorithm for construction of a feature set to present an increase in the classifier’s performance. We assessthe efficacy of the proposed method with three benchmark datasets and three widely used classifiers DT, NB and SVM. The results of experiments depict the efficacy of the proposed method to improve the classification decisions in an automated system. However, the efficacy of proposed method can be improved by tuning the DBN algorithm’s parameters. In the future work, we will analyse the applicability of the proposed method to knob the multi-class problem via tuning parameters.
Acknowledgement
This research is supported by the Higher Education Commission (HEC), Pakistan funds (Project No. 1936/SRGP/R&D/HEC/2018 and 1933/SRGP/R&D/HEC/2018).
References
- I. Idris and A. Selamat, "Improved Email Spam Detection Model With Negative Selection Algorithm And Particles Warm Optimization," Applied Soft Computing, 22, p. 11-27, 2014. https://doi.org/10.1016/j.asoc.2014.05.002
- E. Sarac and S.A. Ozel, "An Ant Colony Optimization Based Feature Selection For Web Page Classification," The Scientific World Journal, p. 1-16, 2014.
- W. Medhat, A. Hassan, and H. Korashy, "Sentiment Analysis Algorithms and Applications: A Survey," Ain Shams Engineering Journal, 5, p. 1093-1113, 2014. https://doi.org/10.1016/j.asej.2014.04.011
- C. Zhang, X. Wu, Z. Niu and W. Ding, "Authorship Identification From Unstructured Texts," Knowledge Based Systems, 66, p. 99-111, 2014. https://doi.org/10.1016/j.knosys.2014.04.025
- S. Hussain, J. Keung, A.A. Khan, and K.E. Bennin, "A Methodology to Automate the Selection of Design Patterns," in Proc. of International Conference on Computers, Software and Application, IEEE, 2016.
- S. Hussain, J. Keung, and A.A. Khan, "Software design patterns classification and selection using text categorization approach," Applied Soft Computing, 58, p. 225-244, 2017. https://doi.org/10.1016/j.asoc.2017.04.043
- A. K. Uysal, "An Improved Global Feature Selection Scheme for Text Classification," Expert System with Applications, 43, p. 82-92, 2016. https://doi.org/10.1016/j.eswa.2015.08.050
- H. Uguz, "A Two-Stage Feature Selection Method For Text Classification By Using Information Gain, Principal Component Analysis and Genetic Algorithm," Knowledge-Based Systems, 24 (7), p. 1024-1032, 2011. https://doi.org/10.1016/j.knosys.2011.04.014
- Y. Yang, and J.O. Pedersen, "A Comparative Study on Feature Selection in Text Categorization," in Proc. of the 14th International Conference on Machine Learning, p. 412-420. 1997.
- W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, and Z. Wang, "A Novel Feature Selection Algorithm For Text Categorization," Expert Systems with Applications, 33 (1), p. 1-5, 2007. https://doi.org/10.1016/j.eswa.2006.04.001
- C. Lee, and G.G. Lee, "Information Gain And Divergence-Based Feature Selection for Machine Learning-Based Text Categorization," Information Processing and Management, 42 (1), p. 155-165, 2006. https://doi.org/10.1016/j.ipm.2004.08.006
- A. K. Uysal and S. Gunal, "A Novel Probabilistic Feature Selection Method for Text Classification," Knowledge-Based Systems, 36, p. 226-235, 2012. https://doi.org/10.1016/j.knosys.2012.06.005
- G. Forman, "An Extensive Empirical Study of Feature Selection Metrics for Text Classification," Journal of Machine Learning Research, 3, p. 1289-1305, 2003.
- A. Lam, A. Nguyen, H. Nguyen, and T. Nguyen. "Combining deep learning with information retrieval to localize buggy files for bug reports," in Proc. of 30th IEEE ASE Conference, pages 476-481, 2015.
- S. Hussain. et al, "Implications of Deep Learning for the Automation of Design Patterns Organization," Journal of Parallel and Distributed Computing, 2017.
- G. E. Hinton, R. R. Salakhutdinov, "Reducing the Dimensionality of Data with Neural Networks," Science, 313(5786), p. 504-507, 2006. https://doi.org/10.1126/science.1127647
- M. White, C. Vendome, M. L. Vasquez, and D. Poshyvanyk. "Toward deep learning software repositories," in Proc. of MSR'15, p. 334-345, 2015.
- S. Wang, T. Liu, and L. Tan, "Automatically Learning Semantic Features for Defect Prediction," in Proc. of IEEE International Conference on Software Engineering (ICSE), 2016.
- X. Yang, D. Lo, X. xia, Y. Zhang, and J. Sun., "Deep learning for just-in-time defect prediction," in Proc. of QRS, p. 17-26, 2015.
- T. Bengio, A. Courville., P. Vincent, "Representation Learning: A Review and New Perspectives," IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 8, p. 1798-1828, 2013. https://doi.org/10.1109/TPAMI.2013.50
- G. E. Hinton, S. Osindero, and Y.-W. Teh. "A fast learning Algorithm for deep belief nets," Neural computation, 18(7), p. 1527-1554, 2006. https://doi.org/10.1162/neco.2006.18.7.1527
- A. Hotho, A. Nurnberger, and G. Paab, "A Brief Survey of Text Mining," Journal for Computational Linguistics and Language Technology, 20, p. 19-62, 2005.
- H. Ogura, H. Amano and M. Kondo, "Distinctive Characteristics of a Metric using Deviation from Poisson for Feature Selection," Journal of Expert Systems with Applications, 37, p. 2273-2281, 2010. https://doi.org/10.1016/j.eswa.2009.07.045
- S. Tasci, and T. Gungor, "Comparison of Text Feature Selection Policies and Using an Adaptive Framework," Expert Systems with Applications, 40, p. 4871-4886, 2013. https://doi.org/10.1016/j.eswa.2013.02.019
- S. Gunal,S, "Hybrid Feature Selection for Text Classification," Turkish Journal of Electrical Engineering and Computer Sciences, 20, p. 1296-1311, 2012.
- R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas., "Malware classification with recurrent networks," in Proc. of ICASSP, p. 1916-1920, 2015.
- T. Liu, "A Novel Text Classification Approach Based on Deep Belief Network," in Proc. of ICONIP 2010, Part I, LNCS 6443, p. 314-321, 2010.
- M. Fernandez-Delgado, E. Cernadas, S. Barro, and D. Amorim, "Do we need hundreds of classifiers to solve real world classification problems," Journal of Machine Learning Research, vol. 15, no. 1, pp. 3133-3181, 2014.
- R. Bermudez-Chacon, G. H. Gonnet, K. Smith, "Automatic problem-specific hyperparameter optimization and model selection for supervised machine learning: Technical Report," Tech. rep., Zurich, 2015.
- C. Thornton, F. Hutter, H. H. Hoos, K. Leyton-Brown, "Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms," in KDD '13 Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, p. 847-855, 2013.
- C. Tantithamthavorn, S. Mclntosh, A. E. Hassan and K. Matsumoto, "An Empirical Comparison of Model Validation Techniques for Defect Prediction Models," IEEE Transactions on Software Engineering, PP(99), 2016.