DOI QR코드

DOI QR Code

Feature-selection algorithm based on genetic algorithms using unstructured data for attack mail identification

공격 메일 식별을 위한 비정형 데이터를 사용한 유전자 알고리즘 기반의 특징선택 알고리즘

  • Received : 2018.09.07
  • Accepted : 2018.12.05
  • Published : 2019.02.28

Abstract

Since big-data text mining extracts many features and data, clustering and classification can result in high computational complexity and low reliability of the analysis results. In particular, a term document matrix obtained through text mining represents term-document features, but produces a sparse matrix. We designed an advanced genetic algorithm (GA) to extract features in text mining for detection model. Term frequency inverse document frequency (TF-IDF) is used to reflect the document-term relationships in feature extraction. Through a repetitive process, a predetermined number of features are selected. And, we used the sparsity score to improve the performance of detection model. If a spam mail data set has the high sparsity, detection model have low performance and is difficult to search the optimization detection model. In addition, we find a low sparsity model that have also high TF-IDF score by using s(F) where the numerator in fitness function. We also verified its performance by applying the proposed algorithm to text classification. As a result, we have found that our algorithm shows higher performance (speed and accuracy) in attack mail classification.

빅 데이터에서 텍스트 마이닝은 많은 수의 데이터로부터 많은 특징 추출하기 때문에, 클러스터링 및 분류 과정의 계산 복잡도가 높고 분석결과의 신뢰성이 낮아질 수 있다. 특히 텍스트마이닝 과정을 통해 얻는 Term document matrix는 term과 문서간의 특징들을 표현하고 있지만, 희소행렬 형태를 보이게 된다. 본 논문에서는 탐지모델을 위해 텍스트마이닝에서 개선된 GA(Genetic Algorithm)을 이용한 특징 추출 방법을 설계하였다. TF-IDF는 특징 추출에서 문서와 용어간의 관계를 반영하는데 사용된다. 반복과정을 통해 사전에 미리 결정된 만큼의 특징을 선택한다. 또한 탐지모델의 성능 향상을 위해 sparsity score(희소성 점수)를 사용하였다. 스팸메일 세트의 희소성이 높으면 탐지모델의 성능이 낮아져 최적화된 탐지 모델을 찾기가 어렵다. 우리는 fitness function에서 s(F)를 사용하여 희소성이 낮고 TF-IDF 점수가 높은 탐지모델을 찾았다. 또한 제안된 알고리즘을 텍스트 분류 실험에 적용하여 성능을 검증하였다. 결과적으로, 제안한 알고리즘은 공격 메일 분류에서 좋은 성능(속도와 정확도)을 보여주었다.

Keywords

1. Introduction

A Big data refers to a very large amount of data andincludes a range of methodologies, such as big data collection, processing, storage, management, and analysis. In particular, text mining of unstructured big data, which has recently been utilized in many industries, is an importantunstructured-data analysis technique.

Text mining is likely to extract a larger number of terms (features) as the amount of data increases. Since big-data text mining extracts a large number of features and data, clustering and classification can result in high computational complexity and low reliability of the analysis results. In particular, a term-document matrix (TDM) obtained throughtext mining represents term-document features, but produces a sparse matrix. In a sparse matrix, useful information cannot be retrieved and the analysis results cannot be trusted. Therefore, various studies have been conducted on featureselection and data dimensions [1].

This study focuses on selecting a set of optimized features from the corpus. A genetic algorithm (GA) is used to extractterms (features) as desired, according to the term importance calculated by the equation found. The study revolves around a feature-selection method that lowers the computational complexity and increases the analytical performance. And we have improved FSGA(Feature Selection based on Genetic Algorithm) that is proposed in [1].

We designed an advanced GA to extract features in text mining for detection model. Term frequency inversedocument frequency (TF-IDF) [2] is used to reflect the document-term relationships in feature extraction. Through arepetitive process, a predetermined number of features areselected. And, we used the sparsity score to improve the performance of detection model. If a spam mail data set has the high sparsity, detection model have low performance and is difficult to search the optimization detection model. In addition, we find a low sparsity model that have also high TF-IDF score by using s(F) where the numerator in fitnessfunction. We also verified its performance by applying the proposed algorithm to text classification.

The rest of the paper is organized as follows. In Section 2, related work is introduced. In Section 3, feature-selectiontechniques using GAs in text mining are described. In Section 4, the experiment and analysis results are presented. Finally, the conclusions are described in Section 5.

2. Related Works

2.1 Intelligent Security Data Analysis

Intelligent security data analysis uses[2] intelligentmethods and systems for analyzing [the] security datadescribed above to improve the performance of security technologies, including attack detection, attack analysis, security assessment and vulnerability analysis and to enhance [the] security hardening concerning prediction of new attacks and defense against new attacks.

OTJBCD_2019_v20n1_1_f0001.png 이미지

(Figure 1) Type of Imbalance Data (a) Class Overlapping (b) Small Disjunct

Intelligent methods or intelligent systems can be described as methods that enable searching, learning, analysis, and prediction based on knowledge, models, patterns, and features extracted from the collected data and the monitoring data obtained by computers. Intelligent systems can includedata mining (classification, clustering, association analysis and ensemble, etc.), information fusion (Bayesian, fuzzy, etc.), soft computing (heuristic search, evolutionary programming, etc.) and artificial intelligence (AI) algorithms. In security systems, many studies have applied these intelligent systems to security technologies, since it is difficult to handle evolving-attack techniques with existing simple filtering-based methods and signature-based analysis methods. In particular, intelligent methods are perfect forresearch on intrusion detection (insider and outsider) and threat inference (threat assessment or risk assessment).

We targeted intelligent security systems; thus, we studied security systems based on data analysis. This paperintroduces the concept of intelligent systems and intelligent-system approaches in intrusion detection and threat inference, on which our studies have focused, to implement intelligentsecurity systems. An improved security system is proposed by applying GAs to intelligent systems. To apply intelligentalgorithms for anomaly detection and intrusion reasoning, the following basic concepts and related studies are brieflyintroduced.

2.1.1 Security Data

It is necessary to define security data and theircharacteristics prior to the analysis of intelligent security data. Security data broadly means "structured and unstructured datathat are collected by various sensors to ensure confidentiality, integrity, and availability of security from security attacks and threats." From the perspective of attack defense, it can be narrowly defined as "all kinds of data that [are] used to detect security attacks and threats and to analyze their patterns, trends, and signs."

The sensors involved can be various hardware and s of tware, such as host PCs, security software, and security systems, which include intrusion detection systems (IDS), fire walls (FW), and enterprise security management (ESM). Thesesensors collect and log data, mainly storing monitoring status, real-time conditions, and the analysis results. Each sensorcontains many different forms of data.

Security data can be classified into the following twocategories: data designed to be used for security purposes (e.g., IDS and FW blocking records) and data not created for securitypurposes but used for security data analysis (e.g., network packets or social network service (SNS) data). Insider attacks and threats have recently increased; thus, SNS, memberrecords, and unstructured data such as CCTV footage, are also included in security data analysis [3]. With these data, studies on attack and threat analysis are being actively being carried out in the field of security data analysis.

2.1.2 Imbalance Problem in Security Data Analysis

These extensive security data sets contain a lot of imbalanced data, which can reduce the performance of learning algorithms and lead to incorrect predictions; therefore, they can degrade the detection rate of security systems and the analysis performance of algorithms. As shown in Figure 1 most data used for solving real-world problems contains imbalanced data, which cannot be used to build a good model whenstudying algorithms. In particular, security data usually containconsiderable normal data with relatively little attack and anomaly data; therefore, security data is characterized imbalance

Data with a lower frequency are not classified into the normal category. In order to process and analyze these datain classification as follows, they must be balanced. That is, to improve poor classification accuracy it is necessary tounderstand the characteristics of the data frequencies and findeffective data categories. It is also necessary to derive applicable and meaningful data through domain-specific data analysis. The necessity of handling these imbalanced data iscurrently recognized in fields related to data mining. Therehave been research activities to analyze imbalanced data as follows. The Association for the Advancement of Artificial Intelligence (AAAl) is devoted to applying imbalanced datato all industries involved in artificial intelligence (AI) algorithms that learn from mined data [4]. The International Conference on Machine Learning (ICML) is the leadinginternational conference on machine learning from imbalanced data [4]. The Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) focuses on extracting useful information from data and discovering knowledge for computer science [4].

The problem of imbalanced data is commonly considered to be important in the field of data processing for knowledgediscovery. There are major challenges in learning fromimbalanced data, including the evaluation of learning algorithms and the cost of imbalanced data [5, 6, 7, 8].

There have been many studies on improving conventional algorithms for use with imbalanced data sets. In recent years, there have also been studies on sampling from increasinglylarge data sets. Several of these have focused on generating a subset of data using a method relating to the selection of the most efficient features.

Security data that are divided into two types, normal andattack data, tend to be imbalanced because they are composed of data requiring dichotomous analysis, such as attack/detection or normal/abnormal (or anomaly) . However, there are not many studies aiming to solve the problem of imbalanced datain data mining; many existing security data analyses are based on data created for research. A significant achievement of solving data imbalance is the performance enhancement of ISDA. Feature selection techniques are used for solving dataimbalance in ISDA, and in this paper, we propose a GA-based feature selection for this purpose.

2.2 Feature Selection for Intelligent System Design and Application

Feature selection is a method of selecting a subset of important or relevant features from an entire feature set to improve the analysis time and performance and reduce the dimensionality. Feature selection is mainly divided intowrapper approaches and filter approaches. Wrapper approaches evaluate feature sets, based on their performance whenclassifying training data; they use a document classifier toselect suitable features. The computational complexity of wrapper approaches is greater than that of filter approaches because they perform the classification of all candidatefeature sets. However, the classification accuracy of w rapperapproaches is better than that of filter approaches since the classifier properties are considered [9, 10].

Filter approaches evaluate candidate feature sets based on the intrinsic information of the dataset. The distance betweendata instances representing feature sets or information gain is measured to determine whether each feature should beselected. Filtering approaches are limited because it isimpossible to consider the relationships between features because only the weights for independent single features are used for feature selection [9].

Feature selection plays an important role in anomaly detection since it is intended to eliminate unimportant orinappropriate features from all data features [11]. Featureselection usually enhances data generalization to make datamore understandable, and it reduces computational complexity, data dimensionality, and redundancy to improvethe performance of anomaly detection.

As shown in Figure 1, feature selection is largely made up of three steps: 1) sub-feature set generation, 2) sub-featureset evaluation, and 3) validation. The important part is the sub-feature set evaluation, which has five differentapproaches: score-based, entropy/mutual information-based, correlation-based, consistency-based, and detectionaccuracy-based [12].

3. AFSGA : Advanced FSGA

3.1 Fitness Function Studies and Design

The fitness function in GAs is the fitness equation used to evaluate the superiority of the given solution. In this study, the fitness function for text mining has been designed to evaluate the importance of the given term. The corresponding notation is as follows:

  • \(F=\left\{F_{1,} F_{2}, \ldots ., F_{n}\right\}\) ≜ the set of Features(Chromosomee)
  • \(D=\left\{D_{1}, D_{2}, \ldots, D_{n}\right\}\) ≜ the number of documents
  • \(N\) ≜ the number of documents
  • \(s (F )\)≜ sparsity of Document Term Matrix using F
  • \(x_i = F_i\) ≜ the value of Ith Feature in 
  • \(t f_{i k}\)≜ term frequence of feature \(F_i \)∈ F in document \(D_k \)∈ D
  • \(df_k\)≜ document frequence is the number of document included \(F_k \)∈ F
  • \(idf_k\) ≜ inverse document frequence of feature \(F_k\)∈ F in document \(D_k\)∈ D =\(log((N-df_k)/df_k)\)

Fitness function is shown equation (1),

\(\max F=\frac{\sum_{i=1}^{|F|} \sum_{k=1}^{|D|}\left(t f_{x_{i} k} \times i d f_{k}\right)}{e^{s(F)}}\)       (1)

where

\(s(F)=1-\frac{\sum_{i=1}^{|F|} \sum_{k=1}^{|D|} f\left(x_{i k}\right)}{N^{*} n-\sum_{i=1}^{|F|} \sum_{k=1}^{|D|} f\left(x_{i k}\right)}\)   ,

\(f(x)\left\{\begin{array}{ll}{1,} & {x=0} \\{0,} & {x \neq 0}\end{array}\right.\)       (2)

Regarding the importance of the given term, equation (1) does not simply apply its overall frequency, but uses its relative frequency with respect to each document and term to obtain the fitness value of each solution. Therefore, therelative importance can select the feature. When the frequency of the term is simply used, the term can occurfrequently, but its meaning can be used differently in each document. However, this approach is not suitable for low-frequency terms, which may represent important features of the documents. Therefore, the fitness function using TF-IDF was designed by considering the Document-Term relationships.

And, we use the sparsity ratio s(F)(equation (2)) to improve the performance of detection model. Sparsity is ratio of 0 value in data set(or matrix), and if a spam mail data se thave the high sparsity, detection model have low performance and is difficult to search the optimizationdetection model. Therefore, we find a low sparsity modelthat have also high TF-IDF score by using s(F) where the numerator in fitness function.

Furthermore, we use the exponential function innumerator, it is reflected as an exponential function to prevent the value of 0, and the solving capacity for classimbalance is intended to be improved by increasing the importance of sparsity ratio by raising the degree of lowered total fitness as the sparsity ratio becomes higher.

3.2 Time Complexity and Fitness Curve

This algorithm has \(O(t/l)\) time complexity, where \(t\) istotal length of selected feature and \(l\) is length of sub-selected feature. The time complexity of typical GA is

\(O(n^2*fitness function)\)

[13], where \(n\) is length of chromosome. In this algorithm, time complexity of fitness function is \(O(d)\), where \(d\) is the number of documents. Finally, time complexity of proposal feature selection algorithm is

\(O(n \times d \times t) : (n^2 \times d) \times (t/l) = n\times d\times t\)

Figure 2 is shown Fitness curves of Proposal Algorithmin 1 sub-feature selection. To analyze the fitness curve in GA is possible to know level of optimization problem and algorithm performance. We find that optimization problem of spam mail detection model is a very difficult problem by once sub-feature searching because the best optimal detection model is not searched by algorithm while implement a lot of generations. However, we find that our algorithm will find the best optimal model after all, because fitness value is being convergence at about 800 generation and fitness value is more fast converged to increase a population size.

OTJBCD_2019_v20n1_1_f0002.png 이미지

(Figure 2) Fitness Curve of Proposed Algorithm

4. Experiment Result

4.1 Environment

The hardware and operating system environment used in the experiment is as follows.

  • CPU: Intel Core i5 650 3.20 GHz
  • RAM: 7 GB
  • OS: Windows 7 Enterprise K 64bit

Open software R (version 3.02) [14] is a tool that was used for the experiment. In particular, the text mining (TM) package in R [15] was used for text mining. GA algorithms in the R GA package [16] were modified to run in the Renvironment. The classification algorithm used for the classification experiments is KNN Classifier [17].

The feature-selection experiment was performed using a TF-IDF GA. The clustering results were analyzed and compared with the original corpus. The GA parameters used are as follows:

  • Size of population = 200
  • Length of chromosome = 55
  • Probability of crossover = 0.8
  • Probability of mutation = 0.2

4.2 Document Data Set

The document data set used in the experiments includes 300 documents from the LingSpam Data Set [18], which is used for spam mail classification and clustering. The datasetwas classified into two clusters: normal mail and spam mail. The clustering performance was measured. These experiments used 252 normal mails and 48 spam mails.

4.3 Measuring Method of the Text Classification Experiment

In the evaluation step, the score function is applied to evaluate the performance using the feature set derived from the feature-selection method. A heuristic search is used repeatedly to find the desired feature set to meet the givencriteria in the learning process until the final feature set isselected. The optimal feature set is chosen using evaluation and searching because a partial feature set is selected from the high-dimensional feature sets.

In this case, the F-Measure is used for the score regarding the document-classification evaluation. The F-measure is widely used to evaluate the results of text classification by applying precision and recall [19]. The precision Pi and therecall Ri are calculated by Definitions (3) and (4), respectively.

\(P_{i}=\frac{T P_{i}}{T P_{i}+F P_{i}}\)       (3)

\(P_{i}=\frac{T P_{i}}{T P_{i}+F N_{i}}\)       (4)

\(F_{i}=\frac{2^*P_i^*R_i}{P_i + R_i}\)       (5)

where di indicates the number of documents included incategory i. N is the number of categories.

4.4 Experiment Results

In class imbalance problem, there are measure methods for performance measure of detection algorithms ETS(Equitable Threat Score), CSI(Critical Success Index), PAG(Post Agreement), ACC(Accuracy)[20, 21, 22, 23, 24].There are usually methods to verify result of dichotomous (yes/no) forecasts on weather, detection, etc. There are according to some indicators in confusion matrix as below Table 1: (using indicators like True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) in F1-Measure).

(Table 2) Confusion Matrix

OTJBCD_2019_v20n1_1_t0001.png 이미지

\(ETS=\frac{H-a_r}{H+M+F-a_r}\)  ,

\(a_r=\frac{(H+M)(H+F)}{H+M+F+C}\)       (6)

\(CSI=\frac{H}{H+M+F}\)       (7)

\(PAG=\frac{H}{H+F}\)       (8)

\(ACC=\frac{H+C}{H+M+F+C}\)       (9) 

The results of the experiment show that, in the case where AFSGA was carried out, the best results were achieved whenonly 5% was selected for Precision, 50% for Recall, and 50% for F-measure, showing the values of 1, 0.9664, and 0.9829, respectively. When the results are compared with those achieved using all the features, the best performance was achieved when all the features at P = 1, R = 0.9213, F = 0.9590 were used in the cases of F-measure and Precision. On the other hand, a better result was achieved when AFSGA was used in the case of Recall.

An analysis of the overall experiment result shows that the higher the feature selection ratio is, the more the values of Precision, F-measure, and Recall improve in general, with the exception of several cases shown in Table 2, Figure 3.

(Table 3) Classification Result 

OTJBCD_2019_v20n1_1_f0003.png 이미지

(Figure 3) Result of F1-measure(graph)

In the classification case, as learning is carried out through already-classified criteria, it can be seen that the more data is used for learning, the more the classification result improves.

However, as the classification performance did not deteriorate very much, even when AFSGA was used, and asuperior result was achieved depending on the selection ratio, AFSGA’s superior performance and utilization potential could also be verified in document classification. However, to show better classification performance, AFSGA’sperformance should be enhanced.

The ETS/CSI/PAG/ACC results of the experiment showin Table 3 and Figure 4 that, in the case where FSGA was carried out, the best results were achieved when only 5% wasselected showing the values of ETS = 0.8657, CSI = 0.9664, PAG = 1, and ACC = 0.9739, respectively. There is betterperformance than classification using all of features. ETS and CSI score is reflected as that algorithm has a performance tosolve the imbalance class problem and to search the optimal forecast model. Therefore, the proposal algorithm hashigh-performance on spam mail detection.

Table 4 shows the experiment results of the classificationtime for each feature-selection ratio. Figure 5 shows a graphof the result. It can be seen that the higher the feature-selection ratio becomes, the more the performancetime increases. Accordingly, when we see the results of the classification performance experiment and the results of the performance time, document classification using FSGA can be valuably used, depending on the purpose and the environment.

(Table 4) Result of ETS/CSI/PAG/ACC - Table

OTJBCD_2019_v20n1_1_t0003.png 이미지

OTJBCD_2019_v20n1_1_f0004.png 이미지

(Figure 4) Result of ETS/CSI/PAG/ACC(graph)

As the classification time can be greatly reduced whilemaintaining the document-classification performance to some extent within a usable level, it can be valuably utilized in abig-data environment, for real-time document classification, or where computing resources are insufficient. If the purpose is to accurately carry out classification, irrespective of the document-classification time, a better result may be achieved by using all of the features.

When the overall results of the document-classificationexperiment are considered, the proposed FSGA algorithm can be said to be efficiently usable for document classification.

(Table 5) A Result of Classification Time - Table Classification Time

OTJBCD_2019_v20n1_1_t0004.png 이미지

OTJBCD_2019_v20n1_1_f0005.png 이미지

(Figure 5) Result of Classification Time

5. Conclusion

In this paper, a feature-selection (term selection) method was proposed to enhance the effectiveness of analysis in text mining. A new GA was designed for text mining, due to its optimal search performance. In addition, to maintain genetic diversity, the algorithm was modified to select the final feature set using partial feature sets.

And, we use the sparsity ratio s(F) to improve the performance of detection model. Sparsity is ratio of 0 value in data set(or matrix), and if a spam mail data set have the high sparsity, detection model have low performance and is difficult to search the optimization detection model. Therefore, we find a low sparsity model that have also high TF-IDF score by using S(F) where the numerator in fitness function.

In the document-classification experiment, the classification performance was shown to be better when all features were used than when AFSGA was used, with the exception of the Recall result. ETS and CSI score is reflected as that algorithm has a performance to solve the imbalance class problem and to search the optimal forecast model.

References

  1. Sung-Sam Hong, Wanhee Lee, and Myung-Mook Han, "The Feature Selection Method based on Genetic Algorithm for Efficient of Text Clustering and Text Classification," International Journal of Advance Soft Computing Application, Vol. 5, No. 3, 2013.
  2. Sung-Sam Hong, Dong-Wook Kim and Myung-Mook Han, "Feature-Selection Algorithm based on Genetic Algorithms for Intelligent Security Data Analysis of Unstructured Data," KSII The 12th Asia Pacific International Conference on Information Science and Technology(APIC-IST), Chiangmai, Thailand, 2017
  3. Daniel L. Costa, Matthew L. Collins, Samuel J. Perl, Michael J. Albrethsen, George Silowash, and Derrick Spooner, "An Ontology for Insider Threat Indicators," Proceedings of the 9th Conference on Semantic Technology for Intelligence, Defense, and Security, Fairfax VA, pp.48-53, 2014.
  4. He, Haibo and Edwardo Garcia, "Learning from imbalanced data," IEEE Transactions on Knowledge and Data Engineering, Vol.21, No.9, pp.1263-1284, 2009. https://doi.org/10.1109/tkde.2008.239
  5. Chawla, Nitesh V., Nathalie Japkowicz, and Aleksander Kotcz, "Editorial: special issue on learning from imbalanced data sets," ACM SIGKDD Explorations Newsletter, Vol.6, No.1, pp.1-6, 2004. https://doi.org/10.1145/1046456.1046457
  6. Eun-Jin Kim, Uk Heo, Byoung-Chul Kim, Il-Kyu Eom, and Young-In Kim, "More Realistic Data Generation for the Imbalanced Class Problem," Journal of Korean Institute Of Information Technology, Vol.9, No.11, pp.143-150, 2011.
  7. Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera, "A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), Vol.42, No.4, pp.463-484, 2012. https://doi.org/10.1109/tsmcc.2011.2161285
  8. Zhongbin Sun, Qinbao Song, Xiaoyan Zhu, Heli Sun, Baowen Xu, and Yuming Zhou, "A novel ensemble method for classifying imbalanced data," The Journal of the Pattern Recognition Society, Vol.48, No.5, pp.1623-1637, 2015. https://doi.org/10.1016/j.patcog.2014.11.014
  9. Robertson, Stephen. "Understanding inverse document frequency: On theoretical arguments for IDF," Journal of Documentation, Vol.60, No.5, pp.503-520, 2004. https://doi.org/10.1108/00220410410560582
  10. Joo-ho In, Jung-ho Kim, and Soo-hoan Cahe, "Combined Feature Set and Hybrid Feature Selection Method for Effective Document Classification," Journal of Korean Society for Internet Information, Vol.14, No.5, pp 49-57, 2013. https://doi.org/10.7472/jksii.2013.14.5.49
  11. John, G. Kohavi, and R. Pfleger, K., "Irrelevant Feature and the Subset Selection Problem", In Proceedings of 11th International Conference on Machine Learning, New Brunswick, NJ, pp.121-129, 1994. https://doi.org/10.1016/b978-1-55860-335-6.50023-4
  12. Monowar H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, "Network Anomaly Detection: Methods, Systems and Tools," IEEE COMMUNICATIONS SURVEYS & TUTORIALS, Vol.16, No.1, pp.303-336, 2014. https://doi.org/10.1109/surv.2013.052213.00046
  13. J. Van Rijsbergen, 1979, Information Retrieval, second ed., Buttersworth, London
  14. http://www.r-project.org/
  15. http://cran.r-project.org/web/packages/tm/index.html
  16. http://cran.r-project.org/web/packages/GA/index.html
  17. htps://cran.r-project.org/package=e1071
  18. Androutsopoulos, J. Koutsias, K.V. Chandrinos, George Paliouras, and C.D. Spyropoulos, "An Evaluation of Naive Bayesian Anti-Spam Filtering", 11th European Conference on Machine Learning (ECML 2000), Warsaw, Poland, pp. 9-17, 2000.
  19. Bratko, Andrej; et al. "Spam filtering using statistical data compression models," The Journal of Machine Learning Research, No.7, pp. 2673-2698, 2006
  20. THOMAS M. HAMILL, and JOSIP JURAS, "Measuring forecast skill: is it real skill or is it the varying climatology?," Quarterly Journal of the Royal Meteorological Society, Vol.132, No.621c, pp.2905-2923, 2006. https://doi.org/10.1256/qj.06.25
  21. Roberts, N. M., and H. W. Lean, "Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events," Monthly Weather Review, Vol.136, No.1, pp. 78-97, 2008. https://doi.org/10.1175/2007mwr2123.1
  22. http://www.cawcr.gov.au/projects/verification/
  23. Nigro, M.A., J.J. Cassano and M.W. Seefeldt, "A weather-pattern-based approach to evaluate the Antarctic Mesoscale Prediction System (AMPS) forecasts : Comparison to automatic weather station observations," Weather Forecasting, Vol.26, No.2, pp.184-198, 2011. https://doi.org/10.1175/2010waf2222444.1
  24. Wilks, D.S., Statistical Methods in the Atmospheric Sciences. 3rd Edition. Elsevier, p. 676, 2011. https://doi.org/10.1016/c2010-0-65519-2