I. INTRODUCTION
Traditional rule-based security solutions hardly detect advanced attacks such as zero-day attacks and advanced persistent threats (APT). Attackers acquire advanced skills and exploit unknown vulnerabilities to bypass security solutions. Machine-learning (ML) is becoming a prevalent way of detecting advanced attacks with unexpected patterns [1]. ML is based on statistical and mathematical algorithms rather than rule-based algorithms. ML techniques contributes to improving performance of intrusion detection systems (IDS). Numerous studies have been addressing ML-based IDS techniques since KDD CUP 99(KDD) appeared in 1999. KDD is the most widely used dataset generated by Defense Advanced Research Projects Agency (DARPA) for IDS evaluation. KDD consists of four types of attack-labeled data including denial of service(DoS), probe, U2R (User-to-Root), and R2L (Remote-toLocal). Hasan [2] focuses on 2-class classification and multi-class classification using support vector machine (SVM). Mulay [3] employs Random Forest as well as SVM for intrusion detection and preprocess the KDD dataset through binary encoding and data rescale. Beghdad [4] classifies normal and malicious traffic based on SVM and then detect attacks based on Decision Tree. Numerous studies employ deep-learning (DL) for intrusion detects. Jia [5] and Yuchen [6] proposes a convolutional neural network (CNN) model for IDS and trains KDD with images. Le [7], Staudemeyer [8], and Kim [9] suggest an IDS model based on LSTM-RNN. Later, NSL-KDD and gureKDD appeared to improve the problems of KDD. The latest well-known datasets are CIC IDS 2017(CIC-2017) [10] and CSE-CIC-IDS 2018 (CIC2018) [11]. CIC-2017 contains recent attacks with similar form of PCAP. It was created based on B-Profile system [12]. After CIC-2017 released, several studies suggest intrusion detection model using CIC-2017 based on ML [13-14]. CIC-2018 is the most up-to-date dataset including common attacks for IDS evaluation. CIC-2018 was not generated based on KDD and consists of 7 types of attack scenarios-labeled data (specifically 16 types of attacks) including brute-force, DoS, and Botnet. CIC-2018 contains massive network traffic and system logs. We can hardly find CIC-2018 studies using DL compared to ML-based studies on CIC-2018 [15-16]. In this paper, we develop an intrusion detection model based on CNN, one of DL algorithms used to train image datasets. We first convert the CIC-2018 numerical data into images. We then develop a CNN-based intrusion detection model by organizing convolutional layers and max-pooling layers. Furthermore, we train the images based on the proposed model and evaluate its performance by comparing experimental results with that of a recurrent neural network (RNN) model. Lastly, we discuss on a way of improving the performance. CNN and RNN are fundamental deep learning models for image data and time-series data, respectively. Inception [25] as well as ResNet [26] are based on CNN. Long Short-Term Memory (LSTM) [27] is an advanced model of RNN. By employing these fundamental models, we are able to identify the optimal analysis model for the characteristics of CIC-2018. Furthermore, we could improve the performance using those advanced models in the future. The remainder of this paper is organized as follows. Section 2 briefly describes existing ML-based studies on intrusion detection as well as DL algorithms we use in this work. In Section 3, we design our CNN-based intrusion model along with features. We evaluate the proposed model discuss a preprocessing issue for the better performance in Section 4. Finally, the conclusion is in Section 5.
II. RELATED WORKS
KDD CUP 99(KDD) was generated for IDS evaluation and includes four types of attacks such as DoS, R2L, U2R, and probing. KDD consists of 41 features including traffic features, basic and content features of each TCP connection. KDD has been widely used for data mining and ML studies on intrusion detection. Table 1 shows existing ML/DL-based studies on intrusion detection using KDD.
Table 1. ML/DL-based intrusion detection studies using KDD
Some studies employ ML technique such as SVM, Decision Tree, and Artificial Neural Network (ANN) [6,17]. Most of DL-based studies use CNN, RNN, LSTM and Deep Neural Network (DNN) algorithms [7-9], [17-18]. Moreover, some studies focus on preprocessing techniques of KDD [19-20]. NSL-KDD was generated to resolve some issues in KDD, especially duplicated records and lack of patterns of several attacks. Chuanlong [21] studies an intrusion detection model using Recurrent Neural Network (RNN) using NSL-KDD. Canadian Institute for Cybersecurity (CIC) generated IDS datasets in 2012, 2017 and 2018. In 2012, ISCX IDS 2012 (CIC-2012)[22] was generated by injecting 4 types of attacks including infiltration attacks from inside, HTTP DoS attacks, DDoS(distributed denial of service) attacks and brute force attacks. Tamim [23] detects attacks in CIC-2012 based on CNN. He generates input images by converting destination payloads and classifies the images into normal and attack, while we classify two or more attacks in CIC-2018 based on a multi-class classification. CIC-2017 [10] and CIC-2018 [11] are the most up-to-date datasets for IDS evaluation. CIC-2017 contains network traffic with most common attack families including brute force attacks, heartbleed attacks, botnets, DDOS attacks and web attacks. Faker [13] studies intrusion detection using CIC-2017 and UNSWNB15 datasets. This study removes socket information to prevent model overfitting. To reduce data size, they remove null values and unimportant traffic information. They also convert string values into numerical values and normalize the values. If there are missing data or infinite data, they make two versions of data set. First, replace all of missing and infinite data into average data. Second, remove all the missing and infinite data. They evaluate their model with the two kinds of datasets. As training algorithms, DNN (Deep Neural Network), Random Forest, and Gradient Boosting Tree classification are used. X. Zhang [14] focuses on intrusion detection using Deep Forest. They preprocess the datasets using based on the P-ZigZag encoding method and apply an inverse discrete cosine transform (IDCT) into the preprocessed datasets. CIC-2018 contains more recent network traffic with/without attacks. CIC-2018 was generated by collecting network traffic and system logs for about 80 features. Qianru [15] analyzes the CIC-2018 dataset employing ML techniques. This study preprocesses the dataset by eliminating normal data and noise data, and then remove unnecessary values after decimal point. With these preprocessing methods, the size of CIC-2018 decreased by 4MB.
Table 2. Analysis of Intrusion Detection Studies using CIC-2017.
Table 3. Analysis of Intrusion Detection Studies using CIC-2018.
III. METHODS
3.1. Datasets and features
CSE-CIC-IDS2018(CIC-2018) is a dataset containing network traffic and system logs. CIC-2018 consists of 10 days of sub-datasets collected on different days through injecting 16 types of attacks. This dataset was generated using CICFlowMeter-V3 [24] and contains about 80 types of features. These features provide forward and backward directions of network flow and packets. The size of CIC2018 is more than 400GB, which is the larger amount than that of CIC-2017. We can develop a DL-based IDS model and evaluate its performance using CIC-2018.
Table 4. Type of injected attacks and amounts of sub-datasets.
Table 4 shows a list of the injected attacks and the amount of each sub-dataset we use in this work. We have preprocessed incomplete data such as null values and infinite values. We have extracted 80 types of common features from all the sub-datasets. Using these common features, we can develop and evaluate each attack model in the same environment. We finally choose 79 features except ‘Timestamp’ as shown in Table 5. The details of the features can be found at the website of CIC-2018 [11].
Table 5. Extracted features for CNN-based intrusion detection.
3.2. Design of our CNN model
CNN is the most commonly used deep learning algorithm for image training. In order to develop a CNN-based intrusion model, converting the CIC-2018 dataset into images is required. We convert each labeled data into 13x6 size of images because each data contains 78 features except the ‘Label’ feature. The ‘Label’ is used for image classification. A CNN model consists of convolutional layers, max-pooling layers, and a fully connected layer. We can find out the optimal CNN model by organizing those layers along with modeling parameters such as a kernel size, number of kernels, and ratio of dropout. Figure 1 shows our CNN model for CIC-2018.
Fig. 1. Our CNN model and parameters.
We deploy two convolutional layers and the two maxpooling layers behind each convolutional layer. Although the max pooling layer is not mandatory for a CNN model, we deploy the layer because there is very low possibility of losing important features from the max pooling as the converted images only contain numerical data rather than hidden signatures. In addition, we use ‘relu’ as an activation function for each convolutional layer. In order to reduce overfitting, dropout is applied after each step of the max pooling. Finally, a fully connected layer is deployed behind the last max-pooling layer.
IV. RESULTS AND DISCUSSION
We train each sub-dataset described in Table 4 based on our CNN model. Thirty percent of each sub-dataset is used for the testing set. We set the training parameters as shown in Table 6.
Table 6. Training parameters for our experiments.
In order to evaluate the performance of our model, we also train the dataset based on RNN model and compare the experimental results with each other. We design the RNN model based on ‘vanilla RNN’ with 10 units. Figure 2 shows the experimental results of CNN and RNN models. In most sub-datasets, our CNN model has a higher accuracy than that of the RNN model. The accuracy is measured as follows:
\(\text {Accuracy}=\frac{2 \times p \text {recision} \times \text {recall}}{\text {precision+recall}}\) (1)
\(\text { where } \quad \text { precision }=\frac{\text {true positives}}{\text {true positivest false positives}}\) and \(r e c a l l=\frac{\text {true positives}}{\text {false negatives+true positives}}\)
Especially, the accuracies of SD-2, SD-3, SD-5, and SD-9 with CNN are about 10% to 60% higher than that of using RNN.
Fig. 2. Comparison of the accuracy of CNN and RNN model.
Although our experimental results show that our CNN model detects attacks in CIC-2018 with high accuracy, we still need to figure out a way of improving the accuracy of each attack. For instance, the accuracy of SD-3 is 0.9677 in Figure 2. According to the confusion matrix of SD-3, however, the accuracies of ‘DoS- oldenEye’ and ‘DoSSlowloris’ are 0.66 and 0.47 while the accuracy of ‘benign’ is 0.99 as shown Table 7.
Table 7. Accuracy of each attack with a CNN model.
It means our model has the best performance in classifying benign data because the CIC-2018 dataset provides much more ‘benign’ data than attack-labeled data.
Here we adjust the ratio of labeled data for the better performance of DL, although the original dataset better represents the real-world network environment and distinguishing anomalous traffic from massive benign traffic in the real network is challenging. In ML and DL, a data preprocessing is an important strategy for high. We preprocessed sub-datasets (SD-3, SD-6, SD-7, SD-8, and SD-9) with low accuracy in attack-labeled data so that the amount of benign data must not be more than five times than that of the smallest amount of attacklabeled data. We then train the datasets using our CNN model. Figure 3 compares the accuracy of each attack before and after the preprocessing considering the data ratio. The experimental results show that the accuracies of most attacks dramatically increase through the preprocessing. We can find out the optimal ratio of benign and attack-labeled data through repeatitive preprocessing and training.
Fig. 3. Comparison of accuracy of before and after preprocessing.
V. CONCLUSION
We have employed DL techniques for intrusion detection. CIC-2018 has been used as an IDS dataset in this work. We have designed a CNN model consisting of two convolutional layers and two max-pooling layers and converted the dataset into images. These images have been trained based on the proposed CNN model and the experimental results showed that our model detects benign and attack data in CIC-2018 with high accuracy. In order to evaluate the performance of our model, we have also trained the dataset using RNN. In the multiclass classification, our CNN model is more accurate than the RNN model when applied to CIC-2018, the latest CIC dataset, using the image-based deep learning method introduced in Tami’s work [23]. Furthermore, we have suggested a way of improving the performance by preprocessing the dataset considering ratio of benign and attack-labeled data. The experimental results showed that the accuracy of attack-labeled data increased through the preprocessing method. In the future, we will train another IDS dataset based on our CNN model and find out the optimal model by reorganizing the convolutional layers along with CNN parameters.
Acknowledgement
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1D1A1B07050543) and also partially supported by special research grant from Seoul Women's University (2019)
References
- Jiyeon Kim, Yulim Ahn, and Eunjung Choi, "Network Intrusion Detection using Machine Learning Techniques", in Proceeding of International Conference on Culture Technology 2019, August 2019.
- Hasan, Md. Al & Nasser, Mohammed & Pal, Biprodip & Ahmad, Shamim, "Support Vector Machine and Random Forest Modeling for Intrusion Detection System (IDS)," Journal of Intelligent Learning Systems and Applications, vol. 06, pp. 45-52, 2014.
- Mulay, Snehal & Devale, P.R. & Garje, Goraksh, "Intrusion Detection System Using Support Vector Machine and Decision Tree," International Journal of Computer Applications vol. 3. 10.5120/758-993, 2010.
- Beghad, Rachid, "Training all the KDD data set to classif and detect attacks," Neural Network World, vol. 17, pp. 81-91, 2017.
- Jia, F. & Kong, L.-Z., "Intrusion Detection Algorithm Based on Convolutional Neural Network," Beijing Ligong Daxue Xuebao/Transaction of Beijing Institute of Technology, vol. 37, pp. 1271-1275, 2017.
- Yuchen Liu, Shengli Liu and Xing Zhao, "Intrusion Detection Algorithm Based on Convolutional Neural Network", in Proceeding of the 4th International Conference on Engineering Technology and Application, 2017.
- Jihyun Kim, Howon Kim, An Effective Intrusion Detection Classifier Using Long Short-Term Memory with Gradient Descent Optimization, Proceeding of the 2017 IEEE International Conference on Platform Technology and Service (PlatCon), pp. 1-6, 2017..
- R. C. Staudemeyer and C. W. Omlin, "Evaluating performance of long short-term memory recurrent neural networks on intrusion detection data," In Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, pp. 218-224, 2013.
- G. Kim, H. Yi, J. Lee, Y. Paek, and S. Yoon, "LSTM-Based System-Call Language Modeling and Robust Ensemble Method for Designing Host-Based Intrusion Detection Systems," arXiv preprint arXiv:1611.01726, 2016.
- Intrusion Detection Evaluation Dataset (CICIDS2017), https://www.unb.ca/cic/datasets/ids-2017.html
- CSE-CIC-IDS2018 on AWS, https://www.unb.ca /cic/datasets/ids-2018.html
- Sharafaldin I., Gharib A., Habibi Lashkari A., and Ghorbani A. A.. Towards a reliable intrusion detection benchmark dataset, Software Networking, vol. 2017, no. 1, pp. 177-200, 2017. https://doi.org/10.13052/jsn2445-9739.2017.009
- Faker, Osama & Dogdu, Erdogan, "Intrusion Detection Using Big Data and Deep Learning Techniques," in Proceedings of the 2019 ACM Southeast Conference, pp. 86-93. 2019.
- Zhang Xueqin, Chen Jiahao, Zhou Yue, Han, Liangxiu, Lin Jiajun, "A Multiple-layer Representation Learning Model for Network-Based Attack Detection," IEEE Access. pp. 1-1. 10.1109/ACCESS.2019.2927465, 2019.
- Zhou Qianru, Pezaros Dimitrios, "Evaluation of Machine Learning Classifiers for Zero-Day Intrusion Detection -- An Analysis on CIC-AWS-2018 dataset," 2018.
- Jackins, V., and D. Shalini Punithavathani. "An anomaly-based network intrusion detection system using ensemble clustering," International Journal of Enterprise Network Management, vol. 9.3-4, pp. 251-260, 2018. https://doi.org/10.1504/IJENM.2018.094664
- Y. X. Meng, "The practice on using machine learning for network anomaly intrusion detection," in Proceeding of Machine Learning and Cybernetics (ICMLC), 2011 International Conference, vol. 2, pp. 576-581, IEEE, 2011.
- J. Kim, N. Shin, S. Y. Jo, and S. H. Kim, "Method of intrusion detection using deep neural network," in Proceeding of IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 313-316, 2017.
- K. Leung and C. Leckie, "Unsupervised anomaly detection in network intrusion detection using clusters," In Proceedings of the Twenty-eighth Australasian conference on Computer Science, vol. 38, pp. 333-342, 2005.
- J. J. Davis and A. J. Clark, "Data preprocessing for anomaly based network intrusion detection: A review," Computers & Security, vol. 30, no. 6, pp. 353-375, 2011. https://doi.org/10.1016/j.cose.2011.05.008
- Chuanlong Yin, Yuefei Zhu, Jinlong Fei, and Xinzheng He, "A Deep Learning Approach for Intrusion Detection Using Recurrent Neural Networks", IEEE Access, vol 5, pp. 21954-21961, 2017. https://doi.org/10.1109/ACCESS.2017.2762418
- Intrusion detection evaluation dataset (ISCXIDS 2012), https://www.unb.ca/cic/datasets/ids.html
- Tamim Mirza, Building an Intrusion Detection System using Deep Learning, https://towardsdatascience.com/building-an-intrusion-detection-system-using-deep-learning-b9488332b321, August 2018.
- CICFlowMeter, https://www.unb.ca/cic/research/applications.html#CICFlowMeter
- SZEGEDY, Christian, et al. Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-First AAAI Conference on Artificial Intelligence. 2017.
- HE, Kaiming et al., "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
- GERS, Felix A.; SCHMIDHUBER, Jürgen; CUMMINS, Fred. Learning to forget: Continual prediction with LSTM. 1999.
- VINAYAKUMAR, R., et al., "Deep Learning Approach for Intelligent Intrusion Detection System," IEEE Access, vol. 7: 41525-41550, 2019. https://doi.org/10.1109/ACCESS.2019.2895334