DOI QR코드

DOI QR Code

Comparative Study of Anomaly Detection Accuracy of Intrusion Detection Systems Based on Various Data Preprocessing Techniques

다양한 데이터 전처리 기법 기반 침입탐지 시스템의 이상탐지 정확도 비교 연구

  • 박경선 (아주대학교 지식정보공학과) ;
  • 김강석 (아주대학교 사이버보안학과)
  • Received : 2021.09.15
  • Accepted : 2021.10.17
  • Published : 2021.11.30

Abstract

An intrusion detection system is a technology that detects abnormal behaviors that violate security, and detects abnormal operations and prevents system attacks. Existing intrusion detection systems have been designed using statistical analysis or anomaly detection techniques for traffic patterns, but modern systems generate a variety of traffic different from existing systems due to rapidly growing technologies, so the existing methods have limitations. In order to overcome this limitation, study on intrusion detection methods applying various machine learning techniques is being actively conducted. In this study, a comparative study was conducted on data preprocessing techniques that can improve the accuracy of anomaly detection using NGIDS-DS (Next Generation IDS Database) generated by simulation equipment for traffic in various network environments. Padding and sliding window were used as data preprocessing, and an oversampling technique with Adversarial Auto-Encoder (AAE) was applied to solve the problem of imbalance between the normal data rate and the abnormal data rate. In addition, the performance improvement of detection accuracy was confirmed by using Skip-gram among the Word2Vec techniques that can extract feature vectors of preprocessed sequence data. PCA-SVM and GRU were used as models for comparative experiments, and the experimental results showed better performance when sliding window, skip-gram, AAE, and GRU were applied.

침입 탐지 시스템(IDS: Intrusion Detection System)은 보안을 침해하는 이상 행위를 탐지하는 기술로서 비정상적인 조작을 탐지하고 시스템 공격을 방지한다. 기존의 침입탐지 시스템은 트래픽 패턴을 통계 기반으로 분석하여 설계하였다. 그러나 급속도로 성장하는 기술에 의해 현대의 시스템은 다양한 트래픽을 생성하기 때문에 기존의 방법은 한계점이 명확해졌다. 이런 한계점을 극복하기 위해 다양한 기계학습 기법을 적용한 침입탐지 방법의 연구가 활발히 진행되고 있다. 본 논문에서는 다양한 네트워크 환경의 트래픽을 시뮬레이션 장비에서 생성한 NGIDS-DS(Next Generation IDS Dataset)를 이용하여 이상(Anomaly) 탐지 정확도를 높일 수 있는 데이터 전처리 기법에 관한 비교 연구를 진행하였다. 데이터 전처리로 패딩(Padding)과 슬라이딩 윈도우(Sliding Window)를 사용하였고, 정상 데이터 비율과 이상 데이터 비율의 불균형 문제를 해결하기 위해 AAE(Adversarial Auto-Encoder)를 적용한 오버샘플링 기법 등을 적용하였다. 또한, 전처리된 시퀀스 데이터의 특징벡터를 추출할 수 있는 Word2Vec 기법 중 Skip-gram을 이용하여 탐지 정확도의 성능 향상을 확인하였다. 비교실험을 위한 모델로는 PCA-SVM과 GRU를 사용하였고, 실험 결과는 슬라이딩 윈도우, Skip-gram, AAE, GRU를 적용하였을 때, 더 좋은 성능을 보였다.

Keywords

Acknowledgement

이 논문은 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. NRF-2019R1F1A1059036).

References

  1. Y. Lee, "Design and analysis of multiple intrusion detection model," Journal of The Korea Institute of Electronic Communication Sciences, Vol.11, No.6, pp.619-626, 2016. https://doi.org/10.13067/JKIECS.2016.11.6.619
  2. W. Haider, J. Hua, J. Slaya, B. P. Turnbull, and Y. Xieb, "Generating realistic intrusion detection system dataset based on fuzzy qualitative modeling," Journal of Network and Computer Applications, Vol.87, No.1, pp.185-192, 2017. https://doi.org/10.1016/j.jnca.2017.03.018
  3. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of Artificial Inteligence Research(JAIR), Vol.16, No.1, pp.321-357, 2002. https://doi.org/10.1613/jair.953
  4. A. Makhzani, J. Shlens, N. Jaitly, L. Goodfellow, and B. Frey, "Adversarial autoencoders," International Conference on Learning Representations, San Juan, Puerto Rico, 2016, http://arxiv.org/abs/1511.05644
  5. S. Kim and S. Park, "Multi-class classification of database workloads using PCA-SVM classifier," Journal of KIISE: Database, Vol.38, No.1, pp.1-8, 2011.
  6. K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdabau, F. Bougares, H. Schwenk, and Y. Bengio, "Learning phrase representations using RNN encoder-decoder for statistical machine translation," Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing EMNLP, Doha, Qatar, pp.1724-1734, 2014.
  7. Y. Cheong, K. Park, H. Kim, J. Kim, and S. Hyun, "Machine learning based intrusion detection systems for class imbalanced datasets," Journal of the Korea Institute of Information Security and Cryptology, Vol.27, No.6, pp.1385-1395, 2017. https://doi.org/10.13089/JKIISC.2017.27.6.1385
  8. M. Lee, "LSTM model based on session management for network intrusion detection," Journal of The Institute of Internet, Broadcasting and Communication, Vol.20, No.3, pp.1-7, 2020. https://doi.org/10.7236/JIIBC.2020.20.3.1
  9. M. Shahriar and N. Haque, "G-IDS: Generative adversarial networks assisted intrusion detection system," IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), pp.376-385, 2020. https://doi.org/10.1109/COMPSAC48688.2020.0-218
  10. R. Corizzo, E. Zdravevski, M. Russell, A. Vagliano, and N. Japkowicz, "Feature extraction based on word embedding models for intrusion detection in network traffic," Journal of Surveillance, Security and Safety, Vol.1, pp.140-150, 2020. https://doi.org/10.20517/jsss.2020.15
  11. B. Min, J. Ryu, D. Shin, and D. Shin, "Improved network intrusion detection model through hybrid feature selection and data balancing," KIPS Transactions on Software and Data Engineering, Vol.10, No.2, pp.65-72, 2021. https://doi.org/10.3745/KTSDE.2021.10.2.65
  12. J. Lee and K. Park, "GAN-based imbalanced data intrusion detection system," Personal and Ubiquitous Computing, Vol. 25, pp.121-128, 2021. https://doi.org/10.1007/s00779-019-01332-y
  13. D. M. Reddy and N. V. S. Reddy, "Effects of padding on LSTMs and CNNs," arXiv:1903.07288v1, 2019. https://arxiv.org/pdf/1903.07288.pdf
  14. D. Senthil and G. Suseendran, "Efficient time series data classification using sliding window technique based improved association rule mining with enhanced support vector machine," International Journal of Engineering and Technology(UAE), Vol.7, No.2, 2018. https://doi.org/10.14419/ijet.v7i2.33.13890
  15. T. Mikolov, G. Corrado, K. Chen, and J. Dean, "Efficient estimation of word representations in vector space," International Conference on Learning Representations, AZ, USA, pp.1-12, 2013. http://arxiv.org/abs/1301.3781
  16. M. A. Turk and A. P. Pentland, "Face recognition using eigenfaces," Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Maui, HI, USA, pp.586-591, 1991. https://doi.org/10.1109/CVPR.1991.139758
  17. C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, Vol.20, No.3, pp.273-297, 1995. https://dx.doi.org/10.1007%2FBF00994018 https://doi.org/10.1007%2FBF00994018
  18. S. Jo, H. Sung, and B. Ahn, "A comparative study on the performance of SVM and an artificial neural network in intrusion detection," Journal of Korea Academia-Industrial Cooperation Society, Vol.17, No.2, pp.703-711, 2016. https://doi.org/10.5762/KAIS.2016.17.2.703
  19. G. Nicole and J. Alfred, "Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?," Frontiers in Artificial Intelligence, Vol.3, 2020. https://doi.org/10.3389/frai.2020.00040