DOI QR코드

DOI QR Code

Scalable and Accurate Intrusion Detection using n-Gram Augmented Naive Bayes and Generalized k-Truncated Suffix Tree

N-그램 증강 나이브 베이스 알고리즘과 일반화된 k-절단 서픽스트리를 이용한 확장가능하고 정확한 침입 탐지 기법

  • 강대기 (동서대학교 컴퓨터정보공학부) ;
  • 황기현 (동서대학교 컴퓨터정보공학부)
  • Published : 2009.04.30

Abstract

In many intrusion detection applications, n-gram approach has been widely applied. However, n-gram approach has shown a few problems including unscalability and double counting of features. To address those problems, we applied n-gram augmented Naive Bayes with k-truncated suffix tree (k-TST) storage mechanism directly to classify intrusive sequences and compared performance with those of Naive Bayes and Support Vector Machines (SVM) with n-gram features by the experiments on host-based intrusion detection benchmark data sets. Experimental results on the University of New Mexico (UNM) benchmark data sets show that the n-gram augmented method, which solves the problem of independence violation that happens when n-gram features are directly applied to Naive Bayes (i.e. Naive Bayes with n-gram features), yields intrusion detectors with higher accuracy than those from Naive Bayes with n-gram features and shows comparable accuracy to those from SVM with n-gram features. For the scalable and efficient counting of n-gram features, we use k-truncated suffix tree mechanism for storing n-gram features. With the k-truncated suffix tree storage mechanism, we tested the performance of the classifiers up to 20-gram, which illustrates the scalability and accuracy of n-gram augmented Naive Bayes with k-truncated suffix tree storage mechanism.

기계 학습을 응용한 많은 침입 탐지 시스템들에서 n-그램 접근 방법이 사용되고 있다. 그러나, n-그램 접근방법은 확장이 어렵고, 주어진 시퀀스에서 획득한 n-그램들이 서로 겹치는 문제들을 가지고 있다. 본 연구에서는 이러한 문제들을 해결하기 위해, 일반화된 k-절단 서픽스트리 (generalized k-truncated suffix tree; k-TST) 기반의 n-그램 증강 나이브 베이스 (n-gram augmented naive Bayes) 알고리즘을 침입 시퀀스의 분류에 적용하여 보았다. 제 안된 시스템의 성능을 평가하기 위해 n-그램 특징들을 사용하는 일반 나이브 베이스 (naive Bayes) 알고리즘과 서포트 벡터 머신(support vector machines) 알고리즘과 본 연구에서 제안한 n-그램 증강 나이브 베이스 알고리즘을 호스트 기반 침입 탐지 벤치마크 데이터와 비교하였다. 공개된 호스트 기반 침입 탐지 벤치마크 데이터인 뉴 멕시코 대학(University of New Mexico)의 벤치마크 데이터에 적용해 본 결과에 따르면, n-그램 증강 방법이, n-그램이 나이브 베이스에 직접 적용되는 경우(예: n-그램 특징을 사용하는 일반 나이브 베이스), 생기는 독립성 가정에 대한 위배의 문제도 해결하면서, 동시에 더 정확한 침입 탐지기를 생성해냄을 알 수 있었다.

Keywords

References

  1. E. Charniak, Statistical Language Learning, MIT Press, Cambridge, MA, USA, 1994
  2. S. A. Hofmeyr, S. Forrest, and A. Somayaji, Intrusion detection using sequences of system calls, Journal of Computer Security, vol. 6, no. 3, pp. 151-180, 1998 https://doi.org/10.3233/JCS-980109
  3. W. Lee, S. J. Stolfo, and K. W. Mok, A data mining framework for building intrusion detection models, in: IEEE Symposium on Security and Privacy, pp. 120-132, 1999
  4. A. Murali and M. Rao, A survey on intrusion detection approaches, in: First International Conference on Information and Communication Technologies (ICICT 2005), pp. 233-240, 2005
  5. K. Rieck and P. Laskov, Detecting unknown network attacks using language models., in: Proceedings of Third International Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA 2006), Berlin, Germany, pp. 74-90, 2006
  6. M. Z. Shafiq, S. A. Khayam, and M. Farooq, Embedded malware detection using markov n-grams., in: Proceedings of the Fifth Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA 2008), 2008
  7. F. Peng and D. Schuurmans, Combining naive Bayes and n-gram language models for text classification., in: F. Sebastiani (Ed.), Advances in Information Retrieval, 25th European Conference on IR Research (ECIR 2003), Vol. 2633 of Lecture Notes in Computer Science, Springer, pp. 335-350, 2003
  8. C. Andorf, A. Silvescu, D. Dobbs, and V. Honavar, Learning classifiers for assigning protein sequences to gene ontology functional families, in: Proceedings of the Fifth International Conference on Knowledge Based Computer Systems (KBCS 2004), pp. 256-265, 2004
  9. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In COLT '92: Proceedings of the fifth annual workshop on Computational learning theory, pages 144-152, New York, NY, USA, 1992
  10. V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995
  11. J. C. Na and K. Park, Data compression with truncated suffix trees Proceedings of Data Compression Conference 2000, p. 565, 2000
  12. M. H. Schulz, S. Bauer, and P. N. Robinson, The generalised k-Truncated Suffix Tree for time-and space-efficient searches in multiple DNA or protein sequences International Journal of Bioinformatics Research and Applications, 4(1), pp. 81-95, 2008 https://doi.org/10.1504/IJBRA.2008.017165
  13. T. M. Mitchell, Machine Learning McGraw-Hill, 1997
  14. Y. Liao, and V. R. Vemuri, Using Text Categorization Techniques for Intrusion Detection Proceedings of the 11th USENIX Security Symposium, USENIX Association, 51-59, 2002
  15. D. Kang, D. Fuller, and V. Honavar, Learning Classifiers for Misuse and Anomaly Detection Using a Bag of System Calls Representation Proceedings of 6th IEEE Systems Man and Cybernetics Information Assurance Workshop (IAW), 2005
  16. A. Liu, C. Martin, T. Hetherington, and S. Matzner, A Comparison of System Call Feature Representations for Insider Threat Detection Proceedings of 6th IEEE Systems Man and Cybernetics Information Assurance Workshop (IAW), 2005
  17. S. Forrest, A. S. Perelson, L. Allen, and R. Cherukuri, Self-Nonself Discrimination in a Computer SP '94: Proceedings of the 1994 IEEE Symposium on Security and Privacy, IEEE Computer Society, 202, 1994
  18. W. Lee, and S. Stolfo, Data mining approaches for intrusion detection Proceedings of the 7th USENIX Security Symposium, 1998
  19. C. Warrender, S. Forrest, and B. A. Pearlmutter, Detecting Intrusions using System Calls: Alternative Data Models IEEE Symposium on Security and Privacy, 133-145, 1999
  20. R. G. Cowell, S. L. Lauritzen, A. P. David, D. J. Spiegelhalter, D. J. Spiegelhater, Probabilistic Networks and Expert Systems, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1999
  21. E. Ukkonen, On-line construction of suffix-trees Algorithmica, 14, 249-260, 1995 https://doi.org/10.1007/BF01206331
  22. D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology Cambridge University Press, 1997
  23. K. M. C. Tan, and R. A. Maxion, "Why 6?" Defining the Operational Limits of STIDE, an Anomaly-Based Intrusion Detector Proceedings of the 2002 IEEE Symposium on Security and Privacy, IEEE Computer Society, 2002, 188