DOI QR코드

DOI QR Code

Malware classification using statistical techniques

통계적 기법을 이용한 악성 소프트웨어 분류

  • Won, Sungmin (Department of Statistics, Ewha Womans University) ;
  • Kim, Hyunjoo (Department of Statistics, Ewha Womans University) ;
  • Song, Jongwoo (Department of Statistics, Ewha Womans University)
  • 원성민 (이화여자대학교 통계학과) ;
  • 김현주 (이화여자대학교 통계학과) ;
  • 송종우 (이화여자대학교 통계학과)
  • Received : 2017.08.14
  • Accepted : 2017.10.12
  • Published : 2017.12.31

Abstract

Ransomware such as WannaCry is a global issue and methods to defend against malware attacks are important. We have to be able to classify the malware types efficiently in order to minimize the damage from malwares. This study makes models to classify malware properly with various statistical techniques. Several classification techniques such as logistic regression, random forest, gradient boosting, and support vector machine are used to construct models. This study also helps us understand key variables to classify the type of malicious software.

최근 워너크라이라는 이름의 랜섬웨어가 전 세계적으로 큰 화두에 오르면서, 악성 소프트웨어로 인한 피해를 줄이기 위한 방법들이 재조명 되고 있다. 새로운 악성 소프트웨어가 발생했을 때 피해를 최소화하기 위해서는 해당 소프트웨어가 어떤 공격 유형을 가진 악성 소프트웨어인지 빠르게 분류할 필요가 있다. 본 연구 목적은 다양한 통계적 기법을 이용하여 악성 소프트웨어를 효과적으로 분류할 수 있는 모형을 구축하는 데 있다. 모형 적합 시 다항 로지스틱, 랜덤 포레스트, 그래디언트 부스팅, 서포트 벡터 기계 등의 기법들을 이용하였으며, 본 연구를 통해 악성 소프트웨어를 분류하는 데에 있어 중요한 역할을 하는 변수들이 존재한다는 사실을 발견하였다.

Keywords

References

  1. Brieman, L. (2001). Random forests, Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
  2. Brieman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees, Chapman and Hall, New York.
  3. Chen, L. and Aritsugi, M. (2006). An SVM-Based Masquerade Dection Method with Online Update Using Co-occurrence Matrix, DIMVA 2006, LNCS 4064, 37-53.
  4. Choi, J., Kim, H., Kim, K., Park, H., and Song, J. (2014). A study on extraction of optimized API sequence length and combination for efficient malware classification, Journal of The Korea Institute of Information Security & Cryptology, 24, 897-909. https://doi.org/10.13089/JKIISC.2014.24.5.897
  5. Cortes, C. and Vapnik, V. (1995). Support-vector networks, Machine Learning, 20, 273-297.
  6. Dahl, G. E., Stokes, J, W., Deng, L., and Yu, D. (2013). LARGE-SCALE MALWARE CLASSIFICATION USING RANDOM PROJECTIONS AND NEURAL NET WORKS, Acoustics, Speech and Processing (ICASSP), IEEE.
  7. Friedman, J. (2002). Stochastic gradient boosting, Computational Statistics & Data Analysis, 38, 367-378. https://doi.org/10.1016/S0167-9473(01)00065-2
  8. Han, S., Lee, K., and Lee, S. (2009). Packed PE file detection for Malware forensics, 2nd International Conference on Computer Science and its Applications, CSA.
  9. Kim, M., Lee, J., Chang, H., Cho, S., and Park, Y. (2010). Design and performance evaluation of binary code packing for protecting embedded software against reverse engineering, In 13th IEEE International Symposium, (ISORC), 80-86.
  10. Konrad, R. (2011). Automatic analysis of malware behavior using machine learning, Journal of Computer Security, 19, 639-668. https://doi.org/10.3233/JCS-2010-0410
  11. Kwon, H., Kim, S., and Im, E. (2012). An Malware classification system using multi N-gram, Journal of Security Engineering, 9, 531-542.
  12. Lyda, R. and Hamrock, J. (2007). Using entropy analysis to find encrypted and packed malware, IEEE Security & Privacy, 5.
  13. Ridgeway, G. (2007). Generalized Boosted Models: A guide to the gbm package, https://cran.r-project.org/web/packages/gbm/
  14. Runwal, N., Low, R. M., and Stamp, M. (2012). Opcode graph similarity and metamorphic detection, Journal in Computer Virology, 8, 37-52. https://doi.org/10.1007/s11416-012-0160-5
  15. Santos, I., Penya, Y. K., Devesa, J., and Bringas, P. G. (2009). N-grams-based file signatures for malware detection, 11th International Conference on Enterprise Information Systems (ICEIS), AIDSS, 317-320.