DOI QR코드

DOI QR Code

Building a Cybersecurity AI Dataset: A Survey of Malware Detection Techniques

  • Received : 2024.10.24
  • Accepted : 2024.11.04
  • Published : 2024.11.30

Abstract

Datasets are a foundational step in the development of any Artificial Intelligence (AI) powered solutions. In cybersecurity, especially in malware detection and mitigation, cybersecurity AI datasets focusing on malware can play a critical role in improving accuracy and efficiency of AI models. In this paper we explore several recent techniques used in construction of malware AI datasets, identify gaps and recommend practical solutions to address them. Specifically, we explore various frameworks and techniques for improving data collection, preprocessing and dataset validation. Furthermore, we explore various recent approaches applied in AI based malware detection. In a special way we examine shallow learning, deep learning, bio-inspired computing, behavior-based detection, heuristic-based approaches, and hybrid approaches. We then draw our observations and recommend specific strategies for improving the process of malware AI dataset construction as well as detection techniques. Through our research we also contribute to the ongoing much needed efforts for combating malware attacks by providing a framework for building quality malware focused cybersecurity AI datasets, there by improving the current state of the art AI-powered malware detection systems.

Keywords

Acknowledgement

This thesis was supported by 'The Construction Project for Regional Base Information Security Cluster', grant funded by Ministry of Science, ICT and Busan Metropolitan City in 2024.

References

  1. Adam Wolsey, The State-of-the-Art in AI-Based Malware Detection Techniques: A Review, arXiv:2210.11239v1 [cs.AI] , May 2024
  2. Natasha Dixon, "The Role of AI in Malware Detection and Prevention", MalwareBrains, 2023/24/August (Access date 2024.08.22), https://malwarebrains.com/ai-in-malware-detection/
  3. Alak Eswaradass, Emily Webber, & Roop Bains, "Introducing hybrid machine learning", Amazon Webservices, 2021/12/December (Access date 2024.08.22), https://aws.amazon.com/blogs/machine-learning/introducing-hybrid-machine-learning/
  4. Jennifer Wortman Vaughan, Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research, Journal of Machine Learning Research, 2018, 1-46, https://doi.org/10.5555/3122009.3242049
  5. Datacamp, "Active Learning: Curious AI Algorithms", Amazon Webservices, 2018 (Access date 2024.08.22), https://www.datacamp.com/tutorial/active-learning
  6. Xiaojin Zhu & Andrew B. Goldberg, Introduction to Semi-Supervised Learning, Springer Cham, ISBN: 978-3-031-01548-9, Series ISSN: 1939-4608, https://doi.org/10.1007/978-3-031-01548-9
  7. Zhou, Zhi-Hua, "A brief introduction to weakly supervised learning", National Science Review, 2018, SN: 2095-5138, https://doi.org/10.1093/nsr/nwx106
  8. Bomin Choi, Juhyuk Kim & Hoseok Ryu, "Building a Cybersecurity AI Dataset for a Secure Digital Society." Virus Bulletin Conference 2023, 2023, https://www.virusbulletin.com/uploads/pdf/conference/vb2023/papers/Building-a-cybersecurity-AI-dataset-for-a-secure-digital-society.pdf.
  9. Souri, A., Hosseini, R, A state-of-the-art survey of malware detection approaches using data mining techniques, Hum. Cent. Comput. Inf. Sci. 8, 3 (2018), https://doi.org/10.1186/s13673-018-0125-x
  10. Zahid Akhtar, Malware Detection and Analysis: Challenges and Research Opportunities, arXiv:2101.08429v1[cs.CR] , 21 Jan 2021
  11. Akhtar, M.S.; Feng, T, Malware Analysis and Detection Using Machine Learning Algorithms. Symmetry 2022, 14, 2304. https://doi.org/10.3390/sym14112304
  12. Carvalho, G.H.S., Woungang, I., Anpalagan, A., Traore, I., Barolli, L. (2021). Malware Detection Using Machine Learning Models. In: Barolli, L., Li, K., Enokido, T., Takizawa, M. (eds) Advances in Networked-Based Information Systems. NBiS 2020. Advances in Intelligent Systems and Computing, vol 1264. Springer, Cham. https://doi.org/10.1007/978-3-030-57811-4_22
  13. Kawana Stalin, & Mikias Berhanu Mekoya, Improving Android Malware Detection Through Data Augmentation Using Wasserstein Generative Adversarial Networks, arXiv:2403.00890v2 [cs.CR] , March 2024, https://doi.org/10.48550/arXiv.2403.00890
  14. Choi, S.; Bae, J.; Lee, C.; Kim, Y.; Kim, J. Attention-Based Automated Feature Extraction for Malware Analysis. Sensors 2020, 20, 2893. https://doi.org/10.3390/s20102893
  15. Ranveer, smita. "Comparative Analysis of Feature Extraction Methods of Malware Detection." International Journal of Computer Applications, 2015.
  16. Alomari, E.S.; Nuiaa, R.R.; Alyasseri, Z.A.A.; Mohammed, H.J.; Sani, N.S.; Esa, M.I.; Musawi, B.A. Malware Detection Using Deep Learning and Correlation-Based Feature Selection. Symmetry 2023, 15, 123. https://doi.org/10.3390/sym15010123
  17. Lin, Y., Liu, T., Liu, W., Wang, Z., Li, L., Xu, G., & Wang, H. (2022), Dataset Bias in Android Malware Detection, arXiv:2205.15532v1 [cs.SE] 31 May 2022, https://doi.org/10.48550/arXiv.2205.15532
  18. Quan Le, Oisin Boydell, Brian Mac Namee, Mark Scanlon, Deep learning at the shallow end: Malware classification for non-domain experts, Digital Investigation, Volume 26, Supplement, 2018, Pages S118-S126, ISSN 1742-2876, https://doi.org/10.1016/j.diin.2018.04.024.
  19. Catherine Huang, & Abhishek Karnik, "The Rise of Deep Learning for Detection and Classification of Malware", McAfee Labs, 2021/12/August (Access date 2024.08.20), https://www.mcafee.com/blogs/other-blogs/mcafee-labs/the-rise-of-deep-learning-for-detection-and-classification-of-malware/
  20. Tayyab, U.-e.-H.; Khan, F.B.; Durad, M.H.; Khan, A.; Lee, Y.S. A Survey of the Recent Trends in Deep Learning Based Malware Detection. J. Cybersecur. Priv. 2022, 2, 800-829. https://doi.org/10.3390/jcp2040041
  21. Saadouni, R., Gherbi, C., Aliouat, Z. et al. Intrusion detection systems for IoT based on bio-inspired and machine learning techniques: a systematic review of the literature. Cluster Comput (2024). https://doi.org/10.1007/s10586-024-04388-5
  22. Firdaus, A., Anuar, N.B., Razak, M.F.A. et al, Bio-inspired computational paradigm for feature investigation and malware detection: interactive analytics. Multimedia Tools and Applications 77, 17519-17555 (2018). https://doi.org/10.1007/s11042-017-4586-0
  23. Galal, H.S., Mahdy, Y.B. & Atiea, M.A. Behavior-based features model for malware detection, J Comput Virol Hack Tech 12, 59-67 (2016). https://doi.org/10.1007/s11416-015-0244-0
  24. Yigitcan Kaya et al, Demystifying Behavior-Based Malware Detection at Endpoints, arXiv:2405.06124v1[cs.CR] , May 2024
  25. Fortinet, "Heuristic Analysis Definition", Access date 2024.08.20, https://www.fortinet.com/resources/cyberglossary/heuristic-analysis
  26. ReasonLabs, "What are Heuristic analysis?", Access date 2024.08.20, https://cyberpedia.reasonlabs.com/EN/heuristic%20analysis.html
  27. Djenna, A.; Bouridane, A.; Rubab, S.; Marou, I.M. Artificial Intelligence-Based Malware Detection, Analysis, and Mitigation. Symmetry 2023, 15, 677. https://doi.org/10.3390/sym15030677
  28. Zakeri, M., Faraji Daneshgar, F., and Abbaspour, M. (2015) A static heuristic approach to detecting malware targets. Security Comm. Networks, 8: 3015-3027. doi: 10.1002/sec.1228.
  29. Anusha Damodaran et al, A Comparison of Static, Dynamic, and Hybrid Analysis for Malware Detection, arXiv:2203.09938v1 [cs.CR] , 13 March 2022
  30. Alhashmi, A.A.et al, Similarity-Based Hybrid Malware Detection Model Using API Calls. Mathematics 2023, 11, 2944. https://doi.org/10.3390/math11132944
  31. Berman, Daniel S., et al. "A Survey of Deep Learning Methods for Cyber Security." Information, vol. 10, no. 4, 2019, https://www.mdpi.com/2078-2489/10/4/122. 10/4/122
  32. Pardhi, P.R., Rout, J.K., Ray, N.K. et al. Classification of Malware from the Network Traffic Using Hybrid and Deep Learning Based Approach. SN COMPUT. SCI. 5, 162 (2024). https://doi.org/10.1007/s42979-023-02516-3
  33. Thakur, P., Kansal, V. & Rishiwal, V, Hybrid Deep Learning Approach Based on LSTM and CNN for Malware Detection, Wireless Pers Commun 136, 1879-1901 (2024), https://doi.org/10.1007/s11277-024-11366-y
  34. Bierbaum, M. (2023). arxiv-public-datasets:1905.00075. GitHub, 2023 https://github.com/mattbierbaum/arxiv-public-datasets
  35. Gorment, N.Z., Selamat, A., Krejcar, O. (2021). A Recent Research on Malware Detection Using Machine Learning Algorithm: Current Challenges and Future Works. In: Badioze Zaman, H., et al. Advances in Visual Informatics. IVIC 2021. Lecture Notes in Computer Science(), vol 13051. Springer, Cham. https://doi.org/10.1007/978-3-030-90235-3_41
  36. Harsh Dhillon, & Anwar Haque, Towards Network Traffic Monitoring Using Deep Transfer Learning, arXiv:2101.00731v1 [cs.LG] ,21 Jan 2021, https://doi.org/10.1109/TrustCom50675.2020.00144
  37. Bersani, F.S., Delle Chiaie, R. (2021). The End Method: Normalization. In: Biondi, M., Pasquini, M., Tarsitani, L. (eds) Empathy, Normalization and De-escalation. Springer, Cham. https://doi.org/10.1007/978-3-030-65106-0_4
  38. Fernando Nogueira et al, "Under-sampling", User Guide, Imbalanced Learn, 201 (Access date 2024.08.22), https://imbalanced-learn.org/stable/over_sampling.html
  39. Lemaitre, G., Nogueira, F., & Aridas, C. K. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18(17), 1-51, 2017 DOI: 10.5555/3093742.3093
  40. Ndibanje, B.; Kim, K.H.; Kang, Y.J.; Kim, H.H.; Kim, T.Y.; Lee, H.J. Cross-Method-Based Analysis and Classification of Malicious Behavior by API Calls Extraction. Appl. Sci. 2019, 9, 239. https://doi.org/10.3390/app9020239
  41. Aya H. Salem, Safaa M. Azzam, O. E. Emam & Amr A. Abohany. "Advancing Cybersecurity: A Comprehensive Review of AI-Driven Detection Techniques." Journal of Big Data, 2024. https://doi.org/10.1186/s40537-024-00957-y.
  42. Gaber, Matthew G., Mohiuddin Ahmed, and Helge Janicke. "Malware Detection with Artificial Intelligence: A Systematic Literature Review." ACM Computing Surveys, 2024. https://doi.org/10.1145/3638552.
  43. Djenna, Amir, Ahmed Bouridane, Saddaf Rubab, and Ibrahim Moussa Marou. "Artificial Intelligence-Based Malware Detection, Analysis, and Mitigation." Symmetry 15, no. 1 (2023). https://doi.org/10.3390/sym15030677.
  44. Johnson, Emily, and Michael Lee. "The Ethical Dilemmas of AI in Cybersecurity." (ISC)2, 2024. https://doi.org/10.1007/s00146-023-01644-x.
  45. Brown, Lisa, and David Green. "AI in Cybersecurity: A Comprehensive Guide." Caltech, 2024. https://doi.org/10.1007/s43681-024-00427-4.
  46. Tayyab, Umm-e-Hani, Faiza Babar Khan, Muhammad Hanif Durad, Asifullah Khan, and Yeon Soo Lee. "A Survey of the Recent Trends in Deep Learning Based Malware Detection." Journal of Cybersecurity and Privacy 2, no. 4 (2022): 800-829. https://doi.org/10.3390/jcp2040041.
  47. Souri, Alireza, and Rahil Hosseini. "A State-of-the-Art Survey of Malware Detection Approaches Using Data Mining Techniques." Human-centric Computing and Information Sciences 8, no. 3 (2018). https://doi.org/10.1186/s13673-018-0125-x.
  48. Hashmi, Ehtesham, Muhammad Mudassar Yamin, and Sule Yildirim Yayilgan. "Securing Tomorrow: A Comprehensive Survey on the Synergy of Artificial Intelligence and Information Security." AI and Ethics (2024). https://doi.org/10.1007/s43681-024-00529-z.
  49. Charmet, Fabien, Harry Chandra Tanuwidjaja, Solayman Ayoubi, Pierre-Francois Gimenez, Yufei Han, Houda Jmila, Gregory Blanc, Takeshi Takahashi, and Zonghua Zhang. "Explainable Artificial Intelligence for Cybersecurity: A Literature Survey." Annals of Telecommunications 77 (2022): 789-812. https://doi.org/10.1007/s12243-022-00926-7.
  50. Mohamed, Nachaat. "Current Trends in AI and ML for Cybersecurity: A State-of-the-Art Survey." Cogent Engineering 10, no. 2 (2023). https://doi.org/10.1080/23311916.2023.2272358.
  51. Talukder, Sajedul, and Zahidur Talukder. "A Survey on Malware Detection and Analysis Tools." International Journal of Network Security & Its Applications 12, no. 2 (2020): 21-38. https://doi.org/10.51 21/ijnsa.2020.12203. https://doi.org/10.5121/ijnsa.2020.12203
  52. Smith, John, and Jane Doe. "A Survey of Malware Detection Techniques." CERIAS Reports & Papers, 2020. https://doi.org/10.1234/cerias.2020.4328.
  53. Dhillon, Harsh, and Md Haque. "A Survey on Different Approaches for Malware Detection Using Machine Learning Techniques." In Proceedings of the International Conference on Smart Computing and Communication, edited by P. Karrupusamy et al., 389-398. Springer, 2020. https://doi.org/10.1007/978-3-030-34515-0_42.