Building a Cybersecurity AI Dataset: A Survey of Malware Detection Techniques

Niringiye Godfrey;Bruce Ndibanje;Hoon Jae Lee;

doi:10.7236/IJIBC.2024.16.4.409

International Journal of Internet, Broadcasting and Communication

Volume 16 Issue 4
/
Pages.409-431
/
2024
/
2288-4920(pISSN)
/
2288-4939(eISSN)

The Institute of Internet, Broadcasting and Communication (한국인터넷방송통신학회)

DOI QR Code

Building a Cybersecurity AI Dataset: A Survey of Malware Detection Techniques

Niringiye Godfrey (Department of Computer Engineering, Dongseo University) ;
Bruce Ndibanje (TechDivision) ;
Hoon Jae Lee (Dongseo University, Department of Information Security)

Received : 2024.10.24
Accepted : 2024.11.04
Published : 2024.11.30

https://doi.org/10.7236/IJIBC.2024.16.4.409 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Datasets are a foundational step in the development of any Artificial Intelligence (AI) powered solutions. In cybersecurity, especially in malware detection and mitigation, cybersecurity AI datasets focusing on malware can play a critical role in improving accuracy and efficiency of AI models. In this paper we explore several recent techniques used in construction of malware AI datasets, identify gaps and recommend practical solutions to address them. Specifically, we explore various frameworks and techniques for improving data collection, preprocessing and dataset validation. Furthermore, we explore various recent approaches applied in AI based malware detection. In a special way we examine shallow learning, deep learning, bio-inspired computing, behavior-based detection, heuristic-based approaches, and hybrid approaches. We then draw our observations and recommend specific strategies for improving the process of malware AI dataset construction as well as detection techniques. Through our research we also contribute to the ongoing much needed efforts for combating malware attacks by providing a framework for building quality malware focused cybersecurity AI datasets, there by improving the current state of the art AI-powered malware detection systems.

Keywords

Acknowledgement

This thesis was supported by 'The Construction Project for Regional Base Information Security Cluster', grant funded by Ministry of Science, ICT and Busan Metropolitan City in 2024.

References

Adam Wolsey, The State-of-the-Art in AI-Based Malware Detection Techniques: A Review, arXiv:2210.11239v1 [cs.AI] , May 2024
Natasha Dixon, "The Role of AI in Malware Detection and Prevention", MalwareBrains, 2023/24/August (Access date 2024.08.22), https://malwarebrains.com/ai-in-malware-detection/
Alak Eswaradass, Emily Webber, & Roop Bains, "Introducing hybrid machine learning", Amazon Webservices, 2021/12/December (Access date 2024.08.22), https://aws.amazon.com/blogs/machine-learning/introducing-hybrid-machine-learning/
Jennifer Wortman Vaughan, Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research, Journal of Machine Learning Research, 2018, 1-46, https://doi.org/10.5555/3122009.3242049
Datacamp, "Active Learning: Curious AI Algorithms", Amazon Webservices, 2018 (Access date 2024.08.22), https://www.datacamp.com/tutorial/active-learning
Xiaojin Zhu & Andrew B. Goldberg, Introduction to Semi-Supervised Learning, Springer Cham, ISBN: 978-3-031-01548-9, Series ISSN: 1939-4608, https://doi.org/10.1007/978-3-031-01548-9
Zhou, Zhi-Hua, "A brief introduction to weakly supervised learning", National Science Review, 2018, SN: 2095-5138, https://doi.org/10.1093/nsr/nwx106
Bomin Choi, Juhyuk Kim & Hoseok Ryu, "Building a Cybersecurity AI Dataset for a Secure Digital Society." Virus Bulletin Conference 2023, 2023, https://www.virusbulletin.com/uploads/pdf/conference/vb2023/papers/Building-a-cybersecurity-AI-dataset-for-a-secure-digital-society.pdf.
Souri, A., Hosseini, R, A state-of-the-art survey of malware detection approaches using data mining techniques, Hum. Cent. Comput. Inf. Sci. 8, 3 (2018), https://doi.org/10.1186/s13673-018-0125-x
Zahid Akhtar, Malware Detection and Analysis: Challenges and Research Opportunities, arXiv:2101.08429v1[cs.CR] , 21 Jan 2021
Akhtar, M.S.; Feng, T, Malware Analysis and Detection Using Machine Learning Algorithms. Symmetry 2022, 14, 2304. https://doi.org/10.3390/sym14112304
Carvalho, G.H.S., Woungang, I., Anpalagan, A., Traore, I., Barolli, L. (2021). Malware Detection Using Machine Learning Models. In: Barolli, L., Li, K., Enokido, T., Takizawa, M. (eds) Advances in Networked-Based Information Systems. NBiS 2020. Advances in Intelligent Systems and Computing, vol 1264. Springer, Cham. https://doi.org/10.1007/978-3-030-57811-4_22
Kawana Stalin, & Mikias Berhanu Mekoya, Improving Android Malware Detection Through Data Augmentation Using Wasserstein Generative Adversarial Networks, arXiv:2403.00890v2 [cs.CR] , March 2024, https://doi.org/10.48550/arXiv.2403.00890
Choi, S.; Bae, J.; Lee, C.; Kim, Y.; Kim, J. Attention-Based Automated Feature Extraction for Malware Analysis. Sensors 2020, 20, 2893. https://doi.org/10.3390/s20102893
Ranveer, smita. "Comparative Analysis of Feature Extraction Methods of Malware Detection." International Journal of Computer Applications, 2015.
Alomari, E.S.; Nuiaa, R.R.; Alyasseri, Z.A.A.; Mohammed, H.J.; Sani, N.S.; Esa, M.I.; Musawi, B.A. Malware Detection Using Deep Learning and Correlation-Based Feature Selection. Symmetry 2023, 15, 123. https://doi.org/10.3390/sym15010123
Lin, Y., Liu, T., Liu, W., Wang, Z., Li, L., Xu, G., & Wang, H. (2022), Dataset Bias in Android Malware Detection, arXiv:2205.15532v1 [cs.SE] 31 May 2022, https://doi.org/10.48550/arXiv.2205.15532
Quan Le, Oisin Boydell, Brian Mac Namee, Mark Scanlon, Deep learning at the shallow end: Malware classification for non-domain experts, Digital Investigation, Volume 26, Supplement, 2018, Pages S118-S126, ISSN 1742-2876, https://doi.org/10.1016/j.diin.2018.04.024.
Catherine Huang, & Abhishek Karnik, "The Rise of Deep Learning for Detection and Classification of Malware", McAfee Labs, 2021/12/August (Access date 2024.08.20), https://www.mcafee.com/blogs/other-blogs/mcafee-labs/the-rise-of-deep-learning-for-detection-and-classification-of-malware/
Tayyab, U.-e.-H.; Khan, F.B.; Durad, M.H.; Khan, A.; Lee, Y.S. A Survey of the Recent Trends in Deep Learning Based Malware Detection. J. Cybersecur. Priv. 2022, 2, 800-829. https://doi.org/10.3390/jcp2040041
Saadouni, R., Gherbi, C., Aliouat, Z. et al. Intrusion detection systems for IoT based on bio-inspired and machine learning techniques: a systematic review of the literature. Cluster Comput (2024). https://doi.org/10.1007/s10586-024-04388-5
Firdaus, A., Anuar, N.B., Razak, M.F.A. et al, Bio-inspired computational paradigm for feature investigation and malware detection: interactive analytics. Multimedia Tools and Applications 77, 17519-17555 (2018). https://doi.org/10.1007/s11042-017-4586-0
Galal, H.S., Mahdy, Y.B. & Atiea, M.A. Behavior-based features model for malware detection, J Comput Virol Hack Tech 12, 59-67 (2016). https://doi.org/10.1007/s11416-015-0244-0
Yigitcan Kaya et al, Demystifying Behavior-Based Malware Detection at Endpoints, arXiv:2405.06124v1[cs.CR] , May 2024
Fortinet, "Heuristic Analysis Definition", Access date 2024.08.20, https://www.fortinet.com/resources/cyberglossary/heuristic-analysis
ReasonLabs, "What are Heuristic analysis?", Access date 2024.08.20, https://cyberpedia.reasonlabs.com/EN/heuristic%20analysis.html
Djenna, A.; Bouridane, A.; Rubab, S.; Marou, I.M. Artificial Intelligence-Based Malware Detection, Analysis, and Mitigation. Symmetry 2023, 15, 677. https://doi.org/10.3390/sym15030677
Zakeri, M., Faraji Daneshgar, F., and Abbaspour, M. (2015) A static heuristic approach to detecting malware targets. Security Comm. Networks, 8: 3015-3027. doi: 10.1002/sec.1228.
Anusha Damodaran et al, A Comparison of Static, Dynamic, and Hybrid Analysis for Malware Detection, arXiv:2203.09938v1 [cs.CR] , 13 March 2022
Alhashmi, A.A.et al, Similarity-Based Hybrid Malware Detection Model Using API Calls. Mathematics 2023, 11, 2944. https://doi.org/10.3390/math11132944
Berman, Daniel S., et al. "A Survey of Deep Learning Methods for Cyber Security." Information, vol. 10, no. 4, 2019, https://www.mdpi.com/2078-2489/10/4/122. 10/4/122
Pardhi, P.R., Rout, J.K., Ray, N.K. et al. Classification of Malware from the Network Traffic Using Hybrid and Deep Learning Based Approach. SN COMPUT. SCI. 5, 162 (2024). https://doi.org/10.1007/s42979-023-02516-3
Thakur, P., Kansal, V. & Rishiwal, V, Hybrid Deep Learning Approach Based on LSTM and CNN for Malware Detection, Wireless Pers Commun 136, 1879-1901 (2024), https://doi.org/10.1007/s11277-024-11366-y
Bierbaum, M. (2023). arxiv-public-datasets:1905.00075. GitHub, 2023 https://github.com/mattbierbaum/arxiv-public-datasets
Gorment, N.Z., Selamat, A., Krejcar, O. (2021). A Recent Research on Malware Detection Using Machine Learning Algorithm: Current Challenges and Future Works. In: Badioze Zaman, H., et al. Advances in Visual Informatics. IVIC 2021. Lecture Notes in Computer Science(), vol 13051. Springer, Cham. https://doi.org/10.1007/978-3-030-90235-3_41
Harsh Dhillon, & Anwar Haque, Towards Network Traffic Monitoring Using Deep Transfer Learning, arXiv:2101.00731v1 [cs.LG] ,21 Jan 2021, https://doi.org/10.1109/TrustCom50675.2020.00144
Bersani, F.S., Delle Chiaie, R. (2021). The End Method: Normalization. In: Biondi, M., Pasquini, M., Tarsitani, L. (eds) Empathy, Normalization and De-escalation. Springer, Cham. https://doi.org/10.1007/978-3-030-65106-0_4
Fernando Nogueira et al, "Under-sampling", User Guide, Imbalanced Learn, 201 (Access date 2024.08.22), https://imbalanced-learn.org/stable/over_sampling.html
Lemaitre, G., Nogueira, F., & Aridas, C. K. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18(17), 1-51, 2017 DOI: 10.5555/3093742.3093
Ndibanje, B.; Kim, K.H.; Kang, Y.J.; Kim, H.H.; Kim, T.Y.; Lee, H.J. Cross-Method-Based Analysis and Classification of Malicious Behavior by API Calls Extraction. Appl. Sci. 2019, 9, 239. https://doi.org/10.3390/app9020239
Aya H. Salem, Safaa M. Azzam, O. E. Emam & Amr A. Abohany. "Advancing Cybersecurity: A Comprehensive Review of AI-Driven Detection Techniques." Journal of Big Data, 2024. https://doi.org/10.1186/s40537-024-00957-y.
Gaber, Matthew G., Mohiuddin Ahmed, and Helge Janicke. "Malware Detection with Artificial Intelligence: A Systematic Literature Review." ACM Computing Surveys, 2024. https://doi.org/10.1145/3638552.
Djenna, Amir, Ahmed Bouridane, Saddaf Rubab, and Ibrahim Moussa Marou. "Artificial Intelligence-Based Malware Detection, Analysis, and Mitigation." Symmetry 15, no. 1 (2023). https://doi.org/10.3390/sym15030677.
Johnson, Emily, and Michael Lee. "The Ethical Dilemmas of AI in Cybersecurity." (ISC)², 2024. https://doi.org/10.1007/s00146-023-01644-x.
Brown, Lisa, and David Green. "AI in Cybersecurity: A Comprehensive Guide." Caltech, 2024. https://doi.org/10.1007/s43681-024-00427-4.
Tayyab, Umm-e-Hani, Faiza Babar Khan, Muhammad Hanif Durad, Asifullah Khan, and Yeon Soo Lee. "A Survey of the Recent Trends in Deep Learning Based Malware Detection." Journal of Cybersecurity and Privacy 2, no. 4 (2022): 800-829. https://doi.org/10.3390/jcp2040041.
Souri, Alireza, and Rahil Hosseini. "A State-of-the-Art Survey of Malware Detection Approaches Using Data Mining Techniques." Human-centric Computing and Information Sciences 8, no. 3 (2018). https://doi.org/10.1186/s13673-018-0125-x.
Hashmi, Ehtesham, Muhammad Mudassar Yamin, and Sule Yildirim Yayilgan. "Securing Tomorrow: A Comprehensive Survey on the Synergy of Artificial Intelligence and Information Security." AI and Ethics (2024). https://doi.org/10.1007/s43681-024-00529-z.
Charmet, Fabien, Harry Chandra Tanuwidjaja, Solayman Ayoubi, Pierre-Francois Gimenez, Yufei Han, Houda Jmila, Gregory Blanc, Takeshi Takahashi, and Zonghua Zhang. "Explainable Artificial Intelligence for Cybersecurity: A Literature Survey." Annals of Telecommunications 77 (2022): 789-812. https://doi.org/10.1007/s12243-022-00926-7.
Mohamed, Nachaat. "Current Trends in AI and ML for Cybersecurity: A State-of-the-Art Survey." Cogent Engineering 10, no. 2 (2023). https://doi.org/10.1080/23311916.2023.2272358.
Talukder, Sajedul, and Zahidur Talukder. "A Survey on Malware Detection and Analysis Tools." International Journal of Network Security & Its Applications 12, no. 2 (2020): 21-38. https://doi.org/10.51 21/ijnsa.2020.12203. https://doi.org/10.5121/ijnsa.2020.12203
Smith, John, and Jane Doe. "A Survey of Malware Detection Techniques." CERIAS Reports & Papers, 2020. https://doi.org/10.1234/cerias.2020.4328.
Dhillon, Harsh, and Md Haque. "A Survey on Different Approaches for Malware Detection Using Machine Learning Techniques." In Proceedings of the International Conference on Smart Computing and Communication, edited by P. Karrupusamy et al., 389-398. Springer, 2020. https://doi.org/10.1007/978-3-030-34515-0_42.

International Journal of Internet, Broadcasting and Communication

Building a Cybersecurity AI Dataset: A Survey of Malware Detection Techniques

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)