Browse > Article
http://dx.doi.org/10.7472/jksii.2020.21.1.45

Research on text mining based malware analysis technology using string information  

Ha, Ji-hee (Department of Information Security, Hoseo University)
Lee, Tae-jin (Division of Computer and Information Engineering, Hoseo University)
Publication Information
Journal of Internet Computing and Services / v.21, no.1, 2020 , pp. 45-55 More about this Journal
Abstract
Due to the development of information and communication technology, the number of new / variant malicious codes is increasing rapidly every year, and various types of malicious codes are spreading due to the development of Internet of things and cloud computing technology. In this paper, we propose a malware analysis method based on string information that can be used regardless of operating system environment and represents library call information related to malicious behavior. Attackers can easily create malware using existing code or by using automated authoring tools, and the generated malware operates in a similar way to existing malware. Since most of the strings that can be extracted from malicious code are composed of information closely related to malicious behavior, it is processed by weighting data features using text mining based method to extract them as effective features for malware analysis. Based on the processed data, a model is constructed using various machine learning algorithms to perform experiments on detection of malicious status and classification of malicious groups. Data has been compared and verified against all files used on Windows and Linux operating systems. The accuracy of malicious detection is about 93.5%, the accuracy of group classification is about 90%. The proposed technique has a wide range of applications because it is relatively simple, fast, and operating system independent as a single model because it is not necessary to build a model for each group when classifying malicious groups. In addition, since the string information is extracted through static analysis, it can be processed faster than the analysis method that directly executes the code.
Keywords
Malware; Malware Analysis; String; Text Mining; TFIDF;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 Kaspersky Lab, Kaspersky, "Kaspersky Security Bulletin 2018. Statistics", 2018 https://securelist.com/
2 TaeGuen Kim, HwanTae Ji and Eul Gyu Im, "Malware Classification Using Machine Learning and Binary Visualization", KIISE, Vol.24, No.4, pp.198-203, 2018 https://doi.org/10.5626/KTCP.2018.24.4.198
3 Ji-yeon Choi, et al, "A study on extraction of optimized API sequence length and combination for efficient malware classification", JKIISC, Vol. 24, No. 5, pp. 897-909. 2014. http://dx.doi.org/10.13089/JKIISC.2014.24.5.897
4 Islam, Rafiqul, et al. "Classification of malware based on string and function feature selection.", 2010 Second Cybercrime and Trustworthy Computing Workshop. IEEE, pp. 9-17. 2010. http://hdl.handle.net/10536/DRO/DU:30033826
5 Kang, BooJoong, et al, "Malicious Code Trends and Detection Technologies", Communications of the Korean Institute of Information Scientists and Engineers, Vol.30, No.1, pp 44-53, 2012 koreascience.or.kr/article/JAKO201213036233563.page
6 Kyoung-Soo Han, In-Kyoung Kim, and Eul-Gyu Im, "Malware Family Classification Method using API Sequential Characteristic" Journal of Security Engineering, Vol.8, No.2, pp. 319-335, 2011 http://dx.doi.org/10.1007/978-94-007-2911-7_60
7 Kangsik Shin, Sangmonn Jung, and Yoojea Won, "A Study on PE File Analysis using API Call Sequence and Parameter Information", KIISE, Vol.2017, No.12, pp 1,086-1,088, 2017 http://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE07322400&language=ko_KR
8 Shrestha, Prasha, et al. "Using String Information for Malware Family Identification." Ibero-American Conference on Artificial Intelligence. pp. 686-697, Springer, Cham, 2014 https://doi.org/10.1007/978-3-319-12027-0_55
9 F. Leder, B. Steinbock and P. Martini, "Classification and detection of metamorphic malware using value set analysis", In Proceedings of the 4rd International Conference on Malicious and Unwanted Software : MALWARE 2009, pp. 39-46, 2009 https://doi.org/10.1109/malware.2009.5403019
10 Y. Ye, T. Li, Q. Jiang, and Y.Wang. "CIMDS: Adapting Postprocessing Techniques of Associative Classification for Malware Detection. Systems", Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, Vol.40, No.3, pp 298-307, 2010 https://doi.org/10.1109/TSMCC.2009.2037978
11 Kang SeungJun, Won Yoon, Ji, "Probabilistic K-nearest neighbor classifier for detection of malware in android mobile", Journal of the Korea Institute of Information Security and Cryptology, Vol.25, No.4, pp 817-827, 2016 https://doi.org/10.13089/JKIISC.2015.25.4.817   DOI
12 Vineeth S. Bhaskara and Debanjan Bhattacharyya, "Emulating malware authors for proactive protection using GANs over a distributed image visualization of the dynamic file behavior" ArXiv, abs/1807.07525, 2018 https://arxiv.org/abs/1807.07525
13 Jeong Sangyoon, Kwon Jiyeon, Han Taehyun, Jo Heeseung, "Virus detection based on PE header Machine learning", KIISE, Vol.2018, No.12, pp 2321-2323, 2018 http://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE07614296
14 R. Tian, L. Batten, R. Islam and S. Versteeg, "An automated classification system based on the strings of trojan and virus families", 2009 4th International Conference on Malicious and Unwanted Software (MALWARE), pp. 23-30, 2009 https://doi.org/10.1109/MALWARE.2009.5403021
15 H.S.Shin, J.H.Hwang, T.J.Lee, Malware Variants Detection based on Dhash, Fall Conference of KSII, November, 2018