Browse > Article
http://dx.doi.org/10.3745/KTCCS.2019.8.2.35

Distributed Processing System Design and Implementation for Feature Extraction from Large-Scale Malicious Code  

Lee, Hyunjong (단국대학교 소프트웨어학과)
Euh, Seongyul (단국대학교 소프트웨어학과)
Hwang, Doosung (단국대학교 소프트웨어학과)
Publication Information
KIPS Transactions on Computer and Communication Systems / v.8, no.2, 2019 , pp. 35-40 More about this Journal
Abstract
Traditional Malware Detection is susceptible for detecting malware which is modified by polymorphism or obfuscation technology. By learning patterns that are embedded in malware code, machine learning algorithms can detect similar behaviors and replace the current detection methods. Data must collected continuously in order to learn malicious code patterns that change over time. However, the process of storing and processing a large amount of malware files is accompanied by high space and time complexity. In this paper, an HDFS-based distributed processing system is designed to reduce space complexity and accelerate feature extraction time. Using a distributed processing system, we extract two API features based on filtering basis, 2-gram feature and APICFG feature and the generalization performance of ensemble learning models is compared. In experiments, the time complexity of the feature extraction was improved about 3.75 times faster than the processing time of a single computer, and the space complexity was about 5 times more efficient. The 2-gram feature was the best when comparing the classification performance by feature, but the learning time was long due to high dimensionality.
Keywords
Distributed Processing System; Malware Detection; Feature Extraction; Machine Learning;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Charles LeDoux and Arun Lakhotia, "Malware and Machine Learning," Intelligent Methods for Cyber Warfare, Intelligent Methods for Cyber Warfare, Studies in Computational Intelligence Book Series, Springer, Vol.563, pp.1-42, 2014.
2 Kaspersky Enterprise Cybersecurity, Machine Learning for Malware Detection [Internet], www.kaspersky.com/
3 Rafiqul Islam, Ronghua Tian, Lynn M. Batten, and Steve Versteeg, "Classification of malware based on integrated static and dynamic features," Journal of Network and Computer Applications, Vol.36, Issue 2, pp.646-656, 2013.   DOI
4 M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, "Novel feature extraction, selection and fusion for effective malware family classification," in Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy. ACM, pp.183-194, 2016.
5 I. Santos and F. Brezo, "Opcode sequences as representation of executables for data-mining-based unknown malware detection," Information Sciences, Vol.231, pp.64-82, 2013.   DOI
6 SS. Hansen, TMT. Larsen, and M. Stevanovic, "An approach for detection and family classification of malware based on behavioral analysis," Computing, Networking and Communications(ICNC), 2016 International Conference on. IEEE, pp.1-5, 2016.
7 M. Wagner, F. Fischer, R. Luh, A. Haberson, A. Rind, D. A. Keim, and W. Aigner, "A Survey of Visualization Systems for Malware Analysis," in EG Conference on visualization (EuroVis)-STARs, pp.105-125, 2015.
8 Hadoop MapReduce [Internet], http://hadoop.apache.org/
9 T. White, "Hadoop: The Definitive Guide: Storage and Analysis at the Internet Scale," 4th ed., Beijing: O'Reilly Media, 2015.
10 C. Lin, N. Wang, H. Xiao, and C. Eckert, "Feature Selection and Extraction for Malware Classification," Journal of Informations Science and Engineering, Vol.31, No.3, pp.965-992, 2015.
11 CWSandbox [Internet], https://cwsandbox.org/
12 Cuckoo Sandbox [Internet], https://cuckoosandbox.org/
13 R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, "Microsoft Malware Classification Challenge," arXiv:1802.10135v1, 2018.
14 T. Chen and C. Guestrin. "Xgboost: A scalable tree boosting system," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp.785-794, 2016.
15 VXHeaven [Internet], http://83.133.184.251/virensimulation.org/
16 VirusShare [Internet], https://virusshare.com/
17 Ninite, Ninite [Internet], https://ninite.com/
18 Lupo PenSuite Collections, Lupo pensuite collections [Internet], http://www.lupopensuite.com/collection.htm, 2015.
19 A. Liaw and M. Wiener, "Classification and regression by randomForest," R news, Vol.2, pp.18-22, 2002.
20 V. Simon, S. O'Keefe, and J. Austin, "Hadoop neural network for parallel and distributed feature selection," Neural Networks 78, pp.24-35. 2015.   DOI
21 M. Bala, O. Boussaid, and Z. Alimazighi, "P-ETL: Parallel-ETL based on the MapReduce paradigm," Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on. IEEE, 2014.
22 Michael Sikorski and Andrew Honig, "Practical Malware Analysis," San Francisco: No Strach Press, 2012.
23 Radare2 [Internet], https://rada.re/r/
24 P. Singhal and N. Raul, "Malware detection module using machine learning algorithms to assist in centralized security in enterprise networks," International Jounal of Network Security & Its Applications(IJNSA), Vol.4, No.1, 2012.
25 Malware.com [Internet], https://www.malwares.com/
26 I. You and Y. Kangbin. "Malware obfuscation techniques: A brief survey," 2010 International Conference on IEEE, Broadband, Wireless Computing, Communication and Applications(BWCCA), 2010.
27 Symantec, "Internet Security Threat Report," vol.23, 2018.