Browse > Article
http://dx.doi.org/10.3837/tiis.2019.12.019

Fast k-NN based Malware Analysis in a Massive Malware Environment  

Hwang, Jun-ho (Department of Information Security, College of Engineering, Hoseo University)
Kwak, Jin (Department of Cyber Security, College of Information Technology, Ajou University)
Lee, Tae-jin (Department of Information Security, College of Engineering, Hoseo University)
Publication Information
KSII Transactions on Internet and Information Systems (TIIS) / v.13, no.12, 2019 , pp. 6145-6158 More about this Journal
Abstract
It is a challenge for the current security industry to respond to a large number of malicious codes distributed indiscriminately as well as intelligent APT attacks. As a result, studies using machine learning algorithms are being conducted as proactive prevention rather than post processing. The k-NN algorithm is widely used because it is intuitive and suitable for handling malicious code as unstructured data. In addition, in the malicious code analysis domain, the k-NN algorithm is easy to classify malicious codes based on previously analyzed malicious codes. For example, it is possible to classify malicious code families or analyze malicious code variants through similarity analysis with existing malicious codes. However, the main disadvantage of the k-NN algorithm is that the search time increases as the learning data increases. We propose a fast k-NN algorithm which improves the computation speed problem while taking the value of the k-NN algorithm. In the test environment, the k-NN algorithm was able to perform with only the comparison of the average of similarity of 19.71 times for 6.25 million malicious codes. Considering the way the algorithm works, Fast k-NN algorithm can also be used to search all data that can be vectorized as well as malware and SSDEEP. In the future, it is expected that if the k-NN approach is needed, and the central node can be effectively selected for clustering of large amount of data in various environments, it will be possible to design a sophisticated machine learning based system.
Keywords
k-Nearest Neighbor; Clustering; Malware;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Namanya, Anitta Patience, et al, "Detection of malicious portable executables using evidence combinational theory with fuzzy hashing," in Proc. of Future Internet of Things and Cloud (FiCloud), IEEE 4th International Conference on. IEEE, 2016.
2 Gupta, Sanchit, Harshit Sharma, and Sarvjeet Kaur, "Malware Characterization Using Windows API Call Sequences," in Proc. of International Conference on Security, Privacy, and Applied Cryptography Engineering. Springer, Cham, pp. 271-280, 2016.
3 Chen, Jie, Haw-ren Fang, and Yousef Saad, "Fast approximate kNN graph construction for high dimensional data via recursive Lanczos bisection," Journal of Machine Learning Research, vol. 10, pp. 1989-2012, Sep, 2009.
4 Yu, Cui, et al, "Indexing the distance: An efficient method to knn processing," Vldb, vol. 1, 2001.
5 Yong, Zhou, Li Youwen, and Xia Shixiong, "An improved KNN text classification algorithm based on clustering," Journal of computers, pp. 230-237, 2009.
6 Dunham, Ken, "A fuzzy future in malware research," The ISSA Journal, pp. 17-18, 2013.
7 Raff, Edward, and Charles Nicholas, "Lempel-Ziv Jaccard Distance, an effective alternative to ssdeep and sdhash," Digital Investigation, vol. 24, pp. 34-49, 2018.   DOI
8 Hiruta, S., et al, "Evaluation on malware classification by combining traffic analysis and fuzzy hashing of malware binary," in Proc. of he International Conference on Security and Management (SAM). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 2015.
9 Liu, Y. D., and H. M. Niu, "KNN classification algorithm based on k-nearest neighbor graph for small sample," Computer Engineering, pp. 198-200, 2011.
10 Weinberger, Kilian Q., and Lawrence K. Saul, "Fast solvers and efficient implementations for distance metric learning," in Proc. of the 25th international conference on Machine learning. ACM, pp. 1160-1167, 2008.
11 Li, Shengqiao, E. James Harner, and Donald A. Adjeroh, "Random KNN feature selection-a fast and stable alternative to Random Forests," BMC bioinformatics, Article number. 450, 2011.
12 A. Lakhotia, A. Walenstein, C. Miles, A. Singh, "VILO: A Rapid Learning Nearest-neighbor Classifier for Malware Triage," Journal in Computer Virology, vol. 9. no. 3. pp. 109-123, 2013.
13 V. Harichandran, F. Breitinger, I. Baggili, "Bytewise Approximate Matching: The Good, The Bad, and The Unknown," Journal of Digital Forensics, Security and Law, vol. 11, no. 2, 2016.
14 F. Breitinger, G. Stivaktakis, H. Baier, "FRASH: A framework to test algorithms of similarity hashing," Digital Investigation, vol. 10, pp. S50-S58, 2013.   DOI
15 H.-S. Park, C.-H. Jun, "A simple and fast algorithm for k-medoids clustering," Expert Systems with Applications, vol. 36, pp. 3336-3341, 2009.   DOI
16 Y. Ye, T. Li, Y. Chen, Q. Jiang, "Automatic malware categorization using cluster ensemble," in Proc. of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp. 95-104, 2010.
17 S. Pai, F. Di Troia, C. A. Visaggio, T. H. Austin, M. Stamp, "Clustering for malware classification," J Comput Virol Hack Tech. vol. 13, no. 2, pp. 95-107, May 2017.   DOI
18 M. Asquith, "Extremely scalable storage and clustering of malware metadata," Journal of Computer Virology and Hacking Techniques, vol. 12, no 2, pp. 49-58, May 2016.   DOI
19 G. E. Dahl, J. W. Stokes, L. Deng, D. Yu, "Large-scale malware classification using random projections and neural networks," in Proc. of Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 3422-3426, 2013.
20 J. Saxe, K. Berlin, "Deep neural network based malware detection using two dimensional binary program features," in Proc. of Malicious and Unwanted 47 Software (MALWARE), 2015 10th International Conference on, IEEE, pp. 11-20, 2015.
21 VirusShare.com - Because Sharing is Caring. https://virusshare.com/
22 Li, Yuping, et al, "Experimental study of fuzzy hashing in malware clustering analysis," in Proc. of 8th workshop on cyber security experimentation and test (cset 15), USENIX Association Washington, DC, vol. 5, no. 1, 2015.