• Title/Summary/Keyword: malware classification

Search Result 102, Processing Time 0.025 seconds

API Feature Based Ensemble Model for Malware Family Classification (악성코드 패밀리 분류를 위한 API 특징 기반 앙상블 모델 학습)

  • Lee, Hyunjong;Euh, Seongyul;Hwang, Doosung
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.29 no.3
    • /
    • pp.531-539
    • /
    • 2019
  • This paper proposes the training features for malware family analysis and analyzes the multi-classification performance of ensemble models. We construct training data by extracting API and DLL information from malware executables and use Random Forest and XGBoost algorithms which are based on decision tree. API, API-DLL, and DLL-CM features for malware detection and family classification are proposed by analyzing frequently used API and DLL information from malware and converting high-dimensional features to low-dimensional features. The proposed feature selection method provides the advantages of data dimension reduction and fast learning. In performance comparison, the malware detection rate is 93.0% for Random Forest, the accuracy of malware family dataset is 92.0% for XGBoost, and the false positive rate of malware family dataset including benign is about 3.5% for Random Forest and XGBoost.

Method of Similarity Hash-Based Malware Family Classification (유사성 해시 기반 악성코드 유형 분류 기법)

  • Kim, Yun-jeong;Kim, Moon-sun;Lee, Man-hee
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.32 no.5
    • /
    • pp.945-954
    • /
    • 2022
  • Billions of malicious codes are detected every year, of which only 0.01% are new types of malware. In this situation, an effective malware type classification tool is needed, but previous studies have limitations in quickly analyzing a large amount of malicious code because it requires a complex and massive amount of data pre-processing. To solve this problem, this paper proposes a method to classify the types of malicious code based on the similarity hash without complex data preprocessing. This approach trains the XGBoost model based on the similarity hash information of the malware. To evaluate this approach, we used the BIG-15 dataset, which is widely used in the field of malware classification. As a result, the malicious code was classified with an accuracy of 98.9% also, identified 3,432 benign files with 100% accuracy. This result is superior to most recent studies using complex preprocessing and deep learning models. Therefore, it is expected that more efficient malware classification is possible using the proposed approach.

CNN-based Android Malware Detection Using Reduced Feature Set

  • Kim, Dong-Min;Lee, Soo-jin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.10
    • /
    • pp.19-26
    • /
    • 2021
  • The performance of deep learning-based malware detection and classification models depends largely on how to construct a feature set to be applied to training. In this paper, we propose an approach to select the optimal feature set to maximize detection performance for CNN-based Android malware detection. The features to be included in the feature set were selected through the Chi-Square test algorithm, which is widely used for feature selection in machine learning and deep learning. To validate the proposed approach, the CNN model was trained using 36 characteristics selected for the CICANDMAL2017 dataset and then the malware detection performance was measured. As a result, 99.99% of Accuracy was achieved in binary classification and 98.55% in multiclass classification.

Research on Malware Classification with Network Activity for Classification and Attack Prediction of Attack Groups (공격그룹 분류 및 예측을 위한 네트워크 행위기반 악성코드 분류에 관한 연구)

  • Lim, Hyo-young;Kim, Wan-ju;Noh, Hong-jun;Lim, Jae-sung
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.42 no.1
    • /
    • pp.193-204
    • /
    • 2017
  • The security of Internet systems critically depends on the capability to keep anti-virus (AV) software up-to-date and maintain high detection accuracy against new malware. However, malware variants evolve so quickly they cannot be detected by conventional signature-based detection. In this paper, we proposed a malware classification method based on sequence patterns generated from the network flow of malware samples. We evaluated our method with 766 malware samples and obtained a classification accuracy of approximately 40.4%. In this study, malicious codes were classified only by network behavior of malicious codes, excluding codes and other characteristics. Therefore, this study is expected to be further developed in the future. Also, we can predict the attack groups and additional attacks can be prevented.

Analysis of Malware Group Classification with eXplainable Artificial Intelligence (XAI기반 악성코드 그룹분류 결과 해석 연구)

  • Kim, Do-yeon;Jeong, Ah-yeon;Lee, Tae-jin
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.31 no.4
    • /
    • pp.559-571
    • /
    • 2021
  • Along with the increase prevalence of computers, the number of malware distributions by attackers to ordinary users has also increased. Research to detect malware continues to this day, and in recent years, research on malware detection and analysis using AI is focused. However, the AI algorithm has a disadvantage that it cannot explain why it detects and classifies malware. XAI techniques have emerged to overcome these limitations of AI and make it practical. With XAI, it is possible to provide a basis for judgment on the final outcome of the AI. In this paper, we conducted malware group classification using XGBoost and Random Forest, and interpreted the results through SHAP. Both classification models showed a high classification accuracy of about 99%, and when comparing the top 20 API features derived through XAI with the main APIs of malware, it was possible to interpret and understand more than a certain level. In the future, based on this, a direct AI reliability improvement study will be conducted.

Malware Classification Possibility based on Sequence Information (순서 정보 기반 악성코드 분류 가능성)

  • Yun, Tae-Uk;Park, Chan-Soo;Hwang, Tae-Gyu;Kim, Sung Kwon
    • Journal of KIISE
    • /
    • v.44 no.11
    • /
    • pp.1125-1129
    • /
    • 2017
  • LSTM(Long Short-term Memory) is a kind of RNN(Recurrent Neural Network) in which a next-state is updated by remembering the previous states. The information of calling a sequence in a malware can be defined as system call function that is called at each time. In this paper, we use calling sequences of system calls in malware codes as input for malware classification to utilize the feature remembering previous states via LSTM. We run an experiment to show that our method can classify malware and measure accuracy by changing the length of system call sequences.

IoT Malware Detection and Family Classification Using Entropy Time Series Data Extraction and Recurrent Neural Networks (엔트로피 시계열 데이터 추출과 순환 신경망을 이용한 IoT 악성코드 탐지와 패밀리 분류)

  • Kim, Youngho;Lee, Hyunjong;Hwang, Doosung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.5
    • /
    • pp.197-202
    • /
    • 2022
  • IoT (Internet of Things) devices are being attacked by malware due to many security vulnerabilities, such as the use of weak IDs/passwords and unauthenticated firmware updates. However, due to the diversity of CPU architectures, it is difficult to set up a malware analysis environment and design features. In this paper, we design time series features using the byte sequence of executable files to represent independent features of CPU architectures, and analyze them using recurrent neural networks. The proposed feature is a fixed-length time series pattern extracted from the byte sequence by calculating partial entropy and applying linear interpolation. Temporary changes in the extracted feature are analyzed by RNN and LSTM. In the experiment, the IoT malware detection showed high performance, while low performance was analyzed in the malware family classification. When the entropy patterns for each malware family were compared visually, the Tsunami and Gafgyt families showed similar patterns, resulting in low performance. LSTM is more suitable than RNN for learning temporal changes in the proposed malware features.

Dimensionality Reduction of Feature Set for API Call based Android Malware Classification

  • Hwang, Hee-Jin;Lee, Soojin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.11
    • /
    • pp.41-49
    • /
    • 2021
  • All application programs, including malware, call the Application Programming Interface (API) upon execution. Recently, using those characteristics, attempts to detect and classify malware based on API Call information have been actively studied. However, datasets containing API Call information require a large amount of computational cost and processing time. In addition, information that does not significantly affect the classification of malware may affect the classification accuracy of the learning model. Therefore, in this paper, we propose a method of extracting a essential feature set after reducing the dimensionality of API Call information by applying various feature selection methods. We used CICAndMal2020, a recently announced Android malware dataset, for the experiment. After extracting the essential feature set through various feature selection methods, Android malware classification was conducted using CNN (Convolutional Neural Network) and the results were analyzed. The results showed that the selected feature set or weight priority varies according to the feature selection methods. And, in the case of binary classification, malware was classified with 97% accuracy even if the feature set was reduced to 15% of the total size. In the case of multiclass classification, an average accuracy of 83% was achieved while reducing the feature set to 8% of the total size.

Light-weight Classification Model for Android Malware through the Dimensional Reduction of API Call Sequence using PCA

  • Jeon, Dong-Ha;Lee, Soo-Jin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.11
    • /
    • pp.123-130
    • /
    • 2022
  • Recently, studies on the detection and classification of Android malware based on API Call sequence have been actively carried out. However, API Call sequence based malware classification has serious limitations such as excessive time and resource consumption in terms of malware analysis and learning model construction due to the vast amount of data and high-dimensional characteristic of features. In this study, we analyzed various classification models such as LightGBM, Random Forest, and k-Nearest Neighbors after significantly reducing the dimension of features using PCA(Principal Component Analysis) for CICAndMal2020 dataset containing vast API Call information. The experimental result shows that PCA significantly reduces the dimension of features while maintaining the characteristics of the original data and achieves efficient malware classification performance. Both binary classification and multi-class classification achieve higher levels of accuracy than previous studies, even if the data characteristics were reduced to less than 1% of the total size.

A Study on Variant Malware Detection Techniques Using Static and Dynamic Features

  • Kang, Jinsu;Won, Yoojae
    • Journal of Information Processing Systems
    • /
    • v.16 no.4
    • /
    • pp.882-895
    • /
    • 2020
  • The amount of malware increases exponentially every day and poses a threat to networks and operating systems. Most new malware is a variant of existing malware. It is difficult to deal with numerous malware variants since they bypass the existing signature-based malware detection method. Thus, research on automated methods of detecting and processing variant malware has been continuously conducted. This report proposes a method of extracting feature data from files and detecting malware using machine learning. Feature data were extracted from 7,000 malware and 3,000 benign files using static and dynamic malware analysis tools. A malware classification model was constructed using multiple DNN, XGBoost, and RandomForest layers and the performance was analyzed. The proposed method achieved up to 96.3% accuracy.