DOI QR코드

DOI QR Code

임의 차원 데이터 대응 Dynamic RNN-CNN 멀웨어 분류기

Dynamic RNN-CNN malware classifier correspond with Random Dimension Input Data

  • 투고 : 2018.12.17
  • 심사 : 2019.04.25
  • 발행 : 2019.05.31

초록

본 연구는 본 연구는 Microsoft Malware Classification Challenge 데이터 셋을 사용해 임의의 길이 입력 데이터에 대응할 수 있는 멀웨어 분류 모델을 제안한다. 우리는 기존 연구의 멜웨어 데이터를 이미지화 시키는 것을 기반으로 한다. 제안 모델은 멀웨어 데이터가 큰 경우는 많은 이미지를 생성하고, 작은 데이터는 적은 이미지를 생성한다. 생성된 이미지를 시계열 데이터로 Dynamic RNN으로 학습시킨다. RNN의 출력 값은 Attention 기법을 응용해 가장 가중치가 높은 출력만 사용하고, RNN 출력값을 다시 Residual CNN으로 학습시켜 최종적으로 멀웨어를 분류한다. 제안모델을 실험한 결과 검증 데이터 셋에서 Micro-average F1 score 92%를 기록하였다. 실험 결과 특별한 특징 추출 및 차원 축소 없이 임의 길이의 데이터를 학습 및 분류할 수 있는 모델의 성능을 검증할 수 있었다.

This study proposes a malware classification model that can handle arbitrary length input data using the Microsoft Malware Classification Challenge dataset. We are based on imaging existing data from malware. The proposed model generates a lot of images when malware data is large, and generates a small image of small data. The generated image is learned as time series data by Dynamic RNN. The output value of the RNN is classified into malware by using only the highest weighted output by applying the Attention technique, and learning the RNN output value by Residual CNN again. Experiments on the proposed model showed a Micro-average F1 score of 92% in the validation data set. Experimental results show that the performance of a model capable of learning and classifying arbitrary length data can be verified without special feature extraction and dimension reduction.

키워드

HOJBC0_2019_v23n5_533_f0001.png 이미지

Fig. 1 EVariable length encoder

HOJBC0_2019_v23n5_533_f0002.png 이미지

Fig. 2 Classifier with Residual Convolutional Layers

HOJBC0_2019_v23n5_533_f0003.png 이미지

Fig. 3 Residual Conv 3×3 Layer in Fig. 2.

HOJBC0_2019_v23n5_533_f0004.png 이미지

Fig. 4 Loss and accuracy graph.

HOJBC0_2019_v23n5_533_f0005.png 이미지

Fig. 5 Normalized confusion matrix.

HOJBC0_2019_v23n5_533_f0006.png 이미지

Fig. 6 Micro-average ROC curve

HOJBC0_2019_v23n5_533_f0007.png 이미지

Fig. 7 Macro-average ROC curve

Table. 1 Number of training and validation data for each classes

HOJBC0_2019_v23n5_533_t0001.png 이미지

Table. 2 Computer Hardware Specifications for Experiments

HOJBC0_2019_v23n5_533_t0002.png 이미지

참고문헌

  1. S. J. Park, S.M. Choi, H.J. Lee, and J.B. Kim, "Spatial analysis using R based Deep Learning," Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology, vol. 6, no. 4, pp. 1-8, Apr. 2016.
  2. J.M. Kim, and J.H. Lee, "Text Document Classification Based on Recurrent Neural Network Using Word2vec," Journal of korean Institute of Intelligent System, vol. 27, no.6, pp. 560-565, Jun. 2017. https://doi.org/10.5391/JKIIS.2017.27.6.560
  3. P. Baudis, S. Stanko, and J.Sedivy, "Joint Learning of Sentence Embeddings for Relevance and Entailment," in The Workshop on Representation Learning for NLP, Berlin, Germany, pp. 18-26, 2016.
  4. J.Y. Kim, and E. H. Park, "e-Learning Course Reviews Analysis based on Big Data Analytics," Journal of the Korea Institute of Information and Communication Engineering, vol. 21, no. 2, pp. 423-428, Feb. 2017. https://doi.org/10.6109/jkiice.2017.21.2.423
  5. J.M. Kim, and J.H. Lee, "Text Document Classification Based on Recurrent Neural Network Using Word2vec," Journal of Korean Institute of Intelligent Systems, vol. 27, no. 6, pp. 560-565, Dec. 2017. https://doi.org/10.5391/JKIIS.2017.27.6.560
  6. J. Mueller, and T. Aditya "Siamese Recurrent Architectures for Learning Sentence Similarity." in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Arizona, pp. 2786-2792, 2016.
  7. Y. Kim, Y. Jernite, S. David, and M. R. Alexander, "Character-Aware Neural Language Models," CoRR, abs/1508.06615, 2015.
  8. Naver ai hackerton 2018 Team sadang solution [Internet]. Available:https://github.com/moonbings/naver-ai-hackathon-2018.
  9. R. Dey, and F. M. Salem. "Gate-variants of gated recurrent unit (GRU) neural networks," CoRR, abs/1701.05923, 2017.
  10. wiki fast .ai Logloss [Internet]. Available: http://wiki.fast.ai/index.php/Log_Loss
  11. Diederik P. Kingma, Jimmy Ba, "Adam: A Method for Stochastic Optimization" in 3rd International Conference for Learning Representations, San Diego, 2015.
  12. G. Y. Lim, and Y. B. Cho, "The Sentence Similarity Measure Using Deep-Learning and Char2Vec." Journal of the Korea Institute of Information and Communication Engineering, vol. 22, no. 10: 1300-1306, Oct. 2018. https://doi.org/10.6109/JKIICE.2018.22.10.1300

피인용 문헌

  1. 기계학습 기반의 클라우드를 위한 센서 데이터 수집 및 정제 시스템 vol.25, pp.2, 2019, https://doi.org/10.6109/jkiice.2021.25.2.165