Modeling and Selecting Optimal Features for Machine Learning Based Detections of Android Malwares

Lee, Kye Woong;Oh, Seung Taek;Yoon, Young;

doi:10.3745/KTSDE.2019.8.11.427

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Volume 8 Issue 11
/
Pages.427-432
/
2019
/
2287-5905(pISSN)
/
2734-0503(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Modeling and Selecting Optimal Features for Machine Learning Based Detections of Android Malwares

머신러닝 기반 안드로이드 모바일 악성 앱의 최적 특징점 선정 및 모델링 방안 제안

이계웅 (홍익대학교 컴퓨터공학전공) ;
오승택 ((주)넷코아테크) ;
윤영 (홍익대학교 컴퓨터공학과)

Received : 2019.07.08
Accepted : 2019.09.04
Published : 2019.11.30

https://doi.org/10.3745/KTSDE.2019.8.11.427 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we propose three approaches to modeling Android malware. The first method involves human security experts for meticulously selecting feature sets. With the second approach, we choose 300 features with the highest importance among the top 99% features in terms of occurrence rate. The third approach is to combine multiple models and identify malware through weighted voting. In addition, we applied a novel method of eliminating permission information which used to be regarded as a critical factor for distinguishing malware. With our carefully generated feature sets and the weighted voting by the ensemble algorithm, we were able to reach the highest malware detection accuracy of 97.8%. We also verified that discarding the permission information lead to the improvement in terms of false positive and false negative rates.

모바일 운영체제 중 안드로이드의 점유율이 높아지면서 모바일 악성코드 위협은 대부분 안드로이드에서 발생하고 있다. 그러나 정상앱이나 악성앱이 진화하면서 권한 등의 단일 특징점으로 악성여부를 연구하는 방법은 유효성 문제가 발생하여 다양한 특징점 추출 및 기계학습을 통해 이를 극복하고자 한다. 본 논문에서는 APK 파일에서 구동에 필요한 다섯 종류의 특징점들을 안드로가드라는 정적분석 툴을 사용하여 학습데이터의 특성을 추출한다. 또한 추출된 중요 특징점을 기반으로 모델링을 하는 세 가지 방법을 제시한다. 첫 번째 방법은 보안 전문가에 의해 엄선된 132가지의 특징점 조합을 바탕으로 모델링하는 것이다. 두 번째는 학습 데이터 7,000개의 앱에서 발생 빈도수가 높은 상위 99%인 8,004가지의 특징점들 중 랜덤포레스트 분류기를 이용하여 특성중요도가 가장 높은 300가지를 선정 후 모델링 하는 방법이다. 마지막 방법은 300가지의 특징점을 학습한 다수의 모델을 통합하여 하나의 가중치 투표 모델을 구성하는 방법이다. 추가적으로 오탐률 및 미탐률을 개선하기 위해 권한 정보를 모두 제외하여 특징점을 재구성하고 위와 같은 환경으로 모델링하였다. 최종적으로 가중치 투표 모델인 앙상블 알고리즘 모델을 사용하여 97.8%로 정확도가 개선되었고 오탐률은 1.9%로 성능이 개선된 것이 확인되었다.

Keywords

References

K. W. Lee, S. T. Oh, and Y. Yoon, "Modeling and Selecting Optimal Features for Machine Learning Based Detections of Android Malwares," The KIPS Spring Conference, Vol.26, No.1, pp.164-167, 2019.
S. W. Min, H. J. Cho, J. S. Shin, and J. C. Ryou, "Android Malware Analysis and Detection Method Using Machine Learning," Journal of KIISE : Computing Practices and Letters, Vol.19, No.2, pp.95-99, 2013.
H. L. Lee, S. H. Jang, and J. W. Yoon, "Efficient Malware Detector for Android Devices," Journal of The Korea Institute of Information Security & Cryptology, Vol.24, No.4, pp.617- 624, 2014. https://doi.org/10.13089/JKIISC.2014.24.4.617
S. E. Kang, N. V. Long, and S. H. Jung, "Android Malware Detection Using Permission-Based Machine Learning Approach," Journal of the Korea Institute of Information Security & Cryptology, Vol.28, No.3, pp.617-623, 2018. https://doi.org/10.13089/JKIISC.2018.28.3.617
J. W. Jang, H. J. Kang, J. Y. Woo, A. Mohaisen, and H. K. Kim, "Andro-AutoPsy: Anti-malware System Based on Similarity Matching of Malware and Malware Creator-centric Information," Digital Investigation, Vol.14, pp.17-35, 2015. https://doi.org/10.1016/j.diin.2015.06.002
D. W. Kim, K. G. Na, M. M. Han, M. J. Kim, W. Go, and J. H. Park, "Malware Application Classification based on Feature Extraction and Machine Learning for Malicious Behavior Analysis in Android Platform," Journal of Internet Computing and Services, Vol.19, No.1, pp.27-35, 2018. https://doi.org/10.7472/jksii.2018.19.1.27
S. W. Shin, "A Static Analysis of Android Application Permission Requirement," Masters thesis, Korea Aerospace University, 2016.
H. A. Alatwi, "Android Malware Detection using Categorybased Machine Learning Classifiers," Masters Thesis, Rochester Institute of Technology, NY, USA, 2016.
A. Pekta, M. Cavdar, and T. Acarman, "Android Malware Classification by Applying Online Machine Learning," Computer and Information Sciences 31st International Symposium, ISCIS 2016, Krakow, Poland, pp.72-80, 2016.
Dr. Nancy and Dr. Deepak Sharma, "Android Malware Detection using Decision Trees and Network Traffic," International Journal of Computer Science and Information Technologies, Vol.7, No.4, pp.1970-1974, 2016.
L. J. Dong, L. I. Xi-Bing, and K. Peng, "Prediction of Rockburst Classification using Random Forest," Transactions of Nonferrous Metals Society of China, Vol.23, No.2, pp.472-477, 2013. https://doi.org/10.1016/S1003-6326(13)62487-5
Gadyatskaya, Olga, A. L. Lezza, and Y. Zhauniarovich, "Evaluation of Resource-based App Repackaging Detection in Android," In Nordic Conference on Secure IT Systems, pp.135-151. Springer, Cham, 2016.
M. Qin, J. Qiu, Z. Lu, L. Chen, and W. Zhao, "AdaBoost-based Class Imbalance Learning Algorithm," Application Research of Computers, Vol.34, No.11, pp.1-8, 2017.
T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.785-794, 2016.