DOI QR코드

DOI QR Code

Detection of Malicious PDF based on Document Structure Features and Stream Objects

  • Kang, Ah Reum (Dept. of Big Data Engineering, Soonchunhyang University) ;
  • Jeong, Young-Seob (Dept. of Big Data Engineering, Soonchunhyang University) ;
  • Kim, Se Lyeong (Korea Internet & Security Agency(KISA)) ;
  • Kim, Jonghyun (Electronics and Telecommunication Research Institute (ETRI)) ;
  • Woo, Jiyoung (Dept. of Big Data Engineering, Soonchunhyang University) ;
  • Choi, Sunoh (Electronics and Telecommunication Research Institute (ETRI))
  • Received : 2018.10.01
  • Accepted : 2018.11.01
  • Published : 2018.11.30

Abstract

In recent years, there has been an increasing number of ways to distribute document-based malicious code using vulnerabilities in document files. Because document type malware is not an executable file itself, it is easy to bypass existing security programs, so research on a model to detect it is necessary. In this study, we extract main features from the document structure and the JavaScript contained in the stream object In addition, when JavaScript is inserted, keywords with high occurrence frequency in malicious code such as function name, reserved word and the readable string in the script are extracted. Then, we generate a machine learning model that can distinguish between normal and malicious. In order to make it difficult to bypass, we try to achieve good performance in a black box type algorithm. For an experiment, a large amount of documents compared to previous studies is analyzed. Experimental results show 98.9% detection rate from three different type algorithms. SVM, which is a black box type algorithm and makes obfuscation difficult, shows much higher performance than in previous studies.

Keywords

CPTSCQ_2018_v23n11_85_f0001.png 이미지

Fig. 1. PDF structure

CPTSCQ_2018_v23n11_85_f0002.png 이미지

Fig. 2. trailer

CPTSCQ_2018_v23n11_85_f0003.png 이미지

Fig. 4. body

CPTSCQ_2018_v23n11_85_f0004.png 이미지

Fig. 5. benign (above) vs. malicious (below) file

CPTSCQ_2018_v23n11_85_f0005.png 이미지

Fig. 3. Feature importance

CPTSCQ_2018_v23n11_85_t0001.png 이미지

Fig. 3. cross-reference table

Table 1. PDF Version

CPTSCQ_2018_v23n11_85_t0002.png 이미지

Table 2. PDF Feature Statistics

CPTSCQ_2018_v23n11_85_t0003.png 이미지

Table 3. Algorithm performance

CPTSCQ_2018_v23n11_85_t0004.png 이미지

References

  1. P. Laskov and N. Srndic, "Static Detection of Malicious JavaScript-Bearing PDF Documents," Proceedings of the Annual Computer Security Applications Conference (ACSAC), pp.373-382, 2011.
  2. C. Smutz and A. Stavrou, "Malicious PDF Detection using Metadata and Structural Features," Proceedings of the 28th Annual Computer Security Applications Conference, pp.239-248, 2012.
  3. N. Srndic and P. Laskov, "Detection of Malicious PDF Files Based on Hierarchical Document Structure," Proceedings of the 20th Annual Network & Distributed System Security Symposium, pp.1-16, 2013.
  4. X. Lu, J. Zhuge, R. Wang, Y. Cao, and Y. Chen, "De-obfuscation and Detection of Malicious PDF Files with High Accuracy," Proceedings of the 46th Hawaii International Conference on System Sciences (HICSS), pp.4890-4899, 2013.
  5. I. Corona, D. Maiorca, D. Ariu and G. Giacinto, "Lux0r: Detection of Malicious PDF-embedded Javascript Code through Discriminant Analysis of API References," Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop, pp.47-57, 2014.
  6. N. Srndic and P. Laskov, "Hidost: a static machine-learn ing-based detector of malicious files," EURASIP Journal on Information Security, vol.2016, no.1, pp.22, 2016, 9. https://doi.org/10.1186/s13635-016-0045-0
  7. M Li, Y Liu, M Yu, G Li, and Y Wang, "FEPDF: A Robust Feature Extractor for Malicious PDF Detection," Proceedings of BigDataSE/ICESS 2017, pp.218-224, 2017.
  8. S. Khitan, A. Hadi and J. Atoum, "PDF Forensic Analysis System using YARA," International Journal of Computer Science and Network Security, vol.17, no.5, pp.77-85, 2017, 5.
  9. B. Cuan, A. Damien, C. Delaplace, and M. Valois, "Malware Detection in PDF Files Using Machine Learning," SECRYPT 2018 - 15th International Conference on Security and Cryptography, pp.8, 2018, 7.
  10. J. Zhang, "MLPdf: An Effective Machine Learning Based Approach for PDF Malware Detection," arXiv:1808.0699 1v1, 2018, 8.
  11. J. Torres and S. D. L. Santos, "Malicious PDF Documents Detection using Machine Learning Techniques," Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP 2018), pp.337-344, 2018.
  12. D. Maiorca, G. Giacinto, and I. Corona, "A Pattern Recognition System for Malicious PDF Files Detection," Perner, P. (ed.) MLDM 2012, LNCS(LNAI), vol.7326, pp.510-524, 2012.
  13. D. Liu, H. Wang, and A. Stavrou, "Detecting Malicious Javascript in PDF through Document Instrumentation," Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2014, 6.

Cited by

  1. 악성메일 훈련 모델에 관한 연구 vol.30, pp.2, 2018, https://doi.org/10.13089/jkiisc.2020.30.2.197