File Type Identification Using CNN and GRU

CNN과 GRU를 활용한 파일 유형 식별 및 분류

  • 성민규 (아주대학교 사이버보안 전공) ;
  • 손태식 (아주대학교 사이버보안학과)
  • Received : 2024.01.10
  • Accepted : 2024.03.02
  • Published : 2024.04.30

Abstract

With the rapid increase in digital data in modern society, digital forensics plays a crucial role, and file type identification is one of its integral components. Research on the development of identification models utilizing artificial intelligence is underway to identify file types swiftly and accurately. However, existing studies do not support the identification of file types with high domestic usage rates, making them unsuitable for use within the country. Therefore, this paper proposes a more accurate file type identification model using Convolutional Neural Networks (CNN) and Gated Recurrent Units (GRU). To overcome limitations of existing methods, the proposed model demonstrates superior performance on the FFT-75 dataset, effectively identifying file types with high domestic usage rates such as HWP, ALZ, and EGG. The model's performance is validated by comparing it with three existing research models (CNN-CO, FiFTy, CNN-LSTM). Ultimately, the CNN and GRU based file type identification and classification model achieved 68.2% accuracy on 512-byte file fragments and 81.4% accuracy on 4096-byte file fragments.

현대 사회에서의 디지털 데이터의 빠른 증가로 디지털 포렌식이 핵심적인 역할을 하고 있으며, 파일 유형 식별은 그 중에서 중요한 부분 중 하나이다. 파일 유형을 빠르고 정확하게 식별하기 위해서 인공지능을 사용한 파일 유형 식별 모델 개발 연구가 진행되고 있다. 그러나 기존 연구들은 일부 국내 점유율이 높은 파일을 식별할 수 없어, 국내에서 사용하기에 부족함이 있다. 따라서 본 논문에서는 CNN과 GRU를 활용한 더욱 정확하고 강력한 파일 유형 식별 모델을 제안한다. 기존 방법의 한계를 극복하기 위해 제안한 모델은 FFT-75 데이터셋에서 가장 우수한 성능을 보이며, 국내에서 높은 점유율을 가지는 HWP, ALZ, EGG와 같은 파일 유형도 효과적으로 식별할 수 있다. 제안한 모델과 세 개의 기존 연구 모델(CNN-CO, FiFTy, CNN-LSTM)을 서로 비교하여 모델 성능을 검증하였다. 최종적으로 CNN과 GRU 기반의 파일 유형 식별 및 분류 모델은 512바이트 파일 조각에서 68.2%의 정확도를, 4096바이트 파일 조각에서는 81.4%의 정확도를 달성하였다.

Keywords

References

  1. M. C. Amirani, M. Toorani, and A. Beheshti Shirazi, "A New approach to Content-based File Type Detection," IEEE Symposium on Computers and Communications, 2008, pp. 1103-1108.
  2. Jonghoon Won, Minji Kang, Jisung Park, Jihong Kim, "File-Fragment Type Identification using Selected N-grams by Apriori Algorithm," in Proc. Korea Software Congress (KSC). Gangwon State, 2018, pp. 1459-1461.
  3. F. Mansouri Hanis and M. Teimouri, "Dataset for file fragment classification of textual file formats," BMC Res. Notes, vol. 12, no. 1, p. 801, Dec. 2019.
  4. S. Fitzgerald, G. Mathews, C. Morris, and O. Zhulyn, "Using NLP techniques for file fragment classification," Digit. Invest., vol. 9, pp. S44-S49, 2012. https://doi.org/10.1016/j.diin.2012.05.008
  5. N. L. Beebe, L. A. Maddox, L. Liu, and M. Sun, "Sceadan: Using concatenated N-Gram vectors for improved file and data type classification," IEEE Trans. Inf. Forensics Security, vol. 8, no. 9, pp. 1519-1530, 2013. https://doi.org/10.1109/TIFS.2013.2274728
  6. T. Xu, M. Xu, Y. Ren, J. Xu, H. Zhang, and N. Zheng, "A file fragment classification method based on grayscale image," J. Comput., vol. 9, no. 8, pp. 1863-1870, 2014.
  7. N. Zheng, J. Wang, T. Wu, and M. Xu, "A fragment classification method depending on data type," in Proc. IEEE Int. Conf. Comput. Inf. Technol.; Ubiquitous Comput. Commun.; Dependable, Autonomic Secure Comput.; Pervasive Intell. Comput., pp. 1948-1953, 2015.
  8. N. Beebe, L. Liu, and M. Sun, "Data type classification: Hierarchical class-to-type modeling," in Advances in Digital Forensics XII (IFIP Advances in Information and Communication Technology). New Delhi, India: Springer, pp. 325-343, 2016.
  9. Q. Chen et al., "File Fragment Classification Using Grayscale Image Conversion and Deep Learning in Digital Forensics," 2018 IEEE Security and Privacy Workshops (SPW), pp. 140-147, May. 2018.
  10. Manish Bhatt, Avdesh Mishra, Md. Wasi Ul Kabir, S. E. Blake-Gatto, Rishav Rajendra, Tamjidul Hoque and Irfan Ahmed, "Hierarchy-Based File Fragment Classification," Mach. Learn. Knowl. Extr. 2, no. 3, pp. 216-232, 2020. https://doi.org/10.3390/make2030012
  11. Bhat, Anirudh, Aryan Likhite, Swaraj Chavan and Leena Ragha, "File Fragment Classification using Content Based Analysis," ITM Web of Conferences, 2021.
  12. G. Mittal, P. Korus and N. Memon, "FiFTy: Large-Scale File Fragment Type Identification Using Convolutional Neural Networks," IEEE Transactions on Information Forensics and Security, vol. 16, pp. 28-41, 2021. https://doi.org/10.1109/TIFS.2020.3004266
  13. Haque, Md. Enamul and Mehmet Engin Tozal, "Byte embeddings for file fragment classification," Future Generation Computer Systems, vol. 127, pp. 448-461, 2022. https://doi.org/10.1016/j.future.2021.09.019
  14. K. M. Saaim, M. Felemban, S. Alsaleh, and A. Almulhem, "Light-Weight File Fragments Classification Using Depthwise Separable Convolutions," IFIP Adv. Inf. Commun. Technol., vol. 648 IFIP, pp. 196-211, 2022. https://doi.org/10.1007/978-3-031-06975-8_12
  15. M. Ghaleb, K. Saaim, M. Felemban, S. Al-Saleh, and A. Al-Mulhem, "File Fragment Classification using Light-Weight Convolutional Neural Networks," arXiv, May 01, 2023.
  16. Nan Zhu, Yang Liu, Kun Wang and Changyou Ma, "File Fragment Type Identification Based on CNN and LSTM," Proceedings of the 2023 7th International Conference on Digital Signal Processing, Association for Computing Machinery, New York, NY, USA, pp. 16-22, 2023.
  17. Govind Mittal, PawelKorus, and Nasir Memon, File Fragment Type (FFT)-75 Dataset [Online]. Available: http://dx.doi.org/10.21227/kfxw-8084.
  18. Simson Garfinkel, Paul Farrell, Vassil Roussev, and George Dinolt, "Bringing science to digital forensics with standardized forensic corpora," Digit. Investig. vol. 6, pp. S2-S11, 2009. https://doi.org/10.1016/j.diin.2009.06.016