DOI QR코드

DOI QR Code

Spam Image Detection Model based on Deep Learning for Improving Spam Filter

  • Received : 2021.01.14
  • Accepted : 2022.02.26
  • Published : 2023.06.30

Abstract

Due to the development and dissemination of modern technology, anyone can easily communicate using services such as social network service (SNS) through a personal computer (PC) or smartphone. The development of these technologies has caused many beneficial effects. At the same time, bad effects also occurred, one of which was the spam problem. Spam refers to unwanted or rejected information received by unspecified users. The continuous exposure of such information to service users creates inconvenience in the user's use of the service, and if filtering is not performed correctly, the quality of service deteriorates. Recently, spammers are creating more malicious spam by distorting the image of spam text so that optical character recognition (OCR)-based spam filters cannot easily detect it. Fortunately, the level of transformation of image spam circulated on social media is not serious yet. However, in the mail system, spammers (the person who sends spam) showed various modifications to the spam image for neutralizing OCR, and therefore, the same situation can happen with spam images on social media. Spammers have been shown to interfere with OCR reading through geometric transformations such as image distortion, noise addition, and blurring. Various techniques have been studied to filter image spam, but at the same time, methods of interfering with image spam identification using obfuscated images are also continuously developing. In this paper, we propose a deep learning-based spam image detection model to improve the existing OCR-based spam image detection performance and compensate for vulnerabilities. The proposed model extracts text features and image features from the image using four sub-models. First, the OCR-based text model extracts the text-related features, whether the image contains spam words, and the word embedding vector from the input image. Then, the convolution neural network-based image model extracts image obfuscation and image feature vectors from the input image. The extracted feature is determined whether it is a spam image by the final spam image classifier. As a result of evaluating the F1-score of the proposed model, the performance was about 14 points higher than the OCR-based spam image detection performance.

Keywords

Acknowledgement

This research was results of a study on the "HPC Support" Project, supported by the Ministry of Science and ICT' and NIPA. This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. NRF-2020R1I1A3073313).

References

  1. M. Malekshahi Rad, A. M. Rahmani, A. Sahafi, and N. Nasih Qader, "Social Internet of Things: vision, challenges, and trends," Human-centric Computing and Information Sciences, vol. 10, article no. 52, 2020. https://doi.org/10.1186/s13673-020-00254-6
  2. J. Salminen, M. Hopf, S. A. Chowdhury, S. G. Jung, H. Almerekhi, and B. J. Jansen, "Developing an online hate classifier for multiple social media platforms," Human-centric Computing and Information Sciences, vol. 10, article no. 1, 2020. https://doi.org/10.1186/s13673-019-0205-6
  3. Z. Zhang, J. Jing, X. Wang, K. K. R. Choo, and B. B. Gupta, "A crowdsourcing method for online social networks security assessment based on human-centric computing," Human-centric Computing and Information Sciences, vol. 10, article no. 23, 2020.
  4. J. H. Park, S. Rathore, S. K. Singh, M. M. Salim, A. E. Azzaoui, T. W. Kim, Y. Pan, and J. H. Park, "A comprehensive survey on core technologies and services for 5G security: Taxonomies, issues, and solutions," Human-centric Computing and Information Sciences, vol. 11, article no. 3, 2021. https://doi.org/10.22967/HCIS.2021.11.003
  5. S. Rathore, J. H. Park, and H. Chang, "Deep learning and blockchain-empowered security framework for intelligent 5G-enabled IoT," IEEE Access, vol. 9, pp. 90075-90083, 2021. https://doi.org/10.1109/ACCESS.2021.3077069
  6. S. Rathore and J. H. Park, "A blockchain-based deep learning approach for cyber security in next generation industrial cyber-physical systems," IEEE Transactions on Industrial Informatics, vol. 17, no. 8, pp. 5522-5532, 2021. https://doi.org/10.1109/TII.2020.3040968
  7. Apache SpamAssassin [Online]. Available: https://spamassassin.apache.org.
  8. B. Biggio, G. Fumera, I. Pillai, and F. Roli, "Image spam filtering using visual information," in Proceedings of the 14th International Conference on Image Analysis and Processing (ICIAP), Modena, Italy, 2007, pp. 105-110.
  9. Z. Wang, W. K. Josephson, Q. Lv, M. Charikar, and K. Li, "Filtering image spam with near-duplicate detection," in Proceedings of the 4th Conference on Email and Anti-Spam (CEAS), Mountain View, CA, 2007.
  10. A. Barbar and A. Ismail, "Image spam detection using FENOMAA technique," in Artificial Intelligence and Applied Mathematics in Engineering Problems. Cham, Switzerland: Springer, 2020, pp. 347-364.
  11. C. Fatichah, W. F. Lazuardi, D. A. Navastara, N. Suciati, and A. Munif, "Image spam detection on instagram using convolutional neural network," in Intelligent and Interactive Computing. Singapore: Springer, 2019, pp. 295-303.
  12. N. Imam and V. Vassilakis, "Detecting spam images with embedded Arabic text in twitter," in Proceedings of 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Sydney, Australia, 2019, pp. 1-6.
  13. A. P. Singh, "Image spam classification using deep learning," Master's thesis, San Jose State University, San Jose, CA, 2018.
  14. S. Rao and R. Gopalapillai, "Effective spam image classification using CNN and transfer learning," in Computational Vision and Bio-Inspired Computing. Cham, Switzerland: Springer, 2020, pp. 1378-1385.
  15. A. Fan and Z. Yang, "Image spam filtering using convolutional neural networks," Personal and Ubiquitous Computing, vol. 22, pp. 1029-1037, 2018. https://doi.org/10.1007/s00779-018-1168-8
  16. T. Sharmin, F. Di Troia, K. Potika, and M. Stamp, "Convolutional neural networks for image spam detection," Information Security Journal: A Global Perspective, vol. 29, no. 3, pp. 103-117, 2020. https://doi.org/10.1080/19393555.2020.1722867
  17. TorchText [Online]. Available: https://pytorch.org/text/stable/index.html.
  18. OpenCV, "Affine Transformations tutorials," 2019 [Online]. Available: https://docs.opencv.org/2.4/doc/tutorials/imgproc/imgtrans/warp_affine/warp_affine.html.
  19. OpenCV, "Smoothing images tutorial: Gaussian Blurring," 2019 [Online]. Available: https://docs.opencv.org/2.4/doc/tutorials/imgproc/gausian_median_blur_bilateral_filter/gausian_median_blur_bilateral_filter.html.
  20. CAPTCHA, "CAPTCHA: Telling Humans and Computers Apart Automatically," 2010 [Online]. Available: http://www.captcha.net/.
  21. G. Bradski, "The OpenCV library," Dr. Dobb's Journal, vol. 25, no. 11, pp. 120-125, 2000.
  22. M. Dredze, R. Gevaryahu, and A. Elias-Bachrach, "Image Spam Dataset," 2007 [Online]. Available: https://www.cs.jhu.edu/~mdredze/datasets/image_spam/.
  23. Unsplash [Online]. Available: https://unsplash.com/.
  24. Tesseract Open-Source OCR Engine, "Tesseract-OCR," 2023 [Online]. Available: https://github.com/tesseract-ocr/tesseract.