DOI QR코드

DOI QR Code

A Novel Image Captioning based Risk Assessment Model

이미지 캡셔닝 기반의 새로운 위험도 측정 모델

  • 전민성 (충북대학교 대학원 컴퓨터과학전공) ;
  • 고재필 (금오공과대학교 컴퓨터공학과 ) ;
  • 최경주 (충북대학교 전자정보대학 소프트웨어학부 )
  • Received : 2023.11.09
  • Accepted : 2023.12.06
  • Published : 2023.12.31

Abstract

Purpose We introduce a groundbreaking surveillance system explicitly designed to overcome the limitations typically associated with conventional surveillance systems, which often focus primarily on object-centric behavior analysis. Design/methodology/approach The study introduces an innovative approach to risk assessment in surveillance, employing image captioning to generate descriptive captions that effectively encapsulate the interactions among objects, actions, and spatial elements within observed scenes. To support our methodology, we developed a distinctive dataset comprising pairs of [image-caption-danger score] for training purposes. We fine-tuned the BLIP-2 model using this dataset and utilized BERT to decipher the semantic content of the generated captions for assessing risk levels. Findings In a series of experiments conducted with our self-constructed datasets, we illustrate that these datasets offer a wealth of information for risk assessment and display outstanding performance in this area. In comparison to models pre-trained on established datasets, our generated captions thoroughly encompass the necessary object attributes, behaviors, and spatial context crucial for the surveillance system. Additionally, they showcase adaptability to novel sentence structures, ensuring their versatility across a range of contexts.

Keywords

Acknowledgement

이 논문은 2022학년도 충북대학교 학술연구영역 사업의 연구비 지원에 의하여 연구 되었음

References

  1. 행정안전부, "국가안전시스템 개편 종합대책", 2023.1.27.
  2. Alairaji, R. A., Aljazaery, I. A., ALRikabi, H. S., "Abnormal behavior detection of students in the examination hall from surveillance videos," In Advanced Computational Paradigms and Hybrid Intelligent Computing, 2021, pp.113-125.
  3. Chang, C. W., Chang, C. Y., and Lin, Y. Y., "A hybrid CNN and LSTM-based deep learning model for abnormal behavior detection," Multimedia Tools and Applications, Vol. 81, No. 9, 2022, pp.11825-11843. https://doi.org/10.1007/s11042-021-11887-9
  4. Chen, W., Ma, K. T., Yew, Z. J., Hur, M., and Khoo, D. A., "TEVAD: Improved video anomaly detection with captions," IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5548-5558.
  5. Devlin, J., Chang, M. W., Lee, K., and Toutanova, K., "Bert: Pre-training of deep bidirectional transformers for language understanding," Proceedings of NAACL-HLT, 2019, pp.4171-4186.
  6. Dilawari, A., Khan, M. U. G., Al-Otaibi, Y. D., Rehman, Z. U., Rahman, A. U., and Nam, Y. "Natural language description of videos for smart surveillance," Applied Sciences, Vol. 11, No. 9, 2021, pp.3730-3741. https://doi.org/10.3390/app11093730
  7. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., and Unterthiner, T, "Transformers for image recognition at scale," The International Conference on Learning Representations, 2021, arXiv:2010.11929.
  8. Duan, J., Yu, S., Tan, N., Yi, L., and Tan, C, "BOSS: A Benchmark for Human Belief Prediction in Object-context Scenarios," 2022, arXiv:2206.10665.
  9. Graves, A., Fernandez, S., and Schmidhuber, J., "Bidirectional LSTM networks for improved phoneme classification and recognition," International conference on artificial neural networks, 2005, pp. 799-804.
  10. He, K., Zhang, X., Ren, S., and Sun, J., "Deep residual learning for image recognition," IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
  11. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wand, L., Chen, W., "LoRA : Low-rank adaptation of large language models," 2021, arXiv: 2106.09685v2.
  12. Jha, S., Seo, C., Yang, E., and Joshi, G. P., "Real time object detection and tracking system for video surveillance system," Multimedia Tools and Applications, Vol. 80, 2021, pp.3981-3996. https://doi.org/10.1007/s11042-020-09749-x
  13. Kingma, D. P., and Ba, J., "Adam: A method for stochastic optimization," International Conference on Learning Representations, 2015, arXiv:1412.6980.
  14. Li, J., Li, D., Savarese, S., and Hoi, S., "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models," 2023, 10.48550/arXiv.2301.12597.
  15. Lin, K., Li, L., Lin, C. C., Ahmed, F., Gan, Z., Liu, Z., Lu, Y., and Wang, L., "Swinbert: End-to-end transformers with sparse attention for video captioning," IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17949-17958.
  16. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., and Dollar, P., "Microsoft coco: Common objects in context," European Conference on Computer Vision, 2014, pp. 740-755.
  17. OpenAI. Gpt-4 technical report, 2023.
  18. Perez, M., Kot, A. C., and Rocha, A., "Detection of real-world fights in surveillance videos," IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 2662-2666.
  19. Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I., "Improving language understanding by generative pre-training," 2018.
  20. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., and Jitsev, J., "LAION-5B: An open large-scale dataset for training next generation image-text models," Advances in Neural Information Processing Systems, Vol. 35, 2022, pp.25278-25294.
  21. Simonyan, K., and Zisserman, A., "Very deep convolutional networks for large-scale image recognition," 3rd International Conference on Learning Representations, 2015, pp. 1-14.
  22. Sultani, W., Chen, C., and Shah, M., "Real-world anomaly detection in surveillance videos," IEEE conference on computer vision and pattern recognition, 2018. pp. 6479-6488.
  23. Vaswani, A., Shazeer, N., Parmer N., Uszkoreit, J., Jones, L., Gomes, A.N., Kaiser, L., Polosukhin, L., "Attention is all you need," Advances in neural information processing systems, Vol. 30, 2017.
  24. Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., and Yang, Z., "Not only look, but also listen: Learning multimodal violence detection under weak supervision," 16th European Conference on Computer Vision, 2020, pp. 322-339.
  25. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P, S., Sridhar, A., Wang, T., and Zettlemoyer, L., "OPT: Open pre-trained transformer language models," 2022, arXiv:2205.01068v4.
  26. Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhouand, J., and Yang, H, "OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework," International Conference on Machine Learning, 2022, pp. 23318-23340.