DOI QR코드

DOI QR Code

Optimization of Action Recognition based on Slowfast Deep Learning Model using RGB Video Data

RGB 비디오 데이터를 이용한 Slowfast 모델 기반 이상 행동 인식 최적화

  • Jeong, Jae-Hyeok (Dept. of Human Intelligence & Robot Engineering, Sangmyung University) ;
  • Kim, Min-Suk (Dept. of Human Intelligence & Robot Engineering, Sangmyung University)
  • Received : 2022.08.14
  • Accepted : 2022.08.25
  • Published : 2022.08.31

Abstract

HAR(Human Action Recognition) such as anomaly and object detection has become a trend in research field(s) that focus on utilizing Artificial Intelligence (AI) methods to analyze patterns of human action in crime-ridden area(s), media services, and industrial facilities. Especially, in real-time system(s) using video streaming data, HAR has become a more important AI-based research field in application development and many different research fields using HAR have currently been developed and improved. In this paper, we propose and analyze a deep-learning-based HAR that provides more efficient scheme(s) using an intelligent AI models, such system can be applied to media services using RGB video streaming data usage without feature extraction pre-processing. For the method, we adopt Slowfast based on the Deep Neural Network(DNN) model under an open dataset(HMDB-51 or UCF101) for improvement in prediction accuracy.

Keywords

References

  1. H.B. Zhang, Y.X. Zhang, B. Zhong, Q. Lei, L. Yang, and J.X. Du, et al., "A Comprehensive Survey of Vision-Based Human Action Recognition Methods," Sensors, Vol. 19, No. 5, pp 1005, 2019.
  2. C. Feichtenhofer, H. Fan, J. Malik, and K. He, "SlowFast Networks for Video Recognition," IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6201-6210, 2019
  3. Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, "OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields," IEEE Transactions on Pattern Analysis & Machine Intelligence, Vol. 43, No. 01, pp. 172-186, 2021. https://doi.org/10.1109/TPAMI.2019.2929257
  4. H.S. Fang, S. Xie, Y.W. Tai, and C. Lu, "RMPE: Regional Multi-Person Pose Estimation," IEEE International Conference on Computer Vision (ICCV), pp. 2353-2362, 2017.
  5. M.K. Kim and E.Y. Cha, "Using Skeleton Vector Information and RNN Learning Behavior Recognition Algorithm," Journal of Broadcast Engineering, Vol. 23, No. 5, pp. 598-605, 2018. https://doi.org/10.5909/JBE.2018.23.5.598
  6. N. Tasnim and J.H. Baek, "Deep Learning-based Action Recognition using Skeleton Joints Mapping," Journal of Advanced Navigation Technology, Vol. 24, No. 2, pp. 155-162, 2020. https://doi.org/10.12673/JANT.2020.24.2.155
  7. H.J. Bae, G.J. Jang, Y.H. Kim, and J.P. Kim, "LSTM(Long Short-Term Memory)-Based Abnormal Behavior Recognition Using Alpha Pose," KIPS Transactions on Software and Data Engineering, Vol. 10, No. 5, pp. 187-194, 2021. https://doi.org/10.3745/KTSDE.2021.10.5.187
  8. M. Kim, Z. Lee, and W. Kim, "Realtime Human Object Segmentation Using Image and Skeleton Characteristics," Journal of Broadcast Engineering, Vol. 21, No. 5, pp. 782-791, 2016. https://doi.org/10.5909/JBE.2016.21.5.782
  9. S. Chen and R.R. Yang, "Pose Trainer: Correcting Exercise Posture using Pose Estimation," arXiv Preprint, arXiv:2006.11718, 2020.
  10. P. Zell, B. Wandt, and B. Rosenhahn, "Joint 3D Human Motion Capture and Physical Analysis from Monocular Videos," IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 17-26, 2017.
  11. R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh, "Learning Regularity in Skeleton Trajectories for Anomaly Detection in Videos," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11988-11996, 2019.
  12. J.H. Kim, J.H. Choi, Y.H. Park, and A. Nasridinov, "Abnormal Situation Detection on Surveillance Video Using Object Detection and Action Recognition," Journal of Korea Multimedia Society, Vol. 24, No. 2, pp. 186-198, 2021. https://doi.org/10.9717/KMMS.2020.24.2.186
  13. Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, "Channel-Wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition," IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13339-13348, 2021.
  14. H. Duan, Y. Zhao, K. Chen, D. Lin, and Bo. Dai, "Revisiting Skeleton-based Action Recognition," arXiv Preprint, arXiv:2104.13586, 2022.
  15. K. Simonyan and A. Zisserman, "Two-Stream Convolutional Networks for Action Recognition in Videos," NIPS, pp. 568-576, 2014.
  16. S. Ji, W. Xu, M. Yang, and K. Yu, "3D Convolutional Neural Networks for Human Action Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, No. 1, pp. 221-231, 2013. https://doi.org/10.1109/TPAMI.2012.59
  17. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning Spatiotemporal Features with 3D Convolutional Networks," IEEE International Conference on Computer Vision (ICCV), pp. 4489-4497, 2015.
  18. W. Sultani, C. Chen, and M. Shah, "Real-World Anomaly Detection in Surveillance Videos," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6479-6488, 2018.
  19. S. Virender, S. Swati, and G. Pooja, "Real-Time Anomaly Recognition Through CCTV Using Neural Networks," Procedia Computer Science, Vol. 173, pp. 254-263, 2020 https://doi.org/10.1016/j.procs.2020.06.030
  20. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, and S. Vijayanarasimhan, et al., "The Kinetics Human Action Video Dataset," arXiv Preprint, arXiv:1705.06950, 2017.
  21. C. Gu, C. Sun, D.A. Ross, C. Vondrick, C. Pantofaru, and Y. Li, et al., "AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6047-6056, 2018
  22. J. Carreira and A. Zisserman, "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724-4733, 2017
  23. H. Xu, A. Das, and K. Saenko, "R-C3D: Region Convolutional 3D Network for Temporal Activity Detection," IEEE International Conference on Computer Vision (ICCV), pp. 5794-5803, 2017.
  24. J. Pan, S. Chen, M. Z. Shou, Y. Liu, J. Shao, and H. Li, "Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 464-474, 2021.
  25. A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid, "ViViT: A Video Vision Transformer," IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6816-6826, 2021.
  26. H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, and J. Malik, et al., "Multiscale Vision Transformers," IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6804-6815, 2021.
  27. M. Jeon and K. Cheoi, "Anomaly Detection using Combination of Motion Features," Journal of Korea Multimedia Society, Vol. 21, No. 3, pp. 348-357, 2018 https://doi.org/10.9717/KMMS.2018.21.3.348
  28. W. Kim, "Edge Computing Server Deployment Technique for Cloud VR-based Multi-User Metaverse Content," Journal of Korea Multimedia Society, Vol. 24, No. 8, pp. 1090-1100, 2021 https://doi.org/10.9717/KMMS.2021.24.8.1090
  29. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "HMDB: A Large Video Database for Human Motion Recognition," International Conference on Computer Vision, pp. 2556-2563, 2011.
  30. K. Soomro, A.R. Zamir, and M. Shah, "UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild," arXiv Preprint, arXiv:1212.0402, 2012.