DOI QR코드

DOI QR Code

A New Residual Attention Network based on Attention Models for Human Action Recognition in Video

  • Kim, Jee-Hyun (Dept. of Software Engineering, Seoil University) ;
  • Cho, Young-Im (Dept. of Computer Engineering, Gachon University)
  • Received : 2019.11.18
  • Accepted : 2020.01.20
  • Published : 2020.01.31

Abstract

With the development of deep learning technology and advances in computing power, video-based research is now gaining more and more attention. Video data contains a large amount of temporal and spatial information, which is the biggest difference compared with image data. It has a larger amount of data. It has attracted intense attention in computer vision. Among them, motion recognition is one of the research focuses. However, the action recognition of human in the video is extremely complex and challenging subject. Based on many research in human beings, we have found that artificial intelligence-like attention mechanisms are an efficient model for cognition. This efficient model is ideal for processing image information and complex continuous video information. We introduce this attention mechanism into video action recognition, paying attention to human actions in video and effectively improving recognition efficiency. In this paper, we propose a new 3D residual attention network using convolutional neural network based on two attention models to identify human action behavior in the video. An evaluation result of our model showed up to 90.7% accuracy.

딥 러닝 기술의 발전과 컴퓨팅 파워 등의 개선으로 인해 비디오 기반 연구는 최근 많은 관심을 얻고 있다. 비디오 데이터가 이미지 데이터와 비교하여 가장 큰 차이는 비디오 데이터에는 많은 양의 시간적, 공간적 정보가 포함되어 있다는 점이다. 이처럼 비디오에 포함된 많은 양의 데이터로 인해 컴퓨터 비전 연구에 있어서 행동 인식은 중요한 연구 과제 중 하나이지만, 비디오와 같이 움직임이 있는 환경에서 인간의 행동 인식은 매우 복잡하고 도전적인 과제이다. 인간에 대한 여러 연구를 바탕으로 인공지능에서는 인간과 유사한 주의(attention)메커니즘이 효율적인 인식 모델이라는 것을 알게 되었다. 이 효율적인 모델은 이미지 정보와 복잡한 연속 비디오 정보를 처리하는 데 이상적이다. 본 논문에서는 이러한 연구배경을 기반으로, 비디오에서 인간의 행동을 효율적으로 인식하기 위해 먼저 인간의 행동에 주목한 후 비디오 행동 인식에 주의메커니즘을 도입하고자 한다. 논문의 주요내용은 두 가지 주의 메카니즘을 기반으로 컨볼루션 신경망을 이용한 새로운 3D 잔류 주의 네트워크를 제안함으로써 비디오에서 인간의 행동을 식별하고자 한다. 제안 모델의 평가 결과 최대 90.7%정도의 정확도를 보였다.

Keywords

References

  1. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  2. N. Dalal, B. Triggs, C. Schmid, "Human detection using oriented histograms of flow and appearance," In ECCV, 2006.
  3. A. Klaser, M. Marszalek, and C. Schmid, "A spatio-temporal descriptor based on 3d-gradients," In BMVC, 2008.
  4. H. Wang and C. Schmid, "Action recognition with improved trajectories," In P with improved trajectories. In Proc. ICCV, 2013.
  5. H. Wang and C. Schmid, "Action recognition with improved trajectories," In ICCV, 2013.
  6. A. Krizhevsky, I. Sutskever, G. Hinton, "Elmagenet classification with deep convolutional neural networks[C]," Advances in Neural Information Processing Systems, 2012.
  7. S. IOFFE, C. SZEGEDY, "Batch normalization: accelerating deep network training by reducing internal covariate shift[C]," Proceedings of the 32nd International Conference on Machine Learning, 2015.
  8. J. HU, L. SHEN, G. SUN, "Squeeze-and-excitation networks[J]," arXiv preprint arXiv:1709.01507, 2017.
  9. L. WANG, Y. XIONG, Z. WANG, et al., "Temporal segment networks:Towards good practices for deep action recognition[C]," European Conference on Computer Vision. Springer, Cham, 2016.
  10. S. JI, W. XU, M. YANG, et al., "3D convolutional neural networks for human action recognition[J]," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.
  11. K. Soomro, A.R. Zamir, M. Shah, "UCF101: A dataset of 101 human actions classes from videos in the wild," 2012.
  12. H. Kuehne, H. Jhuang, R. Stiefelhagen, T. Serre, " Hmdb51: a large video database for human motion recognition.," High Perform. Comput. Sci. Eng. 12, pp. 571-582 2013.
  13. M. Zinkevich, M. Weimer, L. Li, A. Smola, " Parallelized stochastic gradient descent," In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, pp. 6-9 December 2010.
  14. I. Sutskever, J. Martens, G. Dahl, G. Hinton, "On the importance of initialization and momentum in deep learning," In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, pp. 16-21 June 2013.
  15. Jiahui Cai, Jianguo Hu, "3D RANs: 3D Residual Attention Networks for action recognition," The Visual Computer, 25, July 2019.
  16. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497, 2015.
  17. H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, "Dynamic image networks for action recognition," In Proc. CVPR, 2016.
  18. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman, "Convolutional two-stream network fusion for video action recognition," In Proc. CVPR, 2016.