DOI QR코드

DOI QR Code

DeepAct: A Deep Neural Network Model for Activity Detection in Untrimmed Videos

  • Received : 2017.05.08
  • Accepted : 2017.09.01
  • Published : 2018.02.28

Abstract

We propose a novel deep neural network model for detecting human activities in untrimmed videos. The process of human activity detection in a video involves two steps: a step to extract features that are effective in recognizing human activities in a long untrimmed video, followed by a step to detect human activities from those extracted features. To extract the rich features from video segments that could express unique patterns for each activity, we employ two different convolutional neural network models, C3D and I-ResNet. For detecting human activities from the sequence of extracted feature vectors, we use BLSTM, a bi-directional recurrent neural network model. By conducting experiments with ActivityNet 200, a large-scale benchmark dataset, we show the high performance of the proposed DeepAct model.

Keywords

E1JBB0_2018_v14n1_150_f0001.png 이미지

Fig. 1. Example of human activity detection in a video.

E1JBB0_2018_v14n1_150_f0002.png 이미지

Fig. 2. The process of activity detection in video.

E1JBB0_2018_v14n1_150_f0003.png 이미지

Fig. 3. Feature extraction with two different convolutional neural networks.

E1JBB0_2018_v14n1_150_f0004.png 이미지

Fig. 4. Bi-directional LSTM (BLSTM) model.

E1JBB0_2018_v14n1_150_f0005.png 이미지

Fig. 5. Classification score for each activity (aj) per video segment (ti).

E1JBB0_2018_v14n1_150_f0006.png 이미지

Fig. 6. Threshold-based activity localization.

E1JBB0_2018_v14n1_150_f0007.png 이미지

Fig. 7. Evaluation of localization performance. (a)~(c) three examples of localization results with twodifferent classification models, LSTM and BLSTM, and (d) one example of localization results with twodifferent feature models, the C3D only and the C3D+I-ResNet.

Table 1. Comparison of feature models

E1JBB0_2018_v14n1_150_t0001.png 이미지

Table 2. Comparison of classification models

E1JBB0_2018_v14n1_150_t0002.png 이미지

Table 3. Comparison with previous models in terms of activity localization

E1JBB0_2018_v14n1_150_t0003.png 이미지

References

  1. H. Wang and C. Schmid, "Action recognition with improved trajectories," in Proceedings of IEEE International Conference on Computer Vision (ICCV-13), Sydney, Australia, 2013, pp. 3551-3558.
  2. L. Wang, Y. Qiao, and X. Tang, "Video action detection with relational dynamic-poselets," in Proceedings of European Conference on Computer Vision (ECCV-14), Zurich, Switzerland, 2014, pp. 565-580.
  3. M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks," IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997. https://doi.org/10.1109/78.650093
  4. F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, "ActivityNet: a large-scale video benchmark for human activity understanding," in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR-15), Boston, MA, 2015, pp. 961-970.
  5. S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, 2013. https://doi.org/10.1109/TPAMI.2012.59
  6. K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proceedings of the International Conference on Neural Information Processing Systems (NIPS-14), Montreal, Canada, 2014, pp. 568-576.
  7. J. Zheng, Z. Jiang, and R. Chellappa, "Cross-view action recognition via transferable dictionary learning," IEEE Transactions on Image Processing, vol. 5, no. 6, pp. 2542-2556, 2016.
  8. R. Wang and D. Tao, "UTS at ActivityNet 2016," in ActivityNet Large Scale Activity Recognition Challenge Workshop, Las Vegas, NV, 2016, pp. 1-6.
  9. Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. F. Chang, "CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 5734-5743.
  10. Y. Xiong, Y. Zhao, L. Wang, D. Lin, and X. Tang, "A pursuit of temporal accuracy in general activity detection," 2017 [Online]. Available: https://arxiv.org/abs/1703.02716.
  11. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, "Temporal segment networks: towards good practices for deep action recognition," in Proceedings of European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 2016, pp. 20-36.
  12. G. Singh and F. Cuzzolin, "Untrimmed video classification for activity detection: submission to ActivityNet challenge," 2016 [Online]. Available: https://arxiv.org/abs/1607.01979.
  13. S. Karaman, L. Seidenari, and A. D. Bimbo, "Fast saliency based pooling of fisher encoded dense trajectories," in Proceedings of European Conference on Computer Vision (ECCV) Workshop, Zurich, Switzerland, 2014, pp. 1-4.
  14. Z. Shou, D. Wang, and S. F. Chang, "Temporal action localization in untrimmed videos via multi-stage CNNs," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 1049-1058.
  15. L. Wang, Y. Y. Qiao, and X. Tang, "Action recognition and detection by combining motion and appearance features," in Proceedings of European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014, pp. 1-6.
  16. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in Proceedings of IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4489-4497.
  17. V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem, "DAPs: deep action proposals for action understanding," in Proceedings of European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 2016, pp. 768-784.
  18. A. Montes, A. Salvador, S. Pascual, and X. Giro-i-Nieto, "Temporal activity detection in untrimmed videos with recurrent neural networks," 2017 [Online]. Available: https://arxiv.org/abs/1608.08128.
  19. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778.

Cited by

  1. Comparative study of singing voice detection based on deep neural networks and ensemble learning vol.8, pp.1, 2018, https://doi.org/10.1186/s13673-018-0158-1