Browse > Article
http://dx.doi.org/10.3745/JIPS.04.0059

DeepAct: A Deep Neural Network Model for Activity Detection in Untrimmed Videos  

Song, Yeongtaek (Dept. of Computer Science, Kyonggi University)
Kim, Incheol (Dept. of Computer Science, Kyonggi University)
Publication Information
Journal of Information Processing Systems / v.14, no.1, 2018 , pp. 150-161 More about this Journal
Abstract
We propose a novel deep neural network model for detecting human activities in untrimmed videos. The process of human activity detection in a video involves two steps: a step to extract features that are effective in recognizing human activities in a long untrimmed video, followed by a step to detect human activities from those extracted features. To extract the rich features from video segments that could express unique patterns for each activity, we employ two different convolutional neural network models, C3D and I-ResNet. For detecting human activities from the sequence of extracted feature vectors, we use BLSTM, a bi-directional recurrent neural network model. By conducting experiments with ActivityNet 200, a large-scale benchmark dataset, we show the high performance of the proposed DeepAct model.
Keywords
Activity Detection; Bi-directional LSTM; Deep Neural Networks; Untrimmed Video;
Citations & Related Records
연도 인용수 순위
  • Reference
1 G. Singh and F. Cuzzolin, "Untrimmed video classification for activity detection: submission to ActivityNet challenge," 2016 [Online]. Available: https://arxiv.org/abs/1607.01979.
2 S. Karaman, L. Seidenari, and A. D. Bimbo, "Fast saliency based pooling of fisher encoded dense trajectories," in Proceedings of European Conference on Computer Vision (ECCV) Workshop, Zurich, Switzerland, 2014, pp. 1-4.
3 Z. Shou, D. Wang, and S. F. Chang, "Temporal action localization in untrimmed videos via multi-stage CNNs," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 1049-1058.
4 L. Wang, Y. Y. Qiao, and X. Tang, "Action recognition and detection by combining motion and appearance features," in Proceedings of European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014, pp. 1-6.
5 D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in Proceedings of IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4489-4497.
6 V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem, "DAPs: deep action proposals for action understanding," in Proceedings of European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 2016, pp. 768-784.
7 A. Montes, A. Salvador, S. Pascual, and X. Giro-i-Nieto, "Temporal activity detection in untrimmed videos with recurrent neural networks," 2017 [Online]. Available: https://arxiv.org/abs/1608.08128.
8 K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778.
9 L. Wang, Y. Qiao, and X. Tang, "Video action detection with relational dynamic-poselets," in Proceedings of European Conference on Computer Vision (ECCV-14), Zurich, Switzerland, 2014, pp. 565-580.
10 H. Wang and C. Schmid, "Action recognition with improved trajectories," in Proceedings of IEEE International Conference on Computer Vision (ICCV-13), Sydney, Australia, 2013, pp. 3551-3558.
11 K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proceedings of the International Conference on Neural Information Processing Systems (NIPS-14), Montreal, Canada, 2014, pp. 568-576.
12 M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks," IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997.   DOI
13 F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, "ActivityNet: a large-scale video benchmark for human activity understanding," in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR-15), Boston, MA, 2015, pp. 961-970.
14 S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, 2013.   DOI
15 J. Zheng, Z. Jiang, and R. Chellappa, "Cross-view action recognition via transferable dictionary learning," IEEE Transactions on Image Processing, vol. 5, no. 6, pp. 2542-2556, 2016.
16 L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, "Temporal segment networks: towards good practices for deep action recognition," in Proceedings of European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 2016, pp. 20-36.
17 R. Wang and D. Tao, "UTS at ActivityNet 2016," in ActivityNet Large Scale Activity Recognition Challenge Workshop, Las Vegas, NV, 2016, pp. 1-6.
18 Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. F. Chang, "CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 5734-5743.
19 Y. Xiong, Y. Zhao, L. Wang, D. Lin, and X. Tang, "A pursuit of temporal accuracy in general activity detection," 2017 [Online]. Available: https://arxiv.org/abs/1703.02716.