[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2021.10.011

Two-Stream Convolutional Neural Network for Video Action Recognition

Qiao, Han (School of Computer, South China Normal University)
Liu, Shuang (School of Computer, South China Normal University)
Xu, Qingzhen (School of Computer, South China Normal University)
Liu, Shouqiang (School of Artifificial Intelligence, Faculty of Engineering, South China Normal University)
Yang, Wanggan (Nelson Mandela College of Government and Social Sciences, Southern University and Agricultural & Mechanical College)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.15, no.10, 2021 , pp. 3668-3684 More about this Journal

Abstract

Video action recognition is widely used in video surveillance, behavior detection, human-computer interaction, medically assisted diagnosis and motion analysis. However, video action recognition can be disturbed by many factors, such as background, illumination and so on. Two-stream convolutional neural network uses the video spatial and temporal models to train separately, and performs fusion at the output end. The multi segment Two-Stream convolutional neural network model trains temporal and spatial information from the video to extract their feature and fuse them, then determine the category of video action. Google Xception model and the transfer learning is adopted in this paper, and the Xception model which trained on ImageNet is used as the initial weight. It greatly overcomes the problem of model underfitting caused by insufficient video behavior dataset, and it can effectively reduce the influence of various factors in the video. This way also greatly improves the accuracy and reduces the training time. What's more, to make up for the shortage of dataset, the kinetics400 dataset was used for pre-training, which greatly improved the accuracy of the model. In this applied research, through continuous efforts, the expected goal is basically achieved, and according to the study and research, the design of the original dual-flow model is improved.

Keywords

video action recognition; multi segment; two-stream convolutional neural network; transfer learning; pre-training;

Citations & Related Records

Reference

1	T. Yang, Z. Chen, W. Yue, "Spatio-temporal double stream character action recognition model based on video deep learning," Computer Applications, vol.38, no. 3, pp. 895-899, 2018.
2	A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., "Attention Is All You Need," arXiv, 2017.
3	T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, "Focal loss for dense object detection," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318-327, 2020. DOI
4	E. Z. Xie, W. J. Wang, W. H. Wang, P. Sun, H. Xu, D. Liang, et al., "Segmenting Transparent Object in the Wild," in Proc. of Computer Vision - ECCV 2020, pp 696-711, 2020.
5	H. Q. Fan, B. Xiong, K. Mangalam, Y. H. Li, Z. C. Yan, J. Malik, et al., "Multiscale Vision Transformers," arXiv. 2104.11227, 2021.
6	J. Zhang, X. Li, "Handwritten character recognition based on TensorFlow platform," Computer Knowledge and Technology, vol. 12, no. 16, pp. 199-201, 2016.
7	K. Simonyan, A. Zisserman, "Two-stream convolutional networks for action recognition in videos," Advances in Neural Information Processing Systems, 2014.
8	W. Fang, L. Pang and W. N. Yi, "Survey on the application of deep reinforcement learning in image processing," Journal on Artificial Intelligence, vol. 2, no. 1, pp. 39-58, 2020. DOI
9	J. Li, Y. Lv, B. Ma, M. Yang, C. Wang et al., "Video source identification algorithm based on 3d geometric transformation," Computer Systems Science and Engineering, vol. 35, no.6, pp. 513-521, 2020. DOI
10	W. Mao, Y. Ge, C. Shen, Z. Tian, X. Wang, Z. Wang, "TFPose: Direct Human Pose Estimation with Transformers," arXiv:2103.15320, 2021.
11	D. Wu, L. Shao, "Deep Dynamic Neural Networks for Gesture Segmentation and Recognition," in Proc. of European Conference on Computer Vision Springer, Cham, pp. 552-571, 2014.
12	W. Fang, F. Zhang, Y. Ding and J. Sheng, "A new sequential image prediction method based on lstm and dcgan," Computers, Materials & Continua, vol. 64, no. 1, pp. 217-231, 2020. DOI
13	T. Kaur, T. K. Gandhi, "Automated Brain Image Classification Based on VGG-16 and Transfer Learning," in Proc. of 2019 International Conference on Information Technology (ICIT) IEEE, 2019.
14	P. Ma, Z. Tong, Y. Wang, "Research on Human Behavior Recognition Based on Convolutional Neural Network," in Proc. of China Conference on Wireless Sensor Networks, Springer, Singapore, pp. 131-144, 2019.
15	B. Zhang, L. Wang, Z. Wang, et al., "Real-time action recognition with enhanced motion vector CNNs," in Proc. of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2718-2726, 2016.
16	S. Yan, Y. Xiong, D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in Proc. of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, Apr. 2018.
17	H. Wang, C. Schmid, "Action Recognition with Improved Trajectories," in Proc. of 2013 IEEE International Conference on Computer Vision IEEE, pp. 3551-3558, 2013.
18	Y. Zhao, Y. Xiong, L. Wang, et al., "Temporal action detection with structured segment networks," The IEEE International Conference on Computer Vision, pp. 2933-2942, 2017.
19	G. W. Taylor, R. Fergus, Y. LeCun, et al., "Convolutional learning of spatio-temporal features," in Proc. of European conference on computer vision, Springer, Berlin, Heidelberg, pp. 140-153, 2010.
20	C. Li, Z. Hou, J. Chen, Y. Bu, J. Zhou, Q. Zhong, D. Xie and S. Pu, "Team DEEP-HRI Moments in Time Challenge 2018 Technical Report," Computer Vision and Pattern Recognition, 2018.
21	Z. Qiu, T. Yao, T. Mei, "Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks," in Proc. of 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5534-5542, 2017.
22	J. Devlin, M. W. Chang, K. Lee, K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv. 1810.04805, 2019.
23	N. Ahmed, J. I. Rafiq, M. R. Islam, "Enhanced Human Activity Recognition Based on Smartphone Sensor Data Using Hybrid Feature Selection Model," Sensors (Basel, Switzerland), vol. 20. no. 1. 2020.
24	C. Feichtenhofer, A. Pinz, A. Zisserman, "Convolutional two-stream network fusion for video action recognition," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933-1941, 2016.
25	S. Guan, et al., "SYSU iSEE submission to Moments in Time Challenge 2018," Computer Vision and Pattern Recognition, 2018.
26	Y. Cui, W. X. Che, T. Liu, B. Qin, Z. Q. Yang, S. J. Wang, G. et al., "Pre-Training with Whole Word Masking for Chinese BERT," arXiv:1906.08101, 2019.
27	I. Bello, B. Zoph, A. Vaswani, J. Shlens, Q. V. Le, "Attention Augmented Convolutional Networks," arXiv:1904.09925v5, 2019.
28	S. Nah, T. H. Kim, K. M. Lee, "Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring," in Proc. of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 257-265, 2017.
29	L. Sun, K. Jia, D. Y. Yeung, et al., "Human action recognition using factorized spatio-temporal convolutional networks," in Proc. of 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4597-4605, 2015.
30	C. Zhu, Y. K. Wang, D. B. Pu, M. Qi, H. Sun and L. Tan, "Multi-modality video representation for action recognition," Journal on Big Data, Vol. 2, No. 3, pp. 95-104, 2020. DOI
31	H. F. Sang, C. Xu, D. Y. Wu, J. Huang, "Research on the Real-Time Multiple Face Detection, Tracking and Recognition Based on Video," Applied Mechanics & Materials, vol. 373, pp. 442-446, 2013. DOI
32	D. Yu, et al., "Mixed Pooling for Convolutional Neural Networks," in Proc. of International Conference on Rough Sets & Knowledge Technology, pp.364-375, 2014.
33	W. Yin, S. Ebert, H. Schutze, "Attention-Based Convolutional Neural Network for Machine Comprehension," in Proc. of the Workshop on Human-Computer Question Answering 2016, San Diego, California, pp. 15-21, 2016.
34	R. L. Li, L. L. Wang, K. Wang, "A survey of human body action recognition," Pattern Recognition and Artificial Intelligence, vol.687, no. 12, pp. 3559-3569, 2014.
35	I. Goodfellow, Y. Bengio, A. Courville, Deep learning, Cambridge: MIT press, vol. 1, 2016, pp. 326-366.
36	S. Zhang, et al., "Fast Image Recognition Based on Independent Component Analysis and Extreme Learning Machine," Cognitive Computation, US, vol. 6, pp. 405-422, 2014. DOI
37	A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," arXiv. 2010.11929, 2020.
38	W Xing , Y Li , S Zhang, "View-invariant gait recognition method by three-dimensional convolutional neural network," Journal of electronic imaging, vol. 27, no. 1, pp. 248-258, 2018.
39	X. Shi, C. Ma, Y. Rao, X. Chen and J. Zhang, "Video preview generation for interactive educational digital resources based on the gui traversal," Intelligent Automation & Soft Computing, vol. 26, no.5, pp. 917-932, 2020. DOI
40	M. D. Zeiler, and R. Fergus, "Stochastic Pooling for Regularization of Deep Convolutional Neural Networks," Eprint Arxiv, 2013.
41	F. Chollet, "Xception: Deep Learning with Depthwise Separable Convolutions," in Proc. of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1800-1807, 2017.
42	L. Wang, Y. Xiong, Z. Wang, et al., "Temporal segment networks: Towards good practices for deep action recognition," in Proc. of European Conference on Computer Vision, Springer, Cham, pp. 20-36, 2016.
43	J. Lu, et al., "Application research of convolution neural network in image classification of icing monitoring in power grid," EURASIP Journal on Image and Video Processing, 2019.
44	W. Z. Shen, C. L. Zhang, Z. L. Chen, "Research on Automatic Counting Soybean Leaf Aphids System Based on Computer Vision Technology," Journal of Agricultural Mechanization Research, pp. 1635-1638, 2007.
45	M. Zhang, H. Xu, X. Wang, M. Zhou, S. Hong, "Application of Google TensorFlow machine learning framework," Microcomputers and applications, vol. 36, no.10, pp. 58-60, 2017.
46	D. Guan, W. Yuan, Z. Jin, et al., "Undiagnosed samples aided rough set feature selection for medical data," in Proc. of Parallel Distributed and Grid Computing (PDGC), 2012 2nd IEEE International Conference on IEEE, Solan, India, pp. 639-644, 2012.
47	H. Lee, S. Agethen, C. Lin, H. Hsu, P. Hsu, Z. Liu, H. Chu and W. Hsu, "Multi-Modal Fusion for Moment in Time Video Classification," Computer Vision and Pattern Recognition, 2018.
48	W. H. Wang, J. Y. Tu, "Research on License Plate Recognition Algorithms Based on Deep Learning in Complex Environment," IEEE Access, vol. 8, pp. 91661-91675, 2020. DOI
49	J. Gu, et al., "Recent Advances in Convolutional Neural Networks," Pattern Recognition, vol. 77, pp. 354-377, 2018. DOI
50	J. Liu, D. He, "Research on The Comment Text Classification based on Transfer Learning," in Proc. of 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC) IEEE, Chongqing, China, pp. 191-195, 2020.