[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2019.07.015

Video Representation via Fusion of Static and Motion Features Applied to Human Activity Recognition

Arif, Sheeraz (School of Information and Electronics, Beijing Institute of Technology)
Wang, Jing (School of Information and Electronics, Beijing Institute of Technology)
Fei, Zesong (School of Information and Electronics, Beijing Institute of Technology)
Hussain, Fida (School of Electrical and Information Engineering, Jiangsu University)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.13, no.7, 2019 , pp. 3599-3619 More about this Journal

Abstract

In human activity recognition system both static and motion information play crucial role for efficient and competitive results. Most of the existing methods are insufficient to extract video features and unable to investigate the level of contribution of both (Static and Motion) components. Our work highlights this problem and proposes Static-Motion fused features descriptor (SMFD), which intelligently leverages both static and motion features in the form of descriptor. First, static features are learned by two-stream 3D convolutional neural network. Second, trajectories are extracted by tracking key points and only those trajectories have been selected which are located in central region of the original video frame in order to to reduce irrelevant background trajectories as well computational complexity. Then, shape and motion descriptors are obtained along with key points by using SIFT flow. Next, cholesky transformation is introduced to fuse static and motion feature vectors to guarantee the equal contribution of all descriptors. Finally, Long Short-Term Memory (LSTM) network is utilized to discover long-term temporal dependencies and final prediction. To confirm the effectiveness of the proposed approach, extensive experiments have been conducted on three well-known datasets i.e. UCF101, HMDB51 and YouTube. Findings shows that the resulting recognition system is on par with state-of-the-art methods.

Keywords

Activity recognition; static features; motion features; trajectories; CNN; LSTM;

Citations & Related Records

Reference

1	L. Wang, Y. Qiao and X. Tang, "Mofap: a multi-level representation for action recognition," International Journal of Computer Vision, vol.119, no.3, pp. 254-271, 2016. DOI
2	L. Wang, Y. Qiao and X. Tang, "Action recognition with trajectory-pooled deep-convolutional descriptors," in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4305-4314, June 7-12, 2015.
3	J. Wang, W. Wang and R. Wang, "Deep alternative neural network: exploring contexts as early as possible for action recognition," Advances in Neural Information Processing Systems (NIPS), pp.811-819, December, 2016.
4	H. Bilen, B. Fernando and E. Gavves, "Dynamic image networks for action recognition," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3034-3042, June 27-30, 2016.
5	Z. Li, E. Gavves, M. Jain and C.G.M. Snoek, "VideoLSTM convolves, attends and flows for action recognition," Computer Vision and Image Understanding, vol. 166, pp. 41-50, January 2018. DOI
6	X. Wang, L. Gao, and P. Wang, "Two-stream 3D convNet Fusion for Action Recognition in Videos with Arbitrary Size and Length," IEEE transaction on multimedia, vol.20, no. 3, pp. 634-644, March 2018. DOI
7	S. Yu, Y. Cheng and L. Xie, "Fully convolutional networks for action recognition," Institution of Engineering and Technology (IET) Computer vision, vol.11, no.8, pp. 744 -749, December, 2017.
8	Y. Zhu, Z. Lan and S. Newsam, "Hidden two-stream convolutional networks for action recognition," ArXiv, April, 2017.
9	B. Ni, P. Moulin, X. Yang and S. Yan, "Motion part regularization: Improving action recognition via trajectory selection," in Proc. of IEEE conference on (CVPR), Boston, MA, USA, pp. 3698-3706, June 7-12, 2015.
10	L. Wang and Z. Wang, "Temporal segment networks: towards good practices for deep action recognition," in Proc. of Euro. Conf. on Computer Vision, pp. 20-36, October 11-14, 2016.
11	N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 886-893, June 20-25, 2005.
12	A. Klaser, M. Marszalek, and C. Schmid, "A Spatio-Temporal Descriptor Based on 3D-Gradients," in Proc. of 19th British Machine Vision Conference, British Machine Vision Association: Leeds, United Kingdom, pp.1-10, September, 2008.
13	P. Scovanner, S. Ali and M. Shah, "A 3-Dimensional SIFT Descriptor and its Application to Action recognition," in Proc. of the 15th International Conference on Multimedia, pp. 357-360,September 25-29, 2007.
14	I. Laptev, "On Space-Time Interest Points," International Journal of Computer Vision, vol. 64, no. 2-3, pp.107-123, September, 2005. DOI
15	H. Wang, A. Klaser, and C. Schimid, "Action recognition by dense trajectories," in Proc. of IEEE conference on computer vision and pattern recognition, pp.3169-3176, June 20-25, 2011.
16	O.V.R. Murthy and R.Goecke, "Ordered trajectories for large scale human action recognition," in Proc. of IEEE conference on computer vision and pattern recognition, pp. 412-419, December 2-8, 2013.
17	Y. Wang, S. Wang and J. Tang, "Hierarchical attention network for action recognition in videos," ArXiv, July, 2016.
18	H. Wang, A. Klaser A and C. Schimid, "Dense trajectories and motion boundary descriptor for action recognition," in Proc. of international journal of computer vision, vol. 103, pp. 60-79, March, 2013. DOI
19	S. Yeung, O. Russakovsky and N. Jin, "Every moment counts: Dense detailed labeling of actions in complex videos," International Journal of Computer Vision, vol.126, no.2-4, pp. 375-389, April, 2018. DOI
20	S. Sharma, R. Kiros and R. Salakhutdinov, "Action recognition using visual attention," in Proc. of Neural Information Processing Systems (NIPS) Time Series Workshop, December, 2015.
21	H. Zhu, J. Weibel and S.Lu, "Discriminative multi-modal feature fusion for RGBD indoor scene recognition," in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2969-2976, June 27-30, 2016.
22	K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," Advances in neural information processing systems, vol. 1, pp. 568-576, June, 2014.
23	H. Wang and C. Schmid, "Action recognition with improved trajectories," in Proc. of IEEE International conference on computer vision, pp. 3551-3558, December 1-8, 2013.
24	N. Dalal, B. Triggs and C. Schmid, "Human detection using oriented histograms of flow and appearance," in Proc. of European Conference on Computer Vision , pp 428-441, May 7-13, 2006.
25	A. Karpathy, G. Toderici, S. Shetty and T. Leung, "Large-scale video classification with convolutional neural networks," in Proc. of IEEE conference on computer vision and pattern recognition, pp. 1725-1732, June 23-28, 2014.
26	D.G. Lowe, "Distinctive Image Features from Scale-Invariant Key points," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, November, 2004. DOI
27	Z. Wu, X. Wang and Y. Jiang, "Modeling spatial temporal clues in a hybrid deep learning framework for video classification," in Proc. of ACM international conference on Multimedia, pp. 461-470, October 26-30, 2015.
28	C. Feichtenhofer, A. Pinz and R.P. Wildes, "Spatiotemporal residual networks for video action recognition," in Proc. of Conference on Neural Information Processing Systems, pp. 1-9, December, 2016.
29	D.G. Lowe, "Object Recognition from Local Scale-Invariant Features," in Proc. of international Conference on Computer Vision, pp. 1150-1157, September 20-27, 1999.
30	Farneback, "Two-frame motion estimation based on polynomial expansion," in Proc. of the Scandinavian Conference on Image Analysis (SCIA), pp 363-370, June 29 -July 2, 2003.
31	T. Brox and J. Malik, "Large displacement optical flow: Descriptor matching in variational motion estimation," in Proc. of IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no.3, pp.500-513, August, 2011. DOI
32	L. Sun, K. Jia, and D. Yeung, "Human action recognition using factorized spatio-temporal convolutional networks," in Proc. of IEEE International Conference on computer vision (ICCV), pp. 4597-4605, December 7-13, 2015.
33	G.W Taylor, R. Fergus and Y. LeCun, "Convolutional learning of spatio-temporal features," in Proc. of 11th European conference on Computer vision, pp. 140-153, September 5-11, 2010.
34	Ji Si, Xu W, Yang M, et al., "3d convolutional neural networks for human action recognition," IEEE Trans. Pattern Anal. Mach. Intell., vol.35, no.1, pp.221-231, January, 2013. DOI
35	D. Tran, L. Bourdev and Fergus, "Learning spatiotemporal features with 3d convolutional networks," in Proc. of IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, pp. 4489-4497, December 7-13, 2015.
36	J. Donahue, L.A. Hendricks and S. Guadarrama, "Long-term recurrent convolutional networks for visual recognition and description," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol, 39, no. 4, pp. 677-691, September, 2016. DOI
37	J. Delhumeau, P.H. Gosselin and H. Jegou, "Revisiting the VLAD image representation," in Proc. of the 21st ACM international conference on Multimedia, Barcelona, Spain, pp. 653-656, October 21-25, 2013.
38	C. Liu, J. Yuen and A. Torralba, "SIFT Flow: Dense Correspondence across Different Scenes," in Proc. of European Conference on Computer Vision (ECCV), pp. 28-42, October 12-18, 2008.
39	C. Liu, J. Yuen and A. Torralba, "SIFT Flow: Dense Correspondence across Scenes and its Applications," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 978-994, May, 2011. DOI
40	Y. Boykov, O. Veksler and R. Zabih, "Fast approximate energy minimization via graph cut," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no.11, pp. 1222-1239, November, 2001. DOI
41	K. Soomro, A.R. Zamir and M. Shah, "UCF101: A dataset of 101 human actions classes from videos in the wild," Published in OALib journal, 2012.
42	H. Kuehne, H. Jhuang and H. Garrote, "HMDB: a large video database for human motion recognition," in Proc. of IEEE International Conference on Computer Vision, pp. 2556-2563, November 6-13, 2011.
43	J. Liu, J. Luo and Shah, "Recognizing realistic actions from videos in the wild," in Proc. of IEEE conference on computer vision and pattern recognition, pp. 1996-2003, June 20-25, 2009.
44	Y.G. Jiang, J. Liu and A.R. Zamir, "THUMOS challenge: Action recognition with a large number of classes," 2013.
45	P. Wang, Y. Cao and C. Shen, "Temporal pyramid pooling based convolutional neural networks for action recognition," IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 12, pp. 2613-2622, June, 2017. DOI
46	J.J Seo, H.I. Kim and DE. Neve, "Effective and efficient human action recognition using dynamic frame skipping and trajectory rejection," Journal of Image and Vision Computing, vol.58, pp. 76-85, February, 2017. DOI
47	G. Willems, T. Tuytelaars and L.J.V. Gool, "An efficient dense and scale - variant spatio-temporal interest point detector," in Proc. of European Conference on Computer Vision (ECCV), pp. 650-663, October 12-18, 2008.
48	G. Varol, I. Laptev, and C. Schmid, "Long-term Temporal Convolutions for Action Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1510-1517, June ,2017. DOI
49	Z. Wu, X. Wang, and Y.G. Jiang, "Modeling spatial-temporal clues in a hybrid deep learning framework for video classification," in Proc. of the ACM international conference on Multimedia, pp. 461-470, October 27-30, 2015.
50	S. Hochreiter and J. Schmidhuber, "Long short-term memory," neural computation, vol.9, no.8, pp. 1735-1780, November, 1997. DOI
51	P. Dollar, V. Rabaud and G. Cottrell, "Behavior recognition via sparse spatio-temporal features," IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, October 15-16, 2005.
52	N. Srivastava, E. Mansimov and R. Salakhutdinov, "Unsupervised Learning of Video Representations using LSTMs," in Proc. of the International Conference on Machine Learning, pp. 843-852, July 6-11, 2015.
53	N. Ballas, L. Yao and C. Pal C, "Delving deeper into convolutional networks for learning video representations," in Proc. of IEEE International Conference on Robotics and Automation (ICRA), March, 2016.
54	J.Y. Ng, M. Hausknecht and S. Vijayanarasimhan , "Beyond short snippets: Deep networks for video classification," in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4694-4702, June 7-12, 2015.
55	H. Gammulle, S. Denman and S. Sridharan, "Two stream lstm: A deep fusion framework for human action recognition," in Proc. of IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, USA, pp. 177-186, March 24-31, 2017.