Browse > Article
http://dx.doi.org/10.4218/etrij.2019-0510

Video augmentation technique for human action recognition using genetic algorithm  

Nida, Nudrat (Department of Computer Engineering, UET)
Yousaf, Muhammad Haroon (Department of Computer Engineering, UET)
Irtaza, Aun (Department of Computer Science, UET)
Velastin, Sergio A. (Applied Artificial Intelligence Research Group, Department of Computer Science and Engineering, University Carlos III de Madrid)
Publication Information
ETRI Journal / v.44, no.2, 2022 , pp. 327-338 More about this Journal
Abstract
Classification models for human action recognition require robust features and large training sets for good generalization. However, data augmentation methods are employed for imbalanced training sets to achieve higher accuracy. These samples generated using data augmentation only reflect existing samples within the training set, their feature representations are less diverse and hence, contribute to less precise classification. This paper presents new data augmentation and action representation approaches to grow training sets. The proposed approach is based on two fundamental concepts: virtual video generation for augmentation and representation of the action videos through robust features. Virtual videos are generated from the motion history templates of action videos, which are convolved using a convolutional neural network, to generate deep features. Furthermore, by observing an objective function of the genetic algorithm, the spatiotemporal features of different samples are combined, to generate the representations of the virtual videos and then classified through an extreme learning machine classifier on MuHAVi-Uncut, iXMAS, and IAVID-1 datasets.
Keywords
computer vision; evolutionary deep features augmentation; genetic algorithm; human action recognition; video augmentation;
Citations & Related Records
Times Cited By KSCI : 5  (Citation Analysis)
연도 인용수 순위
1 A. F. Bobick and J. W. Davis, The recognition of human movement using temporal templates, IEEE Trans. Pattern Anal. Mach. Intell. 3 (2001), 257-267.
2 M. N. Haque et al., Heterogeneous ensemble combination search using genetic algorithm for class imbalanced data classification, PloS ONE 11 (2016), no. 1, article no. e0146116.
3 A. Vaswani et al., Attention is all you need, in Proc. Conf. Neural Inf. Process. Syst. (Long Beach, CA, USA), Dec. 2017, pp. 5998-6008.
4 S. Singh, S. A. Velastin and H. Ragheb, Muhavi: A multicamera human action video dataset for the evaluation of action recognition methods, in Proc. IEEE Int. Conf. Adv. Video Signal based Surveillance (Boston, MA, USA), Sept. 2010, pp. 48-55.
5 N. Nida et al., Instructor activity recognition through deep spatiotemporal features and feedforward extreme learning machines, Math. Probl. Eng. 2019 (2019).
6 S. Ji et al., 3D convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2012), no. 1, 221-231.   DOI
7 C. Szegedy et al., Going deeper with convolutions, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Boston, MA, USA), June 2015, pp. 1-9.
8 F. Iandola et al., Densenet: implementing efficient convnet descriptor pyramids, arXiv preprint, CoRR, 2014, arXiv: 1404.1869.
9 W. Zhu et al., Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks, Proc. AAAI Conf. Artif. Intell. 30 (2016), no. 1, 3697-3703.
10 K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Proc. Annu. Conf. Neural Inf. Process. Syst. (Montreal, Canada), Dec. 2014, pp. 568-576.
11 Z. Li et al., Videolstm convolves, attends and flows for action recognition, Comput. Vis. Image Underst. 166 (2018), 41-50.   DOI
12 B. Singh et al., A multi-stream bi-directional recurrent neural network for fine-grained action detection, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Las Vegas, NV, USA), June 2016, pp. 1961-1970.
13 J. Marin et al., Learning appearance in virtual scenarios for pedestrian detection, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (San Francisco, CA, USA), June 2010, pp. 137-144.
14 D. Vazquez et al., Virtual and real world adaptation for pedestrian detection, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2013), no. 4, 797-809.   DOI
15 J. Yue-Hei Ng et al., Beyond short snippets: deep networks for video classification, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Boston, MA, USA), June 2015, pp. 4694-4702.
16 B. Fernando et al., Modeling video evolution for action recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Boston, MA, USA), June 2015, pp. 5378-5387.
17 A. Mikolajczyk and M. Grochowski, Data augmentation for improving deep learning in image classification problem, in Proc. Int. Interdiscip. PhD Workshop (IIPhDW), (Swinoujscie, Poland), May 2018, pp. 117-122.
18 A. A. Chaaraoui, P. Climent-Perez, and F. Florez-Revuelta, Silhouette-based human action recognition using sequences of key poses, Pattern Recognit. Lett. 34 (2013), no. 15, 1799-1807.   DOI
19 G. Varol et al., Learning from synthetic humans, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Honolulu, HI, USA), July 2017, pp. 109-117.
20 N. Srivastava, E. Mansimov and R. Salakhudinov, Unsupervised learning of video representations using lstms, in Proc. Int. Conf. Mach. Learn. (Lille, France), July 2015, pp. 843-852.
21 P. Yang et al., Ensemble-based wrapper methods for feature selection and class imbalance learning, in Advances in Knowledge Discovery and Data Mining, vol. 7818, Springer, Berlin, Heidelberg, Germany, 2013, pp. 544-555.
22 N. R. Howe and A. Deschamps, Better foreground segmentation through graph cuts, arXiv preprint, CoRR, 2004, arXiv: cs/0401017.
23 W. Zhu et al., Hierarchical extreme learning machine for unsupervised representation learning, in Proc. Int. Joint Conf. Neural Netw. (IJCNN), (Killarney, Ireland), July 2015, pp. 1-8.
24 C-H. Huang, Y.-R. Yeh, and Y.-C. F. Wang, Recognizing actions across cameras by exploring the correlated subspace, in Computer Vision-ECCV 2012: Workshops and Demonstrations, vol. 7583, Springer, Berlin, Heidelberg, Germany, 2012, pp. 342-351.
25 K. K. Reddy, J. Liu, and M. Shah, Incremental action recognition using feature-tree, in Proc. IEEE Int. Conf. Comput. Vis. (Kyoto, Japan), Sept. 2009, pp. 1010-1017.
26 J. Carreira and A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Honolulu, HI, USA), July 2017, pp. 6299-6308.
27 M. Mathieu, C. Couprie, and Y. LeCun, Deep multi-scale video prediction beyond mean square error, arXiv preprint, CoRR, 2015, arXiv: 1511.05440.
28 L. Sun et al., Lattice long short-term memory for human action recognition, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), (Venice, Italy), Oct. 2017, pp. 2147-2156.
29 U. Ahsan, C. Sun, and I. Essa, Discrimnet: semi-supervised action recognition from videos using generative adversarial networks, arXiv preprint, CoRR, 2018, arXiv: 1801.07230.
30 W. Lotter, G. Kreiman, and D. Cox, Deep predictive coding networks for video prediction and unsupervised learning, arXiv preprint, CoRR, 2016, arXiv: 1605.08104.
31 S. Tulyakov et al., Mocogan: Decomposing motion and content for video generation, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Salt Lake City, UT, USA), June 2018, pp. 1526-1535.
32 M. Hoai and A. Zisserman, Improving human action recognition using score distribution and ranking, in Asian Conference on Computer Vision, vol. 9007, Springer, Cham, Switzerland, 2014, pp. 3-20.
33 C. Vondrick, H. Pirsiavash, and A. Torralba, Generating videos with scene dynamics, in Proc. Conf. Neural Inf. Process. Syst. (Barcelona, Spain), Dec. 2016, pp. 613-621.
34 S. Wen et al., Generating realistic videos from keyframes with concatenated gans, IEEE Trans. Circuits Syst. Video Technol. 29 (2018), no. 8, 2337-2348.   DOI
35 M. Ranzato et al., Video (language) modeling: A baseline for generative models of natural videos, arXiv preprint, CoRR, 2014, arXiv: 1412.6604.
36 K. He et al., Deep residual learning for image recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (Las Vegas, NV, USA), June 2016, pp. 770-778.
37 H. Wang et al., Dense trajectories and motion boundary descriptors for action recognition, Int. J. Comput. Vis. 103 (2013), no. 1, 60-79.   DOI
38 H. Wang et al., Evaluation of local spatio-temporal features for action recognition, in Proc. British Mach. Vis. Conf. (BMVC), (London, UK), Sept. 2009, pp. 124.1-124.11.
39 D. Weinland, R. Ronfard, and E. Boyer, Free viewpoint action recognition using motion history volumes, Comput. Vision Image Underst. 104 (2006), no. 2-3, 249-257.   DOI
40 Q. Ke et al., Skeletonnet: Mining deep part features for 3-d action recognition, IEEE Sig. Process. Lett. 24 (2017), no. 6, 731-735.   DOI
41 X.-Y. Zhang et al., Learning transferable self-attentive representations for action recognition in untrimmed videos with weak supervision, Proc. AAAI Conf. Artif. Intell. 33 (2019), no. 1, pp. 9227-9234.
42 A. Farhadi and M. K. Tabrizi, Learning to recognize activities from the wrong view point, in Computer Vision-ECCV 2008, vol. 5302, Springer, Berlin, Heidelberg, Germany, 2008, 154-166.
43 D. Weinland, E. Boyer and R. Ronfard, Action recognition from arbitrary views using 3D exemplars, in Proc. IEEE Int. Conf. Comput. Vis. (Rio de Janeiro, Brazil), Oct. 2007, pp. 1-7.
44 F. Murtaza, M. H. Yousaf, and S. A. Velastin, Multi-view human action recognition using 2d motion templates based on mhis and their hog description, IET Comput. Vis. 10 (2016), no. 7, 758-767.   DOI