[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2019.02.015

Two person Interaction Recognition Based on Effective Hybrid Learning

Ahmed, Minhaz Uddin (Department of Computer Engineering, Inha University)
Kim, Yeong Hyeon (Department of Computer Engineering, Inha University)
Kim, Jin Woo (Department of Computer Engineering, Inha University)
Bashar, Md Rezaul (Science,Technology and Management Crest)
Rhee, Phill Kyu (Department of Computer Engineering, Inha University)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.13, no.2, 2019 , pp. 751-770 More about this Journal

Abstract

Action recognition is an essential task in computer vision due to the variety of prospective applications, such as security surveillance, machine learning, and human-computer interaction. The availability of more video data than ever before and the lofty performance of deep convolutional neural networks also make it essential for action recognition in video. Unfortunately, limited crafted video features and the scarcity of benchmark datasets make it challenging to address the multi-person action recognition task in video data. In this work, we propose a deep convolutional neural network-based Effective Hybrid Learning (EHL) framework for two-person interaction classification in video data. Our approach exploits a pre-trained network model (the VGG16 from the University of Oxford Visual Geometry Group) and extends the Faster R-CNN (region-based convolutional neural network a state-of-the-art detector for image classification). We broaden a semi-supervised learning method combined with an active learning method to improve overall performance. Numerous types of two-person interactions exist in the real world, which makes this a challenging task. In our experiment, we consider a limited number of actions, such as hugging, fighting, linking arms, talking, and kidnapping in two environment such simple and complex. We show that our trained model with an active semi-supervised learning architecture gradually improves the performance. In a simple environment using an Intelligent Technology Laboratory (ITLab) dataset from Inha University, performance increased to 95.6% accuracy, and in a complex environment, performance reached 81% accuracy. Our method reduces data-labeling time, compared to supervised learning methods, for the ITLab dataset. We also conduct extensive experiment on Human Action Recognition benchmarks such as UT-Interaction dataset, HMDB51 dataset and obtain better performance than state-of-the-art approaches.

Keywords

Action Recognition; Convolutional Neural Network; Deep Architecture; Transfer Learning;

Citations & Related Records

Reference

1	M. Li and I. K. Sethi, "Confidence-Based Active Learning," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1251-1261, 2006. DOI
2	J. Sourati, M. Akcakaya, D. Erdogmus, T. K. Leen, and J. G. Dy, "A Probabilistic Active Learning Algorithm Based on Fisher Information Ratio," IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 8, pp. 2023-2029, 2018. DOI
3	J. Bernard, M. Hutter, M. Zeppelzauer, D. Fellner, and M. Sedlmair, "Comparing Visual-Interactive Labeling with Active Learning: An Experimental Study," IEEE Trans. Vis. Comput. Graph., vol. 24, no. 1, pp. 298-308, 2018. DOI
4	S. Hao, J. Lu, P. Zhao, C. Zhang, S. C. H. Hoi, and C. Miao, "Second-Order Online Active Learning and Its Applications," IEEE Trans. Knowl. Data Eng., vol. 30, no. 7, pp. 1338-1351, 2018. DOI
5	Sinno Jialin Pan, Qiang Yang, "a Survey on Transfer Learning," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, 2010. DOI
6	G. P. C. Fung, J. X. Yu, H. Lu, and P. S. Yu, "Text classification without negative examples revisit," IEEE Trans. Knowl. Data Eng., vol. 18, no. 1, pp. 6-20, 2006. DOI
7	P. Wu and T. G. Dietterich, "Improving SVM accuracy by training on auxiliary data sources," in Proc. of Int. Conf. Mach. Learn., pp. 110-118, 2004.
8	Y. Jie, Y. Qiang, and N. Lionel, "Adaptive Temporal Radio Maps for Indoor Location Estimation," Pervasive Comput. Commun. 2005. PerCom 2005. Third IEEE Int. Conf., vol. 7, no. 7, pp. 85-94, 2005.
9	R. Gonzalez and R. Woods, Digital image processing. 2002. DOI
10	A. T. Lopes, E. de Aguiar, A. F. De Souza, and T. Oliveira-Santos, "Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order," Pattern Recognit., vol. 61, pp. 610-628, 2017. DOI
11	S. Prakash Sahoo and S. Ari, "On an algorithm for Human Action Recognition," Expert Syst. Appl., 2018.
12	G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, "R-CNNs for Pose Estimation and Action Detection," arXiv Prepr. arXiv1406.5212, pp. 1-8, 2014.
13	K. N. E. H. Slimani, Y. Benezeth, and F. Souami, "Human interaction recognition based on the co-occurrence of visual words," in Proc. of IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work., pp. 461-466, 2014.
14	I. Laptev and C. Schmid, "Long-term Temporal Convolutions for Action Recognition To cite this version : Long-term Temporal Convolutions for Action Recognition," vol. 40, no. 6, pp. 1510-1517, 2015.
15	Y. LeCun, K. Kavukcuoglu, and C. Farabet, "Convolutional networks and applications in vision," ISCAS 2010 - 2010 IEEE Int. Symp. Circuits Syst. Nano-Bio Circuit Fabr. Syst., pp. 253-256, 2010.
16	C. Szegedy et al., "Going Deeper with Convolutions," pp. 1-9, 2014.
17	D. Yarowsky, "Unsupervised word sense disambiguation rivaling supervised methods," in Proc. of 33rd Annu. Meet. Assoc. Comput. Linguist. -, pp. 189-196, 1995.
18	A. B. Goldberg, "Multi-Manifold Semi-Supervised Learning," pp. 169-176, 2009.
19	U. Ahsan, C. Sun, and I. Essa, "DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks," Computer Vision and Pattern Recognition, 2018.
20	J. Zhang, Y. Han, J. Tang, Q. Hu, and J. Jiang, "Semi-Supervised Image-to-Video Adaptation for Video Action Recognition," IEEE Trans. Cybern., vol. 47, no. 4, pp. 960-973, 2017. DOI
21	B. Yao and L. Fei-Fei, "Modeling mutual context of object and human pose in human-object interaction activities," in Proc. of IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 1, pp. 17-24, 2010.
22	S. Jones and L. Shao, "A Multigraph Representation for Improved Unsupervised / Semi-supervised Learning of Human Actions," Cvpr, 2014.
23	T. Zhang, S. Liu, C. Xu, and H. Lu, "Boosted multi-class semi-supervised learning for human action recognition," Pattern Recognit., vol. 44, no. 10-11, pp. 2334-2342, 2011. DOI
24	N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," in Proc. of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005.
25	I. Laptev, "On space-time interest points," International Journal of Computer Vision, 2005.
26	M. Hasan and A. K. Roy-Chowdhury, "A Continuous Learning Framework for Activity Recognition Using Deep Hybrid Feature Models," Ieee Tmm, vol. 17, no. 11, pp. 1909-1922, 2015.
27	J. K. Ryoo, M. S. and Aggarwal, "Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA)," 2010.
28	C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the Inception Architecture for Computer Vision," in Proc. of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
29	K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," Computer Vision and Pattern Recognition, pp. 1-14, 2014.
30	O. Russakovsky et al., "ImageNet Large Scale Visual Recognition Challenge," Int. J. Comput. Vis., vol. 115, no. 3, pp. 211-252, 2015. DOI
31	H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "HMDB: A large video database for human motion recognition," in Proc. of IEEE Int. Conf. Comput. Vis., no. November 2011, pp. 2556-2563, 2011.
32	T. and G. H. Tieleman, "Lecture 6.5-rmsprop: Divide the gradient by a running average ofits recent magnitude.," COURSERA Neural Networks Form. Learn., 2012.
33	D. P. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization," in Proc. of conference paper at the 3rd International Conference for Learning Representations, pp. 1-15, 2014.
34	Y. Jia et al., "Caffe: Convolutional Architecture for Fast Feature Embedding," in Proc. of the 22nd ACM international conference on Multimedia, pp. 675-678, 2014.
35	S. Chetlur et al., "cuDNN: Efficient Primitives for Deep Learning."
36	M. S. Ryoo and J. K. Aggarwal, "Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities," in Proc.of IEEE Int. Conf. Comput. Vis., no. Iccv, pp. 1593-1600, 2009.
37	W. Brendel and S. Todorovic, "Learning spatiotemporal graphs of human activities," in Proc. of IEEE Int. Conf. Comput. Vis., no. Iccv, pp. 778-785, 2011.
38	K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proc. of 2016 IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770-778, 2016.
39	C. Feichtenhofer, A. Pinz, and A. Zisserman, "Convolutional Two-Stream Network Fusion for Video Action Recognition," Cvpr, no. i, pp. 1933-1941, 2016.
40	A. Richard, "A BoW-equivalent Recurrent Neural Network for Action Recognition Bag-of-Words Model as Neural Network," Bmvc2015, 2015.
41	S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137-1149, 2017. DOI
42	A. Vedaldi and K. Lenc, "MatConvNet Convolutional Neural Networks for MATLAB," 2016.
43	K. Simonyan, A. Vedaldi, and A. Zisserman, "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps," Iclr, p. 1-, 2014.
44	G. Gkioxari, U. C. Berkeley, R. Girshick, and U. C. Berkeley, "Contextual Action Recognition with R*CNN," Cvpr, 2015.
45	M. Stikic, K. Van Laerhoven, and B. Schiele, "Exploring semi-supervised and active learning for activity recognition," Wearable Comput. 2008. ISWC 2008. 12th IEEE Int. Symp., pp. 81-88, 2008.
46	B. Settles, Active Learning, vol. 6, no. 1. 2012.
47	H. Wang, A. Klaser, C. Schmid, and C. L. Liu, "Action recognition by dense trajectories," in Proc. of IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 3169-3176, 2011.
48	X. Peng, C. Zou, Y. Qiao, and Q. Peng, "Action Recognition with Stacked Fisher Vectors," Eccv, pp. 581-595, 2014.
49	P. F. Felzenszwalb, R. B. Girshick, D. Mcallester, and D. Ramanan, "Object Detection with Discriminatively Trained Part-Based Models."