Browse > Article
http://dx.doi.org/10.4218/etrij.2019-0230

Three-stream network with context convolution module for human-object interaction detection  

Siadari, Thomhert S. (ICT Major of ETRI School, University of Science and Technology)
Han, Mikyong (City and Transportation ICT Research Department, Electronics and Telecommunications Research Institute)
Yoon, Hyunjin (ICT Major of ETRI School, University of Science and Technology)
Publication Information
ETRI Journal / v.42, no.2, 2020 , pp. 230-238 More about this Journal
Abstract
Human-object interaction (HOI) detection is a popular computer vision task that detects interactions between humans and objects. This task can be useful in many applications that require a deeper understanding of semantic scenes. Current HOI detection networks typically consist of a feature extractor followed by detection layers comprising small filters (eg, 1 × 1 or 3 × 3). Although small filters can capture local spatial features with a few parameters, they fail to capture larger context information relevant for recognizing interactions between humans and distant objects owing to their small receptive regions. Hence, we herein propose a three-stream HOI detection network that employs a context convolution module (CCM) in each stream branch. The CCM can capture larger contexts from input feature maps by adopting combinations of large separable convolution layers and residual-based convolution layers without increasing the number of parameters by using fewer large separable filters. We evaluate our HOI detection method using two benchmark datasets, V-COCO and HICO-DET, and demonstrate its state-of-the-art performance.
Keywords
context convolution module; deep learning; HOI detection; human-object interactions; three-stream network;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 A. Gupta, A. Kembhavi, and L. S. Davis, Observing human-object interactions: Using spatial and functional compatibility for recognition, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2009), no. 10, 1775-1789.   DOI
2 V. Delaitre, I. Laptev, and J. Sivic, Recognizing human actions in still images: a study of bag-of-features and part-based representations, in Proc. BMVC 2010-21st British Mach. Vision Conf., 2010, pp. 97:1-11.
3 B. Yao and L. Fei-Fei, Modeling mutual context of object and human pose in human-object interaction activities, in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recogn., San Francisco, CA, USA, June 2010, pp.17-24.
4 J. W. Choi, D. Moon, and J. H. Yoo, Robust multi-person tracking for real-time intelligent video surveillance, ETRI J. 37 (2015), no. 3, 551-561.   DOI
5 C. Y. Chen and K. Grauman, Predicting the location of interactees in novel human-object interactions, Asian conference on computer vision, Springer, Cham, Switzerland, 2014, pp. 351-367.
6 S. Gupta and J. Malik, Visual semantic role labeling, arXiv preprint arXiv:1505.04474, 2015.
7 L. Wang and D. Sng, Deep learning algorithms with applications to video analytics for a smart city: a survey, arXiv preprint arXiv:1512.03131, 2015.
8 J. Moon et al., Extensible hierarchical method of detecting interactive actions for video understanding, ETRI J. 39 (2017), no. 4, 502-513.   DOI
9 K. Yun et al., Vision-based garbage dumping action detection for real-world surveillance platform, ETRI J. 41 (2019), no. 4, 494-505.   DOI
10 Y. Licheng et al., Visual madlibs: fill in the blank image generation and question answering, arXiv preprint arXiv:1506.00278, 2015.
11 G. Gkioxari et al., Detecting and recognizing human-object interactions, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Salt Lake City, UT, USA, June 2018, pp. 8359-8367.
12 Y. W. Chao et al., Learning to detect human-object interactions, in Proc. IEEE Winter Conf. Applicat. Comput. Vision, Lake Tahoe, NV, USA, Mar. 2018, pp. 381-389.
13 L. Shen et al., Scaling human-object interaction recognition through zero-shot learning, in Proc. IEEE Winter Conf. Applicat. Comput. Vision, Lake Tahoe, NV, USA, Mar. 2018, pp. 1568-1576.
14 L. Cewu et al., Visual relationship detection with language priors, European Conference on Computer Vision, Springer, Cham, Switzerland, 2016, pp. 852-869.
15 C. Gao, Y. Zou, and J. B. Huang, iCAN: Instance-centric attention network for human-object interaction detection, British Machine Vision Conference, 2018.
16 C. Peng et al., Large kernel matters-improve semantic segmentation by global convolutional network, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, 2017, pp. 4353-4361.
17 M. A. Sadeghi and A. Farhadi, Recognition using visual phrases, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Providence, RI, USA, 2011, pp. 1745-1752.
18 M. Yatskar, L. Zettlemoyer, and A. Farhadi, Situation recognition: Visual semantic role labeling for image understanding, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, 2016, pp. 5534-5542.
19 B. Dai, Y. Zhang, and D. Lin, Detecting visual relationships with deep relational networks, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, 2017, pp. 3076-3086.
20 H. Zhang et al., Visual translation embedding network for visual relation detection, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, 2017, pp. 5532-5540.
21 H. Ronghang et al., Modeling relationships in referential expressions with compositional modular networks, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, 2017, pp. 1115-1124.
22 J. Peyre et al., Weakly-supervised learning of visual relations, in Proc. IEEE Int. Conf. Comput. Vision, Venice, Italy, 2017, pp. 5179-5188.
23 A. Kolesnikov, C. H. Lampert, and V. Ferrari. Detecting visual relationships using box attention, arXiv preprint arXiv:1807.02136, 2018.
24 K. He et al., Deep residual learning for image recognition, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 770-778.
25 M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, Feedforward semantic segmentation with zoom-out features, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, 2015, pp. 3376-3385.
26 W. Liu, A. Rabinovich, and A. C. Berg, Parsenet: Looking wider to see better, arXiv preprint arXiv:1506.04579, 2015.
27 F. Yu and V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv preprint arXiv:1511.07122, 2015.
28 R. Girshick et al., Detectron, https://github.com/facebookresearch/detectron, 2018.
29 T. Y. Lin et al., Microsoft COCO: Common objects in context, in Proc. Computer Vision-ECCV, Zurich, Switzerland, Sept. 2014, pp. 740-755.
30 Y. W. Chao et al., HICO: A benchmark for recognizing human-object interactions in images, in Proc. IEEE Int. Conf. Comput. Vision, Santiago, Chile, 2015, pp. 1017-1025.
31 T. Y. Lin et al., Feature pyramid networks for object detection, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 2117-2125.
32 S. Qi et al., Learning human-object interactions by graph parsing neural networks, in Proc. Eur. Conf. Comput. Vision (ECCV), 2018, pp. 401-417.
33 X. Bingjie et al., Interact as you intend: Intention-driven human- object interaction detection, CoRR abs/1808.09796, 2018.