[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2019.06.023

A Multi-Stage Convolution Machine with Scaling and Dilation for Human Pose Estimation

Nie, Yali (Dept. of Electronics Engineering, Chonbuk National University)
Lee, Jaehwan (Dept. of Electronics Engineering, Chonbuk National University)
Yoon, Sook (Dept. of Computer Engineering, Mokpo National University)
Park, Dong Sun (IT Convergence Research Center, Chonbuk National University)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.13, no.6, 2019 , pp. 3182-3198 More about this Journal

Abstract

Vision-based Human Pose Estimation has been considered as one of challenging research subjects due to problems including confounding background clutter, diversity of human appearances and illumination changes in scenes. To tackle these problems, we propose to use a new multi-stage convolution machine for estimating human pose. To provide better heatmap prediction of body joints, the proposed machine repeatedly produces multiple predictions according to stages with receptive field large enough for learning the long-range spatial relationship. And stages are composed of various modules according to their strategic purposes. Pyramid stacking module and dilation module are used to handle problem of human pose at multiple scales. Their multi-scale information from different receptive fields are fused with concatenation, which can catch more contextual information from different features. And spatial and channel information of a given input are converted to gating factors by squeezing the feature maps to a single numeric value based on its importance in order to give each of the network channels different weights. Compared with other ConvNet-based architectures, we demonstrated that our proposed architecture achieved higher accuracy on experiments using standard benchmarks of LSP and MPII pose datasets.

Keywords

CNN; Human pose estimation; Multi-stage; Pyramid stacking; Dilation; Gating;

Citations & Related Records

Reference

1	Yang, Yi, and Deva Ramanan, "Articulated human detection with flexible mixtures of parts," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 12, pp. 2878-2890, 2013. DOI
2	Paszke, Adam, et al., "Enet: A deep neural network architecture for real-time semantic segmentation," arXiv preprint arXiv:1606.02147, 2016.
3	Jia, Yangqing, et al., "Caffe: Convolutional architecture for fast feature embedding," in Proc. of the 22nd ACM international conference on Multimedia. ACM, pp. 675-678, 2014.
4	Tieleman, Tijmen, and Geoffrey Hinton, "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude," COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26-31, 2012.
5	Ferrari, Vittorio, Manuel Marin-Jimenez, and Andrew Zisserman, "Progressive search space reduction for human pose estimation," Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
6	Hu, Jie, Li Shen, and Gang Sun, "Squeeze-and-excitation networks," arXiv preprint arXiv:1709.01507, 2018.
7	Pishchulin, Leonid, et al., "Poselet conditioned pictorial structures" in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
8	Sun, Min, and Silvio Savarese, "Articulated part-based model for joint object detection and pose estimation," Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
9	Yu, Fisher, and VladlenKoltun, "Multi-scale context aggregation by dilated convolutions," arXiv preprint arXiv:1511.07122, 2015.
10	Zhao, Hengshuang, et al., "Pyramid scene parsing network," arXiv preprint arXiv:1612.01105, 2017.
11	Rafi, Umer, et al., "An Efficient Convolutional Network for Human Pose Estimation," BMVC, Vol. 1, pp. 109.1-109.11, 2016.
12	Yu, Xiang, Feng Zhou, and Manmohan Chandraker, "Deep deformation network for object landmark localization," in Proc. of European Conference on Computer Vision. Springer International Publishing, pp. 52-70, 2016.
13	Belagiannis, Vasileios, and Andrew Zisserman, "Recurrent human pose estimation," in Proc. of Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on. IEEE, 2017.
14	Pishchulin, Leonid, et al., "Deepcut: Joint subset partition and labeling for multi person pose estimation," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
15	Tompson, Jonathan, et al., "Efficient object localization using convolutional networks," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
16	Wei, Shih-En, et al., "Convolutional pose machines," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
17	Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification with deep convolutional neural networks," Communications of the ACM, vol. 60, no. 6, 84-90, 2017. DOI
18	Toshev, Alexander, and Christian Szegedy, "Deeppose: Human pose estimation via deep neural networks," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
19	Andriluka, Mykhaylo, Stefan Roth, and Bernt Schiele, "Pictorial structures revisited: People detection and articulated pose estimation," in Proc. of Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.
20	Yang, Yi, and Deva Ramanan, "Articulated pose estimation with flexible mixtures-of-parts," in Proc. of Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.
21	Carreira, Joao, et al., "Human pose estimation with iterative error feedback," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
22	Tian, Yuandong, C. Lawrence Zitnick, and Srinivasa G. Narasimhan, "Exploring the spatial hierarchy of mixture models for human pose estimation," in Proc. of European Conference on Computer Vision. Springer, Berlin, Heidelberg, pp. 256-269, 2012.
23	Lifshitz, Ita, Ethan Fetaya, and Shimon Ullman, "Human pose estimation using deep consensus voting," in Proc. of European Conference on Computer Vision. Springer International Publishing, pp. 246-260, 2016.
24	Chu, Xiao, et al., "Structured feature learning for pose estimation," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
25	Papandreou, George, et al., "Towards Accurate Multi-person Pose Estimation in the Wild," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
26	Newell, Alejandro, Kaiyu Yang, and Jia Deng, "Stacked hourglass networks for human pose estimation," in Proc. of European Conference on Computer Vision. Springer International Publishing, pp. 483-499, 2016.
27	Bulat, Adrian, and Georgios Tzimiropoulos, "Human pose estimation via convolutional part heatmap regression," in Proc. of European Conference on Computer Vision. Springer International Publishing, pp. 717-732, 2016.
28	Iandola, Forrest N., et al., "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size," arXiv preprint arXiv:1602.07360, 2016.
29	Gkioxari, Georgia, Alexander Toshev, and NavdeepJaitly, "Chained predictions using convolutional neural networks," in Proc. of European Conference on Computer Vision. Springer, Cham, pp. 728-743, 2016.
30	Insafutdinov, Eldar, et al., "Deepercut: A deeper, stronger, and faster multi-person pose estimation model," in Proc. of European Conference on Computer Vision, pp. 34-50, 2016.
31	Hu, Peiyun, and Deva Ramanan, "Bottom-up and top-down reasoning with hierarchical rectified gaussians," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.