[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2021.12.010

Dual Attention Based Image Pyramid Network for Object Detection

Dong, Xiang (Institute of Information Science, Beijing Jiaotong University)
Li, Feng (Institute of Information Science, Beijing Jiaotong University)
Bai, Huihui (Institute of Information Science, Beijing Jiaotong University)
Zhao, Yao (Institute of Information Science, Beijing Jiaotong University)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.15, no.12, 2021 , pp. 4439-4455 More about this Journal

Abstract

Compared with two-stage object detection algorithms, one-stage algorithms provide a better trade-off between real-time performance and accuracy. However, these methods treat the intermediate features equally, which lacks the flexibility to emphasize meaningful information for classification and location. Besides, they ignore the interaction of contextual information from different scales, which is important for medium and small objects detection. To tackle these problems, we propose an image pyramid network based on dual attention mechanism (DAIPNet), which builds an image pyramid to enrich the spatial information while emphasizing multi-scale informative features based on dual attention mechanisms for one-stage object detection. Our framework utilizes a pre-trained backbone as standard detection network, where the designed image pyramid network (IPN) is used as auxiliary network to provide complementary information. Here, the dual attention mechanism is composed of the adaptive feature fusion module (AFFM) and the progressive attention fusion module (PAFM). AFFM is designed to automatically pay attention to the feature maps with different importance from the backbone and auxiliary network, while PAFM is utilized to adaptively learn the channel attentive information in the context transfer process. Furthermore, in the IPN, we build an image pyramid to extract scale-wise features from downsampled images of different scales, where the features are further fused at different states to enrich scale-wise information and learn more comprehensive feature representations. Experimental results are shown on MS COCO dataset. Our proposed detector with a 300 × 300 input achieves superior performance of 32.6% mAP on the MS COCO test-dev compared with state-of-the-art methods.

Keywords

Dual attention mechanism; Adaptive feature fusion module; Progressive attention fusion module; Image pyramid network; Multi-scale object detection;

Citations & Related Records

Reference

1	J. Redmon, S. Divvala, R. Girshick, et al., "You only look once: Unified, real-time object detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779-788, 2016.
2	J. Redmon and A. Farhadi, "YOLO9000: better, faster, stronger," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263-7271, 2017.
3	C.-Y. Fu, W. Liu, A. Ranga, et al., "Dssd: Deconvolutional single shot detector," arXiv preprint arXiv:1701.06659, 2017.
4	T. Y. Lin, P. Dollar, R. Girshick, et al., "Feature pyramid networks for object detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117-2125, 2017.
5	W. Li, Z. Wang, B. Yin, et al., "Rethinking on multi-stage networks for human pose estimation," arXiv preprint arXiv:1901.00148, 2019.
6	Y. Pang, T. Wang, R. M. Anwer, et al., "Efficient featurized image pyramid network for single shot detector," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7336-7344, 2019.
7	T. Ojala, M. Pietikainen, D. Harwood, "Performance evaluation of texture measures with classification based on kullback discrimination of distributions," in Proc. of 12th International Conference on Pattern Recognition, pp. 582-585, 1994.
8	M. Aamir, Y.-F. Pu, Z. Rahman, W.A. Abro, Z. Hu, F. Ullah, and A. M. Badr, "A Hybrid Proposed Framework for Object Detection and Classification," Journal of Information Processing Systems 14, no. 5, 2018.
9	W. Liu, D. Anguelov, D. Erhan, et al., "Ssd: Single shot multibox detector," in Proc. of European Conference on Computer Vision, pp. 21-37, 2016.
10	J. Redmon and A. Farhadi, "Yolov3: An incremental improvement," arXiv preprint arXiv: 1804.02767, 2018.
11	S. Liu, L. Qi, H. Qin, et al., "Path aggregation network for instance segmentation," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759-8768, 2018.
12	T. Wang, R. M. Anwer, H. Cholakkal, et al., "Learning rich features at high-speed for single-shot object detection," in Proc. of the IEEE International Conference on Computer Vision, pp. 1971-1980, 2019.
13	D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004. DOI
14	Z. Cai and N. Vasconcelos, "Cascade r-cnn: Delving into high quality object detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154-6162, 2018.
15	Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, and A. L. Yuille, "Single-shot object detection with enriched semantics," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
16	K. He, G. Gkioxari, P. Dollar, et al., "Mask r-cnn," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 386-397, 2020. DOI
17	J. Dai, Y. Li, K. He, et al., "R-FCN: object detection via region-based fully convolutional networks," arXiv preprint arXiv:1605.06409, 2016.
18	Y. Chen, J. Li, B. Zhou, J. Feng, and S. Yan, "Weaving multi-scale context for single shot detector," arXiv preprint arXiv: 1712.03149, 2017.
19	C. Harris and M. Stephens, "A combined corner and edge detector," in Proc. of the Alvey Vision Conference, pp. 23.1-23.6, 1988.
20	B. Singh and L. S. Davis, "An analysis of scale invariance in object detection - snip," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3578-3587, 2018.
21	T.-Y. Lin, M. Maire, S. Belongie, et al., "Microsoft coco: Common objects in context," in Proc. of European Conference on Computer Vision, pp. 740-755, 2014.
22	Y. Guan, M. Aamir, Z. Hu, W.A. Abro, Z. Rahman, Z.A. Dayo, S. Akram, "A region-based efficient network for accurate object detection," Traitement du Signal, 38(2), 481-494, 2021. DOI
23	T. Kong, F. Sun, C. Tan, H. Liu, and W. Huang, "Deep feature pyramid reconfiguration for object detection," in Proc. of the European Conference on Computer Vision, 2018.
24	R. Girshick, J. Donahue, T. Darrell, et al., "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587, 2014.
25	R. Girshick, "Fast r-cnn," in Proc. of the IEEE International Conference on Computer Vision, pp. 1440-1448, 2015.
26	S. Ren, K. He, R. Girshick, et al., "Faster r-cnn: Towards real-time object detection with region proposal networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 2017. DOI
27	Z. Liu, G. Gao, L. Sun, et al., "Ipg-net: Image pyramid guidance network for small object detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1026-1027, 2020.
28	Y. Li, Y. Pang, J. Shen, et al., "Netnet: Neighbor erasing and transferring network for better single shot object detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13346-13355, 2020.
29	S. Liu, D. Huang, Y. Wang, "Receptive field block net for accurate and fast object detection," in Proc. of the European Conference on Computer Vision, pp. 404-419, 2018.
30	S. Zhang, L. Wen, X. Bian, et al., "Single-shot refinement neural network for object detection," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4203-4212, 2018.
31	M. Everingham, S. Eslami, L. V. Gool, C. Williams, J. Winn, A. Zisserman, "The pascal visual object classes challenge: a retrospective," International Journal of Computer Vision, 111(1), 98-136, 2015. DOI
32	J. Deng, W. Dong, R. Socher, et al., "Imagenet: A large-scale hierarchical image database," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009.
33	N. Dalal, B. Triggs, "Histograms of oriented gradients for human detection," in Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 886-893, 2005.
34	K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
35	T.-Y. Lin, P. Goyal, R. Girshick, et al., "Focal loss for dense object detection," in Proc. of the IEEE International Conference on Computer Vision, pp. 2980-2988, 2017.
36	M. Aamir, Y.-F. Pu, Z. Rahman, W.A. Abro, H. Naeem, Z. Rahman, "A hybrid approach for object proposal generation," in Proc. of International Conference on Sensing and Imaging, 506, 251-259, 2017.
37	Y. Guan, M. Aamir, Z. Rahman, A. Ali, W.A. Abro, Z. A. Dayo, M. S. Bhutta, Z. Hu, "A framework for efficient brain tumor classification using MRI images," Mathematical Biosciences and Engineering, 18(5), 5790-5815, 2021. DOI
38	D. Lin, D. Shen, S. Shen, et al., "Zigzagnet: Fusing top-down and bottom-up context for object segmentation," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7490-7499, 2019.
39	K. Chen, J. Li, W. Lin, et al., "Towards accurate one-stage object detection with ap-loss," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5119-5127, 2019.