High-Speed Transformer for Panoptic Segmentation

Baek, Jong-Hyeon;Kim, Dae-Hyun;Lee, Hee-Kyung;Choo, Hyon-Gon;Koh, Yeong Jun;

doi:10.5909/JBE.2022.27.7.1011

방송공학회논문지 (Journal of Broadcast Engineering)

제27권7호
/
Pages.1011-1020
/
2022
/
1226-7953(pISSN)
/
2287-9137(eISSN)

한국방송∙미디어공학회 (The Korean Institute of Broadcast and Media Engineers)

DOI QR Code

High-Speed Transformer for Panoptic Segmentation

Baek, Jong-Hyeon (Department of Computer Engineering, ChungNam National University) ;
Kim, Dae-Hyun (Department of Computer Engineering, ChungNam National University) ;
Lee, Hee-Kyung (Electronics and Telecommunications Research Institute) ;
Choo, Hyon-Gon (Electronics and Telecommunications Research Institute) ;
Koh, Yeong Jun (Department of Computer Engineering, ChungNam National University)

투고 : 2022.10.17
심사 : 2022.12.08
발행 : 2022.12.20

https://doi.org/10.5909/JBE.2022.27.7.1011 인용 PDF KSCI KPUBS

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Recent high-performance panoptic segmentation models are based on transformer architectures. However, transformer-based panoptic segmentation methods are basically slower than convolution-based methods, since the attention mechanism in the transformer requires quadratic complexity w.r.t. image resolution. Also, sine and cosine computation for positional embedding in the transformer also yields a bottleneck for computation time. To address these problems, we adopt three modules to speed up the inference runtime of the transformer-based panoptic segmentation. First, we perform channel-level reduction using depth-wise separable convolution for inputs of the transformer decoder. Second, we replace sine and cosine-based positional encoding with convolution operations, called conv-embedding. We also apply a separable self-attention to the transformer encoder to lower quadratic complexity to linear one for numbers of image pixels. As result, the proposed model achieves 44% faster frame per second than baseline on ADE20K panoptic validation dataset, when we use all three modules.

키워드

과제정보

This work was supported partly by Institute of Information & communications Technology Planning & evaluation(IITP) grant funded by the Korea government(MSIT) (No. 2020-0-00011, Video Coding for Machine) and partly by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. NRF-2022 R1I1A3069113).

참고문헌

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Proc. NIPS, 30, 2017.
J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, "Contrastive Captioners are Image-Text Foundation Models," arXiv:2205.01917, 2022. doi: https://doi.org/10.48550/arXiv.2205.01917
Y. Wei, H. Hu, Z. Xie, Z. Zhang, Y. Cao, J. Bao, D. Chen, and B. Guo, "Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation," arXiv:2205.14141, 2022. doi: https://doi.org/10.48550/arXiv.2205.14141
W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som, and F. Wei, "BEiT Pretraining for All Vision and Vision-Language Tasks," arXiv:2208.10442, 2022. doi:https://doi.org/10.48550/arXiv.2208.10442
F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, and H.-Y. Shum, "Towards A Unified Transformer-based Framework for Object Detection and Segmentation," arXiv:2206.02777, 2022. doi: https://doi.org/10.48550/arXiv.2206.02777
H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen, "Axial-DeepLab:Stand-Alone Axial-Attention for Panoptic Segmentation," in Proc. ECCV, pp.108-126, 2020. doi: https://doi.org/10.1007/978-3-030-58548-8_7
S. Mehta and M. Rastegari, "Separable Self-attention for Mobile Vision Transformers," arXiv:2206.02680, 2022. doi: https://doi.org/10.48550/arXiv.2206.02680
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, M. Andreetto, and H. Adam, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications," arxiv: 1704.04861, 2017. doi: https://doi.org/10.48550/arXiv.1704.04861
B. Cheng, A. Schwing, and A. Kirillov, "Per-Pixel Classification is Not All You Need for Semantic Segmentation," in Proc. NIPS, 34, 2021.
Y. Li, G. Yuan, Y. Wen, E. Hu, G. Evangelidis, S. Tulyakov, Y. Wang, and J. Ren "EfficientFormer: Vision Transformers at MobileNet Speed," arxiv:2206.01191, 2022. doi: https://doi.org/10.48550/arXiv.2206.01191
S. Mehta and M. Rastegari, "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer," arxiv:2110.02178, 2022. doi: https://doi.org/10.48550/arXiv.2110.02178
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L.-C. Chen, "MobileNetV2 Inverted Residuals and Linear Bottlenecks," in Proc. CVPR, Salt Lake City, USA, pp.4510-4520, 2018. doi: https://doi.org/10.1109/CVPR.2018.00474
W. Zhang, Z. Huang, G. Luo, T. Chen, X. Wang, W. Liu, G. Yu, and C. Shen, "TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation," in Proc. CVPR, New Orleans, USA, pp.12083-12093, 2022. doi: https://doi.org/10.1109/CVPR52688.2022.01177
A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar, "Panoptic segmentation," in Proc. CVPR, California, USA, pp.9404-9413, 2019. doi: https://doi.org/10.1109/CVPR.2019.00963
B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L.-C. Chen, "Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation," in Proc. CVPR, pp. 12475-12485, 2020. doi: https://doi.org/10.1109/CVPR42600.2020.01249
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-End Object Detection with Transformers," in Proc. ECCV, pp.213-229, 2020. doi: https://doi.org/10.1007/978-3-030-58452-8_13
K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proc. CVPR, Nevada, USA, pp.770-778, 2016. doi: https://doi.org/10.1109/CVPR.2016.90
I. Loshchilov and F. Hutter, "Decoupled Weight Decay Regularization," in ICLR, 2019.
Y. Wu, G. Zhang, Y. Gao, X. Deng, K. Gong, X. Liang, and, L. Lin, "Bidirectional Graph Reasoning Network for Panoptic Segmentation," in CVPR, pp.9080-9089, 2020. doi: https://doi.org/10.1109/CVPR42600.2020.00910
Y. Wu, G. Zhang, H. Xu, X. Liang, L. Lin, "Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation," in NeurlPS, 2020.
B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, "Masked-Attention Mask Transformer for Universal Image Segmentation," in CVPR, New Orleans, USA, pp.1290-1299 2022. doi: https://doi.org/10.1109/CVPR52688.2022.00135

방송공학회논문지 (Journal of Broadcast Engineering)

High-Speed Transformer for Panoptic Segmentation

초록

키워드

과제정보

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)