DOI QR코드

DOI QR Code

High-Speed Transformer for Panoptic Segmentation

  • Baek, Jong-Hyeon (Department of Computer Engineering, ChungNam National University) ;
  • Kim, Dae-Hyun (Department of Computer Engineering, ChungNam National University) ;
  • Lee, Hee-Kyung (Electronics and Telecommunications Research Institute) ;
  • Choo, Hyon-Gon (Electronics and Telecommunications Research Institute) ;
  • Koh, Yeong Jun (Department of Computer Engineering, ChungNam National University)
  • 투고 : 2022.10.17
  • 심사 : 2022.12.08
  • 발행 : 2022.12.20

초록

Recent high-performance panoptic segmentation models are based on transformer architectures. However, transformer-based panoptic segmentation methods are basically slower than convolution-based methods, since the attention mechanism in the transformer requires quadratic complexity w.r.t. image resolution. Also, sine and cosine computation for positional embedding in the transformer also yields a bottleneck for computation time. To address these problems, we adopt three modules to speed up the inference runtime of the transformer-based panoptic segmentation. First, we perform channel-level reduction using depth-wise separable convolution for inputs of the transformer decoder. Second, we replace sine and cosine-based positional encoding with convolution operations, called conv-embedding. We also apply a separable self-attention to the transformer encoder to lower quadratic complexity to linear one for numbers of image pixels. As result, the proposed model achieves 44% faster frame per second than baseline on ADE20K panoptic validation dataset, when we use all three modules.

키워드

과제정보

This work was supported partly by Institute of Information & communications Technology Planning & evaluation(IITP) grant funded by the Korea government(MSIT) (No. 2020-0-00011, Video Coding for Machine) and partly by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. NRF-2022 R1I1A3069113).

참고문헌

  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Proc. NIPS, 30, 2017.
  2. J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, "Contrastive Captioners are Image-Text Foundation Models," arXiv:2205.01917, 2022. doi: https://doi.org/10.48550/arXiv.2205.01917
  3. Y. Wei, H. Hu, Z. Xie, Z. Zhang, Y. Cao, J. Bao, D. Chen, and B. Guo, "Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation," arXiv:2205.14141, 2022. doi: https://doi.org/10.48550/arXiv.2205.14141
  4. W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som, and F. Wei, "BEiT Pretraining for All Vision and Vision-Language Tasks," arXiv:2208.10442, 2022. doi:https://doi.org/10.48550/arXiv.2208.10442
  5. F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, and H.-Y. Shum, "Towards A Unified Transformer-based Framework for Object Detection and Segmentation," arXiv:2206.02777, 2022. doi: https://doi.org/10.48550/arXiv.2206.02777
  6. H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen, "Axial-DeepLab:Stand-Alone Axial-Attention for Panoptic Segmentation," in Proc. ECCV, pp.108-126, 2020. doi: https://doi.org/10.1007/978-3-030-58548-8_7
  7. S. Mehta and M. Rastegari, "Separable Self-attention for Mobile Vision Transformers," arXiv:2206.02680, 2022. doi: https://doi.org/10.48550/arXiv.2206.02680
  8. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, M. Andreetto, and H. Adam, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications," arxiv: 1704.04861, 2017. doi: https://doi.org/10.48550/arXiv.1704.04861
  9. B. Cheng, A. Schwing, and A. Kirillov, "Per-Pixel Classification is Not All You Need for Semantic Segmentation," in Proc. NIPS, 34, 2021.
  10. Y. Li, G. Yuan, Y. Wen, E. Hu, G. Evangelidis, S. Tulyakov, Y. Wang, and J. Ren "EfficientFormer: Vision Transformers at MobileNet Speed," arxiv:2206.01191, 2022. doi: https://doi.org/10.48550/arXiv.2206.01191
  11. S. Mehta and M. Rastegari, "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer," arxiv:2110.02178, 2022. doi: https://doi.org/10.48550/arXiv.2110.02178
  12. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L.-C. Chen, "MobileNetV2 Inverted Residuals and Linear Bottlenecks," in Proc. CVPR, Salt Lake City, USA, pp.4510-4520, 2018. doi: https://doi.org/10.1109/CVPR.2018.00474
  13. W. Zhang, Z. Huang, G. Luo, T. Chen, X. Wang, W. Liu, G. Yu, and C. Shen, "TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation," in Proc. CVPR, New Orleans, USA, pp.12083-12093, 2022. doi: https://doi.org/10.1109/CVPR52688.2022.01177
  14. A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar, "Panoptic segmentation," in Proc. CVPR, California, USA, pp.9404-9413, 2019. doi: https://doi.org/10.1109/CVPR.2019.00963
  15. B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L.-C. Chen, "Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation," in Proc. CVPR, pp. 12475-12485, 2020. doi: https://doi.org/10.1109/CVPR42600.2020.01249
  16. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-End Object Detection with Transformers," in Proc. ECCV, pp.213-229, 2020. doi: https://doi.org/10.1007/978-3-030-58452-8_13
  17. K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proc. CVPR, Nevada, USA, pp.770-778, 2016. doi: https://doi.org/10.1109/CVPR.2016.90
  18. I. Loshchilov and F. Hutter, "Decoupled Weight Decay Regularization," in ICLR, 2019.
  19. Y. Wu, G. Zhang, Y. Gao, X. Deng, K. Gong, X. Liang, and, L. Lin, "Bidirectional Graph Reasoning Network for Panoptic Segmentation," in CVPR, pp.9080-9089, 2020. doi: https://doi.org/10.1109/CVPR42600.2020.00910
  20. Y. Wu, G. Zhang, H. Xu, X. Liang, L. Lin, "Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation," in NeurlPS, 2020.
  21. B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, "Masked-Attention Mask Transformer for Universal Image Segmentation," in CVPR, New Orleans, USA, pp.1290-1299 2022. doi: https://doi.org/10.1109/CVPR52688.2022.00135