A Survey on Vision Transformers for Object Detection Task

Jungmin, Ha;Hyunjong, Lee;Jungmin, Eom;Jaekoo, Lee;

doi:10.14372/IEMEK.2022.17.6.319

IEMEK Journal of Embedded Systems and Applications (대한임베디드공학회논문지)

Volume 17 Issue 6
/
Pages.319-327
/
2022
/
1975-5066(pISSN)

Institute of Embedded Engineering of Korea (대한임베디드공학회)

DOI QR Code

A Survey on Vision Transformers for Object Detection Task

객체 탐지 과업에서의 트랜스포머 기반 모델의 특장점 분석 연구

Jungmin, Ha (Kookmin University) ;
Hyunjong, Lee (Kookmin University) ;
Jungmin, Eom (Kookmin University) ;
Jaekoo, Lee (Kookmin University)

Received : 2022.10.12
Accepted : 2022.11.21
Published : 2022.12.31

https://doi.org/10.14372/IEMEK.2022.17.6.319 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Transformers are the most famous deep learning models that has achieved great success in natural language processing and also showed good performance on computer vision. In this survey, we categorized transformer-based models for computer vision, particularly object detection tasks and perform comprehensive comparative experiments to understand the characteristics of each model. Next, we evaluated the models subdivided into standard transformer, with key point attention, and adding attention with coordinates by performance comparison in terms of object detection accuracy and real-time performance. For performance comparison, we used two metrics: frame per second (FPS) and mean average precision (mAP). Finally, we confirmed the trends and relationships related to the detection and real-time performance of objects in several transformer models using various experiments.

Keywords

Acknowledgement

이 논문은 2022년도 정부 (과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임 (No.2021-0-00994, 지속가능하고 견고한 자율주행 인공지능 교육/개발 통함 플랫폼과 No.RS-2022-00167194, 미션 크리티컬 시스템을 위한 신뢰 가능한 인공지능).

References

S. Ren, K. He, R. Girshick, J. Sun, "Faster r-cnn: Towards Real-time Object Detection with Region Proposal Networks," Proceedings of Advances in Neural Information Processiong Systems, 2015.
J, Redmon, S. Divvala, R. Girshick, A. Farhadi, "You Only Look Once: Unified, Real-time Object Detection," Proceedings of Computer Vision and Pattern Recognition. pp. 779-788, 2016.
K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang, Y. Zhang, D. Tao, "A survey on Vision Transformer," Journals of IEEE Transections on Pattern Analysis and Machine Intelligence, Vol.45, No. 1, pp. 73-86, 2023.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," Proceedings of International Conference on Learning Representations, 2021.
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, "End-to-end Object Detection with Transformers," Proceedings of European Conference on Computer Vision, pp. 213-229, 2020.
Y. Fang, B. Liao, X. Wang, J. Fang, .J Qi, "You Only Look at one Sequence: Rethinking Transformer in Vision Through Object Detection," Proceedings of Advances in Neural Information Processiong Systems, Vol. 34, pp. 26183-26197, 2021.
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai "Deformable Detr: Deformable Transformers for End-to-end Object Detection," Proceedings of International Conference on Learning Representations, 2021.
B. Roh, J. W. Shin, W. Shin, S. Kim, "Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity," Proceedings of International Conference on Learning Representations, 2022.
D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, J. Wang "Conditional DETR for Fast Training Convergence," Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3651-3660 2021.
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C. L. Zitnick, "Microsoft COCO: Common Objects in Context," Proceedings of European Conference on Computer Vision, pp. 740-750, 2014.
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman, "The Pascal Visual Object Classes (VOC) Challenge," International Journal of Computer Vision, Vol. 88, No. 2, pp. 303-338, 2010. https://doi.org/10.1007/s11263-009-0275-4
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, "Attention is all you Need," Advances in Neural Information Processiong Systems, pp. 6000-6010, 2017.
J. Devlin, M.. W. Chang, K. Lee, K. Toutanova, "Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics, Vol. 1, pp. 4171-4186, 2019.
D. W. Otter, J. R. Medina, J. K. Kalita, "A Survey of the Usages of Deep Learning for Natural Language Processing," Journal of IEEE transactions on neural networks and learning systems, Vol. 32, No. 2, pp. 604-624, 2020.
K. He, X. Zhang, S. Ren, J. Sun, "Deep Residual Learning for Image Recognition," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
P. Dollar, M. Singh, R. Girshick "Fast and Accurate Model Scaling," Proceedings of Computer Vision and Pattern Recognition, pp. 924-932, 2021.
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, "Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows," Proceedings of International Conference on Computer Vision, pp. 10012-10022, 2021.
T. Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, "Focal Loss for Dense Object Detection," Proceedings of IEEE International Conference on Computer Vision, pp. 2980-2988, 2017.