Browse > Article
http://dx.doi.org/10.14372/IEMEK.2022.17.6.319

A Survey on Vision Transformers for Object Detection Task  

Jungmin, Ha (Kookmin University)
Hyunjong, Lee (Kookmin University)
Jungmin, Eom (Kookmin University)
Jaekoo, Lee (Kookmin University)
Publication Information
Abstract
Transformers are the most famous deep learning models that has achieved great success in natural language processing and also showed good performance on computer vision. In this survey, we categorized transformer-based models for computer vision, particularly object detection tasks and perform comprehensive comparative experiments to understand the characteristics of each model. Next, we evaluated the models subdivided into standard transformer, with key point attention, and adding attention with coordinates by performance comparison in terms of object detection accuracy and real-time performance. For performance comparison, we used two metrics: frame per second (FPS) and mean average precision (mAP). Finally, we confirmed the trends and relationships related to the detection and real-time performance of objects in several transformer models using various experiments.
Keywords
Object Detection; Transformer; Inductive Bias; Computer Vision; Deep Learning;
Citations & Related Records
연도 인용수 순위
  • Reference
1 S. Ren, K. He, R. Girshick, J. Sun, "Faster r-cnn: Towards Real-time Object Detection with Region Proposal Networks," Proceedings of Advances in Neural Information Processiong Systems, 2015.
2 J, Redmon, S. Divvala, R. Girshick, A. Farhadi, "You Only Look Once: Unified, Real-time Object Detection," Proceedings of Computer Vision and Pattern Recognition. pp. 779-788, 2016.
3 K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang, Y. Zhang, D. Tao, "A survey on Vision Transformer," Journals of IEEE Transections on Pattern Analysis and Machine Intelligence, Vol.45, No. 1, pp. 73-86, 2023.
4 A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," Proceedings of International Conference on Learning Representations, 2021.
5 N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, "End-to-end Object Detection with Transformers," Proceedings of European Conference on Computer Vision, pp. 213-229, 2020.
6 Y. Fang, B. Liao, X. Wang, J. Fang, .J Qi, "You Only Look at one Sequence: Rethinking Transformer in Vision Through Object Detection," Proceedings of Advances in Neural Information Processiong Systems, Vol. 34, pp. 26183-26197, 2021.
7 X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai "Deformable Detr: Deformable Transformers for End-to-end Object Detection," Proceedings of International Conference on Learning Representations, 2021.
8 B. Roh, J. W. Shin, W. Shin, S. Kim, "Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity," Proceedings of International Conference on Learning Representations, 2022.
9 D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, J. Wang "Conditional DETR for Fast Training Convergence," Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3651-3660 2021.
10 T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C. L. Zitnick, "Microsoft COCO: Common Objects in Context," Proceedings of European Conference on Computer Vision, pp. 740-750, 2014.
11 M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman, "The Pascal Visual Object Classes (VOC) Challenge," International Journal of Computer Vision, Vol. 88, No. 2, pp. 303-338, 2010.   DOI
12 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, "Attention is all you Need," Advances in Neural Information Processiong Systems, pp. 6000-6010, 2017.
13 J. Devlin, M.. W. Chang, K. Lee, K. Toutanova, "Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics, Vol. 1, pp. 4171-4186, 2019.
14 D. W. Otter, J. R. Medina, J. K. Kalita, "A Survey of the Usages of Deep Learning for Natural Language Processing," Journal of IEEE transactions on neural networks and learning systems, Vol. 32, No. 2, pp. 604-624, 2020.
15 K. He, X. Zhang, S. Ren, J. Sun, "Deep Residual Learning for Image Recognition," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
16 P. Dollar, M. Singh, R. Girshick "Fast and Accurate Model Scaling," Proceedings of Computer Vision and Pattern Recognition, pp. 924-932, 2021.
17 T. Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, "Focal Loss for Dense Object Detection," Proceedings of IEEE International Conference on Computer Vision, pp. 2980-2988, 2017.
18 Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, "Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows," Proceedings of International Conference on Computer Vision, pp. 10012-10022, 2021.