DOI QR코드

DOI QR Code

A Dual-Structured Self-Attention for improving the Performance of Vision Transformers

비전 트랜스포머 성능향상을 위한 이중 구조 셀프 어텐션

  • Received : 2023.09.04
  • Accepted : 2023.09.15
  • Published : 2023.09.30

Abstract

In this paper, we propose a dual-structured self-attention method that improves the lack of regional features of the vision transformer's self-attention. Vision Transformers, which are more computationally efficient than convolutional neural networks in object classification, object segmentation, and video image recognition, lack the ability to extract regional features relatively. To solve this problem, many studies are conducted based on Windows or Shift Windows, but these methods weaken the advantages of self-attention-based transformers by increasing computational complexity using multiple levels of encoders. This paper proposes a dual-structure self-attention using self-attention and neighborhood network to improve locality inductive bias compared to the existing method. The neighborhood network for extracting local context information provides a much simpler computational complexity than the window structure. CIFAR-10 and CIFAR-100 were used to compare the performance of the proposed dual-structure self-attention transformer and the existing transformer, and the experiment showed improvements of 0.63% and 1.57% in Top-1 accuracy, respectively.

본 논문에서는 비전 트랜스포머의 셀프 어텐션이 갖는 지역적 특징 부족을 개선하는 이중 구조 셀프 어텐션 방법을 제안한다. 객체 분류, 객체 분할, 비디오 영상 인식에서 합성곱 신경망보다 연산 효율성이 높은 비전 트랜스포머는 상대적으로 지역적 특징 추출능력이 부족하다. 이 문제를 해결하기 위해 윈도우 또는 쉬프트 윈도우를 기반으로 하는 연구가 많이 이루어지고 있으나 이러한 방법은 여러 단계의 인코더를 사용하여 연산 복잡도의 증가로 셀프 어텐션 기반 트랜스포머의 장점이 약화 된다. 본 논문에서는 기존의 방법보다 locality inductive bias 향상을 위해 self-attention과 neighborhood network를 이용하여 이중 구조 셀프 어텐션을 제안한다. 지역적 컨텍스트 정보 추출을 위한 neighborhood network은 윈도우 구조보다 훨씬 단순한 연산 복잡도를 제공한다. 제안된 이중 구조 셀프 어텐션 트랜스포머와 기존의 트랜스포머의 성능 비교를 위해 CIFAR-10과 CIFAR-100을 학습 데이터를 사용하였으며 실험결과 Top-1 정확도에서 각각 0.63%과 1.57% 성능이 개선되었다.

Keywords

Acknowledgement

This work was supported by Seokyeong University in 2022 and by Seokyeong University in 2023.

References

  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, "Attention Is All You Need," 31st Conf. on Neural Information Processing Systems(NIPS 2017), 2017. DOI: 10.48550/arXiv.1706.03762
  2. Chih-Yang Lin, Yi-Cheng Chiu, Hui-Fuang Ng, Timothy K. Shih, Kuan-Hung Lin, "Global-and-Local Context Network for Semantic Segmentation of Street View Images," Sensors, Vol.20, No.10, 2020. DOI: 10.3390 /s20102907 https://doi.org/10.3390
  3. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo, "Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows," Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.10012-10022, 2021.
  4. Jinpeng Li, Yichao Yan, Shengcai Liao, Xiaokang Yang, Ling Shao, "Local-to-Global Self-Attention in Vision Transformers," 2021. DOI: 10.48550/arXiv.2107.04735
  5. Nikolas Ebert, Didier Stricker, Oliver Wasenmuller, "PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention," Sensors, Vol.23, No.7, 2023. DOI: 10.3390/s23073447
  6. B Yang, J Li, DF Wong, LS Chao, X Wang, Z Tu, "Context-aware self-attention networks," Proceedings of the AAAI conference on artificial intelligence, 2019. DOI: 10.48550/arXiv.1902.05766
  7. Ali Hassani, Steven Walton, Jiachen Li, Shen Li, Humphrey Shi, "Neighborhood Attention Transformer," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.6185-6194, 2023. DOI: 10.48550/arXiv.2204.07143
  8. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, SylvainGelly, Jakob Uszkoreit, Neil Houlsby, "An image is worth 16×16 words: Transformersfor image recognition at scale," ICLR, 2020. DOI: 10.48550/arXiv.2010.11929