Browse > Article
http://dx.doi.org/10.5909/JBE.2022.27.4.538

Window Attention Module Based Transformer for Image Classification  

Kim, Sanghoon (Department of Electrical and Electronics Engineering, Konkuk University)
Kim, Wonjun (Department of Electrical and Electronics Engineering, Konkuk University)
Publication Information
Journal of Broadcast Engineering / v.27, no.4, 2022 , pp. 538-547 More about this Journal
Abstract
Recently introduced image classification methods using Transformers show remarkable performance improvements over conventional neural network-based methods. In order to effectively consider regional features, research has been actively conducted on how to apply transformers by dividing image areas into multiple window areas, but learning of inter-window relationships is still insufficient. In this paper, to overcome this problem, we propose a transformer structure that can reflect the relationship between windows in learning. The proposed method computes the importance of each window region through compression and a fully connected layer based on self-attention operations for each window region. The calculated importance is scaled to each window area as a learned weight of the relationship between the window areas to re-calibrate the feature value. Experimental results show that the proposed method can effectively improve the performance of existing transformer-based methods.
Keywords
Image classification; Transformer; Self-attention; Window-attention;
Citations & Related Records
연도 인용수 순위
  • Reference
1 I. Loshchilov and F. Hutter, "Decoupled weight decay regularization," 2017, arXiv:1711.05101. [Online]. Available: https://arxiv.org/abs/1711.05101 doi: https://doi.org/10.48550/arXiv.1711.05101   DOI
2 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Proc. Conf. Neural Inf. Process. Syst., pp. 5998-6008, Dec. 2017. doi: https://doi.org/10.48550/arXiv.1706.03762   DOI
3 A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," in Proc. Int. Conf. Learn. Represent., May 2021. doi: https://doi.org/10.48550/arXiv.2010.11929   DOI
4 J. Hu, L. Shen, and G. Sun, "Squeeze-and-Excitation Networks," in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 7132-7141, Jun. 2018. doi: https://doi.org/10.1109/cvpr.2018.00745   DOI
5 X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, "CSWin transformer: A general vision transformer backbone with cross-shaped windows," 2021, arXiv:2107.00652. [Online]. Available: https://arxiv.org/abs/2107.00652 doi: https://doi.org/10.48550/arXiv.2107.00652   DOI
6 J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, "Focal self-attention for local-global interactions in vision transformers," 2021, arXiv:2107.00641. [Online]. Available: https://arxiv.org/abs/2107.00641 doi: https://doi.org/10.48550/arXiv.2107.00641   DOI
7 J. Deng, W. Dong, R. Socher, LJ. Li, K. Li, and Li Fei-Fei, "ImageNet: A large-scale hierarchical image database," in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 248-255, Jun. 2009. doi: https://doi.org/10.1109/cvpr.2009.5206848   DOI
8 A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S.Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, "PyTorch: An imperative style, high-performance deep learning library," in Proc. Conf. Neural Inf. Process. Syst., pp. 8024-8035, Dec. 2019. doi: https://doi.org/10.48550/arXiv.1912.01703   DOI
9 J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," 2016, arXiv:1607.06450. [Online]. Available: https://arxiv.org/abs/1607.06450 doi: https://doi.org/10.48550/arXiv.1607.06450   DOI
10 Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," in Proc. IEEE Int. Conf. Comput. Vis., pp. 10012-10022, Oct. 2021. doi: https://doi.org/10.1109/iccv48922.2021.00986   DOI
11 X. Chu, Z.Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, "Twins: Revisiting the design of spatial attention in vision transformers," in Proc. Conf. Neural Inf. Process. Syst., pp. 9355-9366, Dec. 2021. doi: https://doi.org/10.48550/arXiv.2104.13840   DOI
12 R. Muller, S. Kornblith, and G. E. Hinton, "When does label smoothing help?," in Proc. Conf. Neural Inf. Process. Syst., pp. 4696-4705, Dec. 2019. doi: https://doi.org/10.48550/arXiv.1906.02629   DOI