[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5909/JBE.2022.27.4.538

Window Attention Module Based Transformer for Image Classification

Kim, Sanghoon (Department of Electrical and Electronics Engineering, Konkuk University)
Kim, Wonjun (Department of Electrical and Electronics Engineering, Konkuk University)

Publication Information

Journal of Broadcast Engineering / v.27, no.4, 2022 , pp. 538-547 More about this Journal

Abstract

Recently introduced image classification methods using Transformers show remarkable performance improvements over conventional neural network-based methods. In order to effectively consider regional features, research has been actively conducted on how to apply transformers by dividing image areas into multiple window areas, but learning of inter-window relationships is still insufficient. In this paper, to overcome this problem, we propose a transformer structure that can reflect the relationship between windows in learning. The proposed method computes the importance of each window region through compression and a fully connected layer based on self-attention operations for each window region. The calculated importance is scaled to each window area as a learned weight of the relationship between the window areas to re-calibrate the feature value. Experimental results show that the proposed method can effectively improve the performance of existing transformer-based methods.

Keywords

Image classification; Transformer; Self-attention; Window-attention;

Citations & Related Records

Reference

1	I. Loshchilov and F. Hutter, "Decoupled weight decay regularization," 2017, arXiv:1711.05101. [Online]. Available: https://arxiv.org/abs/1711.05101 doi: https://doi.org/10.48550/arXiv.1711.05101 DOI
2	A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Proc. Conf. Neural Inf. Process. Syst., pp. 5998-6008, Dec. 2017. doi: https://doi.org/10.48550/arXiv.1706.03762 DOI
3	A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," in Proc. Int. Conf. Learn. Represent., May 2021. doi: https://doi.org/10.48550/arXiv.2010.11929 DOI
4	J. Hu, L. Shen, and G. Sun, "Squeeze-and-Excitation Networks," in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 7132-7141, Jun. 2018. doi: https://doi.org/10.1109/cvpr.2018.00745 DOI
5	X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, "CSWin transformer: A general vision transformer backbone with cross-shaped windows," 2021, arXiv:2107.00652. [Online]. Available: https://arxiv.org/abs/2107.00652 doi: https://doi.org/10.48550/arXiv.2107.00652 DOI
6	J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, "Focal self-attention for local-global interactions in vision transformers," 2021, arXiv:2107.00641. [Online]. Available: https://arxiv.org/abs/2107.00641 doi: https://doi.org/10.48550/arXiv.2107.00641 DOI
7	J. Deng, W. Dong, R. Socher, LJ. Li, K. Li, and Li Fei-Fei, "ImageNet: A large-scale hierarchical image database," in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 248-255, Jun. 2009. doi: https://doi.org/10.1109/cvpr.2009.5206848 DOI
8	A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S.Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, "PyTorch: An imperative style, high-performance deep learning library," in Proc. Conf. Neural Inf. Process. Syst., pp. 8024-8035, Dec. 2019. doi: https://doi.org/10.48550/arXiv.1912.01703 DOI
9	J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," 2016, arXiv:1607.06450. [Online]. Available: https://arxiv.org/abs/1607.06450 doi: https://doi.org/10.48550/arXiv.1607.06450 DOI
10	Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," in Proc. IEEE Int. Conf. Comput. Vis., pp. 10012-10022, Oct. 2021. doi: https://doi.org/10.1109/iccv48922.2021.00986 DOI
11	X. Chu, Z.Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, "Twins: Revisiting the design of spatial attention in vision transformers," in Proc. Conf. Neural Inf. Process. Syst., pp. 9355-9366, Dec. 2021. doi: https://doi.org/10.48550/arXiv.2104.13840 DOI
12	R. Muller, S. Kornblith, and G. E. Hinton, "When does label smoothing help?," in Proc. Conf. Neural Inf. Process. Syst., pp. 4696-4705, Dec. 2019. doi: https://doi.org/10.48550/arXiv.1906.02629 DOI

KSCI

Window Attention Module Based Transformer for Image Classification 윈도우 주의 모듈 기반 트랜스포머를 활용한 이미지 분류 방법

Window Attention Module Based Transformer for Image Classification