DOI QR코드

DOI QR Code

An Efficient Matrix Multiplier Available in Multi-Head Attention and Feed-Forward Network of Transformer Algorithms

트랜스포머 알고리즘의 멀티 헤드 어텐션과 피드포워드 네트워크에서 활용 가능한 효율적인 행렬 곱셈기

  • Seok-Woo Chang (Dept. of Semiconductor Systems Engineering, Sejong University) ;
  • Dong-Sun Kim (Dept. of Semiconductor Systems Engineering, Sejong University)
  • Received : 2024.02.27
  • Accepted : 2024.03.26
  • Published : 2024.03.31

Abstract

With the advancement of NLP(Natural Language Processing) models, conversational AI such as ChatGPT is becoming increasingly popular. To enhance processing speed and reduce power consumption, it is important to implement the Transformer algorithm, which forms the basis of the latest natural language processing models, in hardware. In particular, the multi-head attention and feed-forward network, which analyze the relationships between different words in a sentence through matrix multiplication, are the most computationally intensive core algorithms in the Transformer. In this paper, we propose a new variable systolic array based on the number of input words to enhance matrix multiplication speed. Quantization maintains Transformer accuracy, boosting memory efficiency and speed. For evaluation purposes, this paper verifies the clock cycles required in multi-head attention and feed-forward network and compares the performance with other multipliers.

자연어 처리 모델이 발전함에 따라 챗 GPT와 같은 대화형 언어 생성 AI 모델이 널리 사용되고 있다. 따라서 자연어 처리 최신 모델의 기반이 되는 트랜스포머 알고리즘을 하드웨어로 구현하여 연산 속도와 전력 소비량을 개선하는 것은 중요하다고 할 수 있다. 특히, 행렬 곱셈을 통해 문장에서 서로 다른 단어 간의 관계를 분석하는 멀티 헤드 어텐션과 피드 포워드 네트워크는 트랜스포머에서 연산량이 가장 큰 핵심적인 알고리즘이다. 본 논문에서는 기존의 시스톨릭 어레이를 변형하여 행렬 곱 연산 속도를 개선하고, 입력 단어 개수 변동에 따라 지연시간도 변동되는 유동적인 구조를 제안한다. 또한, 트랜스포머 알고리즘의 정확도를 유지하는 형태로 양자화를 하여 메모리 효율성과 연산 속도를 높였다. 본 논문은 평가를 위해 멀티헤드어텐션과 피드포워드 네트워크에서 소요되는 클럭사이클을 검증하고 다른 곱셈기와 성능을 비교하였다.

Keywords

Acknowledgement

This work was supported by the faculty research fund of Sejong University in 2023.

References

  1. Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. DOI: 10.48550/arXiv.1810.04805
  2. Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer," The Journal of Machine Learning Research 21.1, pp.5485-5551, 2020. DOI: 10.48550/arXiv.1910.10683
  3. Dura, Davide, "Design and analysis of VLSI architectures for Transformers," Diss. Politecnico di Torino, pp.1-2, 2022.
  4. Vaswani, A., et al. "Attention is all you need," Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, pp.6000-6010, 2017. DOI: 10.48550/arXiv.1706.03762
  5. Lan, Zhenzhong, et al. "Albert: A lite bert for self-supervised learning of language representations," arXiv preprint arXiv:1909.11942, 2019. DOI: 10.48550/arXiv.1909.11942
  6. Lu, Siyuan, et al. "Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer," 2020 IEEE 33rd International System-on-Chip Conference, IEEE, pp.2-3, 2020. DOI: 10.48550/arXiv.2009.08605
  7. Ye, Wenhua, et al. "Accelerating attention mechanism on fpgas based on efficient reconfigurable systolic array," ACM Transactions on Embedded Computing Systems vol.22, no.6, pp.1-22, 2023. DOI: 10.1145/3549937
  8. Fang, Chao, et al. "An efficient hardware accelerator for sparse transformer neural networks," 2022 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, pp.2670-2674, 2022. DOI: 10.1109/ISCAS48785.2022.9937659
  9. Fang, Chao et al. "An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.30, pp.1573-1586, 2022. DOI: 10.1109/TVLSI.2022.3197282
  10. Tuli, Shikhar, Niraj Kumar Jha, "AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.42, pp.4038-4051, 2023. DOI: 10.1109/TCAD.2023.3273992
  11. H. T. Kung, B. McDanel, et al. "Maestro: A Memory-on-Logic Architecture for Coordinated Parallel Use of Many Systolic Arrays," 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), New York, NY, USA, pp.42-50, 2019. DOI: 10.1109/ASAP.2019.00-31
  12. Ye, Wenhua, et al. "Accelerating attention mechanism on fpgas based on efficient reconfigurable systolic array," vol.22, no.6, pp.1-22, 2023. DOI: 10.1145/3549937
  13. Bansal, Himanshu, et al. "Wallace tree multiplier designs: a performance comparison," Innov Syst Des Eng, vol.5, no.5, pp.67, 2014. DOI: 10.5120/13825-1414
  14. Tiwari, Shivangi, et al. "Fpga design and implementation of matrix multiplication architecture by ppi-mo techniques," International Journal of Computer Applications, vol.80, no.1, pp.19-22, 2013. DOI: 10.5120/13825-1414
  15. Elliott, Desmond, et al. "Multi30k: Multilingual english-german image descriptions," arXiv preprint arXiv:1605.00459, pp.73, 2016. DOI:10.18653/v1/W16-3210
  16. Cettolo, Mauro, et al. "Overview of the iwslt 2017 evaluation campaign," Proceedings of the 14th International Workshop on Spoken Language Translation, pp.4, 2017.
  17. Wang, Longyue, et al. "Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A Fresh Orb in the Cosmos of LLMs," arXiv preprint arXiv:2311.03127, pp.58, 2023. DOI: 10.48550/arXiv.2311.03127