DOI QR코드

DOI QR Code

Structured Pruning for Efficient Transformer Model compression

효율적인 Transformer 모델 경량화를 위한 구조화된 프루닝

  • Eunji Yoo (Pohang University of Science and Technology) ;
  • Youngjoo Lee (Pohang University of Science and Technology)
  • Received : 2023.08.31
  • Accepted : 2023.10.12
  • Published : 2023.10.31

Abstract

With the recent development of Generative AI technology by IT giants, the size of the transformer model is increasing exponentially over trillion won. In order to continuously enable these AI services, it is essential to reduce the weight of the model. In this paper, we find a hardware-friendly structured pruning pattern and propose a lightweight method of the transformer model. Since compression proceeds by utilizing the characteristics of the model algorithm, the size of the model can be reduced and performance can be maintained as much as possible. Experiments show that the structured pruning proposed when pruning GPT-2 and BERT language models shows almost similar performance to fine-grained pruning even in highly sparse regions. This approach reduces model parameters by 80% and allows hardware acceleration in structured form with 0.003% accuracy loss compared to fine-tuned pruning.

최근 거대 IT 기업들의 Generative AI 기술 개발로 Transformer 모델의 규모가 조 단위를 넘어가며 기하급수적으로 증가하고 있다. 이러한 AI 서비스를 지속적으로 가능케 하기 위해선 모델 경량화가 필수적이다. 본 논문에서는 하드웨어 친화적으로 구조화된(structured) 프루닝 패턴을 찾아 Transformer 모델의 경량화 방법을 제안한다. 이는 모델 알고리즘의 특성을 살려 압축을 진행하기 때문에 모델의 크기는 줄어들면서 성능은 최대한 유지할 수 있다. 실험에 따르면 GPT2 와 BERT 언어 모델을 프루닝할 때 제안하는 구조화된 프루닝 기법은 희소성이 높은 영역에서도 미세 조정된(fine-grained) 프루닝과 거의 흡사한 성능을 보여준다. 이 접근 방식은 미세 조정된 프루닝 대비 0.003%의 정확도 손실로 모델매개 변수를 80% 줄이고 구조화된 형태로 하드웨어 가속화를 진행할 수 있다.

Keywords

Acknowledgement

본 논문은 한국정부(MSIT)가 지원하는 정보통신기술 기획평가원(IITP) 보조금 지원(제 2021-0-00779 호, 개인정보보호기반 하드웨어를 보장하는 초고속 암호화 데이터처리기술개발)과 정부(과학 기술 정보통신부)의 재원으로 한국연구재단-차세대 지능형 반도체 기술개발사업(소자)의 연구(RS-2023-00258227) 그리고 2023 년도 정부(과학 기술 정보통신부)의 재원으로 정보통신기획평가원(RS-2023-00229849,IoT Intelligence 용 eFLASH 파운드리 공정 기반 MPU/Connectivity/경량신경망 통합 반도체 개발)의 지원을 받아 수행된 연구이다.

References

  1. A. Vaswani et al., "Attention is all you need," in Proc. of NeurIPS, 2017. 
  2. T. Brown et al., "Language models are few-shot learners," in Proc. of NeurIPS, 2020, pp. 1877-1901. 
  3. A. Chowdhery et al., "Palm: Scaling language modeling with pathways," arXiv preprint arXiv:2204.02311, 2022. 
  4. S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding," arXiv preprint arXiv:1510.00149, 2015. 
  5. Sutskever, I., Vinyals, O., & Le, Q. V.. "Sequence to sequence learning with neural networks." Advances in neural information processing systems 27 (2014). 
  6. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," in Proc. of NAACL, 2019, pp. 4171-4186. 
  7. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. "GLUE: A multi-task benchmark and analysis platform for natural language understanding." arXiv preprint arXiv:1804.07461 (2018). 
  8. Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., ... & Fernandez, R. "The LAMBADA dataset: Word prediction requiring a broad discourse context." arXiv preprint arXiv:1606.06031 (2016). 
  9. G. Park, B. Park, S. J. Kwon, B. Kim, Y. Lee, and D. Lee, "nuqmm: Quantized matmul for efficient inference of large-scale generative language models," arXiv preprint arXiv:2206.09557, 2022. 
  10. Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. "Llm. int8 (): 8-bit matrix multiplication for transformers at scale." arXiv preprint arXiv:2208.07339 (2022). 
  11. Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., & He, Y. "Zeroquant: Efficient and affordable post-training quantization for large-scale transformers." Advancesin Neural Information Processing Systems 35 (2022): 27168-27183. 
  12. Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. "Smoothquant: Accurate and efficient post-training quantization for large language models." International Conference on Machine Learning. PMLR, 2023. 
  13. Gou, Jianping, et al. "Knowledge distillation: A survey." International Journal of Computer Vision 129 (2021): 1789-1819.  https://doi.org/10.1007/s11263-021-01453-z
  14. Gu, Y., Dong, L., Wei, F., & Huang, M. "Knowledge Distillation of Large Language Models." arXiv preprint arXiv:2306.08543 (2023). 
  15. Frantar, E., & Alistarh, D. "SparseGPT: Massive Language Models Can Be Accurately Pruned in OneShot." (2023). 
  16. Ma, X., Fang, G., & Wang, X. "LLM-Pruner: On the Structural Pruning of Large Language Models." arXiv preprint arXiv:2305.11627 (2023). 
  17. Zhang, M., Shen, C., Yang, Z., Ou, L., Yu, X., & Zhuang, B. "Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning." arXiv preprint arXiv:2305.18403 (2023). 
  18. V. Sanh, T. Wolf, and A. Rush, "Movement pruning: Adaptive sparsity by fine-tuning," in Proc. of NeurIPS, 2020, pp. 20 378-20 389. 
  19. Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, 1993. 
  20. Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal Brain Compression: A framework for accurate post-training quantization and pruning. arXiv preprint arXiv:2208.11580, 2022. 
  21. Y. He, X. Zhang, and J. Sun, "Channel pruning for accelerating very deep neural networks," in Proc. of ICCV, 2017, pp. 1389-1397. 
  22. M. Zhu, T. Zhang, Z. Gu, and Y. Xie, "Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus," in Proc. of MICRO, 2019, pp. 359-371. 
  23. E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, "Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned," in Proc. of ACL, 2019, pp. 5797-5808. 
  24. P. Michel, O. Levy, and G. Neubig, "Are sixteen heads really better than one?" Proc. of NeurIPS, vol. 32, 2019. 
  25. Lagunas, Francois, et al. "Block pruning for faster transformers." arXiv preprint arXiv:2109.04838 (2021). 
  26. E. Yoo, G. Park, J. Min, S. Kwon, B. Park, D. Lee, and Y. Lee*, "TF-MVP: Novel sparsity-aware transformer accelerator with mixed-length vector pruning," Design Automation Conference (DAC), San Francis-co, CA, USA, July 2023.
  27. J. Park, H. Yoon, D. Ahn, J. Choi, and J.-J. Kim, "Optimus: Optimized matrix multiplication structure for transformer neural network accelerator," Proc. of MLSys, pp. 363-378, 2020. 
  28. A. Parashar et al., "Scnn: An accelerator for compressed-sparse convolutional neural networks," ACM SIGARCH computer architecture news, vol. 45, no. 2, pp. 27-40, 2017.  https://doi.org/10.1145/3140659.3080254
  29. S. Zhang et al., "Cambricon-x: An accelerator for sparse neural networks," in Proc. of MICRO. IEEE, 2016, pp. 1-12. 
  30. S. Moon, H. Lee, Y. Byun, J. Park, J. Joe, S. Hwang, S. Lee, and Y. Lee*, "FPGA-based sparsity-aware CNN accelerator for noise-resilient edge-level image recognition," IEEE Asian Solid-State Circuits Confer-ence (A-SSCC), Macao, China, Nov. 2019, pp. 205-208. 
  31. H. Kwon, Y. Byun, S. Kang, and Y. Lee*, "CHAMP: Channel merging process for cost-efficient highly-pruned CNN acceleration," IEEE Transactions on Circuits and Systems I: Regular vol. 69, no. 8, pp. 3308-3319, Aug. 2022.  https://doi.org/10.1109/TCSI.2022.3174531
  32. Y. Byun, S. Moon, B. Park, S. Kwon, D. Lee, G. Park, E. Yoo, J. Min and Y. Lee*, "Sparsity-Aware Memory Interface Architecture using Stacked XOR-Net Compression for Accelerating Pruned-DNN Models," Proceedings of Machine Learning and Systems, Miami, FL, USA, June 2023.