DOI QR코드

DOI QR Code

NPU 반도체를 위한 저정밀도 데이터 타입 개발 동향

Trends of Low-Precision Processing for AI Processor

  • 김혜지 (인공지능프로세서연구실) ;
  • 한진호 (인공지능프로세서연구실) ;
  • 권영수 (지능형반도체연구본부)
  • 발행 : 2022.02.01

초록

With increasing size of transformer-based neural networks, a light-weight algorithm and efficient AI accelerator has been developed to train these huge networks in practical design time. In this article, we present a survey of state-of-the-art research on the low-precision computational algorithms especially for floating-point formats and their hardware accelerator. We describe the trends by focusing on the work of two leading research groups-IBM and Seoul National University-which have deep knowledge in both AI algorithm and hardware architecture. For the low-precision algorithm, we summarize two efficient floating-point formats (hybrid FP8 and radix-4 FP4) with accuracy-preserving algorithms for training on the main research stream. Moreover, we describe the AI processor architecture supporting the low-bit mixed precision computing unit including the integer engine.

키워드

과제정보

This work was supported by the ICT R&D program of MSIT/IITP[2018-0-00195, Artificial Intelligence Processor Research Laboratory].

참고문헌

  1. A. Radford et al., "Improving language understanding by generative pre-training," OpenAI Blog, 2018.
  2. J. Devlin et al., "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint, CoRR, 2018, arXiv: 1810.04805.
  3. A. Radford et al., "Language models are unsupervised multitask learners," OpenAI Blog, 2019.
  4. C. Raffel et al., "Exploring the limits of transfer learning with a unified text-to-text transformer," arXiv preprint, CoRR, 2019, arXiv: 1910.10683.
  5. T.B. Brown et al., "Language models are few-shot learners," arXiv preprint, CoRR, 2020, arXiv: 2005.14165.
  6. W. Fedus, B. Zoph, and N. Shazeer, "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity," arXiv preprint, CoRR, 2021, arXiv:2101.03961.
  7. A. Vaswani et al., "Attention is all you need," in Proc. Conf. Neural Inf. Process. Syst., (Long Beach, CA, USA), Dec. 2017, pp. 5998-6008.
  8. https://paperswithcode.com/sota/image-classification-on-imagenet
  9. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
  10. N. Wang et al., "Training deep neural networks with 8-bit floating point numbers," in Proc. Int. Conf. Neural Inf. Proc. Syst., (Montreal, Canada), Dec. 2018, pp. 7686-7695.
  11. X. Sun et al., "Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks," in Proc. Int. Conf. Neural Inf. Proc. Syst., (Vancouver, Canada), Dec. 2019, pp. 4900-4909.
  12. N.J. Higham, "The accuracy of floating point summation," SIAM J. Sci. Comput., vol. 14, no. 4, 1993, pp. 783-799. https://doi.org/10.1137/0914050
  13. J. Choi et al., "Pact: Parameterized clipping activation for quantized neural networks," arXiv preprint, CoRR, 2018, arXiv: 1805.06085.
  14. S.K. Esser et al., "Learned Step Size Quantization," in Proc. Int. Conf. Learn. Represent., (Addis Ababa, Ethiopia), Feb. 2020.
  15. D. Zhang et al., "Lq-nets: Learned quantization for highly accurate and compact deep neural networks," in Proc. Eur. Conf. Comput. Vis. (ECCV), (Munich, Germany), Sept. 2018, pp. 365-382.
  16. X. Sun et al., "Ultra-low precision 4-bit training of deep neural networks," in Proc. Conf. Neural Inf. Process. Syst., (Vancouver, Canada), Dec. 2020.
  17. A. Agrawal et al., "A 7nm 4-core AI chip with 25.6 TFLOPS hybrid FP8 training, 102.4 TOPS INT4 inference and workload-aware throttling," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), (San Francisco, CA, USA), Feb. 2021, pp. 144-146.
  18. S. Venkataramani et al., "RaPiD: AI accelerator for ultra-low precision training and inference," in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), (Valencia, Spain), June 2021, pp. 153-166.
  19. J. Park, S. Lee, and D. Jeon, "A 40nm 4.81 TFLOPS/W 8b floating-point training processor for non-sparse neural networks using shared exponent bias and 24-way fused multiply-add tree," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), (San Francisco, CA, USA), Feb. 2021, pp. 1-3.
  20. J. Lee et al., "LNPU: A 25.3 TFLOPS/W sparse deep-neural-network learning processor with fine-grained mixed precision of FP8-FP16," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), (San Francisco, CA, USA), Feb. 2019, pp. 142-144.
  21. N. Shah et al., "9.4 PIU: A 248GOPS/W stream-based processor for irregular probabilistic inference networks using precision-scalable posit arithmetic in 28nm," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), (San Francisco, CA, USA), Feb. 2021, pp. 150-152.