Trends of Low-Precision Processing for AI Processor

Kim, H.J.;Han, J.H.;Kwon, Y.S.;

doi:10.22648/ETRI.2022.J.370106

전자통신동향분석 (Electronics and Telecommunications Trends)

제37권1호
/
Pages.53-62
/
2022
/
1225-6455(pISSN)

한국전자통신연구원 (Electronics and Telecommunications Research Institute)

DOI QR Code

NPU 반도체를 위한 저정밀도 데이터 타입 개발 동향

Trends of Low-Precision Processing for AI Processor

김혜지 (인공지능프로세서연구실) ;
한진호 (인공지능프로세서연구실) ;
권영수 (지능형반도체연구본부)

발행 : 2022.02.01

https://doi.org/10.22648/ETRI.2022.J.370106 인용 PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

With increasing size of transformer-based neural networks, a light-weight algorithm and efficient AI accelerator has been developed to train these huge networks in practical design time. In this article, we present a survey of state-of-the-art research on the low-precision computational algorithms especially for floating-point formats and their hardware accelerator. We describe the trends by focusing on the work of two leading research groups-IBM and Seoul National University-which have deep knowledge in both AI algorithm and hardware architecture. For the low-precision algorithm, we summarize two efficient floating-point formats (hybrid FP8 and radix-4 FP4) with accuracy-preserving algorithms for training on the main research stream. Moreover, we describe the AI processor architecture supporting the low-bit mixed precision computing unit including the integer engine.

키워드

과제정보

This work was supported by the ICT R&D program of MSIT/IITP[2018-0-00195, Artificial Intelligence Processor Research Laboratory].

참고문헌

A. Radford et al., "Improving language understanding by generative pre-training," OpenAI Blog, 2018.
J. Devlin et al., "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint, CoRR, 2018, arXiv: 1810.04805.
A. Radford et al., "Language models are unsupervised multitask learners," OpenAI Blog, 2019.
C. Raffel et al., "Exploring the limits of transfer learning with a unified text-to-text transformer," arXiv preprint, CoRR, 2019, arXiv: 1910.10683.
T.B. Brown et al., "Language models are few-shot learners," arXiv preprint, CoRR, 2020, arXiv: 2005.14165.
W. Fedus, B. Zoph, and N. Shazeer, "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity," arXiv preprint, CoRR, 2021, arXiv:2101.03961.
A. Vaswani et al., "Attention is all you need," in Proc. Conf. Neural Inf. Process. Syst., (Long Beach, CA, USA), Dec. 2017, pp. 5998-6008.
https://paperswithcode.com/sota/image-classification-on-imagenet
https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
N. Wang et al., "Training deep neural networks with 8-bit floating point numbers," in Proc. Int. Conf. Neural Inf. Proc. Syst., (Montreal, Canada), Dec. 2018, pp. 7686-7695.
X. Sun et al., "Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks," in Proc. Int. Conf. Neural Inf. Proc. Syst., (Vancouver, Canada), Dec. 2019, pp. 4900-4909.
N.J. Higham, "The accuracy of floating point summation," SIAM J. Sci. Comput., vol. 14, no. 4, 1993, pp. 783-799. https://doi.org/10.1137/0914050
J. Choi et al., "Pact: Parameterized clipping activation for quantized neural networks," arXiv preprint, CoRR, 2018, arXiv: 1805.06085.
S.K. Esser et al., "Learned Step Size Quantization," in Proc. Int. Conf. Learn. Represent., (Addis Ababa, Ethiopia), Feb. 2020.
D. Zhang et al., "Lq-nets: Learned quantization for highly accurate and compact deep neural networks," in Proc. Eur. Conf. Comput. Vis. (ECCV), (Munich, Germany), Sept. 2018, pp. 365-382.
X. Sun et al., "Ultra-low precision 4-bit training of deep neural networks," in Proc. Conf. Neural Inf. Process. Syst., (Vancouver, Canada), Dec. 2020.
A. Agrawal et al., "A 7nm 4-core AI chip with 25.6 TFLOPS hybrid FP8 training, 102.4 TOPS INT4 inference and workload-aware throttling," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), (San Francisco, CA, USA), Feb. 2021, pp. 144-146.
S. Venkataramani et al., "RaPiD: AI accelerator for ultra-low precision training and inference," in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), (Valencia, Spain), June 2021, pp. 153-166.
J. Park, S. Lee, and D. Jeon, "A 40nm 4.81 TFLOPS/W 8b floating-point training processor for non-sparse neural networks using shared exponent bias and 24-way fused multiply-add tree," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), (San Francisco, CA, USA), Feb. 2021, pp. 1-3.
J. Lee et al., "LNPU: A 25.3 TFLOPS/W sparse deep-neural-network learning processor with fine-grained mixed precision of FP8-FP16," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), (San Francisco, CA, USA), Feb. 2019, pp. 142-144.
N. Shah et al., "9.4 PIU: A 248GOPS/W stream-based processor for irregular probabilistic inference networks using precision-scalable posit arithmetic in 28nm," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), (San Francisco, CA, USA), Feb. 2021, pp. 150-152.

전자통신동향분석 (Electronics and Telecommunications Trends)

NPU 반도체를 위한 저정밀도 데이터 타입 개발 동향

Trends of Low-Precision Processing for AI Processor

초록

키워드

과제정보

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)