[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.6109/jkiice.2022.26.6.850

Implementation of FPGA-based Accelerator for GRU Inference with Structured Compression

Chae, Byeong-Cheol (Department of Electronic and Information Engineering, Korea University)

Publication Information

Journal of the Korea Institute of Information and Communication Engineering / v.26, no.6, 2022 , pp. 850-858 More about this Journal

Abstract

To deploy Gate Recurrent Units (GRU) on resource-constrained embedded devices, this paper presents a reconfigurable FPGA-based GRU accelerator that enables structured compression. Firstly, a dense GRU model is significantly reduced in size by hybrid quantization and structured top-k pruning. Secondly, the energy consumption on external memory access is greatly reduced by the proposed reuse computing pattern. Finally, the accelerator can handle a structured sparse model that benefits from the algorithm-hardware co-design workflows. Moreover, inference tasks can be flexibly performed using all functional dimensions, sequence length, and number of layers. Implemented on the Intel DE1-SoC FPGA, the proposed accelerator achieves 45.01 GOPs in a structured sparse GRU network without batching. Compared to the implementation of CPU and GPU, low-cost FPGA accelerator achieves 57 and 30x improvements in latency, 300 and 23.44x improvements in energy efficiency, respectively. Thus, the proposed accelerator is utilized as an early study of real-time embedded applications, demonstrating the potential for further development in the future.

Keywords

AI Accelerator; FPGA; GRU; Human Action Recognition; Quantization; Pruning;

Citations & Related Records

Reference

1	S. -M. Lim, H. -C. Oh, J. Kim, J. Lee, and J. Park, "LSTM-Guided Coaching Assistant for Table Tennis Practice," Sensors, vol. 18, no. 12, p. 4112, Nov. 2018, DOI: 10.3390/s18124112. DOI
2	R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent neural networks," in International conference on machine learning (PMLR), Atlanta: GA, USA, pp. 1310-1318, 2013. Available: https://proceedings.mlr.press/v28/pascanu13.html.
3	M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, "FPGA-Based Low-Power Speech Recognition with Recurrent Neural Networks," in 2016 IEEE International Workshop on Signal Processing Systems (SiPS), Dallas: TX, USA, pp. 230-235, 2016. DOI: 10.1109/SiPS.2016.48. DOI
4	D. Shin, J. Lee, J. Lee and H. Yoo, "14.2 dnpu: An 8.1tops/w reconfigurable cnn-rnn processor for general-purpose deep neural networks," in 2017 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco: CA, USA, pp. 240-241, 2017. DOI: 10.1109/isscc.2017.7870350. DOI
5	D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, "A Public Domain Dataset for Human Activity Recognition Using Smartphones," in 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning(ESANN), Bruges, Belgium, pp. 437-442, 2013. DOI: 10.1201/b16098-7. DOI
6	K. Cho, B. V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, "Learning phrase representations using RNN encoder-decoder for statistical machine translation," in Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1724-1734, 2014. DOI: 10.3115/v1/d14-1179. DOI
7	S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally, "ESE: Efficient speech recognition engine with sparse LSTM on FPGA," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey: CA, USA, pp. 75-84, 2017. DOI: 10.1145/3020078.3021745. DOI
8	A. X. M. Chang and E. Culurciello, "Hardware accelerators for recurrent neural networks on FPGA," in 2017 IEEE International Symposium on Circuits and Systems (ISCAS), Baltimore: MD, USA, pp. 1-4, 2017. DOI: 10.1109/ISCAS.2017.8050816. DOI
9	Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong, "FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates," in 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa: CA, USA, pp. 152-159, 2017, DOI: 10.1109/fccm.2017.25. DOI
10	M. Wang, Z. Wang, J. Lu, J. Lin, and Z. Wang, "E-LSTM: An Efficient Hardware Architecture for Long Short-Term Memory," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 280-291, Jun. 2019. DOI: 10.1109/JETCAS.2019.2911739. DOI
11	V. NGUYEN, J. CAI, Linyu WEI, and J. CHU, "Neural Networks Probability-Based PWL Sigmoid Function Approximation," IEICE Transactions on Information and Systems, vol. E103.D, no. 9, pp. 2023-2026, Sep. 2020. DOI: 10.1587/transinf.2020EDL8007. DOI
12	S. Kim and H. Kim "Linear Domain-aware Log-scale Post-training Quantization," in 2021 IEEE International Conference on Consumer Electronics-Asia(ICCE-Asia), Gangwon, Republic of Korea, pp. 1-3, 2021. DOI: 10.1109/ICCE-Asia53811.2021.9642002. DOI
13	S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, Nov. 1997. DOI: 10.1162/neco.1997.9.8.1735. DOI
14	A. N. Mazumder, H. -A. Rashid, and T. Mohsenin, "An Energy-Efficient Low Power LSTM Processor for Human Activity Monitoring," in 2020 IEEE 33rd International System-on-Chip Conference (SOCC), Las Vegas: NV, USA, pp. 54-59, 2020. DOI: 10.1109/SOCC49529.2020.9524796. DOI
15	N. B. Gaikwad, V. Tiwari, A. Keskar, and N. C. Shivaprakash, "Efficient FPGA Implementation of Multilayer Perceptron for Real-Time Human Activity Classification," IEEE Access, vol. 7, pp. 26696-26706, 2019. DOI: 10.1109/ACCESS.2019.2900084. DOI
16	H. Fan, G. Luo, C. Zeng, M. Ferianc, Z. Que, S. Liu, X. Niu, and W. Luk, "F-E3D: FPGA-based Acceleration of an Efficient 3D Convolutional Neural Network for Human Action Recognition," in 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), New York: NY, USA, pp. 1-8, 2019. DOI: 10.1109/ASAP.2019.00-44. DOI
17	A. Graves, A. -R. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver: BC, Canada, pp. 6645-6649, 2013. DOI: 10.1109/icassp.2013.6638947. DOI
18	D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, L. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu, "Deep speech 2: End-to-end speech recognition in english and mandarin," in Proceedings of The 33rd International conference on machine learning, New York: NY, USA, pp. 173-182, 2016. DOI: 10.5555/3045390.3045410. DOI
19	K. Cho, B. V. Merrienboer, D. Bahdanau, and Y. Bengio, "On the Properties of Neural Machine Translation: Encoder -Decoder Approaches," in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, pp. 103-111, 2014. DOI: 10.3115/v1/w14-4012. DOI

KSCI

Implementation of FPGA-based Accelerator for GRU Inference with Structured Compression 구조적 압축을 통한 FPGA 기반 GRU 추론 가속기 설계

Implementation of FPGA-based Accelerator for GRU Inference with Structured Compression