Optimizing 2-stage Tiling-based Matrix Multiplication in FPGA-based Neural Network Accelerator

Jinse, Kwon;Jemin, Lee;Yongin, Kwon;Jeman, Park;Misun, Yu;Taeho, Kim;Hyungshin, Kim;

doi:10.14372/IEMEK.2022.17.6.367

IEMEK Journal of Embedded Systems and Applications (대한임베디드공학회논문지)

Volume 17 Issue 6
/
Pages.367-374
/
2022
/
1975-5066(pISSN)

Institute of Embedded Engineering of Korea (대한임베디드공학회)

DOI QR Code

Optimizing 2-stage Tiling-based Matrix Multiplication in FPGA-based Neural Network Accelerator

FPGA기반 뉴럴네트워크 가속기에서 2차 타일링 기반 행렬 곱셈 최적화

Jinse, Kwon (Chungnam National University) ;
Jemin, Lee (ETRI) ;
Yongin, Kwon (ETRI) ;
Jeman, Park (ETRI) ;
Misun, Yu (ETRI) ;
Taeho, Kim (ETRI) ;
Hyungshin, Kim (Chungnam National University)

Received : 2022.10.21
Accepted : 2022.11.23
Published : 2022.12.31

https://doi.org/10.14372/IEMEK.2022.17.6.367 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The acceleration of neural networks has become an important topic in the field of computer vision. An accelerator is absolutely necessary for accelerating the lightweight model. Most accelerator-supported operators focused on direct convolution operations. If the accelerator does not provide GEMM operation, it is mostly replaced by CPU operation. In this paper, we proposed an optimization technique for 2-stage tiling-based GEMM routines on VTA. We improved performance of the matrix multiplication routine by maximizing the reusability of the input matrix and optimizing the operation pipelining. In addition, we applied the proposed technique to the DarkNet framework to check the performance improvement of the matrix multiplication routine. The proposed GEMM method showed a performance improvement of more than 2.4 times compared to the non-optimized GEMM method. The inference performance of our DarkNet framework has also improved by at least 2.3 times.

Keywords

Acknowledgement

이 논문은 2022년도 정부 (과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임 (No.2018-0-00769,인공지능 시스템을 위한 뉴로모픽 컴퓨팅 SW 플랫폼 기술 개발).

References

J. W. Bae, B. G. Han, "Implementation of Deep Learning-based Label Inspection System Applicable to Edge Computing Environments," IEMEK J. Embed. Sys. Appl, Vol. 17, No. 2, pp. 77-83, 2022 (in Korean).
J. Y. Choi, H. J. Lee, C. W. Jeong, H. C. Jung, "Development of AI Service with Surgical Tools Segmentation and Action Recognition," IEMEK J. Embed. Sys. Appl, Vol. 16, No. 2, pp. 51-57, 2021 (in Korean).
H. D. Kim, "Design of Speech Enhancement U-Net for Embedded Computing," IEMEK J. Embed. Sys. Appl, Vol. 15, No. 5, pp. 227-234, 2020 (in Korean). https://doi.org/10.14372/IEMEK.2020.15.5.227
H. J. Kim, "Analysis of Reduced-Width Truncated Mitchell Multiplication for Inferences Using CNNs," IEMEK J. Embed. Sys. Appl, Vol. 15, No. 5, pp. 235-242, 2020 (in Korean). https://doi.org/10.14372/IEMEK.2020.15.5.235
J. M. Lee, M. S. Yu, Y. I. Kwon, T. H. Kim, "Quantune: Post-training Quantization of Convolutional Neural Networks using Extreme Gradient Boosting for fast Deployment," Future Generation Computer Systems 132, pp. 124-135. 2022. https://doi.org/10.1016/j.future.2022.02.005
G. Y. Kwon, S. W. Park, T. W. Suh, "Cycle-accurate NPU Simulator and Performance Evaluation According to Data Access Strategies," IEMEK J. Embed. Sys. Appl, Vol. 17, No. 4, pp. 217-228, 2022 (in Korean).
https://coral.ai/products/dev-board
https://www.intel.com/content/www/us/en/developer/tools /neural-compute-stick/overview.html
T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan, L. Zheng, J. Fromm, Z. Jiang, L. Ceze, C. Guestrin, A. Krishnamurthy, "A Hardware-software Blueprint for Flexible Deep Learning Specialization," IEEE Micro, Vol. 39, No. 5, pp. 8-16, 2019. https://doi.org/10.1109/mm.2019.2928962
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, E. Shelhamer, "cudnn: Efficient Primitives for Deep Learning." arXiv preprint arXiv:1410.0759, 2014.
https://www.openblas.net/
https://github.com/Reference-LAPACK/lapack
https://www.arm.com/technologies/compute-library
https://github.com/Maratyszcza/NNPACK
E. Wang, Q. Zhang, B. Shen, G. Zhang, X. Lu, Q. Wu, Y. Wang, "Intel Math Kernel Library," High-Performance Computing on the Intel(R) Xeon Phi TM, Springer, Cham, pp 167-188, 2014.
https://github.com/AlexeyAB/darknet
K. He, X. Zhang, S. Ren, J. Sun, "Deep Residual Learning for Image Recognition," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
A. Anderson, A. Vasudevan, C. Keane, D. Gregg, "Low-memory Gemm-based Convolution Algorithms for Deep Neural Networks," arXiv preprint arXiv:1709.03395, 2017.
J. S. Park, K. M. Bin, K. H. Lee, "mGEMM: Low-latency Convolution with Minimal Memory Overhead Optimized for Mobile Devices," Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services, pp. 222-234, 2022.
M .S. Cho, B. Daniel, "MEC: Memory-efficient Convolution for Deep Neural Network," International Conference on Machine Learning. PMLR, pp. 815-824, 2017.
M. Dukhan, "The Indirect Convolution Algorithm." arXiv preprint arXiv:1907.02129, 2019.
https://docs.nvidia.com/cuda/cublas/index.html
C. Nugteren, "CLBlast: A Tuned OpenCL BLAS Library," Proceedings of the International Workshop on OpenCL, pp. 1-10, 2018.
https://github.com/clMathLibraries/clBLAS
K. Goto, V. D. G Robert, "High-performance Implementation of the Level-3 BLAS," ACM Transactions on Mathematical Software (TOMS) Vol. 35, No. 1, pp. 1-14, 2008.

IEMEK Journal of Embedded Systems and Applications (대한임베디드공학회논문지)

Optimizing 2-stage Tiling-based Matrix Multiplication in FPGA-based Neural Network Accelerator

FPGA기반 뉴럴네트워크 가속기에서 2차 타일링 기반 행렬 곱셈 최적화

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)