Optimizing 2-stage Tiling-based Matrix Multiplication in FPGA-based Neural Network Accelerator
![]() |
Jinse, Kwon
(Chungnam National University)
Jemin, Lee (ETRI) Yongin, Kwon (ETRI) Jeman, Park (ETRI) Misun, Yu (ETRI) Taeho, Kim (ETRI) Hyungshin, Kim (Chungnam National University) |
1 | J. W. Bae, B. G. Han, "Implementation of Deep Learning-based Label Inspection System Applicable to Edge Computing Environments," IEMEK J. Embed. Sys. Appl, Vol. 17, No. 2, pp. 77-83, 2022 (in Korean). |
2 | J. Y. Choi, H. J. Lee, C. W. Jeong, H. C. Jung, "Development of AI Service with Surgical Tools Segmentation and Action Recognition," IEMEK J. Embed. Sys. Appl, Vol. 16, No. 2, pp. 51-57, 2021 (in Korean). |
3 | H. D. Kim, "Design of Speech Enhancement U-Net for Embedded Computing," IEMEK J. Embed. Sys. Appl, Vol. 15, No. 5, pp. 227-234, 2020 (in Korean). DOI |
4 | H. J. Kim, "Analysis of Reduced-Width Truncated Mitchell Multiplication for Inferences Using CNNs," IEMEK J. Embed. Sys. Appl, Vol. 15, No. 5, pp. 235-242, 2020 (in Korean). DOI |
5 | J. M. Lee, M. S. Yu, Y. I. Kwon, T. H. Kim, "Quantune: Post-training Quantization of Convolutional Neural Networks using Extreme Gradient Boosting for fast Deployment," Future Generation Computer Systems 132, pp. 124-135. 2022. DOI |
6 | G. Y. Kwon, S. W. Park, T. W. Suh, "Cycle-accurate NPU Simulator and Performance Evaluation According to Data Access Strategies," IEMEK J. Embed. Sys. Appl, Vol. 17, No. 4, pp. 217-228, 2022 (in Korean). |
7 | https://coral.ai/products/dev-board |
8 | https://www.intel.com/content/www/us/en/developer/tools /neural-compute-stick/overview.html |
9 | T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan, L. Zheng, J. Fromm, Z. Jiang, L. Ceze, C. Guestrin, A. Krishnamurthy, "A Hardware-software Blueprint for Flexible Deep Learning Specialization," IEEE Micro, Vol. 39, No. 5, pp. 8-16, 2019. DOI |
10 | S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, E. Shelhamer, "cudnn: Efficient Primitives for Deep Learning." arXiv preprint arXiv:1410.0759, 2014. |
11 | https://www.openblas.net/ |
12 | https://github.com/Reference-LAPACK/lapack |
13 | https://www.arm.com/technologies/compute-library |
14 | https://github.com/Maratyszcza/NNPACK |
15 | E. Wang, Q. Zhang, B. Shen, G. Zhang, X. Lu, Q. Wu, Y. Wang, "Intel Math Kernel Library," High-Performance Computing on the Intel(R) Xeon Phi TM, Springer, Cham, pp 167-188, 2014. |
16 | https://github.com/AlexeyAB/darknet |
17 | K. He, X. Zhang, S. Ren, J. Sun, "Deep Residual Learning for Image Recognition," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016. |
18 | A. Anderson, A. Vasudevan, C. Keane, D. Gregg, "Low-memory Gemm-based Convolution Algorithms for Deep Neural Networks," arXiv preprint arXiv:1709.03395, 2017. |
19 | J. S. Park, K. M. Bin, K. H. Lee, "mGEMM: Low-latency Convolution with Minimal Memory Overhead Optimized for Mobile Devices," Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services, pp. 222-234, 2022. |
20 | M .S. Cho, B. Daniel, "MEC: Memory-efficient Convolution for Deep Neural Network," International Conference on Machine Learning. PMLR, pp. 815-824, 2017. |
21 | M. Dukhan, "The Indirect Convolution Algorithm." arXiv preprint arXiv:1907.02129, 2019. |
22 | https://docs.nvidia.com/cuda/cublas/index.html |
23 | C. Nugteren, "CLBlast: A Tuned OpenCL BLAS Library," Proceedings of the International Workshop on OpenCL, pp. 1-10, 2018. |
24 | https://github.com/clMathLibraries/clBLAS |
25 | K. Goto, V. D. G Robert, "High-performance Implementation of the Level-3 BLAS," ACM Transactions on Mathematical Software (TOMS) Vol. 35, No. 1, pp. 1-14, 2008. |
![]() |