Browse > Article
http://dx.doi.org/10.7471/ikeee.2020.24.2.559

Memory data layout and DMA transfer technique research For efficient data transfer of CNN accelerator  

Cho, Seok-Jae (Dept. of Electronics Engineering, Pusan National University)
Park, Sungkyung (Dept. of Electronics Engineering, Pusan National University)
Park, Chester Sungchung (Dept. of Electronics Engineering, Konkuk University)
Publication Information
Journal of IKEEE / v.24, no.2, 2020 , pp. 559-569 More about this Journal
Abstract
One of the deep-running algorithms, CNN's artificial intelligence application uses off-chip memory to store data on the Convolution Layer. DMA can reduce processor load at every data transfer. It can also reduce application performance degradation by varying the order in which data from the Convolution layer is transmitted to the global buffer of the accelerator. For basic layouts with continuous memory addresses, SG-DMA showed about 3.4 times performance improvement in pre-setting DMA compared to using ordinaly DMA, and for Ideal layouts with discontinuous memory addresses, the ordinal DMA was about 1396 cycles faster than SG-DMA. Experiments have shown that a combination of memory data layout and DMA can reduce the DMA preset load by about 86 percent.
Keywords
CNN; memory data layout; Loop-tiling; Accelerator; Scatter-Gather DMA;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Ma, Yufei, et al. "Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks," Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2017. DOI: 10.1145/3020078.3021736
2 A. Krizhevsky, I. Sutskever, and G. E. Hinton. "Imagenet classification with deep convolutional neural networks," In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, "Advances in Neural Information Processing Systems 25," Curran Associates, Inc., pp.1097-1105, 2012.
3 K. Simonyan and A. "Zisserman. Very deep convolutional networks for largescale image recognition," CoRR, abs/1409.1556, 2014.
4 Chen, Y. H., Emer, J., & Sze, V. (2018). "Eyeriss v2: A flexible and high-performance accelerator for emerging deep neural networks," arXiv preprint arXiv:1807.07928
5 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Microsoft Research, "Deep Residual Learning for Image Recognition," arXiv:1512.03385v1, 2015.
6 Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto. Hartwig Adam, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications," arXiv:1704.04861v1, 2017.
7 Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun, "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices," arXiv:1707.01083, Dec 2017.
8 Mingxing Tan, Quoc V. L, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," arXiv:1905.11946v3, 2019.
9 LeCun, Yann, Leon Bottou, Yoshua Bengio, and Patrick Haffner, "Gradient-based learning applied to document recognition," in IEEE, 1998. DOI: 10.1109/5.726791
10 K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv:1409.1556v6, 2015.
11 C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun. Cnp: "An fpga-based processor for convolutional networks. In Field Programmable Logic and Applications," 2009. FPL 2009. International Conference on IEEE, pp.32-37, 2009. DOI: 10.1109/FPL.2009.5272559
12 Google. Improving photo search: A step across the semantic gap. http://googleresearch.blogspot.com/2013/06/improving-photo-search-step-across.html.
13 V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," in IEEE CVPRW, 2014. DOI: 10.1109/CVPRW.2014.106
14 S. Ji, W. Xu, M. Yang, and K. Yu. "3d convolutional neural networks for human action recognition," IEEE Trans. Pattern Anal. Mach. Intell., Vol.35, No.1, pp.221-231, 2013. DOI: 10.1109/TPAMI.2012.59   DOI
15 S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf. "A programmable parallel Accelerator for learning and classication," In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pp.273-284. ACM, 2010.
16 R. Hadsell, A. Erkan, P. Sermanet, J. Ben, K. Kavukcuoglu, U. Muller, and Y. LeCun, "A multi-range vision strategy for autonomous offroad navigation," in Proc. Robotics and Applications (RA'07), 2007.
17 Y. Ma, Y. Cao, S. Vrudhula and J. Seo, "Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.26, no.7, pp.1354-1367, 2018. DOI: 10.1109/TVLSI.2018.2815603   DOI
18 Lukas Cavigelli, Luca Benini "Origami: A 803 GOp/s/W Convolutional Network Accelerator in Origami: A 803 GOp/s/W Convolutional Network Accelerator," 2017.
19 Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam, "Shidiannao: shifting vision processing closer to the sensor," in Proceedings of the 42nd. Annual International Symposium on Computer Architecture, pp.92-104, 2015.
20 Dao-Fu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Temam, XiaobingFeng, Xuehai Zhou, and Yunji Chen "PuDianNao: A Polyvalent Machine Learning Accelerator," in ASPLOS '15 Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 2015. DOI: 10.1145/2786763.2694358
21 Youngjin Jo, Youngnam Kim, Sanghyuk Jung, Yong Ho Song "Implementation of Low Cost and High Performance DMA for PCI Express based SSD," in Korea Institute Of Communication Sciences, 2012.
22 Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, "Eyeriss: An energy-efficient reconfigurable Accelerator for deep convolutional neural networks," in IEEE Journal of Solid-State Circuits (JSSC), Vol.52, No.1, pp.127-138, 2017. DOI: 10.1109/JSSC.2016.2616357   DOI
23 Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in 43rd Annual International Symposium on Computer Architecture (ISCA), 2016. DOI: 10.1145/3007787.3001177
24 Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam, "DaDianNao: A Machine-Learning Supercomputer," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014. DOI: 10.1109/MICRO.2014.58
25 Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam "DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning," in ASPLOS '14 Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, 2014. DOI: 10.1145/2644865.2541967
26 C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks," in FPGA, 2015. DOI: 10.1145/2684746.2689060
27 GUO, Kaiyuan, et al. "A survey of fpga-based neural network accelerator," arXiv preprint arXiv: 1712.08934, 2017.