[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7471/ikeee.2020.24.2.559

Memory data layout and DMA transfer technique research For efficient data transfer of CNN accelerator

Cho, Seok-Jae (Dept. of Electronics Engineering, Pusan National University)
Park, Sungkyung (Dept. of Electronics Engineering, Pusan National University)
Park, Chester Sungchung (Dept. of Electronics Engineering, Konkuk University)

Publication Information

Journal of IKEEE / v.24, no.2, 2020 , pp. 559-569 More about this Journal

Abstract

One of the deep-running algorithms, CNN's artificial intelligence application uses off-chip memory to store data on the Convolution Layer. DMA can reduce processor load at every data transfer. It can also reduce application performance degradation by varying the order in which data from the Convolution layer is transmitted to the global buffer of the accelerator. For basic layouts with continuous memory addresses, SG-DMA showed about 3.4 times performance improvement in pre-setting DMA compared to using ordinaly DMA, and for Ideal layouts with discontinuous memory addresses, the ordinal DMA was about 1396 cycles faster than SG-DMA. Experiments have shown that a combination of memory data layout and DMA can reduce the DMA preset load by about 86 percent.

Keywords

CNN; memory data layout; Loop-tiling; Accelerator; Scatter-Gather DMA;

Citations & Related Records

Reference

1	Ma, Yufei, et al. "Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks," Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2017. DOI: 10.1145/3020078.3021736
2	A. Krizhevsky, I. Sutskever, and G. E. Hinton. "Imagenet classification with deep convolutional neural networks," In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, "Advances in Neural Information Processing Systems 25," Curran Associates, Inc., pp.1097-1105, 2012.
3	K. Simonyan and A. "Zisserman. Very deep convolutional networks for largescale image recognition," CoRR, abs/1409.1556, 2014.
4	Chen, Y. H., Emer, J., & Sze, V. (2018). "Eyeriss v2: A flexible and high-performance accelerator for emerging deep neural networks," arXiv preprint arXiv:1807.07928
5	Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Microsoft Research, "Deep Residual Learning for Image Recognition," arXiv:1512.03385v1, 2015.
6	Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto. Hartwig Adam, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications," arXiv:1704.04861v1, 2017.
7	Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun, "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices," arXiv:1707.01083, Dec 2017.
8	Mingxing Tan, Quoc V. L, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," arXiv:1905.11946v3, 2019.
9	LeCun, Yann, Leon Bottou, Yoshua Bengio, and Patrick Haffner, "Gradient-based learning applied to document recognition," in IEEE, 1998. DOI: 10.1109/5.726791
10	K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv:1409.1556v6, 2015.
11	C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun. Cnp: "An fpga-based processor for convolutional networks. In Field Programmable Logic and Applications," 2009. FPL 2009. International Conference on IEEE, pp.32-37, 2009. DOI: 10.1109/FPL.2009.5272559
12	Google. Improving photo search: A step across the semantic gap. http://googleresearch.blogspot.com/2013/06/improving-photo-search-step-across.html.
13	V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," in IEEE CVPRW, 2014. DOI: 10.1109/CVPRW.2014.106
14	S. Ji, W. Xu, M. Yang, and K. Yu. "3d convolutional neural networks for human action recognition," IEEE Trans. Pattern Anal. Mach. Intell., Vol.35, No.1, pp.221-231, 2013. DOI: 10.1109/TPAMI.2012.59 DOI
15	S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf. "A programmable parallel Accelerator for learning and classication," In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pp.273-284. ACM, 2010.
16	R. Hadsell, A. Erkan, P. Sermanet, J. Ben, K. Kavukcuoglu, U. Muller, and Y. LeCun, "A multi-range vision strategy for autonomous offroad navigation," in Proc. Robotics and Applications (RA'07), 2007.
17	Y. Ma, Y. Cao, S. Vrudhula and J. Seo, "Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.26, no.7, pp.1354-1367, 2018. DOI: 10.1109/TVLSI.2018.2815603 DOI
18	Lukas Cavigelli, Luca Benini "Origami: A 803 GOp/s/W Convolutional Network Accelerator in Origami: A 803 GOp/s/W Convolutional Network Accelerator," 2017.
19	Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam, "Shidiannao: shifting vision processing closer to the sensor," in Proceedings of the 42nd. Annual International Symposium on Computer Architecture, pp.92-104, 2015.
20	Dao-Fu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Temam, XiaobingFeng, Xuehai Zhou, and Yunji Chen "PuDianNao: A Polyvalent Machine Learning Accelerator," in ASPLOS '15 Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 2015. DOI: 10.1145/2786763.2694358
21	Youngjin Jo, Youngnam Kim, Sanghyuk Jung, Yong Ho Song "Implementation of Low Cost and High Performance DMA for PCI Express based SSD," in Korea Institute Of Communication Sciences, 2012.
22	Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, "Eyeriss: An energy-efficient reconfigurable Accelerator for deep convolutional neural networks," in IEEE Journal of Solid-State Circuits (JSSC), Vol.52, No.1, pp.127-138, 2017. DOI: 10.1109/JSSC.2016.2616357 DOI
23	Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in 43rd Annual International Symposium on Computer Architecture (ISCA), 2016. DOI: 10.1145/3007787.3001177
24	Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam, "DaDianNao: A Machine-Learning Supercomputer," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014. DOI: 10.1109/MICRO.2014.58
25	Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam "DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning," in ASPLOS '14 Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, 2014. DOI: 10.1145/2644865.2541967
26	C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks," in FPGA, 2015. DOI: 10.1145/2684746.2689060
27	GUO, Kaiyuan, et al. "A survey of fpga-based neural network accelerator," arXiv preprint arXiv: 1712.08934, 2017.

KSCI

Memory data layout and DMA transfer technique research For efficient data transfer of CNN accelerator CNN 가속기의 효율적인 데이터 전송을 위한 메모리 데이터 레이아웃 및 DMA 전송기법 연구

Memory data layout and DMA transfer technique research For efficient data transfer of CNN accelerator