DOI QR코드

DOI QR Code

CNN 가속기의 효율적인 데이터 전송을 위한 메모리 데이터 레이아웃 및 DMA 전송기법 연구

Memory data layout and DMA transfer technique research For efficient data transfer of CNN accelerator

  • 투고 : 2020.06.01
  • 심사 : 2020.06.22
  • 발행 : 2020.06.30

초록

딥 러닝 알고리즘 중 하나인 CNN 인공지능 어플리케이션은 하드웨어 측면에서 컨벌루션 레이어의 많은 데이터들을 저장하기 위해 오프 칩 메모리를 사용 하고, DMA를 사용하여 매 데이터 전송 시 프로세서의 부하를 줄여 성능을 향상 시킬 수 있다. 또한 컨벌루션 레이어의 데이터를 가속기의 글로벌 버퍼에 전송되는 순서를 다르게 하여 어플리케이션의 성능의 저하를 줄일 수 있다. 불 연속된 메모리 주소를 가지고 있는 베이직 레이아웃의 경우 SG-DMA를 사용 할 때 ordinary DMA를 사용할 때보다 DMA를 사전 설정하는 부분에서 약 3.4배의 성능향상을 보였고 연속적인 메모리 주소를 가지고 있는 아이디얼 레이아웃의 경우 ordinary DMA 와 SG-DMA를 사용하는 두가지 경우 모두 1396 사이클 정도의 오버헤드를 가졌다. 가장 효율적인 메모리 데이터 레이아웃과 DMA의 조합은 프로세서의 DMA 사전 설정 부하를 약 86 퍼센트까지 감소할 수 있음을 실험을 통해 확인했다.

One of the deep-running algorithms, CNN's artificial intelligence application uses off-chip memory to store data on the Convolution Layer. DMA can reduce processor load at every data transfer. It can also reduce application performance degradation by varying the order in which data from the Convolution layer is transmitted to the global buffer of the accelerator. For basic layouts with continuous memory addresses, SG-DMA showed about 3.4 times performance improvement in pre-setting DMA compared to using ordinaly DMA, and for Ideal layouts with discontinuous memory addresses, the ordinal DMA was about 1396 cycles faster than SG-DMA. Experiments have shown that a combination of memory data layout and DMA can reduce the DMA preset load by about 86 percent.

키워드

참고문헌

  1. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv:1409.1556v6, 2015.
  2. LeCun, Yann, Leon Bottou, Yoshua Bengio, and Patrick Haffner, "Gradient-based learning applied to document recognition," in IEEE, 1998. DOI: 10.1109/5.726791
  3. C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun. Cnp: "An fpga-based processor for convolutional networks. In Field Programmable Logic and Applications," 2009. FPL 2009. International Conference on IEEE, pp.32-37, 2009. DOI: 10.1109/FPL.2009.5272559
  4. Google. Improving photo search: A step across the semantic gap. http://googleresearch.blogspot.com/2013/06/improving-photo-search-step-across.html.
  5. S. Ji, W. Xu, M. Yang, and K. Yu. "3d convolutional neural networks for human action recognition," IEEE Trans. Pattern Anal. Mach. Intell., Vol.35, No.1, pp.221-231, 2013. DOI: 10.1109/TPAMI.2012.59
  6. S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf. "A programmable parallel Accelerator for learning and classication," In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pp.273-284. ACM, 2010.
  7. R. Hadsell, A. Erkan, P. Sermanet, J. Ben, K. Kavukcuoglu, U. Muller, and Y. LeCun, "A multi-range vision strategy for autonomous offroad navigation," in Proc. Robotics and Applications (RA'07), 2007.
  8. Y. Ma, Y. Cao, S. Vrudhula and J. Seo, "Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.26, no.7, pp.1354-1367, 2018. DOI: 10.1109/TVLSI.2018.2815603
  9. Lukas Cavigelli, Luca Benini "Origami: A 803 GOp/s/W Convolutional Network Accelerator in Origami: A 803 GOp/s/W Convolutional Network Accelerator," 2017.
  10. V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," in IEEE CVPRW, 2014. DOI: 10.1109/CVPRW.2014.106
  11. Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam, "Shidiannao: shifting vision processing closer to the sensor," in Proceedings of the 42nd. Annual International Symposium on Computer Architecture, pp.92-104, 2015.
  12. Dao-Fu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Temam, XiaobingFeng, Xuehai Zhou, and Yunji Chen "PuDianNao: A Polyvalent Machine Learning Accelerator," in ASPLOS '15 Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 2015. DOI: 10.1145/2786763.2694358
  13. Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, "Eyeriss: An energy-efficient reconfigurable Accelerator for deep convolutional neural networks," in IEEE Journal of Solid-State Circuits (JSSC), Vol.52, No.1, pp.127-138, 2017. DOI: 10.1109/JSSC.2016.2616357
  14. Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in 43rd Annual International Symposium on Computer Architecture (ISCA), 2016. DOI: 10.1145/3007787.3001177
  15. Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam, "DaDianNao: A Machine-Learning Supercomputer," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014. DOI: 10.1109/MICRO.2014.58
  16. Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam "DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning," in ASPLOS '14 Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, 2014. DOI: 10.1145/2644865.2541967
  17. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks," in FPGA, 2015. DOI: 10.1145/2684746.2689060
  18. Youngjin Jo, Youngnam Kim, Sanghyuk Jung, Yong Ho Song "Implementation of Low Cost and High Performance DMA for PCI Express based SSD," in Korea Institute Of Communication Sciences, 2012.
  19. GUO, Kaiyuan, et al. "A survey of fpga-based neural network accelerator," arXiv preprint arXiv: 1712.08934, 2017.
  20. Ma, Yufei, et al. "Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks," Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2017. DOI: 10.1145/3020078.3021736
  21. A. Krizhevsky, I. Sutskever, and G. E. Hinton. "Imagenet classification with deep convolutional neural networks," In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, "Advances in Neural Information Processing Systems 25," Curran Associates, Inc., pp.1097-1105, 2012.
  22. K. Simonyan and A. "Zisserman. Very deep convolutional networks for largescale image recognition," CoRR, abs/1409.1556, 2014.
  23. Chen, Y. H., Emer, J., & Sze, V. (2018). "Eyeriss v2: A flexible and high-performance accelerator for emerging deep neural networks," arXiv preprint arXiv:1807.07928
  24. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Microsoft Research, "Deep Residual Learning for Image Recognition," arXiv:1512.03385v1, 2015.
  25. Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto. Hartwig Adam, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications," arXiv:1704.04861v1, 2017.
  26. Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun, "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices," arXiv:1707.01083, Dec 2017.
  27. Mingxing Tan, Quoc V. L, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," arXiv:1905.11946v3, 2019.