DOI QR코드

DOI QR Code

Cycle-accurate NPU Simulator and Performance Evaluation According to Data Access Strategies

Cycle-accurate NPU 시뮬레이터 및 데이터 접근 방식에 따른 NPU 성능평가

  • Received : 2022.07.09
  • Accepted : 2022.08.01
  • Published : 2022.08.31

Abstract

Currently, there are increasing demands for applying deep neural networks (DNNs) in the embedded domain such as classification and object detection. The DNN processing in embedded domain often requires custom hardware such as NPU for acceleration due to the constraints in power, performance, and area. Processing DNN models requires a large amount of data, and its seamless transfer to NPU is crucial for performance. In this paper, we developed a cycle-accurate NPU simulator to evaluate diverse NPU microarchitectures. In addition, we propose a novel technique for reducing the number of memory accesses when processing convolutional layers in convolutional neural networks (CNNs) on the NPU. The main idea is to reuse data with memory interleaving, which recycles the overlapping data between previous and current input windows. Data memory interleaving makes it possible to quickly read consecutive data in unaligned locations. We implemented the proposed technique to the cycle-accurate NPU simulator and measured the performance with LeNet-5, VGGNet-16, and ResNet-50. The experiment shows up to 2.08x speedup in processing one convolutional layer, compared to the baseline.

Keywords

Acknowledgement

이 논문은 2022년도 정부 (과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임 (No.2019-0-00533, 컴퓨터 프로세서의 구조적 보안 취약점 검증 및 공격 탐지대응). 이 성과는 정부 (과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임 (NRF-2022R1A2C1011469). 본 연구는 삼성전자의 지원 (과제번호 IO210204-08384-01)을 받아 수행된 결과임.

References

  1. H. Esmaeilzadeh, A. Sampson, L. Ceze, D. Burger, "Neural Acceleration for General-purpose Approximate Programs," in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 449-460, 2012.
  2. Y. Chen, Y. Xie, L. Song, F. Chen, T. Tang, "A Survey of Accelerator Architectures for Deep Neural Networks," Engineering, Vol. 6, No. 3, pp. 264-274, 2020. https://doi.org/10.1016/j.eng.2020.01.007
  3. https://cloud.google.com/tpu
  4. A. Skillman, T. Edso, "A Technical Overview of Cortex-m55 and Ethos-u55: Arm's most Capable Processors for Endpoint ai," in 2020 IEEE Hot Chips 32 Symposium (HCS), pp. 1-20, 2020.
  5. J. Choquette, W. Gandhi, O. Giroux, N. Stam, R. Krashinsky, "NVIDIA A100 Tensor core GPU: Performance and Innovation," IEEE Micro, Vol. 41, No. 2, pp. 29-35, 2021.
  6. http://deepx.musigndm.com/product/
  7. https://coral.ai/products/accelerator-module
  8. J. W. Jang, S. Lee, D. Kim, H. Park, A. S. Ardestani, Y. Choi, C. Kim, Y. Kim, H. Yu, H. Abdel-Aziz, J. S. Park, H. Lee, D. Lee, M. W. Kim, H. Jung, H. Nam, D. Lim, S. Lee, J. H. Song, S. Kwon, J. Hassoun, S. H. Lim, C. Choi, "Sparsity-aware and Re-configurable npu Architecture for Samsung Flagship Mobile soc," in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 15-28, 2021.
  9. https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-n78
  10. Y. Wang, D. Deng, L. Liu, S. Wei, S. Yin, "PL-NPU: An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing," IEEE Transactions on Circuits and Systems I: Regular Papers, 2022.
  11. L. Lu, Y. Liang, Q. Xiao, S. Yan, "Evaluating fast Algorithms for Convolutional Neural Networks on FPGAs," in 2017 IEEE 25th Annual International Symposium on Field-programmable Custom Computing Machines (FCCM), pp. 101-108, 2017.
  12. Y. H. Chen, T. Krishna, J. S. Emer, V. Sze, "Eyeriss: An Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," IEEE Journal of Solid-state Circuits, Vol. 52, No. 1, pp. 127-138, 2016. https://doi.org/10.1109/JSSC.2016.2616357
  13. Y. H. Chen, T. J. Yang, J. Emer, V. Sze, "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Vol. 9, No. 2, pp. 292-308, 2019. https://doi.org/10.1109/jetcas.2019.2910232
  14. K. Prabhu, A. Gural, Z. F. Khan, R. M. Radway, M. Giordano, K. Koul, R. Doshi, J. W. Kustin, T. Liu, G. B. Lopes, V. Turbiner, W. S. Khwa, Y. D. Chih, M. F. Chang, G. Lallement, B. Murmann, S. Mitra, P. Raina, "CHIMERA: A 0.92-TOPS, 2.2-TOPS/W Edge AI Accelerator With 2-MByte On-Chip Foundry Resistive RAM for Efficient Training and Inference," IEEE Journal of Solid-State Circuits, Vol. 57, No. 4, pp. 1013-1026, 2022. https://doi.org/10.1109/JSSC.2022.3140753
  15. L. Deng, "The Mnist Database of Handwritten Digit Images for Machine Learning Research [best of the web]," IEEE Signal Processing Magazine, Vol. 29, No. 6, pp. 141-142, 2012. https://doi.org/10.1109/MSP.2012.2211477
  16. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, "Imagenet: A Large-scale Hierarchical Image Database," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009.
  17. Y. LeCun, "LeNet-5, Convolutional Neural Networks," URL: http://yann.lecun.com/exdb/lenet, Vol. 20, No. 5, pp. 14, 2015.
  18. K. Simonyan, A. Zisserman, "Very Deep Convolutional Networks for Large-scale Image Recognition," arXiv preprint arXiv:1409.1556, 2014.
  19. K. He, X. Zhang, S. Ren, J. Sun, "Deep Residual Learning for Image Recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
  20. H. Wu, P. Judd, X. Zhang, M. Isaev, P. Micikevicius, "Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation," arXiv preprint arXiv:2004.09602, 2020.