DOI QR코드

DOI QR Code

Warp-Based Load/Store Reordering to Improve GPU Time Predictability

  • Huangfu, Yijie (Department of Electrical and Computer Engineering, Virginia Commonwealth University) ;
  • Zhang, Wei (Department of Electrical and Computer Engineering, Virginia Commonwealth University)
  • Received : 2017.04.21
  • Accepted : 2017.06.02
  • Published : 2017.06.30

Abstract

While graphics processing units (GPUs) can be used to improve the performance of real-time embedded applications that require high throughput, it is challenging to estimate the worst-case execution time (WCET) of GPU programs, because modern GPUs are designed for improving the average-case performance rather than time predictability. In this paper, a reordering framework is proposed to regulate the access to the GPU data cache, which helps to improve the accuracy of the estimation of GPU L1 data cache miss rate with low performance overhead. Also, with the improved cache miss rate estimation, tighter WCET estimations can be achieved for GPU programs.

Keywords

References

  1. S. Che, J. Li, J. W. Sheaffer, K. Skadron, and J. Lach, "Accelerating compute-intensive applications with GPUs and FPGAs," in Proceedings of Symposium on Application Specific Processors (SASP 2008), Anaheim, CA, 2008, pp. 101-107.
  2. J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "GPU computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879-899, 2008. https://doi.org/10.1109/JPROC.2008.917757
  3. J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, "Real-time human pose recognition in parts from single depth images," Communications of the ACM, vol. 56, no. 1, pp. 116-124, 2013. https://doi.org/10.1145/2398356.2398381
  4. NVIDIA Tegra mobile processors, http://www.nvidia.com/object/tegra.html.
  5. NVIDIA DRIVE PX2, http://www.nvidia.com/object/drivepx.html.
  6. NVIDIA CUDA Toolkit Documentation v7.0, https://developer.nvidia.com/cuda-toolkit.
  7. J. E. Stone, D. Gohara, and G. Shi, "OpenCL: a parallel programming standard for heterogeneous computing systems," Computing in Science & Engineering, vol. 12, no. 3, pp. 66-73, 2010.
  8. M. Alt, C. Ferdinand, F. Martin, and R. Wilhelm, "Cache behavior prediction by abstract interpretation," Static Analysis, Lecture Notes in Computer Science vol. 1145, Heidelberg: Springer, 1996, pp. 52-66.
  9. NVIDIA CUDA Parallel Thread Execution ISA version 4.2, http://www.nvidia.com.
  10. A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2009), Boston, MA, 2009, pp. 163-174.
  11. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron, "Rodinia: a benchmark suite for heterogeneous computing," in Proceedings of IEEE International Symposium on Workload Characterization (IISWC 2009), Austin, TX, 2009, pp. 44-54.
  12. D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta, "Real-time parallel hashing on the GPU," ACM Transactions on Graphics (TOG), vol. 28, no. 5, article no. 154, 2009.
  13. U. Verner, A. Schuster, and M. Silberstein, "Processing data streams with hard real-time constraints on heterogeneous systems," in Proceedings of the International Conference on Supercomputing, Tucson, AZ, 2011, pp. 120-129.
  14. B. Andersson, G. Raravi, and K. Bletsas, "Assigning realtime tasks on heterogeneous multiprocessors with two unrelated types of processors," in Proceedings of 2010 IEEE 31st Real-Time Systems Symposium (RTSS), San Diego, CA, 2010.
  15. G. A. Elliott and J. H. Anderson, "Globally scheduled realtime multiprocessor systems with GPUs," Real-Time Systems, vol. 48, no. 1, pp. 34-74, 2012. https://doi.org/10.1007/s11241-011-9140-y
  16. G. Elliott, B. Ward, and J. Anderson, "Gpusync: architectureaware management of GPUs for predictable multi-GPU realtime systems," in Proceedings of 34th IEEE RTSS, Vancouver, Canada, 2013, pp. 33-44.
  17. X. Vera, B. Lisper, and J. Xue, "Data cache locking for higher program predictability," ACM SIGMETRICS Performance Evaluation Review, vol. 31, no. 1, pp. 272-282, 2003.
  18. V. Suhendra and T. Mitra, "Exploring locking & partitioning for predictable shared caches on multi-cores," in Proceedings of the 45th annual Design Automation Conference, Anaheim, CA, 2008, pp. 300-303.
  19. H. Ding, Y. Liang, and T. Mitra, "WCET-centric partial instruction cache locking," in Proceedings of 2012 49th ACM/EDAC/IEEE Design Automation Conference (DAC), San Francisco, CA, 2012, pp. 412-420.
  20. R. Banakar, S. Steinke, B. S. Lee, M. Balakrishnan, and P. Marwedel, "Scratchpad memory: design alternative for cache on-chip memory in embedded systems," in Proceedings of the 10th International Symposium on Hardware/Software Codesign, Estes Park, CO, 2002, pp. 73-78.
  21. M. Schoeberl, "A time predictable instruction cache for a Java processor," in On the Move to Meaningful Internet Systems: OTM 2004 Workshop. Heidelberg: Springer, 2004, pp. 371-382.
  22. D. Hardy and I. Puaut, "WCET analysis of multi-level noninclusive set-associative instruction caches," in Proceedings of Real-Time Systems Symposium (RTSS), Barcelona, Spain, 2008, pp. 456-466.
  23. Y. Yan and W. Zhang, "WCET analysis for multi-core processors with shared L2 instruction caches," in Proceedings of Real-Time and Embedded Technology and Applications Symposium (RTAS'08), St. Louis, MO, 2008, pp. 80-89.
  24. Y. Li, V. Suhendra, Y. Liang, T. Mitra, and A. Roychoudhury, "Timing analysis of concurrent programs running on shared cache multi-cores," in Proceedings of Real-Time Systems Symposium (RTSS), Washington, DC, 2009, pp. 57-67.
  25. B. K. Huynh, L. Ju, and A. Roychoudhury, "Scope-aware data cache analysis for WCET estimation," in Proceedings of 2011 17th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Chicago, IL, 2011, pp. 203-212.
  26. X. Xie, Y. Liang, G. Sun, and D. Chen, "An efficient compiler framework for cache bypassing on GPUs," in Proceedings of 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Jose, CA, 2013, pp. 516-523.
  27. W. Jia, K. A. Shaw, and M. Martonosi, "MRPB: memory request prioritization for massively parallel processors," in Proceedings of 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, 2014, pp. 272-283.
  28. A. Betts and A. Donaldson, "Estimating the WCET of GPUaccelerated applications using hybrid analysis," in Proceedings of 2013 25th Euromicro Conference on Real-Time Systems (ECRTS), Paris, France, 2013, pp. 193-202.
  29. K. Berezovskyi, L. Santinelli, K. Bletsas, and E. Tovar, "WCET measurement-based and extreme value theory characterisation of CUDA kernels," in Proceedings of the 22nd International Conference on Real-Time Networks and Systems, Versaille, France, 2014, p. 279.