DOI QR코드

DOI QR Code

Enhancing GPU Performance by Efficient Hardware-Based and Hybrid L1 Data Cache Bypassing

  • Huangfu, Yijie (Department of Electrical and Computer Engineering, Virginia Commonwealth University) ;
  • Zhang, Wei (Department of Electrical and Computer Engineering, Virginia Commonwealth University)
  • Received : 2017.04.21
  • Accepted : 2017.06.02
  • Published : 2017.06.30

Abstract

Recent GPUs have adopted cache memory to benefit general-purpose GPU (GPGPU) programs. However, unlike CPU programs, GPGPU programs typically have considerably less temporal/spatial locality. Moreover, the L1 data cache is used by many threads that access a data size typically considerably larger than the L1 cache, making it critical to bypass L1 data cache intelligently to enhance GPU cache performance. In this paper, we examine GPU cache access behavior and propose a simple hardware-based GPU cache bypassing method that can be applied to GPU applications without recompiling programs. Moreover, we introduce a hybrid method that integrates static profiling information and hardware-based bypassing to further enhance performance. Our experimental results reveal that hardware-based cache bypassing can boost performance for most benchmarks, and the hybrid method can achieve performance comparable to state-of-the-art compiler-based bypassing with considerably less profiling cost.

Keywords

References

  1. W. Jia, K. A. Shaw, and M. Martonosi, "Characterizing and improving the use of demand-fetched caches in GPUs," in Proceedings of the 26th ACM International Conference on Supercomputing, Venice, Italy, 2012, pp. 15-24.
  2. X. Xie, Y. Liang, G. Sun, and D. Chen, "An efficient compiler framework for cache bypassing on GPUs," in Proceedings of 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Jose, CA, 2013, pp. 516-523.
  3. NVIDIA, CUDA Programming Guide version 5.5, https://developer.nvidia.com/cuda-toolkit-55-archive.
  4. NVIDIA, "NVIDIA's next generation CUDA compute architecture: Fermi," 2009 [Internet], https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture _Whitepaper.pdf.
  5. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron, "Rodinia: a benchmark suite for heterogeneous computing," in Proceedings of IEEE International Symposium on Workload Characterization (IISWC 2009), Austin, TX, 2009, pp. 44-54.
  6. NVIDIA, Parallel Thread Execution ISA version 4.0, https://developer.nvidia.com.
  7. A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2009), Boston, MA, 2009, pp. 163-174.
  8. M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, "A compiler framework for optimization of affine loop nests for GPGPUs," in Proceedings of the 22nd Annual International Conference on Supercomputing, Island of Kos, Greece, 2008, pp. 225-234.
  9. E. Z. Zhang, Y. Jiang, Z. Guo, and X. Shen, "Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping," in Proceedings of the 24th ACM International Conference on Supercomputing, Tsukuba, Japan, 2010, pp. 115-126.
  10. Y. Yang, P. Xiang, J. Kong, and H. Zhou, "A GPGPU compiler for memory optimization and parallelism management," ACM SIGPLAN Notices, vol. 45, no. 6, pp. 86-97, 2010.
  11. I. J. Sung, J. A. Stratton, and W. M. W. Hwu, "Data layout transformation exploiting memory-level parallelism in structured grid many-core applications," in Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, Vienna, Austria,2010, pp. 513-522.
  12. J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, "Many-thread aware prefetching mechanisms for GPGPU applications," in Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Atlanta, GA, 2010, pp. 213-224.
  13. M. Rhu, M. Sullivan, J. Leng, and M. Erez, "A localityaware memory hierarchy for energy-efficient GPU architectures," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, CA, 2013, pp. 86-98.
  14. A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated scheduling and prefetching for GPGPUs," ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 332-343, 2013.
  15. M. Bauer, H. Cook, and B. Khailany, "CudaDMA: optimizing GPU memory bandwidth via warp specialization," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WA, 2011.
  16. D. H. Woo and H. S. Lee, "COMPASS: a programmable data prefetcher using idle GPU shaders," ACM SIGPLAN Notices, vol. 45, no. 3, pp. 297-310, 2010.
  17. M. Moazeni, A. Bui, and M. Sarrafzadeh, "A memory optimization technique for software-managed scratchpad memory in GPUs," in Proceedings of IEEE 7th Symposium on Application Specific Processors (SASP'09), San Francisco, CA, 2009, pp. 43-49.
  18. Y. Yang, P. Xiang, M. Mantor, N. Rubin, and H. Zhou, "Shared memory multiplexing: a novel way to improve GPGPU throughput," in Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, Minneapolis, MN, 2012, pp. 283-292.
  19. V. Mekkat, A. Holey, P. C. Yew, and A. Zhai, "Managing shared last-level cache in a heterogeneous multicore processor," in Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, Edinburgh, Scotland, 2013, pp. 225-234.
  20. X. Chen, L. W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W. M. Hwu, "Adaptive cache management for energy-efficient GPU computing," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, 2014, pp. 343-355.
  21. Y. Tian, S. Puthoor, J. L. Greathouse, B. M. Beckmann, and D. A. Jimenez, "Adaptive GPU cache bypassing," in Proceedings of the 8th Workshop on General Purpose Processing using GPUs, San Francisco, CA, 2015, pp. 25-35.
  22. X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang, "Coordinated static and dynamic cache bypassing for GPUs," in Proceedings of 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Burlingame, CA, 2015, pp. 76-88.
  23. C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, "Locality-driven dynamic GPU cache bypassing," in Proceedings of the 29th ACM on International Conference on Supercomputing, Newport Beach, CA, 2015, pp. 67-77.
  24. A. Gonzalez, C. Aliagas, and M. Valero, "A data cache with multiple caching strategies tuned to different types of locality," in Proceedings of the 9th International Conference on Supercomputing, Barcelona, Spain, 1995, pp. 338-347.
  25. G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun, "A modified approach to data cache management," in Proceedings of the 28th Annual International Symposium on Microarchitecture, Ann Arbor, MI, 1995, pp. 93-103.
  26. T. L. Johnson, D. A. Connors, M. C. Merten, and W. M. Hwu, "Run-time cache bypassing," IEEE Transactions on Computers, vol. 48, no. 12, pp. 1338-1354, 1999. https://doi.org/10.1109/12.817393
  27. H. Liu, M. Ferdman, J. Huh, and D. Burger, "Cache bursts: a new approach for eliminating dead blocks and increasing cache efficiency," in Proceedings of 2008 41st IEEE/ACM International Symposium on Microarchitecture (MICRO-41), Lake Como, Italy, 2008, pp. 222-233.
  28. M. Kharbutli and Y. Solihin, "Counter-based cache replacement and bypassing algorithms," IEEE Transactions on Computers, vol. 57, no. 4, pp. 433-447, 2008. https://doi.org/10.1109/TC.2007.70816
  29. Y. Wu, R. Rakvic, L. L. Chen, C. C. Miao, G. Chrysos, and J. Fang, "Compiler managed micro-cache bypassing for high performance EPIC processors," in Proceedings of 35th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-35), Istanbul, Turkey, 2002, pp. 134-145.
  30. Z. Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems, "Using the compiler to improve cache replacement decisions," in Proceedings of 2002 International Conference on Parallel Architectures and Compilation Techniques, Charlottesville, VA, 2002, pp. 199-208.