DOI QR코드

DOI QR Code

KAWS: Coordinate Kernel-Aware Warp Scheduling and Warp Sharing Mechanism for Advanced GPUs

  • Vo, Viet Tan (ICT Convergence System Engineering, Chonnam National University) ;
  • Kim, Cheol Hong (School of Computer Science and Engineering, Soongsil University)
  • Received : 2021.05.17
  • Accepted : 2021.07.17
  • Published : 2021.12.31

Abstract

Modern graphics processor unit (GPU) architectures offer significant hardware resource enhancements for parallel computing. However, without software optimization, GPUs continuously exhibit hardware resource underutilization. In this paper, we indicate the need to alter different warp scheduler schemes during different kernel execution periods to improve resource utilization. Existing warp schedulers cannot be aware of the kernel progress to provide an effective scheduling policy. In addition, we identified the potential for improving resource utilization for multiple-warp-scheduler GPUs by sharing stalling warps with selected warp schedulers. To address the efficiency issue of the present GPU, we coordinated the kernel-aware warp scheduler and warp sharing mechanism (KAWS). The proposed warp scheduler acknowledges the execution progress of the running kernel to adapt to a more effective scheduling policy when the kernel progress attains a point of resource underutilization. Meanwhile, the warp-sharing mechanism distributes stalling warps to different warp schedulers wherein the execution pipeline unit is ready. Our design achieves performance that is on an average higher than that of the traditional warp scheduler by 7.97% and employs marginal additional hardware overhead.

Keywords

Acknowledgement

This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2021R1A2C1009031).

References

  1. T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in Proceedings of 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, Vancouver, Canada, 2012, pp. 72-83.
  2. H. Asghari Esfeden, F. Khorasani, H. Jeon, D. Wong, and N. Abu-Ghazaleh, "CORF: coalescing operand register file for GPUs," in Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, Providence, RI, 2019, pp. 701-714.
  3. C. T. Do, J. M. Kim, and C. H. Kim, "Application characteristics-aware sporadic cache bypassing for high performance GPGPUs," Journal of Parallel and Distributed Computing, vol. 122, pp. 238-250, 2018. https://doi.org/10.1016/j.jpdc.2018.09.001
  4. X. Chen, L. W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W. M. Hwu, "Adaptive cache management for energy-efficient GPU computing," in Proceedings of 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, 2014, pp. 343-355.
  5. S. Y. Lee, A. Arunkumar, and C. J. Wu, "CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads," ACM SIGARCH Computer Architecture News, vol. 43, no. 3S, pp. 515-527, 2015.
  6. M. Lee, G. Kim, J. Kim, W. Seo, Y. Cho, and S. Ryu, "iPAWS: instruction-issue pattern-based adaptive warp scheduling for GPGPUs," in Proceedings of 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), Barcelona, Spain, 2016, pp. 370-381.
  7. J. Liu, J. Yang, and R. Melhem, "SAWS: synchronization aware GPGPU warp scheduling for multiple independent warp schedulers," in Proceedings of 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Waikiki, HI, 2015, pp. 383-394.
  8. A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proceedings of 2009 IEEE International Symposium on Performance Analysis of Systems and Software, Boston, MA, 2009, pp. 163-174.
  9. M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, "Accel-Sim: an extensible simulation framework for validated GPU modeling," in Proceedings of 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 2020, pp. 473-486.
  10. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron, "Rodinia: a benchmark suite for heterogeneous computing," in Proceedings of 2009 IEEE International Symposium on Workload Characterization (IISWC), Austin, TX, 2009, pp. 44-54.
  11. S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in Proceedings of IEEE International Symposium on Workload Characterization (IISWC), Atlanta, GA, 2010, pp. 1-11.
  12. J. A. Stratton, C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang, N. Anssari, G. D. Liu, and W. W. Hwu, "Parboil: a revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Champaign, IL, Technical Report No. IMPACT-12-01, 2012.
  13. S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to GPU codes," in Proceedings of 2012 Innovative Parallel Computing (InPar), San Jose, CA, 2012, pp. 1-10.
  14. A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, and J. Jeon, "Tango: a deep neural network benchmark suite for various accelerators," in Proceedings of 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Madison, WI, 2019, pp. 137-138.
  15. B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, "GPU-Qin: a methodology for evaluating the error resilience of GPGPU applications," in Proceedings of 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, CA, 2014, pp. 221-230.