[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/JIPS.01.0084

KAWS: Coordinate Kernel-Aware Warp Scheduling and Warp Sharing Mechanism for Advanced GPUs

Vo, Viet Tan (ICT Convergence System Engineering, Chonnam National University)
Kim, Cheol Hong (School of Computer Science and Engineering, Soongsil University)

Publication Information

Journal of Information Processing Systems / v.17, no.6, 2021 , pp. 1157-1169 More about this Journal

Abstract

Modern graphics processor unit (GPU) architectures offer significant hardware resource enhancements for parallel computing. However, without software optimization, GPUs continuously exhibit hardware resource underutilization. In this paper, we indicate the need to alter different warp scheduler schemes during different kernel execution periods to improve resource utilization. Existing warp schedulers cannot be aware of the kernel progress to provide an effective scheduling policy. In addition, we identified the potential for improving resource utilization for multiple-warp-scheduler GPUs by sharing stalling warps with selected warp schedulers. To address the efficiency issue of the present GPU, we coordinated the kernel-aware warp scheduler and warp sharing mechanism (KAWS). The proposed warp scheduler acknowledges the execution progress of the running kernel to adapt to a more effective scheduling policy when the kernel progress attains a point of resource underutilization. Meanwhile, the warp-sharing mechanism distributes stalling warps to different warp schedulers wherein the execution pipeline unit is ready. Our design achieves performance that is on an average higher than that of the traditional warp scheduler by 7.97% and employs marginal additional hardware overhead.

Keywords

GPU; Multiple Warp Schedulers; Resource Underutilization; Warp Scheduling;

Citations & Related Records

Reference

1	B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, "GPU-Qin: a methodology for evaluating the error resilience of GPGPU applications," in Proceedings of 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, CA, 2014, pp. 221-230.
2	A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proceedings of 2009 IEEE International Symposium on Performance Analysis of Systems and Software, Boston, MA, 2009, pp. 163-174.
3	M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, "Accel-Sim: an extensible simulation framework for validated GPU modeling," in Proceedings of 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 2020, pp. 473-486.
4	S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron, "Rodinia: a benchmark suite for heterogeneous computing," in Proceedings of 2009 IEEE International Symposium on Workload Characterization (IISWC), Austin, TX, 2009, pp. 44-54.
5	J. A. Stratton, C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang, N. Anssari, G. D. Liu, and W. W. Hwu, "Parboil: a revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Champaign, IL, Technical Report No. IMPACT-12-01, 2012.
6	S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to GPU codes," in Proceedings of 2012 Innovative Parallel Computing (InPar), San Jose, CA, 2012, pp. 1-10.
7	M. Lee, G. Kim, J. Kim, W. Seo, Y. Cho, and S. Ryu, "iPAWS: instruction-issue pattern-based adaptive warp scheduling for GPGPUs," in Proceedings of 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), Barcelona, Spain, 2016, pp. 370-381.
8	T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in Proceedings of 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, Vancouver, Canada, 2012, pp. 72-83.
9	C. T. Do, J. M. Kim, and C. H. Kim, "Application characteristics-aware sporadic cache bypassing for high performance GPGPUs," Journal of Parallel and Distributed Computing, vol. 122, pp. 238-250, 2018. DOI
10	S. Y. Lee, A. Arunkumar, and C. J. Wu, "CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads," ACM SIGARCH Computer Architecture News, vol. 43, no. 3S, pp. 515-527, 2015.
11	J. Liu, J. Yang, and R. Melhem, "SAWS: synchronization aware GPGPU warp scheduling for multiple independent warp schedulers," in Proceedings of 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Waikiki, HI, 2015, pp. 383-394.
12	A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, and J. Jeon, "Tango: a deep neural network benchmark suite for various accelerators," in Proceedings of 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Madison, WI, 2019, pp. 137-138.
13	H. Asghari Esfeden, F. Khorasani, H. Jeon, D. Wong, and N. Abu-Ghazaleh, "CORF: coalescing operand register file for GPUs," in Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, Providence, RI, 2019, pp. 701-714.
14	X. Chen, L. W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W. M. Hwu, "Adaptive cache management for energy-efficient GPU computing," in Proceedings of 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, 2014, pp. 343-355.
15	S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in Proceedings of IEEE International Symposium on Workload Characterization (IISWC), Atlanta, GA, 2010, pp. 1-11.