[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KTCCS.2017.6.5.219

A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization

Thuan, Do Cong (전남대학교 전자컴퓨터공학부)
Choi, Yong (전남대학교 전자컴퓨터공학부)
Kim, Jong Myon (울산대학교 전기공학부)
Kim, Cheol Hong (전남대학교 전자컴퓨터공학부)

Publication Information

KIPS Transactions on Computer and Communication Systems / v.6, no.5, 2017 , pp. 219-230 More about this Journal

Abstract

General-Purpose Graphics Processing Units (GPGPUs) build massively parallel architecture and apply multithreading technology to explore parallelism. By using programming models like CUDA, and OpenCL, GPGPUs are becoming the best in exploiting plentiful thread-level parallelism caused by parallel applications. Unfortunately, modern GPGPU cannot efficiently utilize its available hardware resources for numerous general-purpose applications. One of the primary reasons is the inefficiency of existing warp/thread block schedulers in hiding long latency instructions, resulting in lost opportunity to improve the performance. This paper studies the effects of hardware thread scheduling policy on GPGPU performance. We propose a novel warp scheduling policy that can alleviate the drawbacks of the traditional round-robin policy. The proposed warp scheduler first classifies the warps of a thread block into two groups, warps with long latency and warps with short latency and then schedules the warps with long latency before the warps with short latency. Furthermore, to support the proposed warp scheduler, we also propose a supplemental technique that can dynamically reduce the number of streaming multiprocessors to which will be assigned thread blocks when encountering a high contention degree at the memory and interconnection network. Based on our experiments on a 15-streaming multiprocessor GPGPU platform, the proposed warp scheduling policy provides an average IPC improvement of 7.5% over the baseline round-robin warp scheduling policy. This paper also shows that the GPGPU performance can be improved by approximately 8.9% on average when the two proposed techniques are combined.

Keywords

GPGPU; Parallelism; Performance; Warp Scheduling; Resource Utilization;

Citations & Related Records

Reference

1	A. Jog et al., "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp.395-406, 2013.
2	S.-Y. Lee, A. Arunkumar, and C.-J. Wu, "CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads," in Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pp.515-527, 2015.
3	W. Jia, K. Shaw, and M. Martonosi, "MRPB: Memory Request Prioritization for Massively Parallel Processors," in Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.272- 283, 2014.
4	X. Xie et al., "Coordinated Static and Dynamic Cache Bypassing for GPUs," in Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.76-88, 2015.
5	V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ACM, pp.308-317, 2011.
6	J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "GPU Computing," in Proceedings of the IEEE, Vol.96, No.5, pp.879-899.
7	A. Bakhola, G. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU simulator," in Proceedings of the 2009 International Symposium on Analysis of Systems and Software (ISPASS-2009), pp. 163-174, Apr. 2009.
8	M. Lee et.al, "Improving GPGPU Resource Utilization through Alternative Thread Block Scheduling," in Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.260-271, 2014.
9	V. V. P. Harish and P. J. Narayanan, "Large graph algorithms for massively multithreaded architectures," in Technical report, IIIT, 2009
10	S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu, "Optimization Principles and Aapplication Performance Evaluation of a Multithreaded GPU Using CUDA," in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, pp.73-82, 2008.
11	V. Volkov and J. W. Demmel, "Benchmarking GPUs to Tune Dense Linear Algebra," in Proceedings of the ACM/IEEE Conference on Supercomputing, pp.1-11, 2008.
12	M. Gebhart, R. D. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindoholm, and K. Skadron, "Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors," in Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA), pp.235-246, 2011.
13	T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache- Conscious Wavefront Scheduling," in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.72-83, 2013.
14	Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. Weiser, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley," Computer Architecture Letters, Vol.8, No.1, pp.25-28, 2009. DOI
15	W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in Proceedings of the 40th Annual IEEE/ ACM International Symposium on Microarchitecture, IEEE Computer Society, pp.407-420, 2007.
16	J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, "Many-Thread Aware Prefetching Mechanisms for GPGPU Applications," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), IEEE Computer Society, pp.213-224, 2010.
17	A. Jog et al., "Orchestrated Scheduling and Prefetching for GPGPUs," in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pp.332-343, Tel-Aviv, Israel, 2013.
18	S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and Wen-Mei W. Hwu, "Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA," in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, pp.73-82, 2008.
19	NVIDIA. "CUDA C Programming Guide," 2012.
20	M. Garland et al., "Parallel Computing Experiences with CUDA," MICRO, IEEE, Vol.28, No.4, 2008.
21	A. Munshi, "The OpenCL Specification," Version 1.2, Khronos OpenCL Working Group, 2011.
22	O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither More Nor Less: Optimizing Thread-Level Parallelism for GPGPUs," in CSE Penn State Tech Report, TR-CES- 2212-006, 2012.
23	W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," in Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.356-367, 2011.
24	H.-Y. Cheng, C.-H. Lin, J. Li, and C.-L. Yang, "Memory Latency Reduction via Thread Throttling," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.53-64, 2010.
25	K. M. Abdalla et al., "Scheduling and Execution of Compute Tasks," US Patent US20130185725, 2013.
26	J. D. Owens et al., "A Survey of Genera-Purpose Computation on Graphics Hardware," in Eurographics 2005, State of the Art Reports, pp.21-51, Aug., 2005.
27	K. Krewell, "AMD's Fusion Finally Arrives," Microprocessor Report, 2011.
28	K. Krewell, "NVIDIA Lowers the Heat on Kepler," Microprocessor Report, 2012.
29	NVIDIA, Whitepaper: NVIDIA's Next Generation CUDA Compute and Graphics Architecture: Fermi.
30	NVIDIA, "NVIDA Tegra Multiprocessor Architecture," Feb. 2010.
31	J. Chen et al., "Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency," in Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing, pp.441-451, 2013.
32	D. Kirk, "NVIDIA CUDA Software and GPU Parallel Computing Architecture," in ISMM, pp.103-104, 2007.
33	NVIDA, CUDA SDK [Internet], http://developer.nvidia.com/gpu-computing-sdk.

KSCI

A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization GPGPU 자원 활용 개선을 위한 블록 지연시간 기반 워프 스케줄링 기법

A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization