Browse > Article
http://dx.doi.org/10.9708/jksci.2016.21.2.009

An IPC-based Dynamic Cooperative Thread Array Scheduling Scheme for GPUs  

Son, Dong Oh (School of Electronics and Computer Engineering, Chonnam National University)
Kim, Jong Myon (School of Electrical Engineering, University of Ulsan)
Kim, Cheol Hong (School of Electronics and Computer Engineering, Chonnam National University)
Abstract
Recently, many research groups have focused on GPGPUs in order to improve the performance of computing systems. GPGPUs can execute general-purpose applications as well as graphics applications by using parallel GPU hardware resources. GPGPUs can process thousands of threads based on warp scheduling and CTA scheduling. In this paper, we utilize the traditional CTA scheduler to assign a various number of CTAs to SMs. According to our simulation results, increasing the number of CTAs assigned to the SM statically does not improve the performance. To solve the problem in traditional CTA scheduling schemes, we propose a new IPC-based dynamic CTA scheduling scheme. Compared to traditional CTA scheduling schemes, the proposed dynamic CTA scheduling scheme can increase the GPU performance by up to 13.1%.
Keywords
General Purpose computation on the Graphics Processing Unit; Cooperative Thread Array Scheduling Schemes; Performance; Parallelism;
Citations & Related Records
연도 인용수 순위
  • Reference
1 V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate versus IPC: the end of the road for conventional microArchitectures," In Proceedings of 27th International Symposium on Computer Architecture, pp. 248-259, 2000.
2 Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, "Brook for GPUs: stream computing on graphics hardware," In Proceedings of 31th Annual Conference on Computer Graphics, pp.777-786, 2004.
3 NVIDIA CUDA Programming, available at http://www.nvidia.com/object/cuda_home_new.html
4 OpenCL, available at http://www.khronos.org/opencl/
5 ATI Streaming, available at http://www.amd.com/stream
6 General-purpose computation on graphics hardware, available at http://www.gpgpu.org
7 I. A. Buck, "Programming CUDA," In Supercomputing 2007 Tutorial Notes, 2007.
8 A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 395-406, 2013.
9 V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," In Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 308-317, 2011.
10 T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," In Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 78-85, 2012.
11 V. W. Lee, C. K. Kim, J. Chhugani, M. Deisher, D. H. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal and P. Dubey, "Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU," In proceedings of International Symposium on Computer Architecture, pp.451-460, 2010.
12 NVIDIA's Next Generation CUDA Compute Architecture: Fermi, available at www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_ architecture_whitepaper.pdf
13 J. E. Thornton, "Parallel operation in the control data 6600," In AFIPS Proceedings of FJCC, Part.2, Volume.26, pp.33-40, 1964.
14 I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, P. Hanrahan, "Brook for GPUs: stream computing on graphics hardware," In Proceedings of Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pp.777-786, 2004.
15 M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu, "Improving GPGPU Resource Utilization Through Alternative Thread Block Scheduling," In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp.260-271, 2014.
16 A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of 9th International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.
17 S. Che, M. Boyer, M. Jiayuan, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K.Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," In Proceedings of the International Symposium on Workload Characterization (IISWC), pp. 44-54, 2009.
18 S. Li, J. H Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi "McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures," In Proceedings of the International Symposium on Microarchitecture, pp.469-480, 2009.
19 J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," In Proceedings of the International Symposium Computer Architecture, pp.487-498, 2013.
20 CUDA SDK, available at http://developerdownload.nvidia.com/compute/cuda/sdk/website/samples.html