Browse > Article
http://dx.doi.org/10.9708/jksci.2019.24.04.001

Latency Hiding based Warp Scheduling Policy for High Performance GPUs  

Kim, Gwang Bok (School of Electronics and Computer Engineering, Chonnam National University)
Kim, Jong Myon (IT Convergence Department, University of Ulsan)
Kim, Cheol Hong (School of Electronics and Computer Engineering, Chonnam National University)
Abstract
LRR(Loose Round Robin) warp scheduling policy for GPU architecture results in high warp-level parallelism and balanced loads across multiple warps. However, traditional LRR policy makes multiple warps execute long latency operations at the same time. In cases that no more warps to be issued under long latency, the throughput of GPUs may be degraded significantly. In this paper, we propose a new warp scheduling policy which utilizes latency hiding, leading to more utilized memory resources in high performance GPUs. The proposed warp scheduler prioritizes memory instruction based on GTO(Greedy Then Oldest) policy in order to provide reduced memory stalls. When no warps can execute memory instruction any more, the warp scheduler selects a warp for computation instruction by round robin manner. Furthermore, our proposed technique achieves high performance by using additional information about recently committed warps. According to our experimental results, our proposed technique improves GPU performance by 12.7% and 5.6% over LRR and GTO on average, respectively.
Keywords
GPUs; Warp Scheduler; Latency Hiding; Thread Level Parallelism; Data Locality;
Citations & Related Records
연도 인용수 순위
  • Reference
1 NVIDIA, "CUDA C Programming Guide," 2012.
2 Khronos OpenCL Group, "The OpenCL Specification," 2011.
3 T. G. Rogers., M. O'Connor., and T. M. Aamodt, "Cache-conscious wavefront scheduling," Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture pp. 72-83, 2012.
4 T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 99-110, 2013.
5 Kim, G. B. Kim, J. M., & Kim. C. H., "Dynamic Selective Warp Scheduling for GPUs Using L1 Data Cache Locality Information," International Conference on Parallel and Distributed Computing: Applications and Technologies. Springer, Singapore, pp. 230-239, 2018.
6 Zhang, Y., Xing, Z., Liu, C., Tang, C., & Wang, Q., "Locality based warp scheduling in GPGPUs," Future Generation Computer Systems, 82, pp. 520-527. 2018.   DOI
7 ElTantawy, A., & Aamodt, T. M., "Warp scheduling for fine-grained synchronization," In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 375-388, 2018.
8 Oh, Yunho, et al. "Adaptive Cooperation of Prefetching and Warp Scheduling on GPUs." IEEE Transactions on Computers 68.4 (2019): 609-616.   DOI
9 S. Y. Lee, A. Arunkumar, and C. J. Wu, "CAWA: Coordinated warp scheduling and Cache Prioritization for critical warp acceleration of GPGPU workloads," ACM SIGARCH Computer Architecture (ISCA), pp. 515-527, 2015.
10 V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 308-317, 2011.
11 M. Lee, G. Kim, J. Kim, W. Seo, Y. Cho, and S. Ryu, "iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs," High Performance Computer Architecture (HPCA), IEEE International Symposium on. pp. 370-381, 2016.
12 M. K. Yoon, Y. Oh, S. Lee, S. H. Kim, D. Kim, and W. W. Ro, "Draw: investigating benefits of adaptive fetch group size on gpu," In Performance Analysis of Systems and Software (ISPASS), pp. 183-192, 2015.
13 A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, pp. 163-174, 2009.
14 "NVIDIA CUDA SDK Code Samples," http://developer.nvidia.com/cuda-downloads, 2015.
15 M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "A hierarchical thread scheduler and register file for energy-efficient throughput processors," ACM Transactions on Computer Systems (TOCS), Vol. 30, No. 2, April 2012.
16 S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Shadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," Proceedings of the International Symposium on Workload Characterization (IISWC), pp. 44-54, 2009.
17 S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to gpu codes," Innovative Parallel Computing (InPar), pp. 1-10, 2012.
18 J. A. Stratton, C. Rodrigues, J. I. Sung, et al. "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," Center for Reliable and High-Performance Computing, 2012.