Browse > Article
http://dx.doi.org/10.3745/KTCCS.2019.8.5.111

MSHR-Aware Dynamic Warp Scheduler for High Performance GPUs  

Kim, Gwang Bok (전남대학교 전자컴퓨터공학부)
Kim, Jong Myon (울산대학교 IT융합학부)
Kim, Cheol Hong (전남대학교 전자컴퓨터공학부)
Publication Information
KIPS Transactions on Computer and Communication Systems / v.8, no.5, 2019 , pp. 111-118 More about this Journal
Abstract
Recent graphic processing units (GPUs) provide high throughput by using powerful hardware resources. However, massive memory accesses cause GPU performance degradation due to cache inefficiency. Therefore, the performance of GPU can be improved by reducing thread parallelism when cache suffers memory contention. In this paper, we propose a dynamic warp scheduler which controls thread parallelism according to degree of cache contention. Usually, the greedy then oldest (GTO) policy for issuing warp shows lower parallelism than loose round robin (LRR) policy. Therefore, the proposed warp scheduler employs the LRR warp scheduling policy when Miss Status Holding Register(MSHR) utilization is low. On the other hand, the GTO policy is employed in order to reduce thread parallelism when MSHRs utilization is high. Our proposed technique shows better performance compared with LRR and GTO policy since it selects efficient scheduling policy dynamically. According to our experimental results, our proposed technique provides IPC improvement by 12.8% and 3.5% over LRR and GTO on average, respectively.
Keywords
GPU; Warp Scheduling; Cache; MSHR; Parallelism;
Citations & Related Records
연도 인용수 순위
  • Reference
1 M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu, "Improving GPGPU resource utilization through alternative thread block scheduling," High Performance Computer Architecture(HPCA), International Symposium on. pp.260-271, 2014.
2 V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in Microarchitecture (MICRO), Annual IEEE/ACM International Symposium on. IEEE, pp.308-317, 2011.
3 Y. Zhang, Z. Xing, C. Liu, C. Tang, and Q. Wang, "Locality based warp scheduling in GPGPUs," Future Generation Computer Systems. 2017.
4 B. Wang, Y. Zhu, and W. Yu, "OAWS: Memory occlusion aware warp scheduling," Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, pp.45-55, 2016.
5 J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, "LaPerm: Locality aware scheduler for dynamic parallelism on GPUs," ACM SIGARCH Computer Architecture News 44.3, pp.584-595, 2016.
6 Y. Liu et al. "Barrier-aware warp scheduling for throughput processors," in Proceedings of the 2016 International Conference on Supercomputing. ACM, pp.42, 2016.
7 NVIDIA CUDA Programming [internet], http://www.nvidia.com/object/cuda_home_new.html
8 M. Lee et al. "iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs," High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, pp.370-381, 2016.
9 T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cacheconscious wavefront scheduling," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp.72-83, 2012.
10 J. Zhang, Y. He, F. Shen, Q. A. Li, and H. Tan, "Memory Request Priority Based Warp Scheduling for GPUs," Chinese Journal of Electronics, Vol.27, No.7, pp.985-994, 2018.   DOI
11 B. Wang, W. Yu, X. H. Sun, and X. Wang, "Dacache: Memory divergence-aware gpu cache management," in Proceedings of the 29th ACM on International Conference on Supercomputing, pp.89-98, 2015.
12 K. Choo, D. Troendle, E. A. Gad, and B. Jang, "Contention-Aware Selective Caching to Mitigate Intra-Warp Contention on GPUs," Parallel and Distributed Computing(ISPDC), pp.1-8, 2017.
13 X. Chen, L. W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W. M. Hwu, "Adaptive cache manageent for energy-efficient gpu computing," in Proceedings of the 47th annual IEEE/ACM International Symposium Microarchitecture, pp.343-355, 2014.
14 A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proceedings of International Symposium, pp.163-174, 2009.
15 S. Che, M. Boyer, M. Jiayuan, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Proceedings of the International Symposium on Workload Characterization (IISWC), pp.44-54, 2009.
16 S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to gpu codes," Innovative Parallel Computing(InPar), pp.1-10, 2012.
17 J. A. Stratton et al. "Parboil: A revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing 127, 2012.