DOI QR코드

DOI QR Code

MSHR-Aware Dynamic Warp Scheduler for High Performance GPUs

GPU 성능 향상을 위한 MSHR 활용률 기반 동적 워프 스케줄러

  • Received : 2018.11.15
  • Accepted : 2019.02.12
  • Published : 2019.05.31

Abstract

Recent graphic processing units (GPUs) provide high throughput by using powerful hardware resources. However, massive memory accesses cause GPU performance degradation due to cache inefficiency. Therefore, the performance of GPU can be improved by reducing thread parallelism when cache suffers memory contention. In this paper, we propose a dynamic warp scheduler which controls thread parallelism according to degree of cache contention. Usually, the greedy then oldest (GTO) policy for issuing warp shows lower parallelism than loose round robin (LRR) policy. Therefore, the proposed warp scheduler employs the LRR warp scheduling policy when Miss Status Holding Register(MSHR) utilization is low. On the other hand, the GTO policy is employed in order to reduce thread parallelism when MSHRs utilization is high. Our proposed technique shows better performance compared with LRR and GTO policy since it selects efficient scheduling policy dynamically. According to our experimental results, our proposed technique provides IPC improvement by 12.8% and 3.5% over LRR and GTO on average, respectively.

GPU는 병렬처리가 가능한 강력한 하드웨어 자원을 기반으로 높은 처리량을 제공한다. 하지만 과도한 메모리 요청이 발생하는 경우 캐쉬 효율이 낮아져 GPU 성능이 크게 감소할 수 있다. 캐쉬에서의 경합이 심각하게 발생한 경우 동시 처리되는 스레드의 수를 감소시킨다면 캐쉬에서의 경합이 완화되어 전체 성능을 향상시킬 수 있다. 본 논문에서는 캐쉬에서의 경합 정도에 따라 동적으로 병렬성을 조절할 수 있는 워프 스케줄링 기법을 제안한다. 기존 워프 스케줄링 정책 중 LRR은 GTO에 비해 워프 수준의 병렬성이 높다. 따라서 제안하는 워프 스케줄러는 L1 데이터 캐쉬 경합 정도를 반영하는 MSHR(Miss Status Holding Register)이 낮은 자원 활용률을 보일 때 LRR 정책을 적용한다. 반대로 MSHR 자원 활용률이 높을 때는 워프 수준의 병렬성을 낮추기 위해 GTO 정책을 적용하여 워프 우선순위를 결정한다. 제안하는 기법은 동적으로 스케줄링 정책을 선택하기 때문에 기존의 고정된 LRR과 GTO에 비해 높은 IPC 성능과 캐쉬 효율을 보여준다. 실험 결과 제안하는 동적 워프 스케줄링 기법은 LRR 정책에 비해 약 12.8%, GTO 정책에 비해 약 3.5% IPC 향상을 보인다.

Keywords

JBCRIN_2019_v8n5_111_f0001.png 이미지

Fig. 1. Streaming Multiprocessor Pipeline

JBCRIN_2019_v8n5_111_f0002.png 이미지

Fig. 2. Warp-Level Parallelism on an SM

JBCRIN_2019_v8n5_111_f0003.png 이미지

Fig. 3. Block Diagram of Warp Scheduler with Proposed Units

JBCRIN_2019_v8n5_111_f0004.png 이미지

Fig. 4. IPC Comparison with Different MSHR Threshold T

JBCRIN_2019_v8n5_111_f0005.png 이미지

Fig. 5. L1 Data Cache Miss Rate Comparison

JBCRIN_2019_v8n5_111_f0006.png 이미지

Fig. 8. L2 Cache Miss Rate

JBCRIN_2019_v8n5_111_f0007.png 이미지

Fig. 6. MSHRs Usage of 3MM

JBCRIN_2019_v8n5_111_f0008.png 이미지

Fig. 7. Interconnection Network Stall Comparison

Table 1. System Configuration

JBCRIN_2019_v8n5_111_t0001.png 이미지

Table 2. Benchmarks

JBCRIN_2019_v8n5_111_t0002.png 이미지

References

  1. M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu, "Improving GPGPU resource utilization through alternative thread block scheduling," High Performance Computer Architecture(HPCA), International Symposium on. pp.260-271, 2014.
  2. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in Microarchitecture (MICRO), Annual IEEE/ACM International Symposium on. IEEE, pp.308-317, 2011.
  3. Y. Zhang, Z. Xing, C. Liu, C. Tang, and Q. Wang, "Locality based warp scheduling in GPGPUs," Future Generation Computer Systems. 2017.
  4. B. Wang, Y. Zhu, and W. Yu, "OAWS: Memory occlusion aware warp scheduling," Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, pp.45-55, 2016.
  5. J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, "LaPerm: Locality aware scheduler for dynamic parallelism on GPUs," ACM SIGARCH Computer Architecture News 44.3, pp.584-595, 2016.
  6. Y. Liu et al. "Barrier-aware warp scheduling for throughput processors," in Proceedings of the 2016 International Conference on Supercomputing. ACM, pp.42, 2016.
  7. NVIDIA CUDA Programming [internet], http://www.nvidia.com/object/cuda_home_new.html
  8. M. Lee et al. "iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs," High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, pp.370-381, 2016.
  9. T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cacheconscious wavefront scheduling," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp.72-83, 2012.
  10. J. Zhang, Y. He, F. Shen, Q. A. Li, and H. Tan, "Memory Request Priority Based Warp Scheduling for GPUs," Chinese Journal of Electronics, Vol.27, No.7, pp.985-994, 2018. https://doi.org/10.1049/cje.2018.05.003
  11. B. Wang, W. Yu, X. H. Sun, and X. Wang, "Dacache: Memory divergence-aware gpu cache management," in Proceedings of the 29th ACM on International Conference on Supercomputing, pp.89-98, 2015.
  12. K. Choo, D. Troendle, E. A. Gad, and B. Jang, "Contention-Aware Selective Caching to Mitigate Intra-Warp Contention on GPUs," Parallel and Distributed Computing(ISPDC), pp.1-8, 2017.
  13. X. Chen, L. W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W. M. Hwu, "Adaptive cache manageent for energy-efficient gpu computing," in Proceedings of the 47th annual IEEE/ACM International Symposium Microarchitecture, pp.343-355, 2014.
  14. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proceedings of International Symposium, pp.163-174, 2009.
  15. S. Che, M. Boyer, M. Jiayuan, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Proceedings of the International Symposium on Workload Characterization (IISWC), pp.44-54, 2009.
  16. S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to gpu codes," Innovative Parallel Computing(InPar), pp.1-10, 2012.
  17. J. A. Stratton et al. "Parboil: A revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing 127, 2012.