Fig. 1. Streaming Multiprocessor Pipeline
Fig. 2. Warp-Level Parallelism on an SM
Fig. 3. Block Diagram of Warp Scheduler with Proposed Units
Fig. 4. IPC Comparison with Different MSHR Threshold T
Fig. 5. L1 Data Cache Miss Rate Comparison
Fig. 8. L2 Cache Miss Rate
Fig. 6. MSHRs Usage of 3MM
Fig. 7. Interconnection Network Stall Comparison
Table 1. System Configuration
Table 2. Benchmarks
참고문헌
- M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu, "Improving GPGPU resource utilization through alternative thread block scheduling," High Performance Computer Architecture(HPCA), International Symposium on. pp.260-271, 2014.
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in Microarchitecture (MICRO), Annual IEEE/ACM International Symposium on. IEEE, pp.308-317, 2011.
- Y. Zhang, Z. Xing, C. Liu, C. Tang, and Q. Wang, "Locality based warp scheduling in GPGPUs," Future Generation Computer Systems. 2017.
- B. Wang, Y. Zhu, and W. Yu, "OAWS: Memory occlusion aware warp scheduling," Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, pp.45-55, 2016.
- J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, "LaPerm: Locality aware scheduler for dynamic parallelism on GPUs," ACM SIGARCH Computer Architecture News 44.3, pp.584-595, 2016.
- Y. Liu et al. "Barrier-aware warp scheduling for throughput processors," in Proceedings of the 2016 International Conference on Supercomputing. ACM, pp.42, 2016.
- NVIDIA CUDA Programming [internet], http://www.nvidia.com/object/cuda_home_new.html
- M. Lee et al. "iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs," High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, pp.370-381, 2016.
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cacheconscious wavefront scheduling," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp.72-83, 2012.
- J. Zhang, Y. He, F. Shen, Q. A. Li, and H. Tan, "Memory Request Priority Based Warp Scheduling for GPUs," Chinese Journal of Electronics, Vol.27, No.7, pp.985-994, 2018. https://doi.org/10.1049/cje.2018.05.003
- B. Wang, W. Yu, X. H. Sun, and X. Wang, "Dacache: Memory divergence-aware gpu cache management," in Proceedings of the 29th ACM on International Conference on Supercomputing, pp.89-98, 2015.
- K. Choo, D. Troendle, E. A. Gad, and B. Jang, "Contention-Aware Selective Caching to Mitigate Intra-Warp Contention on GPUs," Parallel and Distributed Computing(ISPDC), pp.1-8, 2017.
- X. Chen, L. W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W. M. Hwu, "Adaptive cache manageent for energy-efficient gpu computing," in Proceedings of the 47th annual IEEE/ACM International Symposium Microarchitecture, pp.343-355, 2014.
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proceedings of International Symposium, pp.163-174, 2009.
- S. Che, M. Boyer, M. Jiayuan, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Proceedings of the International Symposium on Workload Characterization (IISWC), pp.44-54, 2009.
- S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to gpu codes," Innovative Parallel Computing(InPar), pp.1-10, 2012.
- J. A. Stratton et al. "Parboil: A revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing 127, 2012.