1 |
NVIDIA, "CUDA C Programming Guide," 2012.
|
2 |
Khronos OpenCL Group, "The OpenCL Specification," 2011.
|
3 |
T. G. Rogers., M. O'Connor., and T. M. Aamodt, "Cache-conscious wavefront scheduling," Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture pp. 72-83, 2012.
|
4 |
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 99-110, 2013.
|
5 |
Kim, G. B. Kim, J. M., & Kim. C. H., "Dynamic Selective Warp Scheduling for GPUs Using L1 Data Cache Locality Information," International Conference on Parallel and Distributed Computing: Applications and Technologies. Springer, Singapore, pp. 230-239, 2018.
|
6 |
Zhang, Y., Xing, Z., Liu, C., Tang, C., & Wang, Q., "Locality based warp scheduling in GPGPUs," Future Generation Computer Systems, 82, pp. 520-527. 2018.
DOI
|
7 |
ElTantawy, A., & Aamodt, T. M., "Warp scheduling for fine-grained synchronization," In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 375-388, 2018.
|
8 |
Oh, Yunho, et al. "Adaptive Cooperation of Prefetching and Warp Scheduling on GPUs." IEEE Transactions on Computers 68.4 (2019): 609-616.
DOI
|
9 |
S. Y. Lee, A. Arunkumar, and C. J. Wu, "CAWA: Coordinated warp scheduling and Cache Prioritization for critical warp acceleration of GPGPU workloads," ACM SIGARCH Computer Architecture (ISCA), pp. 515-527, 2015.
|
10 |
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 308-317, 2011.
|
11 |
M. Lee, G. Kim, J. Kim, W. Seo, Y. Cho, and S. Ryu, "iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs," High Performance Computer Architecture (HPCA), IEEE International Symposium on. pp. 370-381, 2016.
|
12 |
M. K. Yoon, Y. Oh, S. Lee, S. H. Kim, D. Kim, and W. W. Ro, "Draw: investigating benefits of adaptive fetch group size on gpu," In Performance Analysis of Systems and Software (ISPASS), pp. 183-192, 2015.
|
13 |
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, pp. 163-174, 2009.
|
14 |
"NVIDIA CUDA SDK Code Samples," http://developer.nvidia.com/cuda-downloads, 2015.
|
15 |
M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "A hierarchical thread scheduler and register file for energy-efficient throughput processors," ACM Transactions on Computer Systems (TOCS), Vol. 30, No. 2, April 2012.
|
16 |
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Shadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," Proceedings of the International Symposium on Workload Characterization (IISWC), pp. 44-54, 2009.
|
17 |
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to gpu codes," Innovative Parallel Computing (InPar), pp. 1-10, 2012.
|
18 |
J. A. Stratton, C. Rodrigues, J. I. Sung, et al. "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," Center for Reliable and High-Performance Computing, 2012.
|