1 |
A. Jog et al., "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp.395-406, 2013.
|
2 |
S.-Y. Lee, A. Arunkumar, and C.-J. Wu, "CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads," in Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pp.515-527, 2015.
|
3 |
W. Jia, K. Shaw, and M. Martonosi, "MRPB: Memory Request Prioritization for Massively Parallel Processors," in Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.272- 283, 2014.
|
4 |
X. Xie et al., "Coordinated Static and Dynamic Cache Bypassing for GPUs," in Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.76-88, 2015.
|
5 |
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ACM, pp.308-317, 2011.
|
6 |
J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "GPU Computing," in Proceedings of the IEEE, Vol.96, No.5, pp.879-899.
|
7 |
A. Bakhola, G. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU simulator," in Proceedings of the 2009 International Symposium on Analysis of Systems and Software (ISPASS-2009), pp. 163-174, Apr. 2009.
|
8 |
M. Lee et.al, "Improving GPGPU Resource Utilization through Alternative Thread Block Scheduling," in Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.260-271, 2014.
|
9 |
V. V. P. Harish and P. J. Narayanan, "Large graph algorithms for massively multithreaded architectures," in Technical report, IIIT, 2009
|
10 |
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu, "Optimization Principles and Aapplication Performance Evaluation of a Multithreaded GPU Using CUDA," in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, pp.73-82, 2008.
|
11 |
V. Volkov and J. W. Demmel, "Benchmarking GPUs to Tune Dense Linear Algebra," in Proceedings of the ACM/IEEE Conference on Supercomputing, pp.1-11, 2008.
|
12 |
M. Gebhart, R. D. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindoholm, and K. Skadron, "Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors," in Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA), pp.235-246, 2011.
|
13 |
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache- Conscious Wavefront Scheduling," in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.72-83, 2013.
|
14 |
Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. Weiser, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley," Computer Architecture Letters, Vol.8, No.1, pp.25-28, 2009.
DOI
|
15 |
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in Proceedings of the 40th Annual IEEE/ ACM International Symposium on Microarchitecture, IEEE Computer Society, pp.407-420, 2007.
|
16 |
J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, "Many-Thread Aware Prefetching Mechanisms for GPGPU Applications," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), IEEE Computer Society, pp.213-224, 2010.
|
17 |
A. Jog et al., "Orchestrated Scheduling and Prefetching for GPGPUs," in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pp.332-343, Tel-Aviv, Israel, 2013.
|
18 |
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and Wen-Mei W. Hwu, "Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA," in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, pp.73-82, 2008.
|
19 |
NVIDIA. "CUDA C Programming Guide," 2012.
|
20 |
M. Garland et al., "Parallel Computing Experiences with CUDA," MICRO, IEEE, Vol.28, No.4, 2008.
|
21 |
A. Munshi, "The OpenCL Specification," Version 1.2, Khronos OpenCL Working Group, 2011.
|
22 |
O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither More Nor Less: Optimizing Thread-Level Parallelism for GPGPUs," in CSE Penn State Tech Report, TR-CES- 2212-006, 2012.
|
23 |
W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," in Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.356-367, 2011.
|
24 |
H.-Y. Cheng, C.-H. Lin, J. Li, and C.-L. Yang, "Memory Latency Reduction via Thread Throttling," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.53-64, 2010.
|
25 |
K. M. Abdalla et al., "Scheduling and Execution of Compute Tasks," US Patent US20130185725, 2013.
|
26 |
J. D. Owens et al., "A Survey of Genera-Purpose Computation on Graphics Hardware," in Eurographics 2005, State of the Art Reports, pp.21-51, Aug., 2005.
|
27 |
K. Krewell, "AMD's Fusion Finally Arrives," Microprocessor Report, 2011.
|
28 |
K. Krewell, "NVIDIA Lowers the Heat on Kepler," Microprocessor Report, 2012.
|
29 |
NVIDIA, Whitepaper: NVIDIA's Next Generation CUDA Compute and Graphics Architecture: Fermi.
|
30 |
NVIDIA, "NVIDA Tegra Multiprocessor Architecture," Feb. 2010.
|
31 |
J. Chen et al., "Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency," in Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing, pp.441-451, 2013.
|
32 |
D. Kirk, "NVIDIA CUDA Software and GPU Parallel Computing Architecture," in ISMM, pp.103-104, 2007.
|
33 |
NVIDA, CUDA SDK [Internet], http://developer.nvidia.com/gpu-computing-sdk.
|