Browse > Article
http://dx.doi.org/10.5392/JKCA.2015.15.07.001

Analysis on the Active/Inactive Status of Computational Resources for Improving the Performance of the GPU  

Choi, Hongjun (한국전자통신연구원 부설연구소)
Son, Dongoh (전남대학교 전자컴퓨터공학부)
Kim, Jongmyon (울산대학교 전기공학부)
Kim, Cheolhong (전남대학교 전자컴퓨터공학부)
Publication Information
Abstract
In recent high performance computing system, GPGPU has been widely used to process general-purpose applications as well as graphics applications, since GPU can provide optimized computational resources for massive parallel processing. Unfortunately, GPGPU doesn't exploit computational resources on GPU in executing general-purpose applications fully, because the applications cannot be optimized to GPU architecture. Therefore, we provide GPU research guideline to improve the performance of computing systems using GPGPU. To accomplish this, we analyze the negative factors on GPU performance. In this paper, in order to clearly classify the cause of the negative factors on GPU performance, GPU core status are defined into 5 status: fully active status, partial active status, idle status, memory stall status and GPU core stall status. All status except fully active status cause performance degradation. We evaluate the ratio of each GPU core status depending on the characteristics of benchmarks to find specific reasons which degrade the performance of GPU. According to our simulation results, partial active status, idle status, memory stall status and GPU core stall status are induced by computational resource underutilization problem, low parallelism, high memory requests, and structural hazard, respectively.
Keywords
GPU; General-Purpose Applications; GPGPU; CPU Core Status;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, "The Case for a Single-Chip Multiprocessor," In Proceedings of 7th Conference on Architectural Support for Programming Languages and Operating Systems, pp.2-11, 1996.
2 V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate versus IPC: the end of the road for conventional microarchitectures," In Proceedings of International Symposium on Computer Architecture, pp.248-259, 2000.
3 H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, "Dark Silicon and the End of Multicore Scaling," In Proceedings of International Symposium on Computer Architecture, pp.365-376, 2011.
4 iSuppli Market Research, available at http://www.isuppli.com/
5 M. D. Hill and M. R. Marty, "Amdahl's law in the multicore era," IEEE Computer, Vol.41, No.7, pp.33-38, 2008.
6 Y. H. Jang, C. Park, J. H. Park, N. Kim, and K. H. Yoo, "Parallel Processing for Integral maging Pickup using Multiple Threads," nternational Journal of Korea Contents, Vol.5, No.4, pp.30-34, 2009.   DOI   ScienceOn
7 Y. H. Jang, C. Park, J. S. Jung, J. H. Park, N. Kim, J. S. Ha, and K. H. Yoo, "Integral Imaging Pickup Method of Bio-Medical Data using GPU and Octree," International Journal of Korea Contents, Vol.10, No.9, pp.1-9, 2009.
8 NVIDIA Corporation, available at http://www.nvidia.com/
9 NVIDIA's Next Generation CUDA Compute Arc hitecture: Fermi, available at http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf
10 H. J. Choi, D. O. Son, J. M. Kim, and C. H. Kim, "Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization," Journal of SuperComputing, Vol.69, No.1, pp.330-356, 2014.   DOI   ScienceOn
11 I. Buck, "Gpu computing with nvidia cuda," In Proceedings of International Conference on Special Interest Group on Computer Graphics and Interactive Techniques(SIGGRAPH), p.6, 2007.
12 T. Li, P. Brett, R. Knauerhase, D. Koufaty, D. Reddy, and S. Hahn, "Operating System Support for Overlapping-ISA Heterogeneous Multi-core Architectures," In Proceedings of International Symposium on High Performance Computer Architecture, pp.1-12, 2010.
13 Performance Comparison between CPU and GP U, Available at http://www.ncsa.illinois.edu/-kindr/projects/hpca/files/ppac09_presentation.pdf
14 V. W. Lee, C. K. Kim, J. Chhugani, M. Deisher, D. H. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey, "Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU," In Proceedings of International Symposium on Computer Architecture, pp.451-460, 2010.
15 General-purpose computation on graphics hardware, available at http://www.gpgpu.org
16 V. Narasiman, C. J. Lee, M. Shebanow, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," In Proceedings of international symposium on Microarchitecture, pp.308-317, 2011.
17 Y. Zhang and J. D. Owens, "A Quantitative Performance Analysis Model for GPU Architectures," In Proceedings of International Symposium on High Performance Computer Architecture, pp.382-393, 2011.
18 E. Blem, M. Sinclair, and K. Sankaralingam, "Challenge Benchmarks That Must be Conquered to Sustain the GPU Revolution," In Proceedings of Workshop on Emerging Applications for Manycore Architecture, 2010
19 W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," In Proceedings of Microarchitecture, pp.407-420, 2007.
20 J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," In Proceedings of International Symposium on Computer Architecture, pp.235-246, 2010.
21 W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," In Proceedings of International Symposium on High Performance Computer Architecture, pp.25-36, 2011.
22 O. Mutlu and T. Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems," In Proceedings of International Symposium on Computer Architecture, pp.63-74, 2008.
23 A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pp.395-406, 2013.
24 NVIDIA SDK, available at http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html
25 H. J. Choi, H. G. Jeon, and C. H. Kim, "Quantitative Analysis of the Negative Factors on the GPU Performance," Journal of KIISE : Computing Practices and Letters, Vol.18, No.4, pp.282-287, 2012.
26 J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch : Enabling Energy Optimizations in GPGPUs," In Proceedings of International Symposium on Computer Architecture, pp.487-498, 2013.
27 A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of 9th International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.
28 S. Che, M. Boyer, M. Jiayuan, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K.Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," In Proceedings of the International Symposium on Workload Characterization, pp.44-54, 2009.