Browse > Article

Analysis of GPU Performance and Memory Efficiency according to Task Processing Units  

Son, Dong Oh (전남대학교 전자컴퓨터공학부)
Sim, Gyu Yeon (전남대학교 전자컴퓨터공학부)
Kim, Cheol Hong (전남대학교 전자컴퓨터공학부)
Publication Information
Smart Media Journal / v.4, no.4, 2015 , pp. 56-63 More about this Journal
Abstract
Modern GPU can execute mass parallel computation by exploiting many GPU core. GPGPU architecture, which is one of approaches exploiting outstanding computational resources on GPU, executes general-purpose applications as well as graphics applications, effectively. In this paper, we investigate the impact of memory-efficiency and performance according to number of CTAs(Cooperative Thread Array) on a SM(Streaming Multiprocessors), since the analysis of relation between number of CTA on a SM and them provides inspiration for researchers who study the GPU to improve the performance. Our simulation results show that almost benchmarks increasing the number of CTAs on a SM improve the performance. On the other hand, some benchmarks cannot provide performance improvement. This is because the number of CTAs generated from same kernel is a little or the number of CTAs executed simultaneously is not enough. To precisely classify the analysis of performance according to number of CTA on a SM, we also analyze the relations between performance and memory stall, dram stall due to the interconnect congestion, pipeline stall at the memory stage. We expect that our analysis results help the study to improve the parallelism and memory-efficiency on GPGPU architecture.
Keywords
General Purpose computation on the Graphics Processing Unit; Memory; Performance; Cooperative Thread Array Scheduling Schemes;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 J. Y. Chang, W. J. Kim, K. J. Byun, and N. W. Eum, "Performance Analysis for Multimedia Video Codec on On-Chip Network," KISM Smart Media Journal, Vol. 1, No.1, pp. 27-35, 2012.
2 S. B. Heo, J. H. Park, and H. S. Jo, "An performance analysis on SSD caching mechanism in Linux," KISM Smart Media Journal, Vol. 4, No. 2, pp. 62-67, 2015.
3 V. Agarwal, M.S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate versus IPC: the end of the road for conventional microarchitectures," In Proceedings of the 27th International Symposium on Computer Architecture, pp. 248-259, 2000.
4 K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, "The Case for a Single-Chip Multi processor," In Proceedings of 7th Conference on Architectural Support for Programming Languages and Operating Systems, pp. 2-11, 1996.
5 V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate versus IPC: the end of the road for conventional microArchitectures," In Proceedings of 27th International Symposium on Computer Architecture, pp. 248-259, 2000.
6 M. D. Hill and M. R. Marty, "Amdahl's law in the multicore era," IEEE Computer, Vol. 41, No. 7, pp. 33-38, 2008.
7 S. Y. Lee, A. Arunkumar, and C. J. Wu, "CAWA:coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads," In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 515-527, 2015.
8 D. Voitsechov, and Y. Etsion, "Single-graph multiple flows: Energy efficient design alternative for GPGPUs," In Proceedings of the 41st Annual International Symposium on Computer Architecture, pp. 205-216, 2014.
9 S. Che, M. Boyer, M. Jiayuan, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K.Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," In Proceedings of the International Symposium on Workload Characterization (IISWC), pp. 44-54, 2009.
10 M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. G. Cho, and S. Ryu, "Improving GPGPU resource utilization through alternative thread block scheduling," In Proceedings of 20th IEEE International Symposium on High Performance Computer Architecture, pp. 260-271, 2014.
11 H. J. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun, "Locality-Aware Mapping of Nested Parallel Patterns on GPUs," In Proceedings of 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 63-74, 2014.
12 H. J. Choi, and C. H. Kim, "Analysis on Memory Characteristics of Graphics Processing Units for Designing Memory System of General-Purpose Computing on Graphics Processing Units," KISM Smart Media Journal, Vol. 3, No. 1, pp. 33-38, 2014.
13 I. A. Buck, "Programming CUDA," In Supercomputing 2007 Tutorial Notes, 2007.
14 A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of 9th International Symposium on Performance Analysis of Systems and Software, pp. 163-174, 2009.
15 S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi "McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures," In Proceedings of the International Symposium on Microarchitecture, pp. 469-480, 2009.
16 NVIDIA's Next Generation CUDA Compute Architecture: Fermi, available at www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf
17 CUDA SDK, available at http://developerdownload.nvidia.com/compute/cuda/sdk/website/samples.hrml