Browse > Article
http://dx.doi.org/10.5392/JKCA.2014.14.03.022

Analysis on the GPU Performance according to Hierarchical Memory Organization  

Choi, Hongjun (전남대학교 전자컴퓨터공학과)
Kim, Jongmyon (울산대학교 전기공학부)
Kim, Cheolhong (전남대학교 전자컴퓨터공학과)
Publication Information
Abstract
Recently, GPGPU has been widely used for general-purpose processing as well as graphics processing by providing optimized hardware for parallel processing. Memory system has big effects on the performance of parallel processing units such as GPU. In the GPU, hierarchical memory architecture is implemented for high memory bandwidth. Moreover, both memory address coalescing and memory request merging techniques are widely used. This paper analyzes the GPU performance according to various memory organizations. According to our simulation results, GPU performance improves by 15.5%, 21.5%, 25.5%, 30.9% as adding 8KB L1, 16KB L1, 32KB L1, 64KB L1 cache, respectively, compared to case without L1 cache. However, experimental results show that some benchmarks decrease performance since memory transaction increases due to data dependency. Moreover, average memory access latency is increased as the depth of hierarchical cache level increases when cache miss occurs significantly.
Keywords
GPU; Memory System; Hierarchical Memory Architecture; Memory Request Merging;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 http://nocs.stanford.edu/booksim.html
2 E. Lindholm, J. Nickolls, S.Oberman, and J. Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE MICRO, Vol.28, No.2, pp.39-55, 2008.   DOI   ScienceOn
3 A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of 9th International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.
4 D. C. Burger and T. M. Austin, "The SimpleScalar tool set, version 2.0," Computer Architecture News, Vol.25, No.3, pp.13-25, 1997.
5 http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html
6 J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. J. Purcell, "A Survey of General-Purpose Computation on Graphics Hardware," Euro-graphics 2005, State of the Art Reports, pp.21-51, 2005.
7 Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, "Brook for GPUs: stream computing on graphics hardware," In Proceedings of 31th Annual Conference on Computer Graphics, pp.777-786, 2004.
8 H. J. Choi and C. H. Kim, "Performance Evaluation of the GPU Architecture Executing Parallel Applications," Journal of the Korea Contents Association, Vol.12, No.5. pp.10-21, 2012.   과학기술학회마을   DOI   ScienceOn
9 H. J. Choi and C. H. Kim, "Analysis of Impact of Correlation Between Hardware Configuration and Branch Handling Methods Executing General Purpose Applications," Journal of the Korea Contents Association, Vol.13, No.3. pp.9-21, 2013.   과학기술학회마을   DOI   ScienceOn
10 http://www.gpgpu.org
11 http://www.khronos.org/opencl/
12 http://www.amd.com/stream
13 http://developer.nvidia.com/object/cuda_3_1_downloads.html
14 J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," In Proceedings of 37th International Symposium on Computer Architecture, pp.235-246, 2010.
15 W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," In Proceedings of 40th Microarchitecture, pp.407-420, 2007.
16 J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch : Enabling Energy Optimizations in GPGPUs," In Proceedings of the 27th International Symposium on Computer Architecture, pp.487-498, 2013.
17 N. B. Lakshminarayana and H. S. Kim, "Effect of Instruction Fetch and Memory Scheduling on GPU Performance," Workshop on Language, Compiler, and Architecture Support for GPGPU(in conjunction with HPCA/PPoPP 2010), 2010.
18 http://www.nvidia.com/object/product_quadro_fx_5800_us.html
19 http://www.isuppli.com/
20 W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," In Proceedings of the 17th International Symposium on High Performance Computer Architecture, pp.25-36, 2011.
21 Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger, "Dark Silicon and the End of Multicore Scaling," In Proceedings of International Symposium on Computer Architecture, pp.365-376, 2011.
22 K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, "The Case for a Single-Chip Multiprocessor," In Proceedings of 7th Conference on Architectural Support for Programming Languages and Operating Systems, pp.2-11, 1996.
23 V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate versus IPC: the end of the road for conventional microarchitectures," In Proceedings of the 27th International Symposium on Computer Architecture, pp.248-259, 2000.