Browse > Article

Analysis of Programming Techniques for Creating Optimized CUDA Software  

Kim, Sung-Soo (서강대학교 컴퓨터공학과)
Kim, Dong-Heon (서강대학교 컴퓨터공학과)
Woo, Sang-Kyu (서강대학교 컴퓨터공학과)
Ihm, In-Sung (서강대학교 컴퓨터공학과)
Abstract
Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.
Keywords
GPU; many-core processor; parallel programming; CUDA; memory hierarchy; latency hiding; occupancy; Sobel operator;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Victor Podlozhnyuk, Image Convolution with CUDA, NVIDIA CUDA 2.0 SDK document, 2007.
2 NVIDIA. NVIDIA CUDA Visual Profiler (Version 2.3), 2009.
3 Joe Stam, Convolution Soup, NVIDIA, 2009.
4 NVIDIA. NVIDIA CUDA Compute Unified Device Architecture: Technical Brief NVIDIA GeForce GTX 200 GPU Architectural Overview, 2008.
5 NVIDIA. Optimizing CUDA, 2009.
6 B. Parhami. Introduction to Parallel Processing: Algorithms and Architectures, Plenum Press, New York, pp.377-379, 1999.
7 Sobel, I., Feldman,G., A 3x3 Isotropic Gradient Operator for Image Processing, presented at a talk at the Stanford Artificial Project, 1968.
8 Mark Segal, Kurt Akeley, The OpenGL Graphics System: A Specification(Version 2.1 - December 1), 2006.
9 Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei W. Hwu, Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA, Proc. 13th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, ACM Press, 2008.
10 Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and kevin Skadron, A Performance Study of General-Purpose Applicaions on Graphics Processors Using CUDA, Journal of Parallel and Distributed Computing, University of Virginia, 2008.
11 NVIDIA. http://www.nvidia.com/object/product_geforc e_gtx_280_us.html, 2009.
12 NVIDIA. NVIDIA CUDA Compute Unified Device Architecture: Programming Guide (Version 2.3), 2009.
13 Maryam Moazeni, Alex Bui, and Majid Sarrafzadeh, A Memory Optimization Technique for Software- Managed Scratchpad Memory in GPUs, University of California, 2009.