• Title/Summary/Keyword: GPU programming

Search Result 60, Processing Time 0.029 seconds

Analysis of Programming Techniques for Creating Optimized CUDA Software (최적화된 CUDA 소프트웨어 제작을 위한 프로그래밍 기법 분석)

  • Kim, Sung-Soo;Kim, Dong-Heon;Woo, Sang-Kyu;Ihm, In-Sung
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.7
    • /
    • pp.775-787
    • /
    • 2010
  • Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.

Calculation Effect of GPU Parallel Programing for Planar Multibody System Dynamics (평면 다물체 동역학 해석에서 GPU 병렬 프로그래밍의 계산효과)

  • Jun, C.W.;Sohn, J.H.
    • Journal of Power System Engineering
    • /
    • v.16 no.4
    • /
    • pp.12-16
    • /
    • 2012
  • In this paper, the equations of motions for planar multibody dynamics are established for considering the parallel programming based on GPU. Cartesian coordinates are used to formulate the equations of motion and implicit integration method called HHT-alpha is employed. Open chain multibody system is considered for computer simulation. CUDA toolkit is employed for establishing the GPU parallel programming. The exactness of the analysis is verified from the comparison with ADAMS. The results from parallel computing based on GPU are compared with the results from the sequential programming based on CPU in terms of calculation time. The multiple pendulum with bodies and joints is employed for the computer simulation. In the pendulum system that has 290 bodies, the parallel program indicates an improved efficiency of about 25.5 second(15.5% improvement). It is noted that the larger the size of system is, the time efficiency is better.

Three-dimensional Wave Propagation Modeling using OpenACC and GPU (OpenACC와 GPU를 이용한 3차원 파동 전파 모델링)

  • Kim, Ahreum;Lee, Jongwoo;Ha, Wansoo
    • Geophysics and Geophysical Exploration
    • /
    • v.20 no.2
    • /
    • pp.72-77
    • /
    • 2017
  • We calculated 3D frequency- and Laplace-domain wavefields using time-domain modeling and Fourier transform or Laplace transform. We adopted OpenACC and GPU for an efficient parallel calculation. The OpenACC makes it easy to use GPU accelerators by adding directives in conventional C, C++, and Fortran programming languages. Accordingly, one doesn't have to learn new GPGPU programming languages such as CUDA or OpenCL to use GPU. An OpenACC program allocates GPU memory, transfers data between the host CPU and GPU devices and performs GPU operations automatically or following user-defined directives. We compared performance of 3D wave propagation modeling programs using OpenACC and GPU to that using single-core CPU through numerical tests. Results using a homogeneous model and the SEG/EAGE salt model show that the OpenACC programs are approximately 53 and 30 times faster than those using single-core CPU.

EFFICIENT COMPUTATION OF COMPRESSIBLE FLOW BY HIGHER-ORDER METHOD ACCELERATED USING GPU (고차 정확도 수치기법의 GPU 계산을 통한 효율적인 압축성 유동 해석)

  • Chang, T.K.;Park, J.S.;Kim, C.
    • Journal of computational fluids engineering
    • /
    • v.19 no.3
    • /
    • pp.52-61
    • /
    • 2014
  • The present paper deals with the efficient computation of higher-order CFD methods for compressible flow using graphics processing units (GPU). The higher-order CFD methods, such as discontinuous Galerkin (DG) methods and correction procedure via reconstruction (CPR) methods, can realize arbitrary higher-order accuracy with compact stencil on unstructured mesh. However, they require much more computational costs compared to the widely used finite volume methods (FVM). Graphics processing unit, consisting of hundreds or thousands small cores, is apt to massive parallel computations of compressible flow based on the higher-order CFD methods and can reduce computational time greatly. Higher-order multi-dimensional limiting process (MLP) is applied for the robust control of numerical oscillations around shock discontinuity and implemented efficiently on GPU. The program is written and optimized in CUDA library offered from NVIDIA. The whole algorithms are implemented to guarantee accurate and efficient computations for parallel programming on shared-memory model of GPU. The extensive numerical experiments validates that the GPU successfully accelerates computing compressible flow using higher-order method.

Stereo-To-Multiview Conversion System Using FPGA and GPU Device (FPGA와 GPU를 이용한 스테레오/다시점 변환 시스템)

  • Shin, Hong-Chang;Lee, Jinwhan;Lee, Gwangsoon;Hur, Namho
    • Journal of Broadcast Engineering
    • /
    • v.19 no.5
    • /
    • pp.616-626
    • /
    • 2014
  • In this paper, we introduce a real-time stereo-to-multiview conversion system using FPGA and GPU. The system is based on two different devices so that it consists of two major blocks. The first block is a disparity estimation block that is implemented on FPGA. In this block, each disparity map of stereoscopic video is estimated by DP(dynamic programming)-based stereo matching. And then the estimated disparity maps are refined by post-processing. The refined disparity map is transferred to the GPU device through USB 3.0 and PCI-express interfaces. Stereoscopic video is also transferred to the GPU device. These data are used to render arbitrary number of virtual views in next block. In the second block, disparity-based view interpolation is performed to generate virtual multi-view video. As a final step, all generated views have to be re-arranged into a single image at full resolution for presenting on the target autostereoscopic 3D display. All these steps of the second block are performed in parallel on the GPU device.

High-Performance Korean Morphological Analyzer Using the MapReduce Framework on the GPU

  • Cho, Shi-Won;Lee, Dong-Wook
    • Journal of Electrical Engineering and Technology
    • /
    • v.6 no.4
    • /
    • pp.573-579
    • /
    • 2011
  • To meet the scalability and performance requirements of data analyses, which often involve voluminous data, efficient parallel or concurrent algorithms and frameworks are essential. We present a high-performance Korean morphological analyzer which employs the MapReduce framework on the graphics processing unit (GPU). MapReduce is a programming framework introduced by Google to aid the development of web search applications on a large number of central processing units (CPUs). GPUs are designed as a special-purpose co-processor. Their programming interfaces are typically formulated for graphics applications. Compared to CPUs, GPUs have greater computation power and memory bandwidth; however, GPUs are more difficult to program because of the design of their architectures. The performance of the Korean morphological analyzer using the MapReduce framework on the GPU is evaluated in comparison with the CPU-based model. The proposed Korean Morphological analyzer shows promising scalable performance on distributed computing with the GPU.

A Performance Study on CPU-GPU Data Transfers of Unified Memory Device (통합메모리 장치에서 CPU-GPU 데이터 전송성능 연구)

  • Kwon, Oh-Kyoung;Gu, Gibeom
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.5
    • /
    • pp.133-138
    • /
    • 2022
  • Recently, as GPU performance has improved in HPC and artificial intelligence, its use is becoming more common, but GPU programming is still a big obstacle in terms of productivity. In particular, due to the difficulty of managing host memory and GPU memory separately, research is being actively conducted in terms of convenience and performance, and various CPU-GPU memory transfer programming methods are suggested. Meanwhile, recently many SoC (System on a Chip) products such as Apple M1 and NVIDIA Tegra that bundle CPU, GPU, and integrated memory into one large silicon package are emerging. In this study, data between CPU and GPU devices are used in such an integrated memory device and performance-related research is conducted during transmission. It shows different characteristics from the existing environment in which the host memory and GPU memory in the CPU are separated. Here, we want to compare performance by CPU-GPU data transmission method in NVIDIA SoC chips, which are integrated memory devices, and NVIDIA SMX-based V100 GPU devices. For the experimental workload for performance comparison, a two-dimensional matrix transposition example frequently used in HPC applications was used. We analyzed the following performance factors: the difference in GPU kernel performance according to the CPU-GPU memory transfer method for each GPU device, the transfer performance difference between page-locked memory and pageable memory, overall performance comparison, and performance comparison by workload size. Through this experiment, it was confirmed that the NVIDIA Xavier can maximize the benefits of integrated memory in the SoC chip by supporting I/O cache consistency.

CUDA programming environment을 활용한 Path-Integral Monte Carlo Simulation의 구현

  • Lee, Hwa-Young;Im, Eun-Jin
    • Proceedings of the Korea Society for Industrial Systems Conference
    • /
    • 2009.05a
    • /
    • pp.196-199
    • /
    • 2009
  • 높아지는 Graphic Processing Unit (GPU)의 연산 성능과 GPU에서의 범용 프로그래밍을 위한 개발 환경의 개발, 보급으로 인해 GPU를 일반연산에 활용하는 연구가 활발히 진행되고 있다. 이와같이 일반 연산에 활용되고 있는 GPU로 nVidia Tesla와 AMD/ATI의 FireStream 들이 있다. 특수목적 연산 장치인 GPU를 일반 연산을 위해 프로그래밍하기 위해서는 그에 맞는 프로그램 개발 환경이 필요한데 nVidia에서 개발한 CUDA (Compute Unified Device Architecture) 환경은 자사의 GPU 프로그램 개발을 위해 제공되는 개발 환경이다. CUDA 개발 환경은 nVidia GPU 프로그래밍 뿐만 아니라 차세대 이종 병렬 프로그램 개발 환경의 공개 표준으로 논의되고 있는 OpenCL (Open Computing Language) 와 유사한 특징을 보일 것으로 예상되기 때문에 그 중요성은 특정 GPU 에만 국한되지 않는다. 본 논문에서는 경로 적분 몬테 카를로 (Path Integral Monte Carlo) 방법을 CUDA 개발 환경을 사용하여 nVidia GPU 상에서 병렬화한 결과를 제시하였다.

  • PDF

Matrix Multiplication Acceleration with GPU and Locality (GPU와 지역성을 이용한 행렬 곱셈 가속)

  • Kwon, Oh-Young;Lee, Chang-Mug
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2009.10a
    • /
    • pp.902-903
    • /
    • 2009
  • Matrix multiplication is widely used in scientific and engineering field. Locality can improve the execution performance of matrix multiplication. A method for accelerating matrix multiplication is presented. This method uses both CPU and GPU computing power in PC. The presented method improved execution time about %15~30% than the method which uses only GPU.

  • PDF

Non-Photorealistic Rendering using GPU Programming Technique (GPU 프로그래밍 기법을 이용한 비사실적 랜더링)

  • Bat-Ochir, Bolormaa;Sung, Kyung;Kim, Soo-Kyun
    • Journal of Advanced Navigation Technology
    • /
    • v.15 no.6
    • /
    • pp.1228-1233
    • /
    • 2011
  • NPR(Non-Photorealistic rendering) technique is developing by every years. NPR is inspired on artistic styles, which is painting, drawing, technical illustration, animation and cartoon. There have many application programs for NPR, which is popular and useful of animations, even on game industrial. In traditional computer graphics focused on non-photorealism, but this method need much more memory and time. Recent years, Many NPR methods present advanced rendering technique and real time technique using graphic accelerator. This paper propose to explain NPR with GPU programming.