• Title/Summary/Keyword: GPU optimization

Search Result 68, Processing Time 0.024 seconds

An Optimization for fast digital hologram generation based on GPU (GPU기반의 디지털 홀로그램 고속 생성을 위한 최적화 기법)

  • Song, Joong-Seok;Park, Jong-Il
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2011.07a
    • /
    • pp.18-21
    • /
    • 2011
  • 디지털 홀로그램은 일반적으로 computer generated hologram(CGH)기법에 의해서 생성된다. 하지만 원리적으로 CGH 기법은 많은 연산량과 복잡도를 요구하고 있기 때문에 실시간으로 디지털 홀로그램을 생성하는 것은 매우 어렵다. 본 논문에서는 CGH 고속연산을 위해 graphics processing unit(GPU)의 병렬처리구조인 CUDA를 사용하였고, 추가적으로 다중 GPU 연산처리를 위해 OpenMP를 사용하였다. 더 나아가 이를 최적화하기 위해서 상수화, 벡터화, 루프풀기 등의 기법들을 제안한다. 결과적으로, 본 논문에서 제안된 기법을 통해서 기존 CPU에서의 CGH 연산속도에 비해 약 8,300배 정도의 속도를 개선할 수 있었다.

  • PDF

Computationally Efficient Implementation of a Hamming Code Decoder Using Graphics Processing Unit

  • Islam, Md Shohidul;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of Communications and Networks
    • /
    • v.17 no.2
    • /
    • pp.198-202
    • /
    • 2015
  • This paper presents a computationally efficient implementation of a Hamming code decoder on a graphics processing unit (GPU) to support real-time software-defined radio, which is a software alternative for realizing wireless communication. The Hamming code algorithm is challenging to parallelize effectively on a GPU because it works on sparsely located data items with several conditional statements, leading to non-coalesced, long latency, global memory access, and huge thread divergence. To address these issues, we propose an optimized implementation of the Hamming code on the GPU to exploit the higher parallelism inherent in the algorithm. Experimental results using a compute unified device architecture (CUDA)-enabled NVIDIA GeForce GTX 560, including 335 cores, revealed that the proposed approach achieved a 99x speedup versus the equivalent CPU-based implementation.

Optimizing Skyline Query Processing Algorithms on CUDA Framework (CUDA 프레임워크 상에서 스카이라인 질의처리 알고리즘 최적화)

  • Min, Jun;Han, Hwan-Soo;Lee, Sang-Won
    • Journal of KIISE:Databases
    • /
    • v.37 no.5
    • /
    • pp.275-284
    • /
    • 2010
  • GPUs are stream processors based on multi-cores, which can process large data with a high speed and a large memory bandwidth. Furthermore, GPUs are less expensive than multi-core CPUs. Recently, usage of GPUs in general purpose computing has been wide spread. The CUDA architecture from Nvidia is one of efforts to help developers use GPUs in their application domains. In this paper, we propose techniques to parallelize a skyline algorithm which uses a simple nested loop structure. In order to employ the CUDA programming model, we apply our optimization techniques to make our skyline algorithm fit into the performance restrictions of the CUDA architecture. According to our experimental results, we improve the original skyline algorithm by 80% with our optimization techniques.

Efficient Representation of Pore Flow, Absorption, Emission and Diffusion using GPU-Accelerated Cloth-Liquid Interaction

  • Jong-Hyun Kim
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.6
    • /
    • pp.23-29
    • /
    • 2024
  • In this paper, we propose a fast GPU-based method for representing pore flow, absorption, emission, and diffusion effects represented by cloth-liquid interactions using smoothed particle hydrodynamics (SPH), a particle-based fluid solver: 1) a unified framework for GPU-based representation of various physical effects represented by cloth-liquid interactions; 2) a method for efficiently calculating the saturation of a node based on SPH and transferring it to the surrounding porous particles; 3) a method for improving the stability based on Darcy's law to reliably calculate the direction of fluid absorption and release; 4) a method for controlling the amount of fluid absorbed by the porous particles according to the direction of flow; and finally, 5) a method for releasing the SPH particles without exceeding their maximum mass. The main advantage of the proposed method is that all computations are computed and run on the GPU, allowing us to quickly model porous materials, porous flows, absorption, reflection, diffusion, etc. represented by the interaction of cloth and fluid.

Analysis of Programming Techniques for Creating Optimized CUDA Software (최적화된 CUDA 소프트웨어 제작을 위한 프로그래밍 기법 분석)

  • Kim, Sung-Soo;Kim, Dong-Heon;Woo, Sang-Kyu;Ihm, In-Sung
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.7
    • /
    • pp.775-787
    • /
    • 2010
  • Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.

Fast Generation of Digital Hologram Based on Multi-GPU (Multi-GPU 기반의 고속 디지털 홀로그램 생성)

  • Song, Joong-Seok;Park, Jung-Sik;Seo, Young-Ho;Park, Jong-Il
    • Journal of Broadcast Engineering
    • /
    • v.16 no.6
    • /
    • pp.1009-1017
    • /
    • 2011
  • Fast generation of digital hologram is of importance for real-time holography broadcasting. In this paper, we propose such a method that parallelizes the Computer-Generated Holography (CGH) algorithm for digital hologram generation and make it faster using Multi Graphic Processing Unit (Multi-GPU) with help of the Compute Unified Device Architecture (CUDA) and the Open Multi-Processing (OpenMP). In addition, we propose optimization methods such as fixation variable, vectorization, and loop unrolling for making the CGH algorithm much faster. Experimental results show that our method is about 9,700 times faster than a CPU-based one.

A Study on the Performance Improvement of Software Digital Filter using GPU (GPU를 이용한 소프트웨어 디지털 필터의 성능개선에 관한 연구)

  • Yeom, Jae-Hwan;Oh, Se-Jin;Roh, Duk-Gyoo;Jung, Dong-Kyu;Hwang, Ju-Yeon;Oh, Chungsik;Kim, Hyo-Ryoung
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.19 no.4
    • /
    • pp.153-161
    • /
    • 2018
  • This paper describes the performance improvement of Software (SW) digital filter using GPU (Graphical Processing Unit). The previous developed SW digital filter has a problem that it operates on a CPU (Central Processing Unit) basis and has a slow speed. The GPU was introduced to filter the data of the EAVN (East Asian VLBI Network) observation to improve the operation speed and to process data with other stations through filtering, respectively. In order to enhance the computational speed of the SW digital filter, NVIDIA Titan V GPU board with built-in Tensor Core is used. The processing speed of about 0.78 (1Gbps, 16MHz BW, 16-IF) and 1.1 (2Gbps, 32MHz BW, 16-IF) times for the observing time was achieved by filtering the 95 second observation data of 2 Gbps (512 MHz BW, 1-IF), respectively. In addition, 2Gbps data is digitally filtered for the 1 and 2Gbps simultaneously observed with KVN (Korean VLBI Network), and compared with the 1Gbps, we obtained similar values such as cross power spectrum, phase, and SNR (Signal to Noise Ratio). As a result, the effectiveness of developed SW digital filter using GPU in this research was confirmed for utilizing the data processing and analysis. In the future, it is expected that the observation data will be able to be filtered in real time when the distributed processing optimization of source code for using multiple GPU boards.

Geographic information 3D Synthetic Model based on Regular Mesh (Regular Mesh 기반 지리정보 3D 합성모델)

  • Jung, Ji-Hwan;Hwang, Sun-Myung;Kim, Sung-Ho
    • Journal of Advanced Navigation Technology
    • /
    • v.15 no.4
    • /
    • pp.616-625
    • /
    • 2011
  • There are two representative geometry rendering methods. One is Geometry Clipmaps, another is ROAM 2.0. We propose an extended Geometry Clipmaps algorithm which does not focus on CPU operation but the GPU for faster and wider visibility area. The extended algorithm presents mesh configuration method of each level by LOD, how to configurate Mesh network between levels, mesh block method for rendering optimization using VFC, and image mapping method to get high resolution up to 1 m.

Accelerating Molecular Dynamics Simulation Using Graphics Processing Unit

  • Myung, Hun-Joo;Sakamaki, Ryuji;Oh, Kwang-Jin;Narumi, Tetsu;Yasuoka, Kenji;Lee, Sik
    • Bulletin of the Korean Chemical Society
    • /
    • v.31 no.12
    • /
    • pp.3639-3643
    • /
    • 2010
  • We have developed CUDA-enabled version of a general purpose molecular dynamics simulation code for GPU. Implementation details including parallelization scheme and performance optimization are described. Here we have focused on the non-bonded force calculation because it is most time consuming part in molecular dynamics simulation. Timing results using CUDA-enabled and CPU versions were obtained and compared for a biomolecular system containing 23558 atoms. CUDA-enabled versions were found to be faster than CPU version. This suggests that GPU could be a useful hardware for molecular dynamics simulation.

KAWS: Coordinate Kernel-Aware Warp Scheduling and Warp Sharing Mechanism for Advanced GPUs

  • Vo, Viet Tan;Kim, Cheol Hong
    • Journal of Information Processing Systems
    • /
    • v.17 no.6
    • /
    • pp.1157-1169
    • /
    • 2021
  • Modern graphics processor unit (GPU) architectures offer significant hardware resource enhancements for parallel computing. However, without software optimization, GPUs continuously exhibit hardware resource underutilization. In this paper, we indicate the need to alter different warp scheduler schemes during different kernel execution periods to improve resource utilization. Existing warp schedulers cannot be aware of the kernel progress to provide an effective scheduling policy. In addition, we identified the potential for improving resource utilization for multiple-warp-scheduler GPUs by sharing stalling warps with selected warp schedulers. To address the efficiency issue of the present GPU, we coordinated the kernel-aware warp scheduler and warp sharing mechanism (KAWS). The proposed warp scheduler acknowledges the execution progress of the running kernel to adapt to a more effective scheduling policy when the kernel progress attains a point of resource underutilization. Meanwhile, the warp-sharing mechanism distributes stalling warps to different warp schedulers wherein the execution pipeline unit is ready. Our design achieves performance that is on an average higher than that of the traditional warp scheduler by 7.97% and employs marginal additional hardware overhead.