• Title/Summary/Keyword: Graphics Processing Unit : GPU

Search Result 154, Processing Time 0.024 seconds

Efficient Parallel Block-layered Nonbinary Quasi-cyclic Low-density Parity-check Decoding on a GPU

  • Thi, Huyen Pham;Lee, Hanho
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.6 no.3
    • /
    • pp.210-219
    • /
    • 2017
  • This paper proposes a modified min-max algorithm (MMMA) for nonbinary quasi-cyclic low-density parity-check (NB-QC-LDPC) codes and an efficient parallel block-layered decoder architecture corresponding to the algorithm on a graphics processing unit (GPU) platform. The algorithm removes multiplications over the Galois field (GF) in the merger step to reduce decoding latency without any performance loss. The decoding implementation on a GPU for NB-QC-LDPC codes achieves improvements in both flexibility and scalability. To perform the decoding on the GPU, data and memory structures suitable for parallel computing are designed. The implementation results for NB-QC-LDPC codes over GF(32) and GF(64) demonstrate that the parallel block-layered decoding on a GPU accelerates the decoding process to provide a faster decoding runtime, and obtains a higher coding gain under a low $10^{-10}$ bit error rate and low $10^{-7}$ frame error rate, compared to existing methods.

Optimized Construction and Visualization of GPU-based Adaptive and Continuous Signed Distance Field, and Its Applications (GPU기반 적응형 및 연속적인 부호 거리장의 최적화된 구성과 시각화, 그리고 그 응용 사례)

  • Moon, Seong-Hyeok;Kim, Jong-Hyun
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2021.07a
    • /
    • pp.655-658
    • /
    • 2021
  • 본 논문에서는 GPU 아키텍처를 이용하여 적응형 부호 거리장을 최적화하여 빠르게 구축하고 시각화 할 수 있는 방법에 대해 제안한다. 쿼드트리를 효율적으로 GPU 메모리로 전달하고, 이를 활용하여 삼각형에 대해 유클리디안 거리를 각 스레드 별로 병렬처리하여 최단 거리를 찾는다. 이 과정에서 GPU를 사용하여 삼각형으로 구성된 3D 메쉬로부터 빠르게 적응형 부호 거리장을 계산할 수 있는 최적화 기법과 절단면 보기, 특정 위치의 값 조회, 실시간 레이트레이싱 및 충돌처리 작업을 빠르고 효율적으로 수행할 수 있는지를 보여준다. 또한, 제안하는 프레임워크를 활용하면 하이 폴리곤 메쉬도 1초 내외로 부호 거리장을 계산할 수 있기 때문에 강체뿐만 아니라 변형체에도 충분히 활용될 수 있다.

  • PDF

Numerical Computing on Graphics Hardware

  • 임인성
    • 한국가시화정보학회:학술대회논문집
    • /
    • 2004.04a
    • /
    • pp.57-63
    • /
    • 2004
  • 최근 일반 범용 PC 에 장착되고 있는 ATI 나 NVIDIA 등의 그래픽스 가속기의 성능은 수년전과 비교할 때 비교가 안 될 정도의 빠른 속도를 자랑하고 있다. 이러한 속도 향상과 함께 급격하게 일어나고 있는 변화 중의 하나는 바로 기존의 고정된 기능의 그래픽스 파이프라인(fixed-function graphics pipeline)과는 달리 프로그래머가 가속기의 기능을 자유자재로 프로그래밍할 수 있도록 해주는 프로그래밍이 가능한 파이프라인(programmable graphics pipeline)의 출현이라 할 수 있다. 이러한 가속기에 장착되고 있는 GPU (Graphics Processing Unit)는 간단한 형태의 SIMD 프로세서라 할 수 있는데, 특히 GPU 의 한 부분인 픽셀 쉐이더는 그 처리 속도가 매우 높기 때문에 이를 통하여 기존의 수치 알고리즘을 병렬화 하려는 시도가 활발히 일어나고 있다. 본 강연에서는 다양한 수치 계산을 그래픽스 가속기를 사용하여 해결하려는 시도에 대하여 간단히 살펴본다.

  • PDF

Development of Diffusive Wave Rainfall-Runoff Model Based on CUDA FORTRAN (CUDA FORTEAN기반 확산파 강우유출모형 개발)

  • Kim, Boram;Kim, Hyeong-Jun;Yoon, Kwang Seok
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2021.06a
    • /
    • pp.287-287
    • /
    • 2021
  • 본 연구에서는 CUDA(Compute Unified Device Architecture) 포트란을 이용하여 확산파 강우 유출모형을 개발하였다. CUDA 포트란은 그래픽 처리 장치(Graphic Processing Unit: GPU)에서 수행하는 병렬 연산 알고리즘을 포트란 언어를 사용하여 작성할 수 있도록 하는 GPU상의 범용계산(General-Purpose Computing on Graphics Processing Units: GPGPU) 기술이다. GPU는 그래픽 처리 작업에 특화된 다수의 산술 논리 장치(Arithmetic Logic Unit: ALU)로 구성되어 있어서 중앙 처리 장치(Central Processing Unit: CPU)보다 한 번에 더 많은 연산 수행이 가능하다. 이에 따라, CUDA 포트란기반 확산파모형은 분포형 강우유출모형의 수치모의 연산시간을 단축시킬 수 있다. 분포형모형의 지배방정식은 확산파모형과 Green-Ampt모형으로 구성되었고, 확산파모형은 유한체적법을 이용하여 이산화 하였다. CUDA 포트란기반 확산파모형의 정확성은 기존 연구된 수리실험 결과 및 CPU기반 강우유출모형과 비교하였으며, 연산소요시간에 대한 효율성은 CPU기반 확산파모형과 비교하였다. 그 결과 CUDA 포트란기반 확산파모형의 결과는 수리실험 결과 및 CPU기반 강우유출모형의 결과와 유사한 결과를 나타냈다. 또한, 연산소요시간은 CPU 기반 확산파모형의 연산소요시간보다 단축되었으며, 본 연구에 사용된 장비를 기준으로 최대 100배 정도 단축되었다.

  • PDF

Design of a Dispatch Unit & Operand Selection Unit for Improving the SIMT Based GP-GPU Instruction Performance (SIMT구조 GP-GPU의 명령어 처리 성능 향상을 위한 Dispatch Unit과 Operand Selection Unit설계)

  • Kwak, Jae Chang
    • Journal of IKEEE
    • /
    • v.19 no.3
    • /
    • pp.455-459
    • /
    • 2015
  • This paper proposes a dispatch unit of GP-GPU with SIMT architecture to support the acceleration of general-purpose operation as well as graphics processing. If all the information of an operand used instructions issued from the warp scheduler is decoded, an unnecessary operand load occurs, resulting in register loads. To resolve this problem, this paper proposes a method that can reduce the operand load and the load on the resister by decoding only the information of the operand using a pre-decoding method. The operand information from the dispatch unit is passed to the operand selection unit with preventing register bank collisions. Thus the overall performance are improved. In the simulation test, the total clock cycles required by processing 10,000 arbitrary instructions issued from the wrap scheduler using ModelSim SE 10.0b are measured. It shows that the application of the dispatch unit equipped with the pre-decoding function proposed in this paper can make an improvement of about 12% in processing performance compared to the conventional method.

Real-time Ray-tracing Chip Architecture

  • Yoon, Hyung-Min;Lee, Byoung-Ok;Cheong, Cheol-Ho;Hur, Jin-Suk;Kim, Sang-Gon;Chung, Woo-Nam;Lee, Yong-Ho;Park, Woo-Chan
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.4 no.2
    • /
    • pp.65-70
    • /
    • 2015
  • In this paper, we describe the world's first real-time ray-tracing chip architecture. Ray-tracing technology generates high-quality 3D graphics images better than current rasterization technology by providing four essential light effects: shadow, reflection, refraction and transmission. The real-time ray-tracing chip named RayChip includes a real-time ray-tracing graphics processing unit and an accelerating tree-building unit. An ARM Ltd. central processing unit (CPU) and other peripherals are also included to support all processes of 3D graphics applications. Using the accelerating tree-building unit named RayTree to minimize the CPU load, the chip uses a low-end CPU and decreases both silicon area and power consumption. The evaluation results with RayChip show appropriate performance to support real-time ray tracing in high-definition (HD) resolution, while the rendered images are scaled to full HD resolution. The chip also integrates the Linux operating system and the familiar OpenGL for Embedded Systems application programming interface for easy application development.

Accelerating Numerical Analysis of Reynolds Equation Using Graphic Processing Units (그래픽처리장치를 이용한 레이놀즈 방정식의 수치 해석 가속화)

  • Myung, Hun-Joo;Kang, Ji-Hoon;Oh, Kwang-Jin
    • Tribology and Lubricants
    • /
    • v.28 no.4
    • /
    • pp.160-166
    • /
    • 2012
  • This paper presents a Reynolds equation solver for hydrostatic gas bearings, implemented to run on graphics processing units (GPUs). The original analysis code for the central processing unit (CPU) was modified for the GPU by using the compute unified device architecture (CUDA). The red-black Gauss-Seidel (RBGS) algorithm was employed instead of the original Gauss-Seidel algorithm for the iterative pressure solver, because the latter has data dependency between neighboring nodes. The implemented GPU program was tested on the nVidia GTX580 system and compared to the original CPU program on the AMD Llano system. In the iterative pressure calculation, the implemented GPU program showed 20-100 times faster performance than the original CPU codes. Comparison of the wall-clock times including all of pre/post processing codes showed that the GPU codes still delivered 4-12 times faster performance than the CPU code for our target problem.

GPU-Based ECC Decode Unit for Efficient Massive Data Reception Acceleration

  • Kwon, Jisu;Seok, Moon Gi;Park, Daejin
    • Journal of Information Processing Systems
    • /
    • v.16 no.6
    • /
    • pp.1359-1371
    • /
    • 2020
  • In transmitting and receiving such a large amount of data, reliable data communication is crucial for normal operation of a device and to prevent abnormal operations caused by errors. Therefore, in this paper, it is assumed that an error correction code (ECC) that can detect and correct errors by itself is used in an environment where massive data is sequentially received. Because an embedded system has limited resources, such as a low-performance processor or a small memory, it requires efficient operation of applications. In this paper, we propose using an accelerated ECC-decoding technique with a graphics processing unit (GPU) built into the embedded system when receiving a large amount of data. In the matrix-vector multiplication that forms the Hamming code used as a function of the ECC operation, the matrix is expressed in compressed sparse row (CSR) format, and a sparse matrix-vector product is used. The multiplication operation is performed in the kernel of the GPU, and we also accelerate the Hamming code computation so that the ECC operation can be performed in parallel. The proposed technique is implemented with CUDA on a GPU-embedded target board, NVIDIA Jetson TX2, and compared with execution time of the CPU.

KAWS: Coordinate Kernel-Aware Warp Scheduling and Warp Sharing Mechanism for Advanced GPUs

  • Vo, Viet Tan;Kim, Cheol Hong
    • Journal of Information Processing Systems
    • /
    • v.17 no.6
    • /
    • pp.1157-1169
    • /
    • 2021
  • Modern graphics processor unit (GPU) architectures offer significant hardware resource enhancements for parallel computing. However, without software optimization, GPUs continuously exhibit hardware resource underutilization. In this paper, we indicate the need to alter different warp scheduler schemes during different kernel execution periods to improve resource utilization. Existing warp schedulers cannot be aware of the kernel progress to provide an effective scheduling policy. In addition, we identified the potential for improving resource utilization for multiple-warp-scheduler GPUs by sharing stalling warps with selected warp schedulers. To address the efficiency issue of the present GPU, we coordinated the kernel-aware warp scheduler and warp sharing mechanism (KAWS). The proposed warp scheduler acknowledges the execution progress of the running kernel to adapt to a more effective scheduling policy when the kernel progress attains a point of resource underutilization. Meanwhile, the warp-sharing mechanism distributes stalling warps to different warp schedulers wherein the execution pipeline unit is ready. Our design achieves performance that is on an average higher than that of the traditional warp scheduler by 7.97% and employs marginal additional hardware overhead.

Parallel Range Query processing on R-tree with Graphics Processing Units (GPU를 이용한 R-tree에서의 범위 질의의 병렬 처리)

  • Yu, Bo-Seon;Kim, Hyun-Duk;Choi, Won-Ik;Kwon, Dong-Seop
    • Journal of Korea Multimedia Society
    • /
    • v.14 no.5
    • /
    • pp.669-680
    • /
    • 2011
  • R-trees are widely used in various areas such as geographical information systems, CAD systems and spatial databases in order to efficiently index multi-dimensional data. As data sets used in these areas grow in size and complexity, however, range query operations on R-tree are needed to be further faster to meet the area-specific constraints. To address this problem, there have been various research efforts to develop strategies for acceleration query processing on R-tree by using the buffer mechanism or parallelizing the query processing on R-tree through multiple disks and processors. As a part of the strategies, approaches which parallelize query processing on R-tree through Graphics Processor Units(GPUs) have been explored. The use of GPUs may guarantee improved performances resulting from faster calculations and reduced disk accesses but may cause additional overhead costs caused by high memory access latencies and low data exchange rate between GPUs and the CPU. In this paper, to address the overhead problems and to adapt GPUs efficiently, we propose a novel approach which uses a GPU as a buffer to parallelize query processing on R-tree. The use of buffer algorithm can give improved performance by reducing the number of disk access and maximizing coalesced memory access resulting in minimizing GPU memory access latencies. Through the extensive performance studies, we observed that the proposed approach achieved up to 5 times higher query performance than the original CPU-based R-trees.