• Title/Summary/Keyword: Graphics Processing Units

Search Result 85, Processing Time 0.025 seconds

Parallel LDPC Decoding on a Heterogeneous Platform using OpenCL

  • Hong, Jung-Hyun;Park, Joo-Yul;Chung, Ki-Seok
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.6
    • /
    • pp.2648-2668
    • /
    • 2016
  • Modern mobile devices are equipped with various accelerated processing units to handle computationally intensive applications; therefore, Open Computing Language (OpenCL) has been proposed to fully take advantage of the computational power in heterogeneous systems. This article introduces a parallel software decoder of Low Density Parity Check (LDPC) codes on an embedded heterogeneous platform using an OpenCL framework. The LDPC code is one of the most popular and strongest error correcting codes for mobile communication systems. Each step of LDPC decoding has different parallelization characteristics. In the proposed LDPC decoder, steps suitable for task-level parallelization are executed on the multi-core central processing unit (CPU), and steps suitable for data-level parallelization are processed by the graphics processing unit (GPU). To improve the performance of OpenCL kernels for LDPC decoding operations, explicit thread scheduling, vectorization, and effective data transfer techniques are applied. The proposed LDPC decoder achieves high performance and high power efficiency by using heterogeneous multi-core processors on a unified computing framework.

Parallel Computation of FDTD algorithm using CUDA (CUDA를 이용한 FDTD 알고리즘의 병렬처리)

  • Lee, Ho-Young;Park, Jong-Hyun;Kim, Jun-Seong
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.47 no.4
    • /
    • pp.82-87
    • /
    • 2010
  • Modern GPUs(Graphic Processing Units) provide computing capability higher than that of the general CPUs(Central Processor Units). With supports of programmability of graphics pipeline GP-GPU(General Purpose computation on GPU) has gained much attention expanding its application area. This paper compares sequential and massively parallel implementations of FDTD(Finite Difference Time Domain) algorithm using CUDA(Compute Unified Device Architecture). Experimental results show upto 45X speedup over conventional CPU execution.

New Two-Level L1 Data Cache Bypassing Technique for High Performance GPUs

  • Kim, Gwang Bok;Kim, Cheol Hong
    • Journal of Information Processing Systems
    • /
    • v.17 no.1
    • /
    • pp.51-62
    • /
    • 2021
  • On-chip caches of graphics processing units (GPUs) have contributed to improved GPU performance by reducing long memory access latency. However, cache efficiency remains low despite the facts that recent GPUs have considerably mitigated the bottleneck problem of L1 data cache. Although the cache miss rate is a reasonable metric for cache efficiency, it is not necessarily proportional to GPU performance. In this study, we introduce a second key determinant to overcome the problem of predicting the performance gains from L1 data cache based on the assumption that miss rate only is not accurate. The proposed technique estimates the benefits of the cache by measuring the balance between cache efficiency and throughput. The throughput of the cache is predicted based on the warp occupancy information in the warp pool. Then, the warp occupancy is used for a second bypass phase when workloads show an ambiguous miss rate. In our proposed architecture, the L1 data cache is turned off for a long period when the warp occupancy is not high. Our two-level bypassing technique can be applied to recent GPU models and improves the performance by 6% on average compared to the architecture without bypassing. Moreover, it outperforms the conventional bottleneck-based bypassing techniques.

Development of Diffusive Wave Rainfall-Runoff Model Based on CUDA FORTRAN (CUDA FORTEAN기반 확산파 강우유출모형 개발)

  • Kim, Boram;Kim, Hyeong-Jun;Yoon, Kwang Seok
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2021.06a
    • /
    • pp.287-287
    • /
    • 2021
  • 본 연구에서는 CUDA(Compute Unified Device Architecture) 포트란을 이용하여 확산파 강우 유출모형을 개발하였다. CUDA 포트란은 그래픽 처리 장치(Graphic Processing Unit: GPU)에서 수행하는 병렬 연산 알고리즘을 포트란 언어를 사용하여 작성할 수 있도록 하는 GPU상의 범용계산(General-Purpose Computing on Graphics Processing Units: GPGPU) 기술이다. GPU는 그래픽 처리 작업에 특화된 다수의 산술 논리 장치(Arithmetic Logic Unit: ALU)로 구성되어 있어서 중앙 처리 장치(Central Processing Unit: CPU)보다 한 번에 더 많은 연산 수행이 가능하다. 이에 따라, CUDA 포트란기반 확산파모형은 분포형 강우유출모형의 수치모의 연산시간을 단축시킬 수 있다. 분포형모형의 지배방정식은 확산파모형과 Green-Ampt모형으로 구성되었고, 확산파모형은 유한체적법을 이용하여 이산화 하였다. CUDA 포트란기반 확산파모형의 정확성은 기존 연구된 수리실험 결과 및 CPU기반 강우유출모형과 비교하였으며, 연산소요시간에 대한 효율성은 CPU기반 확산파모형과 비교하였다. 그 결과 CUDA 포트란기반 확산파모형의 결과는 수리실험 결과 및 CPU기반 강우유출모형의 결과와 유사한 결과를 나타냈다. 또한, 연산소요시간은 CPU 기반 확산파모형의 연산소요시간보다 단축되었으며, 본 연구에 사용된 장비를 기준으로 최대 100배 정도 단축되었다.

  • PDF

Development and Validation of the GPU-based 3D Dynamic Analysis Code for Simulating Rock Fracturing Subjected to Impact Loading (충격 하중 시 암석의 파괴거동해석을 위한 GPGPU 기반 3차원 동적해석기법의 개발과 검증 연구)

  • Min, Gyeong-Jo;Fukuda, Daisuke;Oh, Se-Wook;Cho, Sang-Ho
    • Explosives and Blasting
    • /
    • v.39 no.2
    • /
    • pp.1-14
    • /
    • 2021
  • Recently, with the development of high-performance processing devices such as GPGPU, a three-dimensional dynamic analysis technique that can replace expensive rock material impact tests has been actively developed in the defense and aerospace fields. Experimentally observing or measuring fracture processes occurring in rocks subjected to high impact loads, such as blasting and earth penetration of small-diameter missiles, are difficult due to the inhomogeneity and opacity of rock materials. In this study, a three-dimensional dynamic fracture process analysis technique (3D-DFPA) was developed to simulate the fracture behavior of rocks due to impact. In order to improve the operation speed, an algorithm capable of GPGPU operation was developed for explicit analysis and contact element search. To verify the proposed dynamic fracture process analysis technique, the dynamic fracture toughness tests of the Straight Notched Disk Bending (SNDB) limestone samples were simulated and the propagation of the reflection and transmission of the stress waves at the rock-impact bar interfaces and the fracture process of the rock samples were compared. The dynamic load tests for the SNDB sample applied a Pulse Shape controlled Split Hopkinson presure bar (PS-SHPB) that can control the waveform of the incident stress wave, the stress state, and the fracture process of the rock models were analyzed with experimental results.

Analysis of GPU Performance and Memory Efficiency according to Task Processing Units (작업 처리 단위 변화에 따른 GPU 성능과 메모리 접근 시간의 관계 분석)

  • Son, Dong Oh;Sim, Gyu Yeon;Kim, Cheol Hong
    • Smart Media Journal
    • /
    • v.4 no.4
    • /
    • pp.56-63
    • /
    • 2015
  • Modern GPU can execute mass parallel computation by exploiting many GPU core. GPGPU architecture, which is one of approaches exploiting outstanding computational resources on GPU, executes general-purpose applications as well as graphics applications, effectively. In this paper, we investigate the impact of memory-efficiency and performance according to number of CTAs(Cooperative Thread Array) on a SM(Streaming Multiprocessors), since the analysis of relation between number of CTA on a SM and them provides inspiration for researchers who study the GPU to improve the performance. Our simulation results show that almost benchmarks increasing the number of CTAs on a SM improve the performance. On the other hand, some benchmarks cannot provide performance improvement. This is because the number of CTAs generated from same kernel is a little or the number of CTAs executed simultaneously is not enough. To precisely classify the analysis of performance according to number of CTA on a SM, we also analyze the relations between performance and memory stall, dram stall due to the interconnect congestion, pipeline stall at the memory stage. We expect that our analysis results help the study to improve the parallelism and memory-efficiency on GPGPU architecture.

Warp-Based Load/Store Reordering to Improve GPU Time Predictability

  • Huangfu, Yijie;Zhang, Wei
    • Journal of Computing Science and Engineering
    • /
    • v.11 no.2
    • /
    • pp.58-68
    • /
    • 2017
  • While graphics processing units (GPUs) can be used to improve the performance of real-time embedded applications that require high throughput, it is challenging to estimate the worst-case execution time (WCET) of GPU programs, because modern GPUs are designed for improving the average-case performance rather than time predictability. In this paper, a reordering framework is proposed to regulate the access to the GPU data cache, which helps to improve the accuracy of the estimation of GPU L1 data cache miss rate with low performance overhead. Also, with the improved cache miss rate estimation, tighter WCET estimations can be achieved for GPU programs.

Memory-Efficient Belief Propagation for Stereo Matching on GPU (GPU 에서의 고속 스테레오 정합을 위한 메모리 효율적인 Belief Propagation)

  • Choi, Young-Kyu;Williem, Williem;Park, In Kyu
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2012.11a
    • /
    • pp.52-53
    • /
    • 2012
  • Belief propagation (BP) is a commonly used global energy minimization algorithm for solving stereo matching problem in 3D reconstruction. However, it requires large memory bandwidth and data size. In this paper, we propose a novel memory-efficient algorithm of BP in stereo matching on the Graphics Processing Units (GPU). The data size and transfer bandwidth are significantly reduced by storing only a part of the whole message. In order to maintain the accuracy of the matching result, the local messages are reconstructed using shared memory available in GPU. Experimental result shows that there is almost an order of reduction in the global memory consumption, and 21 to 46% saving in memory bandwidth when compared to the conventional algorithm. The implementation result on a recent GPU shows that we can obtain 22.8 times speedup in execution time compared to the execution on CPU.

  • PDF

Spectral Modeling Synthesis of Haegeum using GPU (GPU를 이용한 해금의 스펙트럼 모델링)

  • Islam, Md Shohidul;Islam, Md Rashedul;Farid, Fahmid Al;Kim, Jong-Myon
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2014.01a
    • /
    • pp.5-8
    • /
    • 2014
  • This paper presents a parallel approach of formant synthesis method for haegeum on graphics processing units (GPU) using spectral modeling. Spectral modeling synthesis (SMS) is a technique that models time-varying spectra as a combination of sinusoids and a time-varying filtered noise component. A second-order digital resonator by the impulse-invariant transform (IIT) is applied to generate deterministic components and the results are band-pass filtered to adjust magnitude. The noise is calculated by first generating the sinusoids with formant synthesis, subtracting them from the original sound, and then removing some harmonics remained. The synthesized sounds are consequently by adding sinusoids, which are shown to be similar to the original Haegeum sounds. Furthermore, GPU accelerates the synthesis process enabling- real time music synthesis system development, supporting more sound effect, and multiple musical sound compositions.

  • PDF

GPU based Sound Synthesis of Guitar using Physical Modeling (물리적 모델링을 이용한 GPU 기반 기타 음 합성)

  • Kang, Seong-Mo;Kim, Cheol-Hong;Kim, Jong-Myon
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2012.07a
    • /
    • pp.1-2
    • /
    • 2012
  • 본 논문에서는 GPU 컴퓨팅 환경에서 물리적 모델링 기반의 음 합성 알고리즘을 수행하는 경우에 GPU의 개수에 따른 성능 및 에너지 효율의 변화를 분석한다. 실험결과, 6개의 GPU를 사용하였을 때 가장 좋은 성능을 보였으며, 1개의 GPU에서 가장 높은 에너지 효율을 보였다.

  • PDF