• 제목/요약/키워드: GPU algorithm

검색결과 265건 처리시간 0.027초

묵시적 동기화 기반의 고성능 다중 GPU 렌더링 (High-Performance Multi-GPU Rendering Based on Implicit Synchronization)

  • 김영욱;이성길
    • 정보과학회 논문지
    • /
    • 제42권11호
    • /
    • pp.1332-1338
    • /
    • 2015
  • 최근 고품질, 초고해상도 실시간 렌더링 지원을 위하여 다중 GPU 렌더링에 대한 관심이 커지고 있다. 실시간 렌더링에서 여러 개의 GPU로 고성능을 달성하기 위해서는 GPU 간의 데이터 전송 지연과 프레임 합성 부하를 고려해야 한다. 이 논문은 이러한 부하를 최소화하고 다중 GPU의 효율을 향상하기 위해 split frame 렌더링의 동기화를 묵시적 질의 기반으로 향상하는 기법을 제안한다. 또한, 이러한 묵시적 동기화 기반 프레임 합성을 지원하기 위한 메시지 큐 기반의 렌더링 스케줄링 알고리즘도 제안한다. 본 알고리즘을 적용한 실험은 본 알고리즘이 기존 알고리즘 대비 200% 이상 효율을 향상함을 확인하였다.

변형 물체를 위한 GPU 기반 병렬 충돌 감지 (GPU-Based Parallel Collision Detection for Deformable Objects)

  • 성낙준;김민상;홍민;최유주
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제7권1호
    • /
    • pp.25-32
    • /
    • 2018
  • 변형물체 시뮬레이션은 강체 시뮬레이션에 비해 많은 연산량을 요구하기 때문에 효과적인 충돌 검사 방법을 필요하다. 그러나 CPU 기반의 충돌 검사 알고리즘을 그대로 GPU 환경에 적용할 경우 GPU의 성능을 제대로 사용할 수 없기 때문에 GPU 환경에 최적화된 충돌 감지 알고리즘과 자료구조가 필요하다. 따라서 본 연구에서는 변형 물체 표현을 위해 널리 사용되고 있는 질량-스프링 시스템을 위한 GPU 기반의 병렬 충돌 감지 알고리즘을 제안한다. 제안하는 방법은 AABB-옥트리 구조를 이용한 GPU 기반의 컬링 알고리즘을 통해 충돌 감지 비용을 줄이는 병렬 알고리즘과 자료 구조를 사용하였다. 본 연구에서는 모든 삼각형 쌍의 충돌을 병렬로 검사하는 기존 방법과의 비교실험을 통하여 제안 알고리즘의 효율성을 입증하였다. 실험결과, 제안된 방법은 기존의 방법에 비해서 평균 약 24%의 성능 개선을 보였다. 따라서 제안하는 방법을 통해서 변형 물체에 대한 실시간 시뮬레이션의 성능 개선이 가능할 것으로 기대한다.

Large-scale 3D fast Fourier transform computation on a GPU

  • Jaehong Lee;Duksu Kim
    • ETRI Journal
    • /
    • 제45권6호
    • /
    • pp.1035-1045
    • /
    • 2023
  • We propose a novel graphics processing unit (GPU) algorithm that can handle a large-scale 3D fast Fourier transform (i.e., 3D-FFT) problem whose data size is larger than the GPU's memory. A 1D FFT-based 3D-FFT computational approach is used to solve the limited device memory issue. Moreover, to reduce the communication overhead between the CPU and GPU, we propose a 3D data-transposition method that converts the target 1D vector into a contiguous memory layout and improves data transfer efficiency. The transposed data are communicated between the host and device memories efficiently through the pinned buffer and multiple streams. We apply our method to various large-scale benchmarks and compare its performance with the state-of-the-art multicore CPU FFT library (i.e., fastest Fourier transform in the West [FFTW]) and a prior GPU-based 3D-FFT algorithm. Our method achieves a higher performance (up to 2.89 times) than FFTW; it yields more performance gaps as the data size increases. The performance of the prior GPU algorithm decreases considerably in massive-scale problems, whereas our method's performance is stable.

반도체 웨이퍼 고속 검사를 위한 GPU 기반 병렬처리 알고리즘 (The GPU-based Parallel Processing Algorithm for Fast Inspection of Semiconductor Wafers)

  • 박영대;김준식;주효남
    • 제어로봇시스템학회논문지
    • /
    • 제19권12호
    • /
    • pp.1072-1080
    • /
    • 2013
  • In a the present day, many vision inspection techniques are used in productive industrial areas. In particular, in the semiconductor industry the vision inspection system for wafers is a very important system. Also, inspection techniques for semiconductor wafer production are required to ensure high precision and fast inspection. In order to achieve these objectives, parallel processing of the inspection algorithm is essentially needed. In this paper, we propose the GPU (Graphical Processing Unit)-based parallel processing algorithm for the fast inspection of semiconductor wafers. The proposed algorithm is implemented on GPU boards made by NVIDIA Company. The defect detection performance of the proposed algorithm implemented on the GPU is the same as if by a single CPU, but the execution time of the proposed method is about 210 times faster than the one with a single CPU.

GPU를 이용한 무리 짓기에서 이웃 에이전트 찾기의 병렬 처리 (A Parallel Processing of Finding Neighbor Agents in Flocking Behaviors Using GPU)

  • 이재문
    • 한국게임학회 논문지
    • /
    • 제10권5호
    • /
    • pp.95-102
    • /
    • 2010
  • 논문은 GPU를 이용한 무리 짓기에 대한 병렬 알고리즘을 제안한다. 이를 위하여 GPU의 병렬처리 구조로 CUDA를 사용하였으며, 그것의 특성 및 제한 요소들을 분석하였다. 이의 특성 및 제한 요소를 기초로 무리 짓기에서 가장 많은 비용을 요구하는 이웃 에이전트들을 찾는 것을 병렬화 함으로써 성능을 개선하였다. 제안된 알고리즘을 GTX 285상에서 구현하였고, 그것의 성능을 실험적으로 기존의 공간분할 알고리즘과 비교하였다. 비교의 결과는 제안된 알고리즘이 실행 시간 관점에서 최대 9배 정도 우수하다는 것을 보였다.

실시간 SAR 영상 생성을 위한 Range Doppler Algorithm의 GPU 가속 (GPU Acceleration of Range Doppler Algorithm for Real-Time SAR Image Generation)

  • 정동민;이우경;이명진;정윤호
    • 전기전자학회논문지
    • /
    • 제27권3호
    • /
    • pp.265-272
    • /
    • 2023
  • 본 논문에서는 FMCW(Frequency Modulated Continuous Wave) SAR(Synthetic Aperture Radar) 기반 실시간 영상 형성을 위해 RDA(Range Doppler Algorithm)의 GPU 가속 커널을 개발하였다. Host와 GPU device 사이의 데이터 전송 시간을 최소화하기 위해 pinned 메모리를 사용하였고, 데이터의 전송 횟수를 최소화하기 위해 모든 RDA 연산을 GPU에서 수행하도록 커널을 구성하였다. FMCW 드론 SAR 실험을 통해 데이터셋를 획득하였고, intel i7-9700K CPU, 32GB RAM과 Nvidia RTX 3090 GPU 환경에서 GPU의 가속 효과를 측정하였다. Host-device간 데이터 전송 시간을 포함했을 경우 CPU 대비 최대 3.41배 가속된 것으로 측정되었고, 데이터 전송 시간을 포함하지 않고 연산의 가속 효과만을 측정했을 때, 최대 156배 가속 가능함을 확인할 수 있었다.

GPU-based Stereo Matching Algorithm with the Strategy of Population-based Incremental Learning

  • Nie, Dong-Hu;Han, Kyu-Phil;Lee, Heng-Suk
    • Journal of Information Processing Systems
    • /
    • 제5권2호
    • /
    • pp.105-116
    • /
    • 2009
  • To solve the general problems surrounding the application of genetic algorithms in stereo matching, two measures are proposed. Firstly, the strategy of simplified population-based incremental learning (PBIL) is adopted to reduce the problems with memory consumption and search inefficiency, and a scheme for controlling the distance of neighbors for disparity smoothness is inserted to obtain a wide-area consistency of disparities. In addition, an alternative version of the proposed algorithm, without the use of a probability vector, is also presented for simpler set-ups. Secondly, programmable graphics-hardware (GPU) consists of multiple multi-processors and has a powerful parallelism which can perform operations in parallel at low cost. Therefore, in order to decrease the running time further, a model of the proposed algorithm, which can be run on programmable graphics-hardware (GPU), is presented for the first time. The algorithms are implemented on the CPU as well as on the GPU and are evaluated by experiments. The experimental results show that the proposed algorithm offers better performance than traditional BMA methods with a deliberate relaxation and its modified version in terms of both running speed and stability. The comparison of computation times for the algorithm both on the GPU and the CPU shows that the former has more speed-up than the latter, the bigger the image size is.

GPU를 이용한 삼각형 집합의 외경계 계산 알고리즘 (GPU Algorithm for Outer Boundaries of a Triangle Set)

  • 경민호
    • 한국CDE학회논문집
    • /
    • 제17권4호
    • /
    • pp.262-273
    • /
    • 2012
  • We present a novel GPU algorithm to compute outer cell boundaries of 3D arrangement subdivided by a given set of triangles. An outer cell boundary is defined as a 2-manifold surface consisting of subdivided polygons facing outward. Many geometric problems, such as Minkowski sum, sweep volume, lower/upper envelop, Bool operations, can be reduced to finding outer cell boundaries with specific properties. Computing outer cell boundaries, however, is a very time-consuming job and also is susceptible to numerical errors. To address these problems, we develop an algorithm based on GPU with a robust scheme combining interval arithmetic and multi-level precisions. The proposed algorithm is tested on Minkowski sum of several polygonal models, and shows 5-20 times speedup over an existing algorithm running on CPU.

GPU 기반 행렬 덧셈 및 스칼라 곱셈 알고리즘 (Matrix Addition & Scalar Multiplication on the GPU)

  • 박상근
    • 융복합기술연구소 논문집
    • /
    • 제8권1호
    • /
    • pp.15-20
    • /
    • 2018
  • Recently a GPU has acquired programmability to perform general purpose computation fast by running thousands of threads concurrently. This paper presents a parallel GPU computation algorithm for dense matrix-matrix addition and scalar multiplication using OpenGL compute shader. It can play a very important role as a fundamental building block for many high-performance computing applications. Experimental results on NVIDIA Quad 4000 show that the proposed algorithm runs 21 times faster than CPU algorithm and achieves performance of 16 GFLOPS in single precision for dense matrices with size 4,096. Such performance proves that our algorithm is practical for real applications.

Accurate and efficient GPU ray-casting algorithm for volume rendering of unstructured grid data

  • Gu, Gibeom;Kim, Duksu
    • ETRI Journal
    • /
    • 제42권4호
    • /
    • pp.608-618
    • /
    • 2020
  • We present a novel GPU-based ray-casting algorithm for volume rendering of unstructured grid data. Our volume rendering system uses a ray-casting method that guarantees accurate rendering results. We also employ the per-pixel intersection list concept in the Bunyk algorithm to guarantee an accurate result for non-convex meshes. For efficient memory access for the lists on the GPU, we represent the intersection lists for all faces as an array with our novel construction algorithm. With the intersection lists, we perform ray-casting on a GPU, and a GPU thread handles each ray. To increase ray-coherency in a thread block and improve memory access efficiency, we extend a prior image-tile-based work distribution method to fit modern GPU architectures. We also show that a prior approach using a per-thread local buffer to reduce redundant computation is not appropriate for modern GPU architectures. Instead, we take an on-demand calculation strategy that achieves better performance even though it allows duplicate computations. We applied our method to three unstructured grid datasets with different characteristics. With a GPU, our method achieved up to 36.5 times higher performance for the ray-casting process and 19.7 times higher performance for the whole volume rendering process compared with the Bunyk algorithm using a CPU core. Also, our approach showed up to 8.2 times higher performance than a GPU-based cell projection method while generating more accurate rendering results. These results demonstrate the efficiency and accuracy of our method.