• Title/Summary/Keyword: GPU 병렬처리

Search Result 246, Processing Time 0.032 seconds

Optimization of Warp-wide CUDA Implementation for Parallel Shifted Sort Algorithm (병렬 Shifted Sort 알고리즘의 Warp 단위 CUDA 구현 최적화)

  • Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.18 no.4
    • /
    • pp.739-745
    • /
    • 2017
  • This paper presents and discusses an implementation of the GPU shifted sorting method to find approximate k nearest neighbors which executes within "warp", the minimum execution unit in GPU parallel architecture. Also, this paper presents the comparison results with other two common nearest neighbor searching methods, GPU-based kd-tree and ANN (Approximate Nearest Neighbor) library. The proposed implementation focuses on the cases when k is small, i.e. 2, 4, 8, and 16, which are handled efficiently within warp to consider it is very common for applications to handle small k's. Also, this paper discusses optimization ways to implementation by improving memory management in a loop for the CUB open library and adopting CUDA commands which are supported by GPU hardware. The proposed implementation shows more than 16-fold speed-up against GPU-based other methods in the tests, implying that the improvement would become higher for more larger input data.

Design of a SIMT architecture GP-GPU Using Tile based on Graphic Pipeline Structure (타일 기반 그래픽 파이프라인 구조를 사용한 SIMT 구조 GP-GPU 설계)

  • Kim, Do-Hyun;Kim, Chi-Yong
    • Journal of IKEEE
    • /
    • v.20 no.1
    • /
    • pp.75-81
    • /
    • 2016
  • This paper proposes a design of the tile based on graphic pipeline to improve the graphic application performance in SIMT based GP-GPU. The proposed Tile based on graphics pipeline avoids unnecessary graphic processing operation, and processes the rasterization step in parallel. The massive data processing in parallel through SIMT architecture improve the computational performance, thereby improving the 3D graphic pipeline performance. The more vertex data of 3D model, the higher performance. The proposed structure was confirmed to improve processing performance of up to 3 times from about 1.18 times as compared to 'RAMP' and previous studies.

Multi-Scale Contact Analysis Between Net and Numerous Particles (그물망과 대량입자의 멀티 스케일 접촉해석)

  • Jun, Chul Woong;Sohn, Jeong Hyun
    • Transactions of the Korean Society of Mechanical Engineers A
    • /
    • v.38 no.1
    • /
    • pp.17-23
    • /
    • 2014
  • Graphics processing units (GPUs) are ideal for solving problems involving parallel data computations. In this study, the GPU is used for effectively carrying out a multi-body dynamic simulation with particle dynamics. The Hilber-Hushes-Taylor (HHT) implicit integration algorithm is used to solve the integral equations. For detecting collisions among particles, the spatial subdivision algorithm and discrete-element methods (DEM) are employed. The developed program is verified by comparing its results with those of ADAMS. The numerical efficiencies of the serial program using the CPU and the parallel program using the GPU are compared in terms of the number of particles, and it is observed that when the number of particles is greater, more computing time is saved by using the GPU. In the present example, when the number of particles is 1,300, the computational speed of the parallel analysis program is about 5 times faster than that of the serial analysis program.

Real-time 3D Visualization Method of Landslide disaster prediction Simulation using GPU (GPU을 이용한 토사재해 예측 시뮬레이션의 3D 실시간 가시화 방법)

  • Song, Sang-Min;Cho, Kwang-Joon;Ok, Soo-yol
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.7
    • /
    • pp.1630-1638
    • /
    • 2015
  • In this paper, we propose a GPU-based interactive and plausible visualization method for the silt and landslide simulation results computed with SPH. By empirical experiments, we verify that our GPU-accelerated screen space mesh method can be effectively used for visualizing the landslide disaster simulation. The method proposed in this paper make it possible to overcome the limitation of previous simulations where the experience obtained by trials and errors plays the most important roles. Because the realtime visualization enables interactive observation of simulation results and efficient data assimilation, the accuracy of the simulation can be significantly improved in an efficient way.

Correct Implementation of Sub-warp Parallel Prefix Operations based on GPU Hardware Architecture (GPU 하드웨어 아키텍처 기반 sub-warp 단위 병렬 프리픽스(prefix) 연산의 정확한 구현)

  • Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.18 no.3
    • /
    • pp.613-619
    • /
    • 2017
  • This paper presents a CUDA (Compute Unified Device Architecture) code to achieve correct GPU parallel segmented prefix operation results with less than 32 segment length for large data arrays. Mark Harris and Michael Garland had published CUDA code to address the tasks. This paper shows that their code does not generate correct results when the local segment length is less than 32, discusses the cause of the problem, and presents a CUDA code that generates correct results. The segmented parallel prefix operation presented in this paper can be applied as a building block to various large parallel processing algorithms including the k-nearest neighbor search problems.

A Study on GPU-based Iterative ML-EM Reconstruction Algorithm for Emission Computed Tomographic Imaging Systems (방출단층촬영 시스템을 위한 GPU 기반 반복적 기댓값 최대화 재구성 알고리즘 연구)

  • Ha, Woo-Seok;Kim, Soo-Mee;Park, Min-Jae;Lee, Dong-Soo;Lee, Jae-Sung
    • Nuclear Medicine and Molecular Imaging
    • /
    • v.43 no.5
    • /
    • pp.459-467
    • /
    • 2009
  • Purpose: The maximum likelihood-expectation maximization (ML-EM) is the statistical reconstruction algorithm derived from probabilistic model of the emission and detection processes. Although the ML-EM has many advantages in accuracy and utility, the use of the ML-EM is limited due to the computational burden of iterating processing on a CPU (central processing unit). In this study, we developed a parallel computing technique on GPU (graphic processing unit) for ML-EM algorithm. Materials and Methods: Using Geforce 9800 GTX+ graphic card and CUDA (compute unified device architecture) the projection and backprojection in ML-EM algorithm were parallelized by NVIDIA's technology. The time delay on computations for projection, errors between measured and estimated data and backprojection in an iteration were measured. Total time included the latency in data transmission between RAM and GPU memory. Results: The total computation time of the CPU- and GPU-based ML-EM with 32 iterations were 3.83 and 0.26 see, respectively. In this case, the computing speed was improved about 15 times on GPU. When the number of iterations increased into 1024, the CPU- and GPU-based computing took totally 18 min and 8 see, respectively. The improvement was about 135 times and was caused by delay on CPU-based computing after certain iterations. On the other hand, the GPU-based computation provided very small variation on time delay per iteration due to use of shared memory. Conclusion: The GPU-based parallel computation for ML-EM improved significantly the computing speed and stability. The developed GPU-based ML-EM algorithm could be easily modified for some other imaging geometries.

Efficient Collaboration Method Between CPU and GPU for Generating All Possible Cases in Combination (조합에서 모든 경우의 수를 만들기 위한 CPU와 GPU의 효율적 협업 방법)

  • Son, Ki-Bong;Son, Min-Young;Kim, Young-Hak
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.7 no.9
    • /
    • pp.219-226
    • /
    • 2018
  • One of the systematic ways to generate the number of all cases is a combination to construct a combination tree, and its time complexity is O($2^n$). A combination tree is used for various purposes such as the graph homogeneity problem, the initial model for calculating frequent item sets, and so on. However, algorithms that must search the number of all cases of a combination are difficult to use realistically due to high time complexity. Nevertheless, as the amount of data becomes large and various studies are being carried out to utilize the data, the number of cases of searching all cases is increasing. Recently, as the GPU environment becomes popular and can be easily accessed, various attempts have been made to reduce time by parallelizing algorithms having high time complexity in a serial environment. Because the method of generating the number of all cases in combination is sequential and the size of sub-task is biased, it is not suitable for parallel implementation. The efficiency of parallel algorithms can be maximized when all threads have tasks with similar size. In this paper, we propose a method to efficiently collaborate between CPU and GPU to parallelize the problem of finding the number of all cases. In order to evaluate the performance of the proposed algorithm, we analyze the time complexity in the theoretical aspect, and compare the experimental time of the proposed algorithm with other algorithms in CPU and GPU environment. Experimental results show that the proposed CPU and GPU collaboration algorithm maintains a balance between the execution time of the CPU and GPU compared to the previous algorithms, and the execution time is improved remarkable as the number of elements increases.

Implememtation of Fast Rasterizer processing using GPGPU based on SIMT structure (SIMT 구조 기반 GPGPU를 이용한 고속 Rasterizer 구현)

  • Kim, Chiyong
    • Journal of IKEEE
    • /
    • v.21 no.3
    • /
    • pp.276-279
    • /
    • 2017
  • In this paper, SIMT structure based GPGPU (General Purpose Computing on Graphics Processing Units) is used for accelerating the Rasterizer which constitutes the screen of the display device in pixel unit. The GPU has a large number of ALUs, and the processing is very fast because of parallel processing. Therefore, in this paper, we implemented a rasterizer that generates a 3D graphics model using a CPU that performs operations sequentially and a GPU that performs operations in parallel. We confirmed that proposed rasterizer in this paper is 1.45 times better than rasterizer using Intel CPU when generating one frame.

Implementation of GPU Based Polymorphic Worm Detection Method and Its Performance Analysis on Different GPU Platforms (GPU를 이용한 Polymorphic worm 탐지 기법 구현 및 GPU 플랫폼에 따른 성능비교)

  • Lee, Sunwon;Song, Chihwan;Lee, Injoon;Joh, Taewon;Kang, Jaewoo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2010.11a
    • /
    • pp.1458-1461
    • /
    • 2010
  • 작년 7월 7일에 있었던 DDoS 공격과 같이 악성 코드로 인한 피해의 규모가 해마다 증가하고 있다. 특히 변형 웜(Polymorphic Worm)은 기존의 방법으로 1차 공격에서의 탐지가 어렵기 때문에 그 위험성이 더 크다. 이에 본 연구에서는 바이오 인포매틱스(Bioinformatics) 분야에서 유전자들의 유사성과 특징을 찾기 위한 방법 중 하나인 Local Alignment를 소개하고 이를 변형 웜 탐지에 적용한다. 또한 수행의 병렬화 및 알고리즘 변형을 통하여 기존 알고리즘의 $O(n^4)$수행시간이라는 단점을 극복한다. 병렬화는 NVIDIA사의 GPU를 이용한 CUDA 프로그래밍과 AMD사의 GPU를 사용한 OpenCL 프로그래밍을 통하여 수행되었다. 이로써 각 GPGPU 플랫폼에서의 Local Alignment를 이용한 변형 웜 탐지 알고리즘의 성능을 비교하였다.

GPU-Based Image Stitching for Camera Array (카메라 어레이를 위한 GPU 기반 이미지 병합)

  • Bae, Do Hyun;Lee, Young-joon;Shin, Heejae;Bayartsogt, Munkhbayar;Kim, Minho;Kim, Jin Suk
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.04a
    • /
    • pp.352-354
    • /
    • 2012
  • 본 논문에서는 웹캠 카메라 어레이(camera array)로 얻은 여러 장의 이미지를 빠른 속도로 봉합(stitching)하여 고해상도 이미지를 얻기 위해 그래픽스 하드웨어를 이용하는 병렬 알고리즘을 제시한다. 고정된 레이아웃의 카메라 어레이를 이용하여 평면 혹은 원경을 촬영하는 경우, 기존에 널리 쓰이던 평면 사영 이미지 봉합(planar projective image stitching)과 선형 혼합(linear blending)을 통해 만족스런 결과를 얻을 수 있다. 본 논문에서는 이러한 연산을 그래픽스 하드웨어에서 병렬처리 함으로써 추후 실시간 고해상도 동영상 스트리밍 이미지 병합에 활용할 수 있을 정도로 빠른 속도로 처리하는 방법을 제시한다.