• Title/Summary/Keyword: Parallel GPU

Search Result 284, Processing Time 0.02 seconds

Real-time 3D Visualization Method of Landslide disaster prediction Simulation using GPU (GPU을 이용한 토사재해 예측 시뮬레이션의 3D 실시간 가시화 방법)

  • Song, Sang-Min;Cho, Kwang-Joon;Ok, Soo-yol
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.7
    • /
    • pp.1630-1638
    • /
    • 2015
  • In this paper, we propose a GPU-based interactive and plausible visualization method for the silt and landslide simulation results computed with SPH. By empirical experiments, we verify that our GPU-accelerated screen space mesh method can be effectively used for visualizing the landslide disaster simulation. The method proposed in this paper make it possible to overcome the limitation of previous simulations where the experience obtained by trials and errors plays the most important roles. Because the realtime visualization enables interactive observation of simulation results and efficient data assimilation, the accuracy of the simulation can be significantly improved in an efficient way.

Parallel Connected Component Labeling Based on the Selective Four Directional Label Search Using CUDA

  • Soh, Young-Sung;Hong, Jung-Woo
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.16 no.3
    • /
    • pp.83-89
    • /
    • 2015
  • Connected component labeling (CCL) is a mandatory step in image segmentation where objects are extracted and uniquely labeled. CCL is a computationally expensive operation and thus is often done in parallel processing framework to reduce execution time. Various parallel CCL methods have been proposed in the literature. Among them are NSZ label equivalence (NSZ-LE) method, modified 8 directional label selection (M8DLS) method, HYBRID1 method, and HYBRID2 method. Soh et al. showed that HYBRID2 outperforms the others and is the best so far. In this paper we propose a new hybrid parallel CCL algorithm termed as HYBRID3 that combines selective four directional label search (S4DLS) with label backtracking (LB). We show that the average percentage speedup of the proposed over M8DLS is around 60% more than that of HYBRID2 over M8DLS for various kinds of images.

Accelerating Group Fusion for Ligand-Based Virtual Screening on Multi-core and Many-core Platforms

  • Mohd-Hilmi, Mohd-Norhadri;Al-Laila, Marwah Haitham;Hassain Malim, Nurul Hashimah Ahamed
    • Journal of Information Processing Systems
    • /
    • v.12 no.4
    • /
    • pp.724-740
    • /
    • 2016
  • The performance issues of screening large database compounds and multiple query compounds in virtual screening highlight a common concern in Chemoinformatics applications. This study investigates these problems by choosing group fusion as a pilot model and presents efficient parallel solutions in parallel platforms, specifically, the multi-core architecture of CPU and many-core architecture of graphical processing unit (GPU). A study of sequential group fusion and a proposed design of parallel CUDA group fusion are presented in this paper. The design involves solving two important stages of group fusion, namely, similarity search and fusion (MAX rule), while addressing embarrassingly parallel and parallel reduction models. The sequential, optimized sequential and parallel OpenMP of group fusion were implemented and evaluated. The outcome of the analysis from these three different design approaches influenced the design of parallel CUDA version in order to optimize and achieve high computation intensity. The proposed parallel CUDA performed better than sequential and parallel OpenMP in terms of both execution time and speedup. The parallel CUDA was 5-10x faster than sequential and parallel OpenMP as both similarity search and fusion MAX stages had been CUDA-optimized.

CUDA-based Parallel Bi-Conjugate Gradient Matrix Solver for BioFET Simulation (BioFET 시뮬레이션을 위한 CUDA 기반 병렬 Bi-CG 행렬 해법)

  • Park, Tae-Jung;Woo, Jun-Myung;Kim, Chang-Hun
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.48 no.1
    • /
    • pp.90-100
    • /
    • 2011
  • We present a parallel bi-conjugate gradient (Bi-CG) matrix solver for large scale Bio-FET simulations based on recent graphics processing units (GPUs) which can realize a large-scale parallel processing with very low cost. The proposed method is focused on solving the Poisson equation in a parallel way, which requires massive computational resources in not only semiconductor simulation, but also other various fields including computational fluid dynamics and heat transfer simulations. As a result, our solver is around 30 times faster than those with traditional methods based on single core CPU systems in solving the Possion equation in a 3D FDM (Finite Difference Method) scheme. The proposed method is implemented and tested based on NVIDIA's CUDA (Compute Unified Device Architecture) environment which enables general purpose parallel processing in GPUs. Unlike other similar GPU-based approaches which apply usually 32-bit single-precision floating point arithmetics, we use 64-bit double-precision operations for better convergence. Applications on the CUDA platform are rather easy to implement but very hard to get optimized performances. In this regard, we also discuss the optimization strategy of the proposed method.

Integer-Pel Motion Estimation for HEVC on Compute Unified Device Architecture (CUDA)

  • Lee, Dongkyu;Sim, Donggyu;Oh, Seoung-Jun
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.3 no.6
    • /
    • pp.397-403
    • /
    • 2014
  • A new video compression standard called High Efficiency Video Coding (HEVC) has recently been released onto the market. HEVC provides higher coding performance compared to previous standards, but at the cost of a significant increase in encoding complexity, particularly in motion estimation (ME). At the same time, the computing capabilities of Graphics Processing Units (GPUs) have become more powerful. This paper proposes a parallel integer-pel ME (IME) algorithm for HEVC on GPU using the Compute Unified Device Architecture (CUDA). In the proposed IME, concurrent parallel reduction (CPR) is introduced. CPR performs several parallel reduction (PR) operations concurrently to solve two problems in conventional PR; low thread utilization and high thread synchronization latency. The proposed encoder reduces the portion of IME in the encoder to almost zero with a 2.3% increase in bitrate. In terms of IME, the proposed IME is up to 172.6 times faster than the IME in the HEVC reference model.

GPU-Accelerated Single Image Depth Estimation with Color-Filtered Aperture

  • Hsu, Yueh-Teng;Chen, Chun-Chieh;Tseng, Shu-Ming
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.8 no.3
    • /
    • pp.1058-1070
    • /
    • 2014
  • There are two major ways to implement depth estimation, multiple image depth estimation and single image depth estimation, respectively. The former has a high hardware cost because it uses multiple cameras but it has a simple software algorithm. Conversely, the latter has a low hardware cost but the software algorithm is complex. One of the recent trends in this field is to make a system compact, or even portable, and to simplify the optical elements to be attached to the conventional camera. In this paper, we present an implementation of depth estimation with a single image using a graphics processing unit (GPU) in a desktop PC, and achieve real-time application via our evolutional algorithm and parallel processing technique, employing a compute shader. The methods greatly accelerate the compute-intensive implementation of depth estimation with a single view image from 0.003 frames per second (fps) (implemented in MATLAB) to 53 fps, which is almost twice the real-time standard of 30 fps. In the previous literature, to the best of our knowledge, no paper discusses the optimization of depth estimation using a single image, and the frame rate of our final result is better than that of previous studies using multiple images, whose frame rate is about 20fps.

Implementation of Parallel Processing Interpolation Algorithm for Multicore GPU (다중코어 GPU를 위한 병렬처리 보간 알고리즘 구현)

  • Lee, Kwang-Yeob;Kim, Chi-Yong
    • Journal of IKEEE
    • /
    • v.16 no.4
    • /
    • pp.304-309
    • /
    • 2012
  • As resolution for displays is recently more and more increasing, the amount of data abd calculation that graphic hardware needs to process are also increasing. Especially the amount of data processing by rasterizer is rapidly increasing. This paper used an algorism using coordinates in center of gravity and area for triangle instead of using bilinear algorism[1] used by conventional interpolation, which is to make it easier for parallel processing by rasterizer. This paper implemented designed rasterizer under FPGA environment, and compared it with conventional rasterizer and verified it. This rasterizer is proved to have approximately 50% higher performance compared to conventional one.

Building a Dynamic Analyzer for CUDA based System.

  • SALAH T. ALSHAMMARI
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.8
    • /
    • pp.77-84
    • /
    • 2023
  • The utilization of GPUs on general-purpose computers is currently on the rise due to the increase in its programmability and performance requirements. The utility of tools like NVIDIA's CUDA have been designed to allow programmers to code algorithms by using C-like language for the execution process on the graphics processing units GPU. Unfortunately, many of the performance and correctness bugs will happen on parallel programs. The CUDA tool support for the parallel programs has not yet been actualized. The use of a dynamic analyzer to find performance and correctness bugs in CUDA programs facilitates the execution of sophisticated processes, especially in modern computing requirements. Any race conditions bug it will impact of program correctness and the share memory bank conflicts to improve the overall performance. The technique instruments the programs in a way that promotes accessibility of the memory locations accessed by different threads well as to check for any bugs in the code of a program. The instrumented source code will be used initiated directly in the device emulation code of CUDA to send report for the user about all errors. The current degree of automation helps programmers solve subtle bugs in highly complex programs or programs that cannot be analyzed manually.

Parallel Range Query processing on R-tree with Graphics Processing Units (GPU를 이용한 R-tree에서의 범위 질의의 병렬 처리)

  • Yu, Bo-Seon;Kim, Hyun-Duk;Choi, Won-Ik;Kwon, Dong-Seop
    • Journal of Korea Multimedia Society
    • /
    • v.14 no.5
    • /
    • pp.669-680
    • /
    • 2011
  • R-trees are widely used in various areas such as geographical information systems, CAD systems and spatial databases in order to efficiently index multi-dimensional data. As data sets used in these areas grow in size and complexity, however, range query operations on R-tree are needed to be further faster to meet the area-specific constraints. To address this problem, there have been various research efforts to develop strategies for acceleration query processing on R-tree by using the buffer mechanism or parallelizing the query processing on R-tree through multiple disks and processors. As a part of the strategies, approaches which parallelize query processing on R-tree through Graphics Processor Units(GPUs) have been explored. The use of GPUs may guarantee improved performances resulting from faster calculations and reduced disk accesses but may cause additional overhead costs caused by high memory access latencies and low data exchange rate between GPUs and the CPU. In this paper, to address the overhead problems and to adapt GPUs efficiently, we propose a novel approach which uses a GPU as a buffer to parallelize query processing on R-tree. The use of buffer algorithm can give improved performance by reducing the number of disk access and maximizing coalesced memory access resulting in minimizing GPU memory access latencies. Through the extensive performance studies, we observed that the proposed approach achieved up to 5 times higher query performance than the original CPU-based R-trees.

A new warp scheduling technique for improving the performance of GPUs by utilizing MSHR information (GPU 성능 향상을 위한 MSHR 정보 기반 워프 스케줄링 기법)

  • Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.13 no.3
    • /
    • pp.72-83
    • /
    • 2017
  • GPUs can provide high throughput with latency hiding by executing many warps in parallel. MSHR(Miss Status Holding Registers) for L1 data cache tracks cache miss requests until required data is serviced from lower level memory. In recent GPUs, excessive requests for cache resources cause underutilization problem of GPU resources due to cache resource reservation fails. In this paper, we propose a new warp scheduling technique to reduce stall cycles under MSHR resource shortage. Cache miss rates for each warp is predicted based on the observation that each warp shows similar cache miss rates for long period. The warps showing low miss rates or computation-intensive warps are given high priority to be issued when MSHR is full status. Our proposal improves GPU performance by utilizing cache resource more efficiently based on cache miss rate prediction and monitoring the MSHR entries. According to our experimental results, reservation fail cycles can be reduced by 25.7% and IPC is increased by 6.2% with the proposed scheduling technique compared to loose round robin scheduler.