• Title/Summary/Keyword: CUDA(CUDA)

Search Result 295, Processing Time 0.026 seconds

Acceleration techniques for GPGPU-based Maximum Intensity Projection (GPGPU 환경에서 최대휘소투영 렌더링의 고속화 방법)

  • Kye, Hee-Won;Kim, Jun-Ho
    • Journal of Korea Multimedia Society
    • /
    • v.14 no.8
    • /
    • pp.981-991
    • /
    • 2011
  • MIP(Maximum Intensity Projection) is a volume rendering technique which is essential for the medical imaging system. MIP rendering based on the ray casting method produces high quality images but takes a long time. Our aim is improvement of the rendering speed using GPGPU(General-purpose computing on Graphic Process Unit) technique. In this paper, we present the ray casting algorithm based on CUDA(an acronym for Compute Unified Device Architecture) which is a programming language for GPGPU and we suggest new acceleration methods for CUDA. In detail, we propose the block based space leaping which skips unnecessary regions of volume data for CUDA, the bisection method which is a fast method to find a block edge, and the initial value estimation method which improves the probability of space leaping. Due to the proposed methods, we noticeably improve the rendering speed without image quality degradation.

An Optimization Method for Hologram Generation on Multiple GPU-based Parallel Processing (다중 GPU기반 홀로그램 생성을 위한 병렬처리 성능 최적화 기법)

  • Kook, Joongjin
    • Smart Media Journal
    • /
    • v.8 no.2
    • /
    • pp.9-15
    • /
    • 2019
  • Since the computational complexity for hologram generation increases exponentially with respect to the size of the point cloud, parallel processing using CUDA and/or OpenCL library based on multiple GPUs has recently become popular. The CUDA kernel for parallelization needs to consist of threads, blocks, and grids properly in accordance with the number of cores and the memory size in the GPU. In addition, in case of multiple GPU environments, the distribution in grid-by-grid, in block-by-block, or in thread-by-thread is needed according to the number of GPUs. In order to evaluate the performance of CGH generation, we compared the computational speed in CPU, in single GPU, and in multi-GPU environments by gradually increasing the number of points in a point cloud from 10 to 1,000,000. We also present a memory structure design and a calculation method required in the CUDA-based parallel processing to accelerate the CGH (Computer Generated Hologram) generation operation in multiple GPU environments.

CUDA-based Parallel Bi-Conjugate Gradient Matrix Solver for BioFET Simulation (BioFET 시뮬레이션을 위한 CUDA 기반 병렬 Bi-CG 행렬 해법)

  • Park, Tae-Jung;Woo, Jun-Myung;Kim, Chang-Hun
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.48 no.1
    • /
    • pp.90-100
    • /
    • 2011
  • We present a parallel bi-conjugate gradient (Bi-CG) matrix solver for large scale Bio-FET simulations based on recent graphics processing units (GPUs) which can realize a large-scale parallel processing with very low cost. The proposed method is focused on solving the Poisson equation in a parallel way, which requires massive computational resources in not only semiconductor simulation, but also other various fields including computational fluid dynamics and heat transfer simulations. As a result, our solver is around 30 times faster than those with traditional methods based on single core CPU systems in solving the Possion equation in a 3D FDM (Finite Difference Method) scheme. The proposed method is implemented and tested based on NVIDIA's CUDA (Compute Unified Device Architecture) environment which enables general purpose parallel processing in GPUs. Unlike other similar GPU-based approaches which apply usually 32-bit single-precision floating point arithmetics, we use 64-bit double-precision operations for better convergence. Applications on the CUDA platform are rather easy to implement but very hard to get optimized performances. In this regard, we also discuss the optimization strategy of the proposed method.

Frequency Hopping Signal Analysis Using High-Speed Parallel Processing (고속 병렬처리 기법을 활용한 주파수 도약 신호 분석)

  • Lee, Kwang-Yong;Yoon, Hyun-Chul;Lee, Hyeon-Hwi
    • The Journal of Korean Institute of Electromagnetic Engineering and Science
    • /
    • v.25 no.2
    • /
    • pp.251-254
    • /
    • 2014
  • In this paper, we studied a technique of extracting a Frequency Hopping(FH) signal for analysis using high-speed parallel processing structure. Unlike fixed frequency signal, FH signal is difficult to detect and analyze because FH systems use many random frequencies instead of a single carrier frequency. To solve this problem we designed a method that analyze FH signal using high-speed parallel processing. In order to apply parallel processing, we use CUDA using GPU and compare single processing with prarallel processing. As a result, using CUDA on a GPU is about 8.53 times faster than single processing.

Method of Multi Thread Management based on Shader Instruction for Mobile GPGPU (GPGPU를 위한 쉐이더 명령어기반 멀티 스레드 관리 기법)

  • Lee, Kwang-Yeob;Park, Tae-Ryong
    • Journal of IKEEE
    • /
    • v.16 no.4
    • /
    • pp.310-315
    • /
    • 2012
  • This thesis is intended to design multi thread mobile GPGPU optimized in mobile environment, and to verify an effective thread management method of the multi thread mobile processor. In thread management, there is no management hardware and implement with software instructions. For the verification of the multi thread management method, Lane detection algorithm was implemented to compare nVidia's CUDA Architecture and the designed GPGPU in terms of thread management efficiency. The number of thread is normalized to 48 threads. An implemented Land Detection Algorithm is composed of Gaussian filter algorithm and Sobel Edge Detection algorithm. As a result, the designed GPGPU's thread efficiency is up to 2 times higher than CUDA's thread efficiency.

Parallelization of CUSUM Test in a CUDA Environment (CUDA 환경에서 CUSUM 검증의 병렬화)

  • Son, Changhwan;Park, Wooyeol;Kim, HyeongGyun;Han, KyungSook;Pyo, Changwoo
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.7
    • /
    • pp.476-481
    • /
    • 2015
  • We have parallelized the cumulative sum (CUSUM) test of NIST's statistical random number test suite in a CUDA environment. Storing random walks in an array instead of in scalar variables eliminates data dependence. The change in data structure makes it possible to apply parallel scans, scatters, and reductions at each stage of the test. In addition, serial data exchanges between CPU and GPU are removed by migrating CPU's tasks to GPU. Finally we have optimized global memory accesses. The overall speedup is 23 times over the sequential version. Our results contribute to improving security of random numbers for cryptographic keys as well as reducing the time for evaluation of randomness.

Efficient Implementation of Convolutional Neural Network Using CUDA (CUDA를 이용한 Convolutional Neural Network의 효율적인 구현)

  • Ki, Cheol-Min;Cho, Tai-Hoon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.6
    • /
    • pp.1143-1148
    • /
    • 2017
  • Currently, Artificial Intelligence and Deep Learning are rising as hot social issues, and these technologies are applied to various fields. A good method among the various algorithms in Artificial Intelligence is Convolutional Neural Networks. Convolutional Neural Network is a form that adds Convolution Layers to Multi Layer Neural Network. If you use Convolutional Neural Networks for small amount of data, or if the structure of layers is not complicated, you don't have to pay attention to speed. But the learning should take long time when the size of the learning data is large and the structure of layers is complicated. In these cases, GPU-based parallel processing is frequently needed. In this paper, we developed Convolutional Neural Networks using CUDA, and show that its learning is faster and more efficient than learning using some other frameworks or programs.

Analysis of GPU-based Parallel Shifted Sort Algorithm by comparing with General GPU-based Tree Traversal (일반적인 GPU 트리 탐색과의 비교실험을 통한 GPU 기반 병렬 Shifted Sort 알고리즘 분석)

  • Kim, Heesu;Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.18 no.6
    • /
    • pp.1151-1156
    • /
    • 2017
  • It is common to achieve lower performance in traversing tree data structures in GPU than one expects. In this paper, we analyze the reason of lower-than-expected performance in GPU tree traversal and present that the warp divergences is caused by the branch instructions ("if${\ldots}$ else") which appear commonly in tree traversal CUDA codes. Also, we compare the parallel shifted sort algorithm which can reduce the number of warp divergences with a kd-tree CUDA implementation to show that the shifted sort algorithm can work faster than the kd-tree CUDA implementation thanks to less warp divergences. As the analysis result, the shifted sort algorithm worked about 16-fold faster than the kd-tree CUDA implementation for $2^{23}$ query points and $2^{23}$ data points in $R^3$ space. The performance gaps tend to increase in proportion to the number of query points and data points.

An Efficient Multidimensional Scaling Method based on CUDA and Divide-and-Conquer (CUDA 및 분할-정복 기반의 효율적인 다차원 척도법)

  • Park, Sung-In;Hwang, Kyu-Baek
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.4
    • /
    • pp.427-431
    • /
    • 2010
  • Multidimensional scaling (MDS) is a widely used method for dimensionality reduction, of which purpose is to represent high-dimensional data in a low-dimensional space while preserving distances among objects as much as possible. MDS has mainly been applied to data visualization and feature selection. Among various MDS methods, the classical MDS is not readily applicable to data which has large numbers of objects, on normal desktop computers due to its computational complexity. More precisely, it needs to solve eigenpair problems on dissimilarity matrices based on Euclidean distance. Thus, running time and required memory of the classical MDS highly increase as n (the number of objects) grows up, restricting its use in large-scale domains. In this paper, we propose an efficient approximation algorithm for the classical MDS based on divide-and-conquer and CUDA. Through a set of experiments, we show that our approach is highly efficient and effective for analysis and visualization of data consisting of several thousands of objects.