• Title/Summary/Keyword: GPU implementation

Search Result 145, Processing Time 0.029 seconds

Design and Implementation of GPU Based Time-Variant Volume Rendering Program and User-Friendly Transfer Function Editor (GPU 기반의 Time-Variant 볼륨 렌더링 프로그램과 사용자 친화적인 전이함수 에디터의 설계 및 구현)

  • Lee, Joong-Youn;Hur, Young-Ju;Koo, Gee-Bum
    • 한국HCI학회:학술대회논문집
    • /
    • 2007.02a
    • /
    • pp.1025-1030
    • /
    • 2007
  • 여러 학계와 산업계로부터 인체영상과 같은 정적인 볼륨 데이터뿐만 아니라, 유체 흐름과 같은 동적으로 움직이는 Time-Variant 볼륨 데이터에 대한 실시간 렌더링의 요구가 계속되고 있다. 일반적으로 Time-Variant 데이터는 그 크기가 정적 볼륨 데이터의 수배에서 수백 배에 이르러, 이를 실시간으로 가시화하는 데에 많은 어려움이 있어왔다. 한편, PC 그래픽스 하드웨어의 급격한 발전에 따라 슈퍼컴퓨터나 다수의 컴퓨터들을 이용한 병렬/분산 렌더링으로나 가능했던 Time-Variant 볼륨 데이터의 실시간 볼륨 렌더링을 한대의 일반 PC에서 수행하려는 시도가 계속되고 있다. GPU의 꼭지점 및 프래그먼트 쉐이더(vertex & fragment shader)는 수치 계산에 최적화된 벡터 연산과 사용자 프로그래밍 기능으로 빠른 볼륨 렌더링을 일반 PC에서도 가능하게 했다. 본 논문에서는 GPU를 이용해서 Time-Variant 볼륨 데이터를 빠르게 가시화하고, 이렇게 개발한 GPU 볼륨 렌더링 프로그램을 사용자가 사용하기 편리하도록 사용자 친화적인 유저 인터페이스를 설계하고 구현하였다. 특히, 시간에 따라 동적으로 변화해야 하는 전이함수를 최대한 편리하게 생성할 수 있도록 전이함수 에디터에 중점을 두었다.

  • PDF

An Echo Processor for Medical Ultrasound Imaging Using a GPU with Massively Parallel Processing Architecture (병렬 처리 구조의 GPU를 이용한 의료 초음파 영상용 에코 신호 처리기)

  • Seo, Sin-Hyeok;Sohn, Hak-Yeol;Song, Tai-Kyong
    • Proceedings of the IEEK Conference
    • /
    • 2008.06a
    • /
    • pp.871-872
    • /
    • 2008
  • The method and results of the software implementation of a echo processor for medical ultrasound imaging using a GPU (NVIDIA G80) is presented. The echo signal processing functions are modified in a SIMD manner suitable for the GPU's massively parallel processing architecture so that the GPU's 128 ALUs are utilized nearly 100%. The preliminary result for a frame of image composed of 128 scan lines, each having 10240 16-bit samples, shows that the echo processor can be inplemented at a high rate of 30 frames per second when implemented in C, which is close to the optimized assembly codes running on the TI's TMS320C6416 DSP.

  • PDF

Implementation of Integrated CPU-GPU for Efficient Uniform Memory Access Method and Verification System (CPU-GPU간 긴밀성을 위한 효율적인 공유메모리 접근 방법과 검증 시스템 구현)

  • Park, Hyun-moon;Kwon, Jinsan;Hwang, Tae-ho;Kim, Dong-Sun
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.11 no.2
    • /
    • pp.57-65
    • /
    • 2016
  • In this paper, we propose a system for efficient use of shared memory between CPU and GPU. The system, called Fusion Architecture, assures consistency of the shared memory and minimizes cache misses that frequently occurs on Heterogeneous System Architecture or Unified Virtual Memory based systems. It also maximizes the performance for memory intensive jobs by efficient allocation of GPU cores. To test between architectures on various scenarios, we introduce the Fusion Architecture Analyzer, which compares OpenMP, OpenCL, CUDA, and the proposed architecture in terms of memory overhead and process time. As a result, Proposed fusion architectures show that the Fusion Architecture runs benchmarks 55% faster and reduces memory overheads by 220% in average.

Memory-Efficient Belief Propagation for Stereo Matching on GPU (GPU 에서의 고속 스테레오 정합을 위한 메모리 효율적인 Belief Propagation)

  • Choi, Young-Kyu;Williem, Williem;Park, In Kyu
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2012.11a
    • /
    • pp.52-53
    • /
    • 2012
  • Belief propagation (BP) is a commonly used global energy minimization algorithm for solving stereo matching problem in 3D reconstruction. However, it requires large memory bandwidth and data size. In this paper, we propose a novel memory-efficient algorithm of BP in stereo matching on the Graphics Processing Units (GPU). The data size and transfer bandwidth are significantly reduced by storing only a part of the whole message. In order to maintain the accuracy of the matching result, the local messages are reconstructed using shared memory available in GPU. Experimental result shows that there is almost an order of reduction in the global memory consumption, and 21 to 46% saving in memory bandwidth when compared to the conventional algorithm. The implementation result on a recent GPU shows that we can obtain 22.8 times speedup in execution time compared to the execution on CPU.

  • PDF

Rapid and Brief Communication GPU implementation of neural networks

  • Oh, Kyoung-Su;Jung, Kee-Chul
    • 한국HCI학회:학술대회논문집
    • /
    • 2007.02c
    • /
    • pp.322-325
    • /
    • 2007
  • Graphics processing unit (GPU) is used for a faster artificial neural network. It is used to implement the matrix multiplication of a neural network to enhance the time performance of a text detection system. Preliminary results produced a 20-fold performance enhancement using an ATI RADEON 9700 PRO board. The parallelism of a GPU is fully utilized by accumulating a lot of input feature vectors and weight vectors, then converting the many inner-product operations into one matrix operation. Further research areas include benchmarking the performance with various hardware and GPU-aware learning algorithms. (c) 2004 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

Accelerating the Sweep3D for a Graphic Processor Unit

  • Gong, Chunye;Liu, Jie;Chen, Haitao;Xie, Jing;Gong, Zhenghu
    • Journal of Information Processing Systems
    • /
    • v.7 no.1
    • /
    • pp.63-74
    • /
    • 2011
  • As a powerful and flexible processor, the Graphic Processing Unit (GPU) can offer a great faculty in solving many high-performance computing applications. Sweep3D, which simulates a single group time-independent discrete ordinates (Sn) neutron transport deterministically on 3D Cartesian geometry space, represents the key part of a real ASCI application. The wavefront process for parallel computation in Sweep3D limits the concurrent threads on the GPU. In this paper, we present multi-dimensional optimization methods for Sweep3D, which can be efficiently implemented on the finegrained parallel architecture of the GPU. Our results show that the overall performance of Sweep3D on the CPU-GPU hybrid platform can be improved up to 4.38 times as compared to the CPU-based implementation.

Implementation of GPU Acceleration of Object Detection Application with Drone Video (드론 영상 대상 물체 검출 어플리케이션의 GPU가속 구현)

  • Park, Si-Hyun;Park, Chun-Su
    • Journal of the Semiconductor & Display Technology
    • /
    • v.20 no.3
    • /
    • pp.117-119
    • /
    • 2021
  • With the development of the industry, the use of drones in specific mission flight is being actively studied. These drones fly a specified path and perform repetitive tasks. if the drone system will detect objects in real time, the performance of these mission flight will increase. In this paper, we implement object detection system and mount GPU acceleration to maximize the efficiency of limited device resources with drone video using Tensorflow Lite which enables in-device inference from a mobile device and Mobile SDK of DJI, a drone manufacture. For performance comparison, the average processing time per frame was measured when object detection was performed using only the CPU and when object detection was performed using the CPU and GPU at the same time.

Performance Analysis of DNN inference using OpenCV Built in CPU and GPU Functions (OpenCV 내장 CPU 및 GPU 함수를 이용한 DNN 추론 시간 복잡도 분석)

  • Park, Chun-Su
    • Journal of the Semiconductor & Display Technology
    • /
    • v.21 no.1
    • /
    • pp.75-78
    • /
    • 2022
  • Deep Neural Networks (DNN) has become an essential data processing architecture for the implementation of multiple computer vision tasks. Recently, DNN-based algorithms achieve much higher recognition accuracy than traditional algorithms based on shallow learning. However, training and inference DNNs require huge computational capabilities than daily usage purposes of computers. Moreover, with increased size and depth of DNNs, CPUs may be unsatisfactory since they use serial processing by default. GPUs are the solution that come up with greater speed compared to CPUs because of their Parallel Processing/Computation nature. In this paper, we analyze the inference time complexity of DNNs using well-known computer vision library, OpenCV. We measure and analyze inference time complexity for three cases, CPU, GPU-Float32, and GPU-Float16.

Implementation of fast facial image detecting system based on GPU (GPU 기반 고속 얼굴 영역 검출 구현)

  • Lee, Seong-Yeon;Park, Seong-Mo;Kim, Jong-Nam
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2009.04a
    • /
    • pp.130-131
    • /
    • 2009
  • 얼굴 영역 검출은 얼굴 인식, 얼굴 복원 등 산업 및 학술 여러 분야에 걸쳐 사용되는 기술이다. 고속의 얼굴 영역 검출을 위하여 고성능 하드웨어를 사용하거나 고속 알고리즘을 사용하는데, 본 논문에서는 GPU 기반 프로그래밍 기법인 CUDA를 이용하여 고속 얼굴 영역 검출 시스템을 구현하였다. 기존의 얼굴 영역 검출 시스템은 처리 속도의 한계로 인해 고속의 검출이 어려웠을 뿐 아니라 고속으로 동작하도록 하려면 고가의 시스템 부품을 사용하여야 하므로 사용자에게 부담을 안겨주었다. 그러나 nVidia 등 그래픽 칩셋 제조업체들이 속속 내놓고 있는 GPGPU 기술을 이용하여 얼굴 영역 검출 시스템을 구현할 경우 보다 저렴한 가격에 보다 뛰어난 성능을 가질 수 있도록 할 수 있다. 따라서 본 논문에서는 이러한 범용 GPU 사용 기술 중 하나인 nVidia의 CUDA를 이용하여 얼굴 검출 시스템을 구현하였다. 실험 결과 GPU 기반 시스템은 CPU 기반 시스템보다 고속으로 검출이 가능함을 확인하였다. 제안하는 방법은 nVidia 그래픽 카드가 설치된 시스템에서 고속의 감시카메라 서버 등으로 적용이 가능하다.

Design and Implementation of High-Performance Cryptanalysis System Based on GPUDirect RDMA (GPUDirect RDMA 기반의 고성능 암호 분석 시스템 설계 및 구현)

  • Lee, Seokmin;Shin, Youngjoo
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.32 no.6
    • /
    • pp.1127-1137
    • /
    • 2022
  • Cryptographic analysis and decryption technology utilizing the parallel operation of GPU has been studied in the direction of shortening the computation time of the password analysis system. These studies focus on optimizing the code to improve the speed of cryptographic analysis operations on a single GPU or simply increasing the number of GPUs to enhance parallel operations. However, using a large number of GPUs without optimization for data transmission causes longer data transmission latency than using a single GPU and increases the overall computation time of the cryptographic analysis system. In this paper, we investigate GPUDirect RDMA and related technologies for high-performance data processing in deep learning or HPC research fields in GPU clustering environments. In addition, we present a method of designing a high-performance cryptanalysis system using the relevant technologies. Furthermore, based on the suggested system topology, we present a method of implementing a cryptanalysis system using password cracking and GPU reduction. Finally, the performance evaluation results are presented according to demonstration of high-performance technology is applied to the implemented cryptanalysis system, and the expected effects of the proposed system design are shown.