• Title/Summary/Keyword: CUDA Implementation

Search Result 67, Processing Time 0.022 seconds

Implementation of Stereo Matching Algorithm using GPU (GPU를 이용한 스테레오 정합 알고리즘의 구현)

  • Choi, Hyun-Jun;Seo, Young-Ho;Kim, Dong-Wook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.15 no.3
    • /
    • pp.583-588
    • /
    • 2011
  • In this paper, we propose an adaptive variable-sized matching window method using the characteristic points of the image and a method to increase the reliability of the cross-consistency check to raise the correctness of the final disparity image. The proposed adaptive variable-sized window method segments the image with the color information, finds the characteristic points inside the window. Also the proposed algorithm implement using a graphic processing unit(GPU). The GPU, we used in this paper is GeForce GTX296 (NVIDIA) and we can use programming based on CUDA. The calculation speed realizes a speed approximately 128 times faster than that of a CPU.

The Implementation of Fast Object Recognition Using Parallel Processing on CPU and GPU (CPU와 GPU의 병렬 처리를 이용한 고속 물체 인식 알고리즘 구현)

  • Kim, Jun-Chul;Jung, Young-Han;Park, Eun-Soo;Cui, Xue-Nan;Kim, Hak-Il;Huh, Uk-Youl
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.15 no.5
    • /
    • pp.488-495
    • /
    • 2009
  • This paper presents a fast feature extraction method for autonomous mobile robots utilizing parallel processing and based on OpenMP, SSE (Streaming SIMD Extension) and CUDA programming. In the first step on CPU version, the algorithms and codes are optimized and then implemented by parallel processing. The parallel algorithms are debugged to maintain the same level of performance and the process for extracting key points and obtaining dominant orientation with respect to key points is parallelized. After extraction, a parallel descriptor via SSE instructions is constructed. And the GPU version also implemented by parallel processing using CUDA based on the SIFT. The GPU-Parallel descriptor achieves an acceleration up to five times compared with the CPU-Parallel descriptor, but it shows the lower performance than CPU version. CPU version also speed-up the four and half times compared with the original SIFT while maintaining robust performance.

Implementation of LTE uplink System for SDR Platform using CUDA and UHD (CUDA와 UHD를 이용한 SDR 플랫폼 용 LTE 상향링크 시스템 구현)

  • Ahn, Chi Young;Kim, Yong;Choi, Seung Won
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.9 no.2
    • /
    • pp.81-87
    • /
    • 2013
  • In this paper, we present an implementation of Long Term Evolution (LTE) Uplink (UL) system on a Software Defined Radio (SDR) platform using a conventional Personal Computer (PC), which adopts Graphic Processing Units (GPU) and Universal Software Radio Peripheral2 (USRP2) with URSP Hardware Driver (UHD) for SDR software modem and Radio Frequency (RF) transceiver, respectively. We have adopted UHD because UHD provides flexibility in the design of transceiver chain. Also, Cognitive Radio (CR) engine have been implemented by using libraries from UHD. Meanwhile, we have implemented the software modem in our system on GPU which is suitable for parallel computing due to its powerful Arithmetic and Logic Units (ALUs). From our experiment tests, we have measured the total processing time for a single frame of both transmit and receive LTE UL data to find that it takes about 5.00ms and 6.78ms for transmit and receive, respectively. It particularly means that the implemented system is capable of real-time processing of all the baseband signal processing algorithms required for LTE UL system.

Implementation of Pedestrian Detection and Tracking with GPU at Night-time (GPU를 이용한 야간 보행자 검출과 추적 시스템 구현)

  • Choi, Beom-Joon;Yoon, Byung-Woo;Song, Jong-Kwan;Park, Jangsik
    • Journal of Broadcast Engineering
    • /
    • v.20 no.3
    • /
    • pp.421-429
    • /
    • 2015
  • This paper is about an approach for pedestrian detection and tracking with infrared imagery. We used the CUDA(Computer Unified Device Architecture) that is a parallel processing language in order to improve the speed of video-based pedestrian detection and tracking. The detection phase is performed by Adaboost algorithm based on Haar-like features. Adaboost classifier is trained with datasets generated from infrared images. After detecting the pedestrian with the Adaboost classifier, we proposed a particle filter tracking strategies on HSV histogram feature that exploit adaptively at the same time. The proposed approach is implemented on an NVIDIA Jetson TK1 developer board that is full-featured device ideal for software development within the Linux environment. In this paper, we presented the results of parallel processing with the NVIDIA GPU on the CUDA development environment for detection and tracking of pedestrians. We compared the object detection and tracking processing time for night-time images on both GPU and CPU. The result showed that the detection and tracking speed of the pedestrian with GPU is approximately 6 times faster than that for CPU.

GPU Implementation Techniques of Genetic Algorithm and Comparative Studies (유전 알고리즘의 GPU 구현 기법 및 비교 연구)

  • Hyeon, Byeong-Yong;Seo, Ki-Sung
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.17 no.4
    • /
    • pp.328-335
    • /
    • 2011
  • GPU (Graphics Processing Units) is consists of SIMD (Single Instruction Multiple Data) architecture and provides fast parallel processing. A GA (Genetic Algorithm), which requires large computations, is implemented in GPU using CUDA (Compute Unified Device Architecture). Three kinds of execution models are presented according to different combinations of processing modules in GPU. Comparison experiments between GPU models and CPU are tested for a couple of benchmark problems by variation of population sizes and complexity of problem sizes.

Implementation of fast facial image detecting system based on GPU (GPU 기반 고속 얼굴 영역 검출 구현)

  • Lee, Seong-Yeon;Park, Seong-Mo;Kim, Jong-Nam
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2009.04a
    • /
    • pp.130-131
    • /
    • 2009
  • 얼굴 영역 검출은 얼굴 인식, 얼굴 복원 등 산업 및 학술 여러 분야에 걸쳐 사용되는 기술이다. 고속의 얼굴 영역 검출을 위하여 고성능 하드웨어를 사용하거나 고속 알고리즘을 사용하는데, 본 논문에서는 GPU 기반 프로그래밍 기법인 CUDA를 이용하여 고속 얼굴 영역 검출 시스템을 구현하였다. 기존의 얼굴 영역 검출 시스템은 처리 속도의 한계로 인해 고속의 검출이 어려웠을 뿐 아니라 고속으로 동작하도록 하려면 고가의 시스템 부품을 사용하여야 하므로 사용자에게 부담을 안겨주었다. 그러나 nVidia 등 그래픽 칩셋 제조업체들이 속속 내놓고 있는 GPGPU 기술을 이용하여 얼굴 영역 검출 시스템을 구현할 경우 보다 저렴한 가격에 보다 뛰어난 성능을 가질 수 있도록 할 수 있다. 따라서 본 논문에서는 이러한 범용 GPU 사용 기술 중 하나인 nVidia의 CUDA를 이용하여 얼굴 검출 시스템을 구현하였다. 실험 결과 GPU 기반 시스템은 CPU 기반 시스템보다 고속으로 검출이 가능함을 확인하였다. 제안하는 방법은 nVidia 그래픽 카드가 설치된 시스템에서 고속의 감시카메라 서버 등으로 적용이 가능하다.

Computationally Efficient Implementation of a Hamming Code Decoder Using Graphics Processing Unit

  • Islam, Md Shohidul;Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of Communications and Networks
    • /
    • v.17 no.2
    • /
    • pp.198-202
    • /
    • 2015
  • This paper presents a computationally efficient implementation of a Hamming code decoder on a graphics processing unit (GPU) to support real-time software-defined radio, which is a software alternative for realizing wireless communication. The Hamming code algorithm is challenging to parallelize effectively on a GPU because it works on sparsely located data items with several conditional statements, leading to non-coalesced, long latency, global memory access, and huge thread divergence. To address these issues, we propose an optimized implementation of the Hamming code on the GPU to exploit the higher parallelism inherent in the algorithm. Experimental results using a compute unified device architecture (CUDA)-enabled NVIDIA GeForce GTX 560, including 335 cores, revealed that the proposed approach achieved a 99x speedup versus the equivalent CPU-based implementation.

Parallel Implementation of Scrypt: A Study on GPU Acceleration for Password-Based Key Derivation Function

  • SeongJun Choi;DongCheon Kim;Seog Chung Seo
    • Journal of information and communication convergence engineering
    • /
    • v.22 no.2
    • /
    • pp.98-108
    • /
    • 2024
  • Scrypt is a password-based key derivation function proposed by Colin Percival in 2009 that has a memory-hard structure. Scrypt has been intentionally designed with a memory-intensive structure to make password cracking using ASICs, GPUs, and similar hardware more difficult. However, in this study, we thoroughly analyzed the operation of Scrypt and proposed strategies to maximize computational parallelism in GPU environments. Through these optimizations, we achieved an outstanding performance improvement of 8284.4% compared with traditional CPU-based Scrypt computations. Moreover, the GPU-optimized implementation presented in this paper outperforms the simple GPU-based Scrypt processing by a significant margin, providing a performance improvement of 204.84% in the RTX3090. These results demonstrate the effectiveness of our proposed approach in harnessing the computational power of GPUs and achieving remarkable performance gains in Scrypt calculations. Our proposed implementation is the first GPU implementation of Scrypt, demonstrating the ability to efficiently crack Scrypt.

Implementation of Efficient Power Method on CUDA GPU (CUDA 기반 GPU에서 효율적인 Power Method의 구현)

  • Kim, Jung-Hwan;Kim, Jin-Soo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.2
    • /
    • pp.9-16
    • /
    • 2011
  • GPU computing is emerging in high performance application area since it can easily exploit massive parallelism in a way of cost-effective computing. The power method which finds the eigen vector of a given matrix is widely used in various applications such as PageRank for calculating importance of web pages. In this research we made the power method efficiently parallelized on GPU and also suggested how it can be improved to enhance its performance. The power method mainly consists of matrix-vector product and it can be easily parallelized. However, it should decide the convergence of the eigen vector and need scaling of the vector subsequently. Such operations incur several calls to GPU kernels and data movement between host and GPU memories. We improved the performance of the power method by means of reduced calls to GPU kernels, optimized thread allocation and enhanced decision operation for the convergence.

Realtime Wideband SW DDC Using High-Speed Parallel Processing (고속 병렬처리 기법을 활용한 실시간 광대역 소프트웨어 DDC)

  • Lee, Hyeon-Hwi;Lee, Kwang-Yong;Yun, Sangbom;Park, Yeongil;Kim, Seongyo
    • The Journal of Korean Institute of Electromagnetic Engineering and Science
    • /
    • v.25 no.11
    • /
    • pp.1135-1141
    • /
    • 2014
  • Performing wideband DDC while quantizing signal over a wide dynamic range and high speed sampling rate have primarily been implemented in a hardware such as, FPGA or ASIC because of time-consuming job. Real-time wideband DDC SW, even though signal environment changes, adapt to signal environment flexibly and can be reused. In addition, it has a lower price than the hardware implementation. In this paper, we study the system design that can be stored in real time designing a high-speed parallel processing architecture for SW-based wideband DDC. Finally, applying a Ping-Pong Buffering mechanism for receiving a signal in real time and CUDA for a high-speed signal processing, we verify wideband DDC design procedure that meets the signal processing.