• Title/Summary/Keyword: CUDA(CUDA)

Search Result 295, Processing Time 0.028 seconds

Parallel Computation For The Edit Distance Based On The Four-Russians' Algorithm (4-러시안 알고리즘 기반의 편집거리 병렬계산)

  • Kim, Young Ho;Jeong, Ju-Hui;Kang, Dae Woong;Sim, Jeong Seop
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.2 no.2
    • /
    • pp.67-74
    • /
    • 2013
  • Approximate string matching problems have been studied in diverse fields. Recently, fast approximate string matching algorithms are being used to reduce the time and costs for the next generation sequencing. To measure the amounts of errors between two strings, we use a distance function such as the edit distance. Given two strings X(|X| = m) and Y(|Y| = n) over an alphabet ${\Sigma}$, the edit distance between X and Y is the minimum number of edit operations to convert X into Y. The edit distance between X and Y can be computed using the well-known dynamic programming technique in O(mn) time and space. The edit distance also can be computed using the Four-Russians' algorithm whose preprocessing step runs in $O((3{\mid}{\Sigma}{\mid})^{2t}t^2)$ time and $O((3{\mid}{\Sigma}{\mid})^{2t}t)$ space and the computation step runs in O(mn/t) time and O(mn) space where t represents the size of the block. In this paper, we present a parallelized version of the computation step of the Four-Russians' algorithm. Our algorithm computes the edit distance between X and Y in O(m+n) time using m/t threads. Then we implemented both the sequential version and our parallelized version of the Four-Russians' algorithm using CUDA to compare the execution times. When t = 1 and t = 2, our algorithm runs about 10 times and 3 times faster than the sequential algorithm, respectively.

Parallelized Matrix Operation for Fast Computations of Antenna Characteristics (안테나 특성 고속 계산을 위한 병렬화 행렬 연산)

  • Cho, Yong-Heui
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2015.05a
    • /
    • pp.61-62
    • /
    • 2015
  • 밀리미터파 대역에서 사용하는 대형 안테나 해석 속도를 개선하기 위한 병렬형 행렬 연산법을 제안한다. 기존의 가우스 소거법을 병렬화하기 위해 행렬 분해와 반복법을 이용한다. 또한, 반복법의 수렴성을 높이기 위해 이전 행렬해를 부분적으로 사용하여 분해 행렬을 구성하는 방식도 제시한다. 본 제안법은 OpenMP, MPI, CUDA 등의 병렬법과 함께 사용할 수 있다.

  • PDF

병렬 영상처리 기반의 고속 머신 비전기술동향

  • Park, Eun-Su;Choe, Hak-Nam;Kim, Jun-Cheol;Jeong, Eum-Han;Kim, Hak-Il
    • ICROS
    • /
    • v.15 no.3
    • /
    • pp.31-39
    • /
    • 2009
  • 본 고에서는 병렬 영상처리를 이용한 고속 머신 비전(Machine Vision) 기술의 동향에 관해 다룬다. 머신 비전에서 사용되는 대표적인 고속 상용 영상처리 라이브러리인 MIL, HALCON, IPP에 대해 소개하고 현재 활발히 연구되고 있는 SSE, OpenMP, CUDA와 같은 병렬 처리 기술에 대하여 알아 본다. 이러한 병렬 처리 기술을 실제 영상처리 알고리즘에 적용하여 그 성능을 고속 상용 영상처리 라이브러리의 성능과 비교하여 소개된 병렬 처리 기술을 실제 PCB 기판 자동검사와 같은 머신 비전에 적용한 연구사례에 대해서 알아본다.

A Study on the Performance of Multiple GPU's (다중 GPU의 성능에 대한 연구)

  • Kim, Yerim;Kim, Youngtae
    • Annual Conference of KIPS
    • /
    • 2016.04a
    • /
    • pp.49-50
    • /
    • 2016
  • 본 논문에서는 다중 GPU의 효율성을 알아보기 위하여 정적분 계산을 이용하여 원주율(${\pi}$)를 계산하는 CUDA 프로그램을 구현하였으며, 다중 GPU를 사용하기 위해서는 병렬처리 라이브러리인 MPI를 사용하였다. 실험 결과 GPU의 수에 비례하여 성능이 선형으로 증가함을 보였다.

Performance Analysis on Parallel Processing of a Hybrid of a CPU and a GPU (CPU와 GPU의 혼합 병렬 계산에 대한 성능 분석)

  • Hwang, Keunchang;Kim, Youngtae
    • Annual Conference of KIPS
    • /
    • 2016.04a
    • /
    • pp.59-60
    • /
    • 2016
  • 본 논문에서는 고성능 병렬 계산 장치로 주목받고 있는 GPU를 CPU와 동시에 병렬로 사용한 계산 성능을 분석하였다. 성능 분석을 위하여 원주율(${\pi}$)을 적분으로 계산하는 CUDA 프로그램을 사용하였으며, 전체 계산을 GPU 대비 CPU 계산 부분으로 할당하여 성능을 분석하였다.

GPU Accelating of SIFT detection (SIFT 추출의 GPU 가속)

  • Seo, Kyoung-Taek;Kwon, Oh-Young
    • Annual Conference of KIPS
    • /
    • 2015.10a
    • /
    • pp.238-241
    • /
    • 2015
  • 특징점 추출 알고리즘은 물체인식, 로보틱스, 비디오트래킹 등 많은 컴퓨터 비전 분야에 사용된다. 그 중 SIFT 알고리즘은 많은 계산량이 필요한 알고리즘으로 구성되어 있으므로 높은 화소의 이미지를 처리하기 위해서는 많은 시간이 소요되므로 GPU를 통한 가속이 필요하다. 본 논문에서는 NVIDIA GPU 장비를 사용하는 CUDA를 이용하여 SIFT 알고리즘을 병렬적으로 처리하여 4배 이상의 수행시간 감소 및 특징점이 많고 고해상도인 영상에서 효율이 더 높은 것을 확인하였다.

Thermal Imagery-based Object Detection Algorithm for Low-Light Level Nighttime Surveillance System (저조도 야간 감시 시스템을 위한 열영상 기반 객체 검출 알고리즘)

  • Chang, Jeong-Uk;Lin, Chi-Ho
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.19 no.3
    • /
    • pp.129-136
    • /
    • 2020
  • In this paper, we propose a thermal imagery-based object detection algorithm for low-light level nighttime surveillance system. Many features selected by Haar-like feature selection algorithm and existing Adaboost algorithm are often vulnerable to noise and problems with similar or overlapping feature set for learning samples. It also removes noise from the feature set from the surveillance image of the low-light night environment, and implements it using the lightweight extended Haar feature and adaboost learning algorithm to enable fast and efficient real-time feature selection. Experiments use extended Haar feature points to recognize non-predictive objects with motion in nighttime low-light environments. The Adaboost learning algorithm with video frame 800*600 thermal image as input is implemented with CUDA 9.0 platform for simulation. As a result, the results of object detection confirmed that the success rate was about 90% or more, and the processing speed was about 30% faster than the computational results obtained through histogram equalization operations in general images.

Bandwidth Efficient Summed Area Table Generation for CUDA (CUDA를 이용한 효율적인 합산 영역 테이블의 생성 방법)

  • Ha, Sang-Won;Choi, Moon-Hee;Jun, Tae-Joon;Kim, Jin-Woo;Byun, Hye-Ran;Han, Tack-Don
    • Journal of Korea Game Society
    • /
    • v.12 no.5
    • /
    • pp.67-78
    • /
    • 2012
  • Summed area table allows filtering of arbitrary-width box regions for every pixel in constant time per pixel. This characteristic makes it beneficial in image processing applications where the sum or average of the surrounding pixel intensity is required. Although calculating the summed area table of an image data is primarily a memory bound job consisting of row or column-wise summation, previous works had to endure excessive access to the high latency global memory in order to exploit data parallelism. In this paper, we propose an efficient algorithm for generating the summed area table in the GPGPU environment where the input is decomposed into square sub-images with intermediate data that are propagated between them. By doing so, the global memory access is almost halved compared to the previous methods making an efficient use of the available memory bandwidth. The results show a substantial increase in performance.

Diffusion of software innovation: a Petri Net theory perspective (Petri Net 이론 관점에서 본 소프트웨어 혁신의 확산)

  • Han, Jiyeon;Ahn, Jongchang;Lee, Ook
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.14 no.2
    • /
    • pp.858-867
    • /
    • 2013
  • Hardware and software field are developed by environment of MPSOC. Also it is still working with economic world and academic world. This study focus on software side and try to classify from parallel programming design world. It can be divided by three; Data, Tasks, and Data flow model. Then we used Petri Net to CUDA and HOPES programmer and found how much they understand parallel programming for each side. We focus on two sides and what is different between their experience. Petri Net is easy to descript parallel program or parallel design pattern for Task, Data, and Hybird. This research can explain how they know and how much they know about parallel programming.

Acceleration of Feature-Based Image Morphing Using GPU (GPU를 이용한 특징 기반 영상모핑의 가속화)

  • Kim, Eun-Ji;Yoon, Seung-Hyun;Lee, Jieun
    • Journal of the Korea Computer Graphics Society
    • /
    • v.20 no.2
    • /
    • pp.13-24
    • /
    • 2014
  • In this study, a graphics-processing-unit (GPU)-based acceleration technique is proposed for the feature-based image morphing. This technique uses the depth-buffer of the graphics hardware to calculate efficiently the shortest distance between a pixel and the control lines. The pairs of control lines between the source image and the destination image are determined by user's input, and the distance function of each control line is rendered using two rectangles and two cones. The distance between each pixel and its nearest control line is stored in the depth buffer through the graphics pipeline, and this is used to conduct the morphing operation efficiently. The pixel-unit morphing operation is parallelized using the compute unified device architecture (CUDA) to reduce the morphing time. We demonstrate the efficiency of the proposed technique using several experimental results.