• Title/Summary/Keyword: GPU optimization

Search Result 68, Processing Time 0.027 seconds

Parallel String Matching and Optimization Using OpenCL on FPGA (FPGA 상에서 OpenCL을 이용한 병렬 문자열 매칭 구현과 최적화 방향)

  • Yoon, Jin Myung;Choi, Kang-Il;Kim, Hyun Jin
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.66 no.1
    • /
    • pp.100-106
    • /
    • 2017
  • In this paper, we propose a parallel optimization method of Aho-Corasick (AC) algorithm and Parallel Failureless Aho-Corasick (PFAC) algorithm using Open Computing Language (OpenCL) on Field Programmable Gate Array (FPGA). The low throughput of string matching engine causes the performance degradation of network process. Recently, many researchers have studied the string matching engine using parallel computing. FPGA's vendors offer a parallel computing platform using OpenCL. In this paper, we apply the AC and PFAC algorithm on DE1-SoC board with Cyclone V FPGA, where the optimization that considers FPGA architecture is performed. Experiments are performed considering global id, local id, local memory, and loop unrolling optimizations using PFAC algorithm. The performance improvement using loop unrolling is 129 times greater than AC algorithm that not adopt loop unrolling. The performance improvements using loop unrolling are 1.1, 0.2, and 1.5 times greater than those using global id, local id, and local memory optimizations mentioned above.

CUDA Optimization of Super-Resolution Algorithm using ELBP Classifier (ELBP 분류기를 이용한 초해상도 기법의 CUDA 최적화)

  • Choi, Ji Hoon;Song, Byung Cheol
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2016.06a
    • /
    • pp.92-94
    • /
    • 2016
  • 저해상도 영상을 고해상도 영상으로 복원하기 위한 다양한 방법의 초해상도 기법이 존재한다. 다양한 기법들 중에서도 ELBP 분류기를 이용한 초해상도 기법[1]은 단일 영상 기반의 초해상도 기법으로 사전에 학습된 필터를 이용하여 고해상도 영상을 획득하는 기법이다. 그러나 해당 알고리즘을 일반적인 CPU 환경에서 수행할 경우 실시간으로 영상을 획득하는데 어려움이 존재한다. 본 논문에서는 지역메모리를 이용한 GPU 환경에서의 최적화를 수행하여 ELBP 분류기를 이용한 초해상도 기법의 가속성을 보인다. 먼저, 알고리즘에 대하여 간단히 설명하고 CUDA 가속화 기법[2]을 차례로 적용했을 때 얻을 수 있는 가속 성능을 확인한다. 최종적으로 본 논문은 CPU 환경과 비교했을 때 5 배의 가속 효과를 얻을 수 있다.

  • PDF

The optimization of deep learning performance for embedded systems using a zero-copy technique (Zero-copy 방식을 활용한 임베디드 환경에서의 딥러닝 성능 최적화)

  • Lee, Minhak;Kang, Woochul
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2016.10a
    • /
    • pp.62-63
    • /
    • 2016
  • 딥러닝의 대표적 개발 환경 중 하나인 Caffe를 임베디드 시스템의 메모리 구조를 고려하여 최적화하고 실제 측정 실험으로 기존의 방식보다 처리시간과 소비 전력량의 이득이 있다는 것을 확인하였다. 구체적으로 통합 메모리를 사용하는 임베디드 시스템 환경의 특성에 적합한 zero-copy기법을 적용하여 CPU와 GPU 모두 접근이 가능하도록 메모리 영역을 맵핑하는 방식으로 메모리 복제에 따른 오버헤드를 줄였으며, GoogLeNet 네트워크 모델에 대하여 10%의 처리 속도 향상과, 36% 소비 전력 감소를 확인하였다.

Artificial Intelligence for the Fourth Industrial Revolution

  • Jeong, Young-Sik;Park, Jong Hyuk
    • Journal of Information Processing Systems
    • /
    • v.14 no.6
    • /
    • pp.1301-1306
    • /
    • 2018
  • Artificial intelligence is one of the key technologies of the Fourth Industrial Revolution. This paper introduces the diverse kinds of approaches to subjects that tackle diverse kinds of research fields such as model-based MS approach, deep neural network model, image edge detection approach, cross-layer optimization model, LSSVM approach, screen design approach, CPU-GPU hybrid approach and so on. The research on Superintelligence and superconnection for IoT and big data is also described such as 'superintelligence-based systems and infrastructures', 'superconnection-based IoT and big data systems', 'analysis of IoT-based data and big data', 'infrastructure design for IoT and big data', 'artificial intelligence applications', and 'superconnection-based IoT devices'.

Exploration of Optimization Environment for CUDA-based Cholesky Decomposition (CUDA 기반 숄레스키 분해 성능 최적화 환경 탐색)

  • Junbeom Kang;Myungho Lee;Neungsoo Park
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2024.05a
    • /
    • pp.15-17
    • /
    • 2024
  • 최근 다양한 연구 분야에서는 CUDA 프레임워크를 이용하여 병렬 처리를 통해 연산 시간을 단축하는데 성공하고 있다. 이 중 숄레스키 분해는 양의 정부호 행렬을 하삼각행렬로 분해하는 과정에서 많은 행렬 곱셈이 요구되어 GPU 의 구조적 특징을 활용하면 상당한 가속화가 가능하다. 따라서 이 논문에서는 CUDA 코어에 연산을 할당할 때, 핵심 요소인 블록의 개수와 블록 당 쓰레드 개수를 조절할 수 있는 병렬 숄레스키 분해 연산 프로그램을 구현하였다. 서로 다른 세 종류의 행렬 크기에 대해 다양한 블록 수-쓰레드 수 환경을 설정하여 가속화 정도를 측정한 결과, 각 행렬 별 최적 환경에서 동일 그룹 내 최장 시간 대비, 1000x1000 행렬에서는 약 1.80 배, 2000x2000 행렬에서는 약 2.94 배의 추가적인 가속화를 달성하였다.

Color2Gray using Conventional Approaches in Black-and-White Photography (전통적 사진 기법에 기반한 컬러 영상의 흑백 변환)

  • Jang, Hyuk-Su;Choi, Min-Gyu
    • Journal of the Korea Computer Graphics Society
    • /
    • v.14 no.3
    • /
    • pp.1-9
    • /
    • 2008
  • This paper presents a novel optimization-based saliency-preserving method for converting color images to grayscale in a manner consistent with conventional approaches of black-and-white photographers. In black-and-white photography, a colored filter called a contrast filter has been commonly employed on a camera to lighten or darken selected colors. In addition, local exposure controls such as dodging and burning techniques are typically employed in the darkroom process to change the exposure of local areas within the print without affecting the overall exposure. Our method seeks a digital version of a conventional contrast filter to preserve visually-important image features. Furthermore, conventional burning and dodging techniques are addressed, together with image similarity weights, to give edge-aware local exposure control over the image space. Our method can be efficiently optimized on GPU. According to the experiments, CUDA implementation enables 1 megapixel color images to be converted to grayscale at interactive frames rates.

  • PDF

RGB Camera-based Real-time 21 DoF Hand Pose Tracking (RGB 카메라 기반 실시간 21 DoF 손 추적)

  • Choi, Junyeong;Park, Jong-Il
    • Journal of Broadcast Engineering
    • /
    • v.19 no.6
    • /
    • pp.942-956
    • /
    • 2014
  • This paper proposes a real-time hand pose tracking method using a monocular RGB camera. Hand tracking has high ambiguity since a hand has a number of degrees of freedom. Thus, to reduce the ambiguity the proposed method adopts the step-by-step estimation scheme: a palm pose estimation, a finger yaw motion estimation, and a finger pitch motion estimation, which are performed in consecutive order. Assuming a hand to be a plane, the proposed method utilizes a planar hand model, which facilitates a hand model regeneration. The hand model regeneration modifies the hand model to fit a current user's hand, and improves robustness and accuracy of the tracking results. The proposed method can work in real-time and does not require GPU-based processing. Thus, it can be applied to various platforms including mobile devices such as Google Glass. The effectiveness and performance of the proposed method will be verified through various experiments.

MPEG-I RVS Software Speed-up for Real-time Application (실시간 렌더링을 위한 MPEG-I RVS 가속화 기법)

  • Ahn, Heejune;Lee, Myeong-jin
    • Journal of Broadcast Engineering
    • /
    • v.25 no.5
    • /
    • pp.655-664
    • /
    • 2020
  • Free viewpoint image synthesis technology is one of the important technologies in the MPEG-I (Immersive) standard. RVS (Reference View Synthesizer) developed by MPEG-I and in use in MPEG group is a DIBR (Depth Information-Based Rendering) program that generates an image at a virtual (intermediate) viewpoint from multiple viewpoints' inputs. RVS uses the mesh surface method based on computer graphics, and outperforms the pixel-based ones by 2.5dB or more compared to the previous pixel method. Even though its OpenGL version provides 10 times speed up over the non OpenGL based one, it still shows a non-real-time processing speed, i.e., 0.75 fps on the two 2k resolution input images. In this paper, we analyze the internal of RVS implementation and modify its structure, achieving 34 times speed up, therefore, real-time performance (22-26 fps), through the 3 key improvements: 1) the reuse of OpenGL buffers and texture objects 2) the parallelization of file I/O and OpenGL execution 3) the parallelization of GPU shader program and buffer transfer.

Development of a Flooding Detection Learning Model Using CNN Technology (CNN 기술을 적용한 침수탐지 학습모델 개발)

  • Dong Jun Kim;YU Jin Choi;Kyung Min Park;Sang Jun Park;Jae-Moon Lee;Kitae Hwang;Inhwan Jung
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.23 no.6
    • /
    • pp.1-7
    • /
    • 2023
  • This paper developed a training model to classify normal roads and flooded roads using artificial intelligence technology. We expanded the diversity of learning data using various data augmentation techniques and implemented a model that shows good performance in various environments. Transfer learning was performed using the CNN-based Resnet152v2 model as a pre-learning model. During the model learning process, the performance of the final model was improved through various parameter tuning and optimization processes. Learning was implemented in Python using Google Colab NVIDIA Tesla T4 GPU, and the test results showed that flooding situations were detected with very high accuracy in the test dataset.

CUDA-based Parallel Bi-Conjugate Gradient Matrix Solver for BioFET Simulation (BioFET 시뮬레이션을 위한 CUDA 기반 병렬 Bi-CG 행렬 해법)

  • Park, Tae-Jung;Woo, Jun-Myung;Kim, Chang-Hun
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.48 no.1
    • /
    • pp.90-100
    • /
    • 2011
  • We present a parallel bi-conjugate gradient (Bi-CG) matrix solver for large scale Bio-FET simulations based on recent graphics processing units (GPUs) which can realize a large-scale parallel processing with very low cost. The proposed method is focused on solving the Poisson equation in a parallel way, which requires massive computational resources in not only semiconductor simulation, but also other various fields including computational fluid dynamics and heat transfer simulations. As a result, our solver is around 30 times faster than those with traditional methods based on single core CPU systems in solving the Possion equation in a 3D FDM (Finite Difference Method) scheme. The proposed method is implemented and tested based on NVIDIA's CUDA (Compute Unified Device Architecture) environment which enables general purpose parallel processing in GPUs. Unlike other similar GPU-based approaches which apply usually 32-bit single-precision floating point arithmetics, we use 64-bit double-precision operations for better convergence. Applications on the CUDA platform are rather easy to implement but very hard to get optimized performances. In this regard, we also discuss the optimization strategy of the proposed method.