• Title/Summary/Keyword: embedded GPU

Search Result 50, Processing Time 0.028 seconds

Implementation and Performance Evaluation of Vector based Rasterization Algorithm using a Many-Core Processor (매니코어 프로세서를 이용한 벡터 기반 래스터화 알고리즘 구현 및 성능평가)

  • Shon, Dong-Koo;Kim, Jong-Myon
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.8 no.2
    • /
    • pp.87-93
    • /
    • 2013
  • In this paper, we implemented and evaluated the performance of a vector-based rasterization algorithm of 3D graphics using a SIMD-based many-core processor that consists of 4,096 processing elements. In addition, we compared the performance and efficiency of the rasterization algorithm using the many-core processor and commercial GPU (Graphics Processing Unit) system which consists of 7 GPUs and each of which have 512 cores. Experimental results showed that the SIMD-based many-core processor outperforms the commercial GPU system in terms of execution time (3.13x speedup), energy efficiency (17.5x better), and area efficiency (13.3x better). These results demonstrate that the SIMD-based many-core processor has potential as an embedded mobile processor.

Deep Learning-Based Real-Time Pedestrian Detection on Embedded GPUs (임베디드 GPU에서의 딥러닝 기반 실시간 보행자 탐지 기법)

  • Vien, An Gia;Lee, Chul
    • Journal of Broadcast Engineering
    • /
    • v.24 no.2
    • /
    • pp.357-360
    • /
    • 2019
  • We propose an efficient single convolutional neural network (CNN) for pedestrian detection on embedded GPUs. We first determine the optimal number of the convolutional layers and hyper-parameters for a lightweight CNN. Then, we employ a multi-scale approach to make the network robust to the sizes of the pedestrians in images. Experimental results demonstrate that the proposed algorithm is capable of real-time operation, while providing higher detection performance than conventional algorithms.

Parallelized Particle Swarm Optimization with GPU for Real-Time Ballistic Target Tracking (실시간 탄도 궤적 목표물 추적을 위한 GPU 기반 병렬적 입자군집최적화 기법)

  • Yunho, Han;Heoncheol, Lee;Hyeokhoon, Gwon;Wonseok, Choi;Bora, Jeong
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.17 no.6
    • /
    • pp.355-365
    • /
    • 2022
  • This paper addresses the problem of real-time tracking a high-speed ballistic target. Particle filters can be considered to overcome the nonlinearity in motion and measurement models in the ballistic target. However, it is difficult to apply particle filters to real-time systems because particle filters generally require much computation time. This paper proposes an accelerated particle filter using graphics processing unit (GPU) for real-time ballistic target tracking. The real-time performance of the proposed method was tested and analyzed on a widely-used embedded system. The comparison results with the conventional particle filter on CPU (central processing unit) showed that the proposed method improved the real-time performance by reducing computation time significantly.

Fast Computation of DWT and JPEG2000 using GPU (GPU를 이용한 DWT 및 JPEG2000의 고속 연산)

  • Lee, Man-Hee;Park, In-Kyu;Won, Seok-Jin;Cho, Sung-Dae
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.44 no.6
    • /
    • pp.9-15
    • /
    • 2007
  • In this paper, we propose an efficient method for Processing DWT (Discrete Wavelet Transform) on GPU (Graphics Processing Unit). Since the DWT and EBCOT (embedded block coding with optimized truncation) are the most complicated submodules in JPEG2000, we design a high-performance processing framework for performing DWT using the fragment shader of GPU based on the render-to-texture (RTT) architecture. Experimental results show that the performance increases significantly, in which DWT running on modern GPU is more than 10 times faster than on modern CPU. Furthermore, by replacing the DWT part of Jasper which is the JPEG2000 reference software, the overall processing is 2$\sim$16 times faster than the original JasPer. The GPU-driven render-to-texture architecture proposed in this paper can be used in the general image and computer vision processing for high-speed processing.

Multi-DNN Acceleration Techniques for Embedded Systems with Tucker Decomposition and Hidden-layer-based Parallel Processing (터커 분해 및 은닉층 병렬처리를 통한 임베디드 시스템의 다중 DNN 가속화 기법)

  • Kim, Ji-Min;Kim, In-Mo;Kim, Myung-Sun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.6
    • /
    • pp.842-849
    • /
    • 2022
  • With the development of deep learning technology, there are many cases of using DNNs in embedded systems such as unmanned vehicles, drones, and robotics. Typically, in the case of an autonomous driving system, it is crucial to run several DNNs which have high accuracy results and large computation amount at the same time. However, running multiple DNNs simultaneously in an embedded system with relatively low performance increases the time required for the inference. This phenomenon may cause a problem of performing an abnormal function because the operation according to the inference result is not performed in time. To solve this problem, the solution proposed in this paper first reduces the computation by applying the Tucker decomposition to DNN models with big computation amount, and then, make DNN models run in parallel as much as possible in the unit of hidden layer inside the GPU. The experimental result shows that the DNN inference time decreases by up to 75.6% compared to the case before applying the proposed technique.

Real-time signal processing of LADAR image (LADAR 영상의 실시간 신호 처리)

  • Ha, Choong-lim;Nam, Jai-du;Kim, Young-kil
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.05a
    • /
    • pp.387-390
    • /
    • 2017
  • With the advent of high-resolution sensors in the embedded field, the demand for heterogeneous computing continues to increase. Logic Module is an embedded system for controlling LADAR system components and for real-time 3D imaging of laser radar image data. In this paper, we discuss the design of Logic Module and the signal processing using CPU-GPU heterogeneous computing.

  • PDF

Warp-Based Load/Store Reordering to Improve GPU Time Predictability

  • Huangfu, Yijie;Zhang, Wei
    • Journal of Computing Science and Engineering
    • /
    • v.11 no.2
    • /
    • pp.58-68
    • /
    • 2017
  • While graphics processing units (GPUs) can be used to improve the performance of real-time embedded applications that require high throughput, it is challenging to estimate the worst-case execution time (WCET) of GPU programs, because modern GPUs are designed for improving the average-case performance rather than time predictability. In this paper, a reordering framework is proposed to regulate the access to the GPU data cache, which helps to improve the accuracy of the estimation of GPU L1 data cache miss rate with low performance overhead. Also, with the improved cache miss rate estimation, tighter WCET estimations can be achieved for GPU programs.

Hybrid parallel programming for Heterogeneous Multi-core performance optimization (헤테로지니어스 멀티코어 성능 최적화를 위한 하이브리드 병렬 프로그래밍)

  • Lim, Ju-Ho
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2012.06a
    • /
    • pp.7-9
    • /
    • 2012
  • CPU는 싱글 코어 구조에서 클록 속도를 높여 성능을 향상 시키려는 노력을 해왔으나 한계에 도달하자 하나의 칩에 코어를 여러 개 둔 멀티코어 형태로 발전하였다. CPU의 성능 향상을 위해 이제는 3D그래픽을 연산처리하기 위해 만들어진 GPU와 결합하기에 이르렀다. CPU와 GPU의 결합은 CPU간의 결합보다 훨씬 더 좋은 성능을 보였고 전력의 사용량도 더 적었으며 비용면에서도 경제적이라는 장점을 가지고 있다. 본 논문에서는 CPU와 GPU의 Heterogeneous multicore상에서 성능을 최적화하기 위해 기존의 병렬화 모델을 조합하고 최적화를 시도하였다. CPU상에서는 성능 향상을 위해 기존의 병렬 프로그램 모델인 SIMD와 공유메모리 병렬 프로그래밍 모델 그리고 메시지 패싱 병렬 프로그래밍 모델을 조합하는 실험을 했다. GPU에서는 CUDA를 최적화 하였다. 이렇게 CPU와 GPU를 최적화하고 조합하여 고성능 연산을 요구하는 어플리케이션을 위한 Heterogeneous multicore 성능 최적화 방법을 제안한다.

Gender Classification System Based on Deep Learning in Low Power Embedded Board (저전력 임베디드 보드 환경에서의 딥 러닝 기반 성별인식 시스템 구현)

  • Jeong, Hyunwook;Kim, Dae Hoe;Baddar, Wisam J.;Ro, Yong Man
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.1
    • /
    • pp.37-44
    • /
    • 2017
  • While IoT (Internet of Things) industry has been spreading, it becomes very important for object to recognize user's information by itself without any control. Above all, gender (male, female) is dominant factor to analyze user's information on account of social and biological difference between male and female. However since each gender consists of diverse face feature, face-based gender classification research is still in challengeable research field. Also to apply gender classification system to IoT, size of device should be reduced and device should be operated with low power. Consequently, To port the function that can classify gender in real-world, this paper contributes two things. The first one is new gender classification algorithm based on deep learning and the second one is to implement real-time gender classification system in embedded board operated by low power. In our experiment, we measured frame per second for gender classification processing and power consumption in PC circumstance and mobile GPU circumstance. Therefore we verified that gender classification system based on deep learning works well with low power in mobile GPU circumstance comparing to in PC circumstance.

A GPU scheduling framework for applications based on dataflow specification (데이터 플로우 기반 응용들을 위한 GPU 스케줄링 프레임워크)

  • Lee, Yongbin;Kim, Sungchan
    • Journal of Korea Multimedia Society
    • /
    • v.17 no.10
    • /
    • pp.1189-1197
    • /
    • 2014
  • Recently, general purpose graphic processing units(GPUs) are being widely used in mobile embedded systems such as smart phone and tablet PCs. Because of architectural limitations of mobile GPGPUs, only a single program is allowed to occupy a GPU at a time in a non-preemptive way. As a result, it is difficult to meet performance requirements of applications such as frame rate or response time if applications running on a GPU are not scheduled properly. To tackle this difficulty, we propose to specify applications using synchronous data flow model of computation such that applications are formed with edges and nodes. Then nodes of applications are scheduled onto a GPU unlike conventional scheduling an application as a whole. This approach allows applications to share a GPU at a finer granularity, node (or task)-level, providing several benefits such as eliminating need for manually partitioning applications and better GPU utilization. Furthermore, any scheduling policy can be applied in response to the characteristics of applications.