• Title/Summary/Keyword: embedded GPU

Search Result 50, Processing Time 0.026 seconds

Direct Pass-Through based GPU Virtualization for Biologic Applications (바이오 응용을 위한 직접 통로 기반의 GPU 가상화)

  • Choi, Dong Hoon;Jo, Heeseung;Lee, Myungho
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.2
    • /
    • pp.113-118
    • /
    • 2013
  • The current GPU virtualization techniques incur large overheads when executing application programs mainly due to the fine-grain time-sharing scheduling of the GPU among multiple Virtual Machines (VMs). Besides, the current techniques lack of portability, because they include the APIs for the GPU computations in the VM monitor. In this paper, we propose a low overhead and high performance GPU virtualization approach on a heterogeneous HPC system based on the open-source Xen. Our proposed techniques are tailored to the bio applications. In our virtualization framework, we allow a VM to solely occupy a GPU once the VM is assigned a GPU instead of relying on the time-sharing the GPU. This improves the performance of the applications and the utilization of the GPUs. Our techniques also allow a direct pass-through to the GPU by using the IOMMU virtualization features embedded in the hardware for the high portability. Experimental studies using microbiology genome analysis applications show that our proposed techniques based on the direct pass-through significantly reduce the overheads compared with the previous Domain0 based approaches. Furthermore, our approach closely matches the performance for the applications to the bare machine or rather improves the performance.

Embedded GPU based Fast Image Processing for Mobile Device (임베디드 GPU 기반 영상처리 고속화 방법)

  • Lee, Kang-Woon;Beak, A-Ram;Cho, Haechul
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2014.11a
    • /
    • pp.39-40
    • /
    • 2014
  • 카메라를 갖춘 모바일 기기가 보편화되면서 모바일 환경에서 영상처리를 이용한 다양한 응용이 확산되고 있다. 영상은 다른 정보에 비해 데이터의 양이 비교적 방대하기 때문에 모바일 환경에서 영상처리를 수행하기 위해서는 처리속도, 전력, 발열 등의 물리적 제약조건이 존재할 수 있다. 본 논문에서는 이러한 문제를 극복하기 위해 모바일 기기에서 코프로세서인 임베디드 GPU(Graphic Processing Unit)를 이용한 영상처리의 고속화 방법을 제시한다. 실험에서는 보편적으로 활용되는 영상처리 알고리즘에 대해 CPU(Central Processing Unit) 및 GPU 각각에서의 성능을 비교함으로써 고속화 방법의 우수성을 검증하고 특징을 분석하였다.

  • PDF

Parallel LDPC Decoder for CMMB on CPU and GPU Using OpenCL (OpenCL을 활용한 CPU와 GPU 에서의 CMMB LDPC 복호기 병렬화)

  • Park, Joo-Yul;Hong, Jung-Hyun;Chung, Ki-Seok
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.11 no.6
    • /
    • pp.325-334
    • /
    • 2016
  • Recently, Open Computing Language (OpenCL) has been proposed to provide a framework that supports heterogeneous computing platforms. By using an OpenCL framework, digital communication systems can support various protocols in a unified computing environment to achieve both high portability and high performance. This article introduces a parallel software decoder of Low Density Parity Check (LDPC) codes for China Multimedia Mobile Broadcasting (CMMB) on a heterogeneous platform. Each step of LDPC decoding has different parallelization characteristics. In this paper, steps suitable for task-level parallelization are executed on the CPU, and steps suitable for data-level parallelization are processed by the GPU. To improve the performance of the proposed OpenCL kernels for LDPC decoding operations, explicit thread scheduling, loop-unrolling, and effective data transfer techniques are applied. The proposed LDPC decoder achieves high performance by using heterogeneous multi-core processors on a unified computing framework.

Development of People Counting Algorithm using Stereo Camera on NVIDIA Jetson TX2

  • Lee, Gyucheol;Yoo, Jisang;Kwon, Soonchul
    • International journal of advanced smart convergence
    • /
    • v.7 no.3
    • /
    • pp.8-14
    • /
    • 2018
  • In the field of surveillance cameras, it is possible to increase the people detection accuracy by using depth information indicating the distance between the camera and the object. In general, depth information is obtained by calculating the parallax information of the stereo camera. However, this method is difficult to operate in real time in the embedded environment due to the large amount of computation. Jetson TX2, released by NVIDIA in March 2017, is a high-performance embedded board with a GPU that enables parallel processing using the GPU. In this paper, a stereo camera is installed in Jetson TX2 to acquire depth information in real time, and we proposed a people counting method using acquired depth information. Experimental results show that the proposed method had a counting accuracy of 98.6% and operating in real time.

Development of a Real-Time Automatic Passenger Counting System using Head Detection Based on Deep Learning

  • Kim, Hyunduk;Sohn, Myoung-Kyu;Lee, Sang-Heon
    • Journal of Information Processing Systems
    • /
    • v.18 no.3
    • /
    • pp.428-442
    • /
    • 2022
  • A reliable automatic passenger counting (APC) system is a key point in transportation related to the efficient scheduling and management of transport routes. In this study, we introduce a lightweight head detection network using deep learning applicable to an embedded system. Currently, object detection algorithms using deep learning have been found to be successful. However, these algorithms essentially need a graphics processing unit (GPU) to make them performable in real-time. So, we modify a Tiny-YOLOv3 network using certain techniques to speed up the proposed network and to make it more accurate in a non-GPU environment. Finally, we introduce an APC system, which is performable in real-time on embedded systems, using the proposed head detection algorithm. We implement and test the proposed APC system on a Samsung ARTIK 710 board. The experimental results on three public head datasets reflect the detection accuracy and efficiency of the proposed head detection network against Tiny-YOLOv3. Moreover, to test the proposed APC system, we measured the accuracy and recognition speed by repeating 50 instances of entering and 50 instances of exiting. These experimental results showed 99% accuracy and a 0.041-second recognition speed despite the fact that only the CPU was used.

Performance Evaluation of Efficient Vision Transformers on Embedded Edge Platforms (임베디드 엣지 플랫폼에서의 경량 비전 트랜스포머 성능 평가)

  • Minha Lee;Seongjae Lee;Taehyoun Kim
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.18 no.3
    • /
    • pp.89-100
    • /
    • 2023
  • Recently, on-device artificial intelligence (AI) solutions using mobile devices and embedded edge devices have emerged in various fields, such as computer vision, to address network traffic burdens, low-energy operations, and security problems. Although vision transformer deep learning models have outperformed conventional convolutional neural network (CNN) models in computer vision, they require more computations and parameters than CNN models. Thus, they are not directly applicable to embedded edge devices with limited hardware resources. Many researchers have proposed various model compression methods or lightweight architectures for vision transformers; however, there are only a few studies evaluating the effects of model compression techniques of vision transformers on performance. Regarding this problem, this paper presents a performance evaluation of vision transformers on embedded platforms. We investigated the behaviors of three vision transformers: DeiT, LeViT, and MobileViT. Each model performance was evaluated by accuracy and inference time on edge devices using the ImageNet dataset. We assessed the effects of the quantization method applied to the models on latency enhancement and accuracy degradation by profiling the proportion of response time occupied by major operations. In addition, we evaluated the performance of each model on GPU and EdgeTPU-based edge devices. In our experimental results, LeViT showed the best performance in CPU-based edge devices, and DeiT-small showed the highest performance improvement in GPU-based edge devices. In addition, only MobileViT models showed performance improvement on EdgeTPU. Summarizing the analysis results through profiling, the degree of performance improvement of each vision transformer model was highly dependent on the proportion of parts that could be optimized in the target edge device. In summary, to apply vision transformers to on-device AI solutions, either proper operation composition and optimizations specific to target edge devices must be considered.

Analysis of Implementing Mobile Heterogeneous Computing for Image Sequence Processing

  • BAEK, Aram;LEE, Kangwoon;KIM, Jae-Gon;CHOI, Haechul
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.10
    • /
    • pp.4948-4967
    • /
    • 2017
  • On mobile devices, image sequences are widely used for multimedia applications such as computer vision, video enhancement, and augmented reality. However, the real-time processing of mobile devices is still a challenge because of constraints and demands for higher resolution images. Recently, heterogeneous computing methods that utilize both a central processing unit (CPU) and a graphics processing unit (GPU) have been researched to accelerate the image sequence processing. This paper deals with various optimizing techniques such as parallel processing by the CPU and GPU, distributed processing on the CPU, frame buffer object, and double buffering for parallel and/or distributed tasks. Using the optimizing techniques both individually and combined, several heterogeneous computing structures were implemented and their effectiveness were analyzed. The experimental results show that the heterogeneous computing facilitates executions up to 3.5 times faster than CPU-only processing.

CUDA based parallel design of a shot change detection algorithm using frame segmentation and object movement

  • Kim, Seung-Hyun;Lee, Joon-Goo;Hwang, Doo-Sung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.7
    • /
    • pp.9-16
    • /
    • 2015
  • This paper proposes the parallel design of a shot change detection algorithm using frame segmentation and moving blocks. In the proposed approach, the high parallel processing components, such as frame histogram calculation, block histogram calculation, Otsu threshold setting function, frame moving operation, and block histogram comparison, are designed in parallel for NVIDIA GPU. In order to minimize memory access delay time and guarantee fast computation, the output of a GPU kernel becomes the input data of another kernel in a pipeline way using the shared memory of GPU. In addition, the optimal sizes of CUDA processing blocks and threads are estimated through the prior experiments. In the experimental test of the proposed shot change detection algorithm, the detection rate of the GPU based parallel algorithm is the same as that of the CPU based algorithm, but the average of processing time speeds up about 6~8 times.

A Study on GPU Computing of Bi-conjugate Gradient Method for Finite Element Analysis of the Incompressible Navier-Stokes Equations (유한요소 비압축성 유동장 해석을 위한 이중공액구배법의 GPU 기반 연산에 대한 연구)

  • Yoon, Jong Seon;Jeon, Byoung Jin;Jung, Hye Dong;Choi, Hyoung Gwon
    • Transactions of the Korean Society of Mechanical Engineers B
    • /
    • v.40 no.9
    • /
    • pp.597-604
    • /
    • 2016
  • A parallel algorithm of bi-conjugate gradient method was developed based on CUDA for parallel computation of the incompressible Navier-Stokes equations. The governing equations were discretized using splitting P2P1 finite element method. Asymmetric stenotic flow problem was solved to validate the proposed algorithm, and then the parallel performance of the GPU was examined by measuring the elapsed times. Further, the GPU performance for sparse matrix-vector multiplication was also investigated with a matrix of fluid-structure interaction problem. A kernel was generated to simultaneously compute the inner product of each row of sparse matrix and a vector. In addition, the kernel was optimized to improve the performance by using both parallel reduction and memory coalescing. In the kernel construction, the effect of warp on the parallel performance of the present CUDA was also examined. The present GPU computation was more than 7 times faster than the single CPU by double precision.

Fast Medical Volume Decompression Using GPGPU (GPGPU를 이용한 고속 의료 볼륨 영상의 압축 복원)

  • Kye, Hee-Won
    • Journal of Korea Multimedia Society
    • /
    • v.15 no.5
    • /
    • pp.624-631
    • /
    • 2012
  • For many medical imaging systems, volume datasets are stored as a compressed form, so that the dataset has to be decompressed before it is visualized. Since the decompression process takes quite a long time, we present an acceleration method for medical volume decompression using GPU. Our method supports that both lossy and lossless compression and progressive refinement is possible to satisfy variable user requirements. Moreover, our decompression method is well parallelized for GPU so that the decompression takes a very short time. Finally, we designed that the decompression and volume rendering work in one framework so that the selective decompression is available. As a result, we gained additional improvement in volume decompression.