• Title/Summary/Keyword: Compute unified device architecture

Search Result 61, Processing Time 0.028 seconds

Accelerating Numerical Analysis of Reynolds Equation Using Graphic Processing Units (그래픽처리장치를 이용한 레이놀즈 방정식의 수치 해석 가속화)

  • Myung, Hun-Joo;Kang, Ji-Hoon;Oh, Kwang-Jin
    • Tribology and Lubricants
    • /
    • v.28 no.4
    • /
    • pp.160-166
    • /
    • 2012
  • This paper presents a Reynolds equation solver for hydrostatic gas bearings, implemented to run on graphics processing units (GPUs). The original analysis code for the central processing unit (CPU) was modified for the GPU by using the compute unified device architecture (CUDA). The red-black Gauss-Seidel (RBGS) algorithm was employed instead of the original Gauss-Seidel algorithm for the iterative pressure solver, because the latter has data dependency between neighboring nodes. The implemented GPU program was tested on the nVidia GTX580 system and compared to the original CPU program on the AMD Llano system. In the iterative pressure calculation, the implemented GPU program showed 20-100 times faster performance than the original CPU codes. Comparison of the wall-clock times including all of pre/post processing codes showed that the GPU codes still delivered 4-12 times faster performance than the CPU code for our target problem.

All Phase Discrete Sine Biorthogonal Transform and Its Application in JPEG-like Image Coding Using GPU

  • Shan, Rongyang;Zhou, Xiao;Wang, Chengyou;Jiang, Baochen
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.9
    • /
    • pp.4467-4486
    • /
    • 2016
  • Discrete cosine transform (DCT) based JPEG standard significantly improves the coding efficiency of image compression, but it is unacceptable event in serious blocking artifacts at low bit rate and low efficiency of high-definition image. In the light of all phase digital filtering theory, this paper proposes a novel transform based on discrete sine transform (DST), which is called all phase discrete sine biorthogonal transform (APDSBT). Applying APDSBT to JPEG scheme, the blocking artifacts are reduced significantly. The reconstructed image of APDSBT-JPEG is better than that of DCT-JPEG in terms of objective quality and subjective effect. For improving the efficiency of JPEG coding, the structure of JPEG is analyzed. We analyze key factors in design and evaluation of JPEG compression on the massive parallel graphics processing units (GPUs) using the compute unified device architecture (CUDA) programming model. Experimental results show that the maximum speedup ratio of parallel algorithm of APDSBT-JPEG can reach more than 100 times with a very low version GPU. Some new parallel strategies are illustrated in this paper for improving the performance of parallel algorithm. With the optimal strategy, the efficiency can be improved over 10%.

Parallel Computation of FDTD algorithm using CUDA (CUDA를 이용한 FDTD 알고리즘의 병렬처리)

  • Lee, Ho-Young;Park, Jong-Hyun;Kim, Jun-Seong
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.47 no.4
    • /
    • pp.82-87
    • /
    • 2010
  • Modern GPUs(Graphic Processing Units) provide computing capability higher than that of the general CPUs(Central Processor Units). With supports of programmability of graphics pipeline GP-GPU(General Purpose computation on GPU) has gained much attention expanding its application area. This paper compares sequential and massively parallel implementations of FDTD(Finite Difference Time Domain) algorithm using CUDA(Compute Unified Device Architecture). Experimental results show upto 45X speedup over conventional CPU execution.

An Improved Hybrid Approach to Parallel Connected Component Labeling using CUDA

  • Soh, Young-Sung;Ashraf, Hadi;Kim, In-Taek
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.16 no.1
    • /
    • pp.1-8
    • /
    • 2015
  • In many image processing tasks, connected component labeling (CCL) is performed to extract regions of interest. CCL was usually done in a sequential fashion when image resolution was relatively low and there are small number of input channels. As image resolution gets higher up to HD or Full HD and as the number of input channels increases, sequential CCL is too time-consuming to be used in real time applications. To cope with this situation, parallel CCL framework was introduced where multiple cores are utilized simultaneously. Several parallel CCL methods have been proposed in the literature. Among them are NSZ label equivalence (NSZ-LE) method[1], modified 8 directional label selection (M8DLS) method[2], and HYBRID1 method[3]. Soh [3] showed that HYBRID1 outperforms NSZ-LE and M8DLS, and argued that HYBRID1 is by far the best. In this paper we propose an improved hybrid parallel CCL algorithm termed as HYBRID2 that hybridizes M8DLS with label backtracking (LB) and show that it runs around 20% faster than HYBRID1 for various kinds of images.

Fundamental Function Design of Real-Time Unmanned Monitoring System Applying YOLOv5s on NVIDIA TX2TM AI Edge Computing Platform

  • LEE, SI HYUN
    • International journal of advanced smart convergence
    • /
    • v.11 no.2
    • /
    • pp.22-29
    • /
    • 2022
  • In this paper, for the purpose of designing an real-time unmanned monitoring system, the YOLOv5s (small) object detection model was applied on the NVIDIA TX2TM AI (Artificial Intelligence) edge computing platform in order to design the fundamental function of an unmanned monitoring system that can detect objects in real time. YOLOv5s was applied to the our real-time unmanned monitoring system based on the performance evaluation of object detection algorithms (for example, R-CNN, SSD, RetinaNet, and YOLOv5). In addition, the performance of the four YOLOv5 models (small, medium, large, and xlarge) was compared and evaluated. Furthermore, based on these results, the YOLOv5s model suitable for the design purpose of this paper was ported to the NVIDIA TX2TM AI edge computing system and it was confirmed that it operates normally. The real-time unmanned monitoring system designed as a result of the research can be applied to various application fields such as an security or monitoring system. Future research is to apply NMS (Non-Maximum Suppression) modification, model reconstruction, and parallel processing programming techniques using CUDA (Compute Unified Device Architecture) for the improvement of object detection speed and performance.

Performance Improvement in HTTP Packet Extraction from Network Traffic using GPGPU (GPGPU 를 이용한 네트워크 트래픽에서의 HTTP 패킷 추출 성능 향상)

  • Han, SangWoon;Kim, Hyogon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2011.11a
    • /
    • pp.718-721
    • /
    • 2011
  • 웹 서비스를 대상으로 하는 DDoS(Distributed Denial-of-Service) 공격 또는 유해 트래픽 유입을 탐지 또는 차단하기 위한 목적으로 HTTP(Hypertext Transfer Protocol) 트래픽을 실시간으로 분석하는 기능은 거의 모든 네트워크 트래픽 보안 솔루션들이 탑재하고 있는 필수적인 요소이다. 하지만, HTTP 트래픽의 실시간 데이터 측정 양이 시간이 지날수록 기하급수적으로 증가함에 따라, HTTP 트래픽을 실시간 패킷 단위로 분석한다는 것에 대한 성능 부담감은 날로 커지고 있는 실정이다. 이제는 응용 어플리케이션 차원에서는 성능에 대한 부담감을 해소할 수 없기 때문에 고비용의 소프트웨어 가속기나 하드웨어에 의존적인 전용 장비를 탑재하여 해결하려는 시도가 대부분이다. 본 논문에서는 현재 대부분의 PC 에 탑재되어 있는 그래픽 카드의 GPU(Graphics Processing Units)를 범용적으로 활용하고자 하는 GPGPU(General-Purpose computation on Graphics Processing Units)의 연구에 힘입어, NVIDIA사의 CUDA(Compute Unified Device Architecture)를 사용하여 네트워크 트래픽에서 HTTP 패킷 추출성능을 응용 어플리케이션 차원에서 향상시켜 보고자 하였다. HTTP 패킷 추출 연산만을 기준으로 GPU 의 연산속도는 CPU 에 비해 10 배 이상의 높은 성능을 얻을 수 있었다.

Development of Diffusive Wave Rainfall-Runoff Model Based on CUDA FORTRAN (CUDA FORTEAN기반 확산파 강우유출모형 개발)

  • Kim, Boram;Kim, Hyeong-Jun;Yoon, Kwang Seok
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2021.06a
    • /
    • pp.287-287
    • /
    • 2021
  • 본 연구에서는 CUDA(Compute Unified Device Architecture) 포트란을 이용하여 확산파 강우 유출모형을 개발하였다. CUDA 포트란은 그래픽 처리 장치(Graphic Processing Unit: GPU)에서 수행하는 병렬 연산 알고리즘을 포트란 언어를 사용하여 작성할 수 있도록 하는 GPU상의 범용계산(General-Purpose Computing on Graphics Processing Units: GPGPU) 기술이다. GPU는 그래픽 처리 작업에 특화된 다수의 산술 논리 장치(Arithmetic Logic Unit: ALU)로 구성되어 있어서 중앙 처리 장치(Central Processing Unit: CPU)보다 한 번에 더 많은 연산 수행이 가능하다. 이에 따라, CUDA 포트란기반 확산파모형은 분포형 강우유출모형의 수치모의 연산시간을 단축시킬 수 있다. 분포형모형의 지배방정식은 확산파모형과 Green-Ampt모형으로 구성되었고, 확산파모형은 유한체적법을 이용하여 이산화 하였다. CUDA 포트란기반 확산파모형의 정확성은 기존 연구된 수리실험 결과 및 CPU기반 강우유출모형과 비교하였으며, 연산소요시간에 대한 효율성은 CPU기반 확산파모형과 비교하였다. 그 결과 CUDA 포트란기반 확산파모형의 결과는 수리실험 결과 및 CPU기반 강우유출모형의 결과와 유사한 결과를 나타냈다. 또한, 연산소요시간은 CPU 기반 확산파모형의 연산소요시간보다 단축되었으며, 본 연구에 사용된 장비를 기준으로 최대 100배 정도 단축되었다.

  • PDF

Real-time Depth Image Refinement using Hierarchical Joint Bilateral Filter (계층적 결합형 양방향 필터를 이용한 실시간 깊이 영상 보정 방법)

  • Shin, Dong-Won;Hoa, Yo-Sung
    • Journal of Broadcast Engineering
    • /
    • v.19 no.2
    • /
    • pp.140-147
    • /
    • 2014
  • In this paper, we propose a method for real-time depth image refinement. In order to improve the quality of the depth map acquired from Kinect camera, we employ constant memory and texture memory which are suitable for a 2D image processing in the graphics processing unit (GPU). In addition, we applied the joint bilateral filter (JBF) in parallel to accelerate the overall execution. To enhance the quality of the depth image, we applied the JBF hierarchically using the compute unified device architecture (CUDA). Finally, we obtain the refined depth image. Experimental results showed that the proposed real-time depth image refinement algorithm improved the subjective quality of the depth image and the computational time was 260 frames per second.

Vehicle Detection in Dense Area Using UAV Aerial Images (무인 항공기를 이용한 밀집영역 자동차 탐지)

  • Seo, Chang-Jin
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.19 no.3
    • /
    • pp.693-698
    • /
    • 2018
  • This paper proposes a vehicle detection method for parking areas using unmanned aerial vehicles (UAVs) and using YOLOv2, which is a recent, known, fast, object-detection real-time algorithm. The YOLOv2 convolutional network algorithm can calculate the probability of each class in an entire image with a one-pass evaluation, and can also predict the location of bounding boxes. It has the advantage of very fast, easy, and optimized-at-detection performance, because the object detection process has a single network. The sliding windows methods and region-based convolutional neural network series detection algorithms use a lot of region proposals and take too much calculation time for each class. So these algorithms have a disadvantage in real-time applications. This research uses the YOLOv2 algorithm to overcome the disadvantage that previous algorithms have in real-time processing problems. Using Darknet, OpenCV, and the Compute Unified Device Architecture as open sources for object detection. a deep learning server is used for the learning and detecting process with each car. In the experiment results, the algorithm could detect cars in a dense area using UAVs, and reduced overhead for object detection. It could be applied in real time.

Ultrahigh-Resolution Spectral Domain Optical Coherence Tomography Based on a Linear-Wavenumber Spectrometer

  • Lee, Sang-Won;Kang, Heesung;Park, Joo Hyun;Lee, Tae Geol;Lee, Eun Seong;Lee, Jae Yong
    • Journal of the Optical Society of Korea
    • /
    • v.19 no.1
    • /
    • pp.55-62
    • /
    • 2015
  • In this study we demonstrate ultrahigh-resolution spectral domain optical coherence tomography (UHR SD-OCT) with a linear-wavenumber (k) spectrometer, to accelerate signal processing and to display two-dimensional (2-D) images in real time. First, we performed a numerical simulation to find the optimal parameters for the linear-k spectrometer to achieve ultrahigh axial resolution, such as the number of grooves in a grating, the material for a dispersive prism, and the rotational angle between the grating and the dispersive prism. We found that a grating with 1200 grooves and an F2 equilateral prism at a rotational angle of $26.07^{\circ}$, in combination with a lens of focal length 85.1 mm, are suitable for UHR SD-OCT with the imaging depth range (limited by spectrometer resolution) set at 2.0 mm. As guided by the simulation results, we constructed the linear-k spectrometer needed to implement a UHR SD-OCT. The actual imaging depth range was measured to be approximately 2.1 mm, and axial resolution of $3.8{\mu}m$ in air was achieved, corresponding to $2.8{\mu}m$ in tissue (n = 1.35). The sensitivity was -91 dB with -10 dB roll-off at 1.5 mm depth. We demonstrated a 128.2 fps acquisition rate for OCT images with 800 lines/frame, by taking advantage of NVIDIA's compute unified device architecture (CUDA) technology, which allowed for real-time signal processing compatible with the speed of the spectrometer's data acquisition.