• Title/Summary/Keyword: 병렬 GPU

Search Result 315, Processing Time 0.022 seconds

Performance Enhancement of Scaling Filter and Transcoder using CUDA (CUDA를 활용한 스케일링 필터 및 트랜스코더의 성능향상)

  • Han, Jae-Geun;Ko, Young-Sub;Suh, Sung-Han;Ha, Soon-Hoi
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.4
    • /
    • pp.507-511
    • /
    • 2010
  • In this paper, we propose to enhance the performance of software transcoder by using GPGPU for scaling filters. Video transcoding is a technique that translates a video file to another video file that has a different coding algorithm and/or a different frame size. Its demand increases as more multimedia devices with different specification coexist in our daily life. Since transcoding is computationally intensive, a software transcoder that runs on a CPU takes long processing time. In this paper, we achieve significant speed-up by parallelizing the scaling filter using a GPGPU that can provide significantly large computation power. Through extensive experiments with various video scripts of different size and with various scaling filter options, it is verified that the enhanced transcoder could achieve 36% performance improvement in the default option, and up to 101% in a certain option.

Design and Implementation of an Approximate Surface Lens Array System based on OpenCL (OpenCL 기반 근사곡면 렌즈어레이 시스템의 설계 및 구현)

  • Kim, Do-Hyeong;Song, Min-Ho;Jung, Ji-Sung;Kwon, Ki-Chul;Kim, Nam;Kim, Kyung-Ah;Yoo, Kwan-Hee
    • The Journal of the Korea Contents Association
    • /
    • v.14 no.10
    • /
    • pp.1-9
    • /
    • 2014
  • Generally, integral image used for autostereoscopic 3d display is generated for flat lens array, but flat lens array cannot provide a wide range of view for generated integral image because of narrow range of view. To make up for this flat lens array's weak point, curved lens array has been proposed, and due to technical and cost problem, approximate surface lens array composed of several flat lens array is used instead of ideal curved lens array. In this paper, we constructed an approximate surface lens array arranged for $20{\times}8$ square flat lens in 100mm radius sphere, and we could get about twice angle of view compared to flat lens array. Specially, unlike existing researches which manually generate integral image, we propose an OpenCL GPU parallel process algorithm for generating real-time integral image. As a result, we could get 12-20 frame/sec speed about various 3D volume data from $15{\times}15$ approximate surface lens array.

Multi-scale Texture Synthesis (다중 스케일 텍스처 합성)

  • Lee, Sung-Ho;Park, Han-Wook;Lee, Jung;Kim, Chang-Hun
    • Journal of the Korea Computer Graphics Society
    • /
    • v.14 no.2
    • /
    • pp.19-25
    • /
    • 2008
  • We synthesize a texture with different structures at different scales. Our technique is based on deterministic parallel synthesis allowing real-time processing on a GPU. A new coordinate transformation operator is used to construct a synthesized coordinate map based on different exemplars at different scales. The runtime overhead is minimal because this operator can be precalculated as a small lookup table. Our technique is effective for upsampling texture-rich images, because the result preserves texture detail well. In addition, a user can design a texture by coloring a low-resolution control image. This design tool can also be used for the interactive synthesis of terrain in the style of a particular exemplar, using the familiar 'raise and lower' airbrush to specify elevation.

  • PDF

SimTBS: Simulator For GPGPU Thread Block Scheduling (SimTBS: GPGPU 스레드블록 스케줄링 시뮬레이터)

  • Cho, Kyung-Woon;Bahn, Hyokyung
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.20 no.4
    • /
    • pp.87-92
    • /
    • 2020
  • Although GPGPU (General-Purpose GPU) can maximize performance by parallelizing a task with tens of thousands of threads, those threads are internally grouped into a thread block, which is a base unit for processing and resource allocation. A thread block scheduler is a specialized hardware gadget whose role is to allocate thread blocks to GPGPU processing hardware in a round-robin manner. However, round-robin is a sequential allocation policy and is not optimized for GPGPU resource utilization. In this paper, we propose a thread block scheduler model which can analyze and quantify performances for various thread block scheduling policies. Experiment results from the implemented simulator of our model show that the legacy hardware thread block scheduling does not behave well when workload becomes heavy.

Design and Implementation of Accelerator Architecture for Binary Weight Network on FPGA with Limited Resources (한정된 자원을 갖는 FPGA에서의 이진가중치 신경망 가속처리 구조 설계 및 구현)

  • Kim, Jong-Hyun;Yun, SangKyun
    • Journal of IKEEE
    • /
    • v.24 no.1
    • /
    • pp.225-231
    • /
    • 2020
  • In this paper, we propose a method to accelerate BWN based on FPGA with limited resources for embedded system. Because of the limited number of logic elements available, a single computing unit capable of handling Conv-layer, FC-layer of various sizes must be designed and reused. Also, if the input feature map can not be parallel processed at one time, the output must be calculated by reading the inputs several times. Since the number of available BRAM modules is limited, the number of data bits in the BWN accelerator must be minimized. The image classification processing time of the BWN accelerator is superior when compared with a embedded CPU and is faster than a desktop PC and 50% slower than a GPU system. Since the BWN accelerator uses a slow clock of 50MHz, it can be seen that the BWN accelerator is advantageous in performance versus power.

Development and Speed Comparison of Convolutional Neural Network Using CUDA (CUDA를 이용한 Convolutional Neural Network의 구현 및 속도 비교)

  • Ki, Cheol-min;Cho, Tai-Hoon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.05a
    • /
    • pp.335-338
    • /
    • 2017
  • Currently Artificial Inteligence and Deep Learning are social issues, and These technologies are applied to various fields. A good method among the various algorithms in Artificial Inteligence is Convolutional Neural Network. Convolutional Neural Network is a form that adds convolution layers that extracts features by convolution operation on a general neural network method. If you use Convolutional Neural Network as small amount of data, or if the structure of layers is not complicated, you don't have to pay attention to speed. But the learning time is long as the size of the learning data is large and the structure of layers is complicated. So, GPU-based parallel processing is a lot. In this paper, we developed Convolutional Neural Network using CUDA and Learning speed is faster and more efficient than the method using the CPU.

  • PDF

Real-time Eye Contact System Using a Kinect Depth Camera for Realistic Telepresence (Kinect 깊이 카메라를 이용한 실감 원격 영상회의의 시선 맞춤 시스템)

  • Lee, Sang-Beom;Ho, Yo-Sung
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37 no.4C
    • /
    • pp.277-282
    • /
    • 2012
  • In this paper, we present a real-time eye contact system for realistic telepresence using a Kinect depth camera. In order to generate the eye contact image, we capture a pair of color and depth video. Then, the foreground single user is separated from the background. Since the raw depth data includes several types of noises, we perform a joint bilateral filtering method. We apply the discontinuity-adaptive depth filter to the filtered depth map to reduce the disocclusion area. From the color image and the preprocessed depth map, we construct a user mesh model at the virtual viewpoint. The entire system is implemented through GPU-based parallel programming for real-time processing. Experimental results have shown that the proposed eye contact system is efficient in realizing eye contact, providing the realistic telepresence.

Scalable Ontology Reasoning Using GPU Cluster Approach (GPU 클러스터 기반 대용량 온톨로지 추론)

  • Hong, JinYung;Jeon, MyungJoong;Park, YoungTack
    • Journal of KIISE
    • /
    • v.43 no.1
    • /
    • pp.61-70
    • /
    • 2016
  • In recent years, there has been a need for techniques for large-scale ontology inference in order to infer new knowledge from existing knowledge at a high speed, and for a diversity of semantic services. With the recent advances in distributed computing, developments of ontology inference engines have mostly been studied based on Hadoop or Spark frameworks on large clusters. Parallel programming techniques using GPGPU, which utilizes many cores when compared with CPU, is also used for ontology inference. In this paper, by combining the advantages of both techniques, we propose a new method for reasoning large RDFS ontology data using a Spark in-memory framework and inferencing distributed data at a high speed using GPGPU. Using GPGPU, ontology reasoning over high-capacity data can be performed as a low cost with higher efficiency over conventional inference methods. In addition, we show that GPGPU can reduce the data workload on each node through the Spark cluster. In order to evaluate our approach, we used LUBM ranging from 10 to 120. Our experimental results showed that our proposed reasoning engine performs 7 times faster than a conventional approach which uses a Spark in-memory inference engine.

Processing Speed Improvement of Software for Automatic Corner Radius Analysis of Laminate Composite using CUDA (CUDA를 이용한 적층 복합재 구조물 코너 부의 자동 구조 해석 소프트웨어의 처리 속도 향상)

  • Hyeon, Ju-Ha;Kang, Moon-Hyae;Moon, Yong-Ho;Ha, Seok-Wun
    • Journal of Convergence for Information Technology
    • /
    • v.9 no.7
    • /
    • pp.33-40
    • /
    • 2019
  • As aerospace industry has been activated recently, it is required to commercialize composite analysis software. Until now, commercial software has been mainly used for analyzing composites, but it has been difficult to use due to high price and limited functions. In order to solve this problem, automatic analysis software for both in-plane and corner radius strength, which are all made on-line and generalized, has recently been developed. However, these have the disadvantage that they can not be analyzed simultaneously with multiple failure criteria. In this paper, we propose a method to greatly improve the processing speed while simultaneously handling the analysis of multiple failure criteria using a parallel processing platform that only works with a GPU equipped with a CUDA core. We have obtained satisfactory results when the analysis speed is experimented on the vast structure data.

Real-Time Copyright Security Scheme of Immersive Content based on HEVC (HEVC 기반의 실감형 콘텐츠 실시간 저작권 보호 기법)

  • Yun, Chang Seob;Jun, Jae Hyun;Kim, Sung Ho;Kim, Dae Soo
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.21 no.1
    • /
    • pp.27-34
    • /
    • 2021
  • In this paper, we propose a copyright protection scheme for real-time streaming of HEVC(High Efficiency Video Coding) based realistic content. Previous research uses encryption and modular operation for copyright pre-protection and copyright post-protection, which causes delays in ultra high resolution video. The proposed scheme maximizes parallelism by using thread pool based DRM(Digital Rights Management) packaging with only HEVC's CABAC(Context Adaptive Binary Arithmetic Coding) codec and GPU based high-speed bit operation(XOR), thus enabling real-time copyright protection. As a result of comparing this scheme with previous research at three resolutions, PSNR showed an average of 8 times higher performance, and the process speed showed an average of 18 times difference. In addition, as a result of comparing the robustness of the forensic mark, the filter and noise attack, which showed the largest and smallest difference, with a 27-fold difference in recompression attacks, showed an 8-fold difference.