• Title/Summary/Keyword: GPU-parallelism

Search Result 40, Processing Time 0.024 seconds

Design of Line Scratch Detection and Restoration Algorithm using GPU (GPU를 이용한 선형 스크래치 탐지와 복원 알고리즘의 설계)

  • Lee, Joon-Goo;Shim, She-Yong;You, Byoung-Moon;Hwang, Doo-Sung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.19 no.4
    • /
    • pp.9-16
    • /
    • 2014
  • This paper proposes a linear scratch detection and restoration algorithm using pixel data comparison in a single frame or consecutive frames. There exists a high parallelism in that a scratch detection and restoration algorithm needs a large amount of comparison operations. The proposed scratch detection and restoration algorithm is designed with a GPU for fast computation. We test the proposed algorithm in sequential and parallel processing with the set of digital videos in National Archive of Korea. In the experiments, the scratch detection rate of consecutive frames is as fast as about 20% for that of a single frame. The detection and restoration rates of a GPU-based algorithm are similar to those of a CPU-based algorithm, but the parallel implementation speeds up to about 50 times.

A GPU-enabled Face Detection System in the Hadoop Platform Considering Big Data for Images (이미지 빅데이터를 고려한 하둡 플랫폼 환경에서 GPU 기반의 얼굴 검출 시스템)

  • Bae, Yuseok;Park, Jongyoul
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.1
    • /
    • pp.20-25
    • /
    • 2016
  • With the advent of the era of digital big data, the Hadoop platform has become widely used in various fields. However, the Hadoop MapReduce framework suffers from problems related to the increase of the name node's main memory and map tasks for the processing of large number of small files. In addition, a method for running C++-based tasks in the MapReduce framework is required in order to conjugate GPUs supporting hardware-based data parallelism in the MapReduce framework. Therefore, in this paper, we present a face detection system that generates a sequence file for images to process big data for images in the Hadoop platform. The system also deals with tasks for GPU-based face detection in the MapReduce framework using Hadoop Pipes. We demonstrate a performance increase of around 6.8-fold as compared to a single CPU process.

An IPC-based Dynamic Cooperative Thread Array Scheduling Scheme for GPUs

  • Son, Dong Oh;Kim, Jong Myon;Kim, Cheol Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.2
    • /
    • pp.9-16
    • /
    • 2016
  • Recently, many research groups have focused on GPGPUs in order to improve the performance of computing systems. GPGPUs can execute general-purpose applications as well as graphics applications by using parallel GPU hardware resources. GPGPUs can process thousands of threads based on warp scheduling and CTA scheduling. In this paper, we utilize the traditional CTA scheduler to assign a various number of CTAs to SMs. According to our simulation results, increasing the number of CTAs assigned to the SM statically does not improve the performance. To solve the problem in traditional CTA scheduling schemes, we propose a new IPC-based dynamic CTA scheduling scheme. Compared to traditional CTA scheduling schemes, the proposed dynamic CTA scheduling scheme can increase the GPU performance by up to 13.1%.

Fast Hilbert R-tree Bulk-loading Scheme using GPGPU (GPGPU를 이용한 Hilbert R-tree 벌크로딩 고속화 기법)

  • Yang, Sidong;Choi, Wonik
    • Journal of KIISE
    • /
    • v.41 no.10
    • /
    • pp.792-798
    • /
    • 2014
  • In spatial databases, R-tree is one of the most widely used indexing structures and many variants have been proposed for its performance improvement. Among these variants, Hilbert R-tree is a representative method using Hilbert curve to process large amounts of data without high cost split techniques to construct the R-tree. This Hilbert R-tree, however, is hardly applicable to large-scale applications in practice mainly due to high pre-processing costs and slow bulk-load time. To overcome the limitations of Hilbert R-tree, we propose a novel approach for parallelizing Hilbert mapping and thus accelerating bulk-loading of Hilbert R-tree on GPU memory. Hilbert R-tree based on GPU improves bulk-loading performance by applying the inversed-cell method and exploiting parallelism for packing the R-tree structure. Our experimental results show that the proposed scheme is up to 45 times faster compared to the traditional CPU-based bulk-loading schemes.

A Software Method for Improving the Performance of Real-time Rendering of 3D Games (3D 게임의 실시간 렌더링 속도 향상을 위한 소프트웨어적 기법)

  • Whang, Suk-Min;Sung, Mee-Young;You, Yong-Hee;Kim, Nam-Joong
    • Journal of Korea Game Society
    • /
    • v.6 no.4
    • /
    • pp.55-61
    • /
    • 2006
  • Graphics rendering pipeline (application, geometry, and rasterizer) is the core of real-time graphics which is the most important functionality for computer games. Usually this rendering process is completed by both the CPU and the GPU, and a bottleneck can be located either in the CPU or the GPU. This paper focuses on reducing the bottleneck between the CPU and the GPU. We are proposing a method for improving the performance of parallel processing for real-time graphics rendering by separating the CPU operations (usually performed using a thread) into two parts: pure CPU operations and operations related to the GPU, and let them operate in parallel. This allows for maximizing the parallelism in processing the communication between the CPU and the GPU. Some experiments lead us to confirm that our method proposed in this paper can allow for faster graphics rendering. In addition to our method of using a dedicated thread for GPU related operations, we are also proposing an algorithm for balancing the graphics pipeline using the idle time due to the bottleneck. We have implemented the two methods proposed in this paper in our networked 3D game engine and verified that our methods are effective in real systems.

  • PDF

Numerical Integration based on Harmonic Oscillation and Jacobi Iteration for Efficient Simulation of Soft Objects with GPU (GPU를 활용한 고성능 연체 객체 시뮬레이션을 위한 조화진동 모델과 야코비 반복법 기반 수치 적분 기술)

  • Kang, Young-Min
    • Journal of Korea Game Society
    • /
    • v.18 no.5
    • /
    • pp.123-132
    • /
    • 2018
  • Various methods have been proposed to efficiently animate the motion of soft objects in realtime. In order to maintain the topology between the elements of the objects, it is required to employ constraint forces, which limit the size of the time steps for the numerical integration and reduce the efficiency. To tackle this, an implicit method with larger steps was proposed. However, the method is, in essence, a linear system with a large matrix, of which solution requires heavy computations. Several approximate methods have been proposed, but the approximation is obtained with an increased damping and the loss of accuracy. In this paper, new integration method based on harmonic oscillation with better stability was proposed, and it was further stabilized with the hybridization with approximate implicit method. GPU parallelism can be easily implemented for the method, and large-scale soft objects can be simulated in realtime.

An Efficient Block Cipher Implementation on Many-Core Graphics Processing Units

  • Lee, Sang-Pil;Kim, Deok-Ho;Yi, Jae-Young;Ro, Won-Woo
    • Journal of Information Processing Systems
    • /
    • v.8 no.1
    • /
    • pp.159-174
    • /
    • 2012
  • This paper presents a study on a high-performance design for a block cipher algorithm implemented on modern many-core graphics processing units (GPUs). The recent emergence of VLSI technology makes it feasible to fabricate multiple processing cores on a single chip and enables general-purpose computation on a GPU (GPGPU). The GPU strategy offers significant performance improvements for all-purpose computation and can be used to support a broad variety of applications, including cryptography. We have proposed an efficient implementation of the encryption/decryption operations of a block cipher algorithm, SEED, on off-the-shelf NVIDIA many-core graphics processors. In a thorough experiment, we achieved high performance that is capable of supporting a high network speed of up to 9.5 Gbps on an NVIDIA GTX285 system (which has 240 processing cores). Our implementation provides up to 4.75 times higher performance in terms of encoding and decoding throughput as compared to the Intel 8-core system.

Real-Time Copyright Security Scheme of Immersive Content based on HEVC (HEVC 기반의 실감형 콘텐츠 실시간 저작권 보호 기법)

  • Yun, Chang Seob;Jun, Jae Hyun;Kim, Sung Ho;Kim, Dae Soo
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.21 no.1
    • /
    • pp.27-34
    • /
    • 2021
  • In this paper, we propose a copyright protection scheme for real-time streaming of HEVC(High Efficiency Video Coding) based realistic content. Previous research uses encryption and modular operation for copyright pre-protection and copyright post-protection, which causes delays in ultra high resolution video. The proposed scheme maximizes parallelism by using thread pool based DRM(Digital Rights Management) packaging with only HEVC's CABAC(Context Adaptive Binary Arithmetic Coding) codec and GPU based high-speed bit operation(XOR), thus enabling real-time copyright protection. As a result of comparing this scheme with previous research at three resolutions, PSNR showed an average of 8 times higher performance, and the process speed showed an average of 18 times difference. In addition, as a result of comparing the robustness of the forensic mark, the filter and noise attack, which showed the largest and smallest difference, with a 27-fold difference in recompression attacks, showed an 8-fold difference.

An Implementation of a Convolutional Accelerator based on a GPGPU for a Deep Learning (Deep Learning을 위한 GPGPU 기반 Convolution 가속기 구현)

  • Jeon, Hee-Kyeong;Lee, Kwang-yeob;Kim, Chi-yong
    • Journal of IKEEE
    • /
    • v.20 no.3
    • /
    • pp.303-306
    • /
    • 2016
  • In this paper, we propose a method to accelerate convolutional neural network by utilizing a GPGPU. Convolutional neural network is a sort of the neural network learning features of images. Convolutional neural network is suitable for the image processing required to learn a lot of data such as images. The convolutional layer of the conventional CNN required a large number of multiplications and it is difficult to operate in the real-time on the embedded environment. In this paper, we reduce the number of multiplications through Winograd convolution operation and perform parallel processing of the convolution by utilizing SIMT-based GPGPU. The experiment was conducted using ModelSim and TestDrive, and the experimental results showed that the processing time was improved by about 17%, compared to the conventional convolution.

Study of Parallelization Methods for Software based Real-time HEVC Encoder Implementation (소프트웨어 기반 실시간 HEVC 인코더 구현을 위한 병렬화 기법에 관한 연구)

  • Ahn, Yong-Jo;Hwang, Tae-Jin;Lee, Dongkyu;Kim, Sangmin;Oh, Seoung-Jun;Sim, Dong-Gyu
    • Journal of Broadcast Engineering
    • /
    • v.18 no.6
    • /
    • pp.835-849
    • /
    • 2013
  • Joint Collaborative Team on Video Coding (JCT-VC), which have founded ISO/IEC MPEG and ITU-T VCEG, has standardized High Efficiency Video Coding (HEVC). Standardization of HEVC has started with purpose of twice or more coding performance compared to H.264/AVC. However, flexible and hierarchical coding block and recursive coding structure are problems to overcome of HEVC standard. Many fast encoding algorithms for reducing computational complexity of HEVC encoder have been proposed. However, it is hard to implement a real-time HEVC encoder only with those fast encoding algorithms. In this paper, for implementation of software-based real-time HEVC encoder, data-level parallelism using SIMD instructions and CPU/GPU multi-threading methods are proposed. And we also proposed appropriate operations and functional modules to apply the proposed methods on HM 10.0 software. Evaluation of the proposed methods implemented on HM 10.0 software showed 20-30fps for $832{\times}480$ sequences and 5-10fps for $1920{\times}1080$ sequences, respectively.