• Title/Summary/Keyword: GPU process

Search Result 147, Processing Time 0.023 seconds

Adaptive Search Range Decision for Accelerating GPU-based Integer-pel Motion Estimation in HEVC Encoders (HEVC 부호화기에서 GPU 기반 정수화소 움직임 추정을 고속화하기 위한 적응적인 탐색영역 결정 방법)

  • Kim, Sangmin;Lee, Dongkyu;Sim, Dong-Gyu;Oh, Seoung-Jun
    • Journal of Broadcast Engineering
    • /
    • v.19 no.5
    • /
    • pp.699-712
    • /
    • 2014
  • In this paper, we propose a new Adaptive Search Range (ASR) decision algorithm for accelerating GPU-based Integer-pel Motion Estimation (IME) of High Efficiency Video Coding (HEVC). For deciding the ASR, we classify a frame into two models using Motion Vector Differences (MVDs) then adaptively decide the search ranges of each model. In order to apply the proposed algorithm to the GPU-based ME process, starting points of the ME are decided using only temporal Motion Vectors (MVs). The CPU decides the ASR as well as the starting points and transfers them to the GPU. Then, the GPU performs the integer-pel ME. The proposed algorithm reduces the total encoding time by 37.9% with BD-rate increase of 1.1% and yields 951.2 times faster ME against the CPU-based anchor. In addition, the proposed algorithm achieves the time reduction of 57.5% in the ME running time with the negligible coding loss of 0.6%, compared with the simple GPU-based ME without ASR decision.

Implementation of FFT on Massively Parallel GPU for DVB-T Receiver (DVB-T 수신기를 위한 대규모 병렬처리 GPU 기반의 FFT 구현)

  • Lee, Kyu Hyung;Heo, Seo Weon
    • Journal of Broadcast Engineering
    • /
    • v.18 no.2
    • /
    • pp.204-214
    • /
    • 2013
  • Recently various research have been conducted relating to the implementation of signal processing or communication system by software using the massively parallel processing capability of the GPU. In this work, we focus on reducing software simulation time of 2K/8K FFT in DVB-T by using GPU. we estimate the processing time of the DVB-T system, which is one of the standards for DTV transmission, by CPU. Then we implement the FFT processing by the software using the NVIDIA's massively parallel GPU processor. In this paper we apply stream process method to reduce the overhead for data transfer between CPU and GPU, coalescing method to reduce the global memory access time and data structure design method to maximize the shared memory usage. The results show that our proposed method is approximately 20~30 times as fast as the CPU based FFT processor, and approximately 1.8 times as fast as the CUFFT library (version 2.1) which is provided by the NVIDIA when applied to the DVB-T 2K/8K mode FFT.

Fast GPU Implementation for the Solution of Tridiagonal Matrix Systems (삼중대각행렬 시스템 풀이의 빠른 GPU 구현)

  • Kim, Yong-Hee;Lee, Sung-Kee
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.11_12
    • /
    • pp.692-704
    • /
    • 2005
  • With the improvement of computer hardware, GPUs(Graphics Processor Units) have tremendous memory bandwidth and computation power. This leads GPUs to use in general purpose computation. Especially, GPU implementation of compute-intensive physics based simulations is actively studied. In the solution of differential equations which are base of physics simulations, tridiagonal matrix systems occur repeatedly by finite-difference approximation. From the point of view of physics based simulations, fast solution of tridiagonal matrix system is important research field. We propose a fast GPU implementation for the solution of tridiagonal matrix systems. In this paper, we implement the cyclic reduction(also known as odd-even reduction) algorithm which is a popular choice for vector processors. We obtained a considerable performance improvement for solving tridiagonal matrix systems over Thomas method and conjugate gradient method. Thomas method is well known as a method for solving tridiagonal matrix systems on CPU and conjugate gradient method has shown good results on GPU. We experimented our proposed method by applying it to heat conduction, advection-diffusion, and shallow water simulations. The results of these simulations have shown a remarkable performance of over 35 frame-per-second on the 1024x1024 grid.

Realistic and Real-Time Modeling of Numerous Trees Using Growing Environment (성장 환경을 활용한 다수의 나무에 대한 사실적인 실시간 모델링 기법)

  • Kim, Jin-Mo;Cho, Hyung-Je
    • Journal of Korea Multimedia Society
    • /
    • v.15 no.3
    • /
    • pp.398-407
    • /
    • 2012
  • We propose a tree modeling method of expressing realistically and efficiently numerous trees distributed on a broad terrain. This method combines and simplifies the recursive hierarchy of tree branch and branch generation process through self-organizing from buds, allowing users to generate trees that can be used more intuitively and efficiently. With the generation process the leveled structure and the appearance such as branch length, distribution and direction can be controlled interactively by user. In addition, we introduce an environment-adaptive model that allows to grow a number of trees variously by controlling at the same time and we propose an efficient application method of growing environment. For the real-time rendering of the complex tree models distributed on a broad terrain, the rendering process, the LOD(level of detail) for the branch surfaces, and shader instancing are introduced through the GPU(Graphics Processing Unit). Whether the numerous trees are expressed realistically and efficiently on wide terrain by proposed models are confirmed through simulation.

Research on the Development of an Integral Imaging System Framework and an Improved Viewpoint Vector Rendering Method Utilizing GPU (GPU를 이용한 개선된 뷰포인트 벡터 렌더링 방식의 집적영상시스템 프레임워크에 관한 연구)

  • Lee, Bin-Na-Ra;Park, Kyoung-Shin;Cho, Yong-Joo
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.10 no.10
    • /
    • pp.1767-1772
    • /
    • 2006
  • Computer-generated integral imaging system is an auto-stereoscopic display system that users can see and feel the stereoscopic images when they see the pre-rendered elemental images through a lens array. The process of constructing elemental images using computer graphics is called image mapping. Viewpoint vector rendering (VVR) method is one of the image mapping algorithm specially designed for real-time graphics applications, which would not be affected by the size of the rendered objects or the number of elemental lenses used in the integral imaging system. This paper describes a new VVR framework which improved its rendering performance considerably. It also compares the previous VVR implementation with the new VVR work utilizing GPU and shows that newer implementation shows pretty big improvements over the old method.

Fast Double Random Phase Encoding by Using Graphics Processing Unit (GPU 컴퓨팅에 의한 고속 Double Random Phase Encoding)

  • Saifullah, Saifullah;Moon, In-Kyu
    • Proceedings of the Korea Multimedia Society Conference
    • /
    • 2012.05a
    • /
    • pp.343-344
    • /
    • 2012
  • With the increase of sensitive data and their secure transmission and storage, the use of encryption techniques has become widespread. The performance of encoding majorly depends on the computational time, so a system with less computational time suits more appropriate as compared to its contrary part. Double Random Phase Encoding (DRPE) is an algorithm with many sub functions which consumes more time when executed serially; the computation time can be significantly reduced by implementing important functions in a parallel fashion on Graphics Processing Unit (GPU). Computing convolution using Fast Fourier transform in DRPE is the most important part of the algorithm and it is shown in the paper that by performing this portion in GPU reduced the execution time of the process by substantial amount and can be compared with MATALB for performance analysis. NVIDIA graphic card GeForce 310 is used with CUDA C as a programming language.

  • PDF

A Dynamic Accuracy Estimation for GPU-based Monte Carlo Simulation in Tissue Optics

  • Cai, Fuhong;Lu, Wen
    • Current Optics and Photonics
    • /
    • v.1 no.5
    • /
    • pp.551-555
    • /
    • 2017
  • Tissue optics is a well-established and extensively studied area. In the last decades, Monte Carlo simulation (MCS) has been one of the standard tools for simulation of light propagation in turbid media. The utilization of parallel processing exhibits dramatic increase in the speed of MCS's of photon migration. Some calculations based on MCS can be completed within a few seconds. Since the MCS's have the potential to become a real time calculation method, a dynamic accuracy estimation, which is also known as history by history statistical estimators, is required in the simulation code to automatically terminate the MCS as the results' accuracy achieves a high enough level. In this work, spatial and time-domain GPU-based MCS, adopting the dynamic accuracy estimation, are performed to calculate the light dose/reflectance in homogeneous and heterogeneous tissue media. This dynamic accuracy estimation can effectively derive the statistical error of optical dose/reflectance during the parallel Monte Carlo process.

The Implementation of Fast Object Recognition Using Parallel Processing on CPU and GPU (CPU와 GPU의 병렬 처리를 이용한 고속 물체 인식 알고리즘 구현)

  • Kim, Jun-Chul;Jung, Young-Han;Park, Eun-Soo;Cui, Xue-Nan;Kim, Hak-Il;Huh, Uk-Youl
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.15 no.5
    • /
    • pp.488-495
    • /
    • 2009
  • This paper presents a fast feature extraction method for autonomous mobile robots utilizing parallel processing and based on OpenMP, SSE (Streaming SIMD Extension) and CUDA programming. In the first step on CPU version, the algorithms and codes are optimized and then implemented by parallel processing. The parallel algorithms are debugged to maintain the same level of performance and the process for extracting key points and obtaining dominant orientation with respect to key points is parallelized. After extraction, a parallel descriptor via SSE instructions is constructed. And the GPU version also implemented by parallel processing using CUDA based on the SIFT. The GPU-Parallel descriptor achieves an acceleration up to five times compared with the CPU-Parallel descriptor, but it shows the lower performance than CPU version. CPU version also speed-up the four and half times compared with the original SIFT while maintaining robust performance.

Polymeric Membrane Modules for Substituting the $CO_2$ Absorption Column in the DME Plant Process (DME 플랜트 $CO_2$흡수탑 대체용 고분자 분리막 모듈)

  • Chung, Jong-Tae;Lee, Choong-Seop;Koh, Hyung-Chul;Ha, Seong-Yong;Nam, Sang-Yong;Jo, Won-Jun;Baek, Young-Soon
    • Membrane Journal
    • /
    • v.22 no.2
    • /
    • pp.142-154
    • /
    • 2012
  • In order to remove $CO_2$ from the DME plant process, we investigated the composite membrane with rubbery polymers as the separation layer and its separation performance of $CO_2$ and $H_2$. Hollow fiber membranes for supporting layer were prepared by solution spinning method. In case of using PDMS as a separation layer, the composite membranes showed the permeation rates of $CO_2$ were over 300 GPU and minimum $CO_2/H_2$ selectivitties were 4.3 and in case of using PEBAX as a separation layer, the composite membranes showed the permeation rates of $CO_2$ were over 120 GPU and minimum $CO_2/H_2$ selectivities were 5.

PDF Version 1.4-1.6 Password Cracking in CUDA GPU Environment (PDF 버전 1.4-1.6의 CUDA GPU 환경에서 암호 해독 최적 구현)

  • Hyun Jun, Kim;Si Woo, Eum;Hwa Jeong, Seo
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.12 no.2
    • /
    • pp.69-76
    • /
    • 2023
  • Hundreds of thousands of passwords are lost or forgotten every year, making the necessary information unavailable to legitimate owners or authorized law enforcement personnel. In order to recover such a password, a tool for password cracking is required. Using GPUs instead of CPUs for password cracking can quickly process the large amount of computation required during the recovery process. This paper optimizes on GPUs using CUDA, with a focus on decryption of the currently most popular PDF 1.4-1.6 version. Techniques such as eliminating unnecessary operations of the MD5 algorithm, implementing 32-bit word integration of the RC4 algorithm, and using shared memory were used. In addition, autotune techniques were used to search for the number of blocks and threads that affect performance improvement. As a result, we showed throughput of 31,460 kp/s (kilo passwords per second) and 66,351 kp/s at block size 65,536, thread size 96 in RTX 3060, RTX 3090 environments, and improved throughput by 22.5% and 15.2%, respectively, compared to the cracking tool hashcat that achieves the highest throughput.