• 제목/요약/키워드: CUDA(CUDA)

검색결과 295건 처리시간 0.029초

GPGPU 병렬 프로그래밍을 이용한 H.264/AVC 고속 화면내 예측 모드 결정 (H.264/AVC Fast Intra Mode Decision using GPGPU Parallel Programming)

  • 최성준;한기훈;유영수
    • 한국방송∙미디어공학회:학술대회논문집
    • /
    • 한국방송공학회 2011년도 추계학술대회
    • /
    • pp.110-112
    • /
    • 2011
  • GPU의 병렬성과 연산능력을 일반적인 공학적 문제 해결에 적용하는 GPGPU 컴퓨팅에 대한 연구가 최근 활발히 진행되고 있다. 비디오 압축과정에는 많은 양의 화소 데이터에 동일하게 반복되는 연산을 수행하는 알고리즘이 많이 적용되므로 GPGPU를 통한 고속 병렬 계산의 응용 분야로 매우 적합하다. H.264/AVC는 비디오를 압축하는 가장 최신의 국제표준으로 여러 제품군과 서비스에 대한 적용되어 시장에서 널리 사용되고 있다. 본 논문에서는 GPGPU의 응용 분야로 주목 받고 있는 비디오 압축 분야에 대한 적용으로 H.264/AVC의 화면내 예측 모드 결정과정에 GPGPU 병렬 프로그래밍을 적용하여 예측 모드 결정 속도를 향상하는 방법을 제안한다. GPU상에서의 데이터 병렬처리를 위해 CUDA C언어를 사용하였으며, CPU상에서의 연산은 C언어를 사용하여 구현되었다. GPU상에서 프레임 전체에 대한 화면내 예측 모드를 병렬적으로 결정함으로써 이에 소요되는 시간을 줄여 줄 수 있었다. 실험결과 GPU상에서 병렬적으로 예측 모드를 결정할 때 Full-HD급 영상에서 약 2.8배 정도의 속도 향상을 확인할 수 있었다. 향후 GPGPU 병렬 프로그래밍을 화면 내 예측뿐만 아니라 반복되는 연산을 수행하는 다른 알고리즘에도 적용하여 부호화기의 계산 부담을 덜어준다면 고속 실시간 비디오 압축 부호기 개발이 더욱 용이해 질것으로 기대된다.

  • PDF

IEEE 754 부동 소수점 32비트 float 변수의 Morton Code 변환 분석 (Analysis of Morton Code Conversion for 32 Bit IEEE 754 Floating Point Variables)

  • 박태정
    • 디지털콘텐츠학회 논문지
    • /
    • 제17권3호
    • /
    • pp.165-172
    • /
    • 2016
  • GPU 기반 병렬처리에서 대규모 데이터의 인접 정보 검색(nearest neighbor search)에서 Morton code의 역할이 점점 더 중요하게 부각되고 있으며 그 응용 사례도 점차 증가하고 있다. 본 논문에서는 Tero Karras가 제안한 float 형 변수에 기반한 $[0,1]^3$ 공간 내의 3차원 기하 정보를 32비트 unsigned int형 Morton code로 변경하는 기존의 방법을 논의하고 그 기하학적인 의미를 분석함으로써, 보다 높은 해상도를 구현할 수 있는 64비트 unsigned long long형의 Morton code 변환 알고리듬을 제안한다. 제안하는 알고리듬은 GPU에서 구현되었을 때 CPU에서 실행하는 것보다 약 1000배 수준의 성능 향상을 달성한다.

GPU 가속 기술을 이용한 격자 볼츠만법 기반 원유 확산 과정 시뮬레이션 (GPU-accelerated Lattice Boltzmann Simulation for the Prediction of Oil Slick Movement in Ocean Environment)

  • 하솔;구남국;노명일
    • 한국CDE학회논문집
    • /
    • 제18권6호
    • /
    • pp.399-406
    • /
    • 2013
  • This paper describes a new simulation technique for advection-diffusion phenomena over the sea surface using the lattice Boltzmann method (LBM), capable of predicting oil dispersion from tankers. The LBM is used to solve the pollutant transport problem within the framework of the ocean environment. The sea space is represented by the lattices, where each lattice has the information on oil transportation. Since dispersed oils (i.e., oil droplets) at sea are transported by convection due to waves, buoyancy, and turbulent diffusion, the conservation of mass and many physical oil transport rules were used in the prediction model. Since the LBM is modeled using the uniform lattices and simple rules, it can be easily accelerated by the parallel mechanism, for example, GPU-accelerated method. The proposed model using the LBM is used to simulate a simple pollution event with the oil pollutants of 10,000 kL. The simulation results indicate that the LBM method accelerated with the GPU is 6 times faster than that without the GPU.

Workload Characteristics-based L1 Data Cache Switching-off Mechanism for GPUs

  • Do, Thuan Cong;Kim, Gwang Bok;Kim, Cheol Hong
    • 한국컴퓨터정보학회논문지
    • /
    • 제23권10호
    • /
    • pp.1-9
    • /
    • 2018
  • Modern graphics processing units (GPUs) have become one of the most attractive platforms in exploiting high thread level parallelism with the support of new programming tools such as CUDA and OpenCL. Recent GPUs has applied cache hierarchy to support irregular memory access patterns; however, L1 data cache (L1D) exhibits poor efficiency in the GPU. This paper shows that the L1D does not always positively affect the applications in terms of performance and energy efficiency for the GPU. The performance of the GPU is even harmed by using the L1D for lots of applications. Our proposed technique exploits the characteristics of the currently-executed applications to predict the performance impact of the L1D on the GPU and then decides whether to continuously use the cache for the application or not. Our experimental results show that the proposed technique improves the GPU performance by 9.4% and saves up to 52.1% of the power consumption in the L1D.

High Speed SD-OCT System Using GPU Accelerated Mode for in vivo Human Eye Imaging

  • Cho, Nam Hyun;Jung, Unsang;Kim, Suhwan;Jung, Woonggyu;Oh, Junghwan;Kang, Hyun Wook;Kim, Jeehyun
    • Journal of the Optical Society of Korea
    • /
    • 제17권1호
    • /
    • pp.68-72
    • /
    • 2013
  • We developed an SD-OCT (Spectral Domain-Optical Coherence Tomography) system which uses a GPU (Graphics Processing Unit) for processing. The image size from the SD-OCT system is $1024{\times}512$ and the speed is 110 frame/sec in real-time. K-domain linearization, FFT (Fast Fourier Transform), and log scaling were included in the GPU processing. The signal processing speed was about 62 ms using a CPU (Central Processing Unit) and 1.6 ms using a GPU, which is 39 times faster. We performed an in-vivo retinal scan, and reconstructed a 3D visualization based on C-scan images. As a result, there were minimal motion artifacts and we confirmed that tomograms of blood vessels, the optic nerve, and the optic disk are clearly identified. According to the results of this study, this SD-OCT can be applied to real-time 3D display technology, particularly auxiliary instruments for eye operations in ophthalmology.

GPU를 활용한 분산 컴퓨팅 프레임워크 성능 개선 연구 (A Study on Performance Improvement of Distributed Computing Framework using GPU)

  • 송주영;공용준;심탁길;신의섭;성기진
    • 한국정보처리학회:학술대회논문집
    • /
    • 한국정보처리학회 2012년도 춘계학술발표대회
    • /
    • pp.499-502
    • /
    • 2012
  • 빅 데이터 분석의 시대가 도래하면서 대용량 데이터의 특성과 계산 집약적 연산의 특성을 동시에 가지는 문제 해결에 대한 요구가 늘어나고 있다. 대용량 데이터 처리의 경우 각종 분산 파일 시스템과 분산/병렬 컴퓨팅 기술들이 이미 많이 사용되고 있으며, 계산 집약적 연산 처리의 경우에도 GPGPU 활용 기술의 발달로 보편화되는 추세에 있다. 하지만 대용량 데이터와 계산 집약적 연산 이 두 가지 특성을 모두 가지는 문제를 처리하기 위해서는 많은 제약 사항들을 해결해야 하는데, 본 논문에서는 이에 대한 대안으로 분산 컴퓨팅 프레임워크인 Hadoop MapReduce와 Nvidia의 GPU 병렬 컴퓨팅 아키텍처인 CUDA 흘 연동하는 방안을 제시하고, 이를 밀집행렬(dense matrix) 연산에 적용했을 때 얻을 수 있는 성능 개선 효과에 대해 소개하고자 한다.

A Design and Implementation of Software Defined Radio for Rapid Prototyping of GNSS Receiver

  • Park, Kwi Woo;Yang, Jin-Mo;Park, Chansik
    • Journal of Positioning, Navigation, and Timing
    • /
    • 제7권4호
    • /
    • pp.189-203
    • /
    • 2018
  • In this paper, a Software Defined Radio (SDR) architecture was designed and implemented for rapid prototyping of GNSS receiver. The proposed SDR can receive various GNSS and direct sequence spread spectrum (DSSS) signals without software modification by expanded input parameters containing information of the desired signal. Input parameters include code information, center frequency, message format, etc. To receive various signal by parameter controlling, a correlator, a data bit extractor and a receiver channel were designed considering the expanded input parameters. In navigation signal processing, pseudorange was measured based on Coordinated Universal Time (UTC) and appropriate navigation message decoder was selected by message format of input parameter so that receiver position can be calculated even if SDR is set up various GNSS combination. To validate the proposed SDR, the software was implemented using C++, CUDA C based on GPU and USRP. Experimentation has confirmed that changing the input parameters allows GPS, GLONASS, and BDS satellite signals to be received. The precision of the position from implemented SDR were measured below 5 m (Circular Error Probability; CEP) for all scenarios. This means that the implemented SDR operated normally. The implemented SDR will be used in a variety of fields by allowing prototyping of various GNSS signal only by changing input parameters.

Empirical Performance Evaluation of Communication Libraries for Multi-GPU based Distributed Deep Learning in a Container Environment

  • Choi, HyeonSeong;Kim, Youngrang;Lee, Jaehwan;Kim, Yoonhee
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제15권3호
    • /
    • pp.911-931
    • /
    • 2021
  • Recently, most cloud services use Docker container environment to provide their services. However, there are no researches to evaluate the performance of communication libraries for multi-GPU based distributed deep learning in a Docker container environment. In this paper, we propose an efficient communication architecture for multi-GPU based deep learning in a Docker container environment by evaluating the performances of various communication libraries. We compare the performances of the parameter server architecture and the All-reduce architecture, which are typical distributed deep learning architectures. Further, we analyze the performances of two separate multi-GPU resource allocation policies - allocating a single GPU to each Docker container and allocating multiple GPUs to each Docker container. We also experiment with the scalability of collective communication by increasing the number of GPUs from one to four. Through experiments, we compare OpenMPI and MPICH, which are representative open source MPI libraries, and NCCL, which is NVIDIA's collective communication library for the multi-GPU setting. In the parameter server architecture, we show that using CUDA-aware OpenMPI with multi-GPU per Docker container environment reduces communication latency by up to 75%. Also, we show that using NCCL in All-reduce architecture reduces communication latency by up to 93% compared to other libraries.

GPU-Based ECC Decode Unit for Efficient Massive Data Reception Acceleration

  • Kwon, Jisu;Seok, Moon Gi;Park, Daejin
    • Journal of Information Processing Systems
    • /
    • 제16권6호
    • /
    • pp.1359-1371
    • /
    • 2020
  • In transmitting and receiving such a large amount of data, reliable data communication is crucial for normal operation of a device and to prevent abnormal operations caused by errors. Therefore, in this paper, it is assumed that an error correction code (ECC) that can detect and correct errors by itself is used in an environment where massive data is sequentially received. Because an embedded system has limited resources, such as a low-performance processor or a small memory, it requires efficient operation of applications. In this paper, we propose using an accelerated ECC-decoding technique with a graphics processing unit (GPU) built into the embedded system when receiving a large amount of data. In the matrix-vector multiplication that forms the Hamming code used as a function of the ECC operation, the matrix is expressed in compressed sparse row (CSR) format, and a sparse matrix-vector product is used. The multiplication operation is performed in the kernel of the GPU, and we also accelerate the Hamming code computation so that the ECC operation can be performed in parallel. The proposed technique is implemented with CUDA on a GPU-embedded target board, NVIDIA Jetson TX2, and compared with execution time of the CPU.

딥러닝 기반 자율주행 계단 등반 물품운송 로봇 개발 (Development of Stair Climbing Robot for Delivery Based on Deep Learning)

  • 문기일;이승현;추정필;오연우;이상순
    • 반도체디스플레이기술학회지
    • /
    • 제21권4호
    • /
    • pp.121-125
    • /
    • 2022
  • This paper deals with the development of a deep-learning-based robot that recognizes various types of stairs and performs a mission to go up to the target floor. The overall motion sequence of the robot is performed based on the ROS robot operating system, and it is possible to detect the shape of the stairs required to implement the motion sequence through rapid object recognition through YOLOv4 and Cuda acceleration calculations. Using the ROS operating system installed in Jetson Nano, a system was built to support communication between Arduino DUE and OpenCM 9.04 with heterogeneous hardware and to control the movement of the robot by aligning the received sensors and data. In addition, the web server for robot control was manufactured as ROS web server, and flow chart and basic ROS communication were designed to enable control through computer and smartphone through message passing.