• Title/Summary/Keyword: GPGPU(GPGPU)

Search Result 200, Processing Time 0.024 seconds

Spark Framework Based on a Heterogenous Pipeline Computing with OpenCL (OpenCL을 활용한 이기종 파이프라인 컴퓨팅 기반 Spark 프레임워크)

  • Kim, Daehee;Park, Neungsoo
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.67 no.2
    • /
    • pp.270-276
    • /
    • 2018
  • Apache Spark is one of the high performance in-memory computing frameworks for big-data processing. Recently, to improve the performance, general-purpose computing on graphics processing unit(GPGPU) is adapted to Apache Spark framework. Previous Spark-GPGPU frameworks focus on overcoming the difficulty of an implementation resulting from the difference between the computation environment of GPGPU and Spark framework. In this paper, we propose a Spark framework based on a heterogenous pipeline computing with OpenCL to further improve the performance. The proposed framework overlaps the Java-to-Native memory copies of CPU with CPU-GPU communications(DMA) and GPU kernel computations to hide the CPU idle time. Also, CPU-GPU communication buffers are implemented with switching dual buffers, which reduce the mapped memory region resulting in decreasing memory mapping overhead. Experimental results showed that the proposed Spark framework based on a heterogenous pipeline computing with OpenCL had up to 2.13 times faster than the previous Spark framework using OpenCL.

Algorithmic GPGPU Memory Optimization

  • Jang, Byunghyun;Choi, Minsu;Kim, Kyung Ki
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.14 no.4
    • /
    • pp.391-406
    • /
    • 2014
  • The performance of General-Purpose computation on Graphics Processing Units (GPGPU) is heavily dependent on the memory access behavior. This sensitivity is due to a combination of the underlying Massively Parallel Processing (MPP) execution model present on GPUs and the lack of architectural support to handle irregular memory access patterns. Application performance can be significantly improved by applying memory-access-pattern-aware optimizations that can exploit knowledge of the characteristics of each access pattern. In this paper, we present an algorithmic methodology to semi-automatically find the best mapping of memory accesses present in serial loop nest to underlying data-parallel architectures based on a comprehensive static memory access pattern analysis. To that end we present a simple, yet powerful, mathematical model that captures all memory access pattern information present in serial data-parallel loop nests. We then show how this model is used in practice to select the most appropriate memory space for data and to search for an appropriate thread mapping and work group size from a large design space. To evaluate the effectiveness of our methodology, we report on execution speedup using selected benchmark kernels that cover a wide range of memory access patterns commonly found in GPGPU workloads. Our experimental results are reported using the industry standard heterogeneous programming language, OpenCL, targeting the NVIDIA GT200 architecture.

A Benchmark Suite for Data Race Detection Technique in GPGPU Progrmas (GPGPU 프로그램의 자료경합 탐지기법을 위한 벤치마크 모음)

  • Lee, Keonpyo;Choi, Eu-Teum;Jun, Yong-Kee
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2019.01a
    • /
    • pp.7-8
    • /
    • 2019
  • 자료경합은 두 개 이상의 스레드가 같은 공유메모리에 적절한 동기화 없이 접근하고, 적어도 한 개의 접근사건이 쓰기일 때 발생할 수 있는 동시성 오류이다. 자료경합은 프로그래머가 의도하지 않은 비결정적인 수행결과를 초래하여, 항공기 소프트웨어와 같은 고신뢰성이 요구되는 프로그램에서 치명적인 오류를 발생시켜 인적 물적 손해로 이어질 수 있다. 자료경합 탐지기법은 이러한 문제를 사전에 탐지하여 수정하는데 사용되어진다. 하지만 GPGPU 프로그램에서의 자료경합은 CPU 병행프로그램에서보다 복잡한 실행구조를 가지고 있어 스레드 및 메모리 계층, 스케줄링, 동기화 기법 등의 많은 변수가 존재한다. 이로 인해 실세계 프로그램에 자료경합 탐지기법을 적용하여 검증 시 이러한 변수들을 반영하여 실험하는데 많은 노력이 소요된다. 본 논문은 실세계 프로그램에서의 자료경합을 대표하는 4가지 패턴의 합성프로그램으로 이루어지고 실행 시 스레드 및 메모리 계층, 스레드 구조, 메모리 사용량 및 동기화 방안을 지정할 수 있는 벤치마크 모음을 제시한다.

  • PDF

Implementation of handwritten digit recognition CNN structure using GPGPU and Combined Layer (GPGPU와 Combined Layer를 이용한 필기체 숫자인식 CNN구조 구현)

  • Lee, Sangil;Nam, Kihun;Jung, Jun Mo
    • The Journal of the Convergence on Culture Technology
    • /
    • v.3 no.4
    • /
    • pp.165-169
    • /
    • 2017
  • CNN(Convolutional Nerual Network) is one of the algorithms that show superior performance in image recognition and classification among machine learning algorithms. CNN is simple, but it has a large amount of computation and it takes a lot of time. Consequently, in this paper we performed an parallel processing unit for the convolution layer, pooling layer and the fully connected layer, which consumes a lot of handling time in the process of CNN, through the SIMT(Single Instruction Multiple Thread)'s structure of GPGPU(General-Purpose computing on Graphics Processing Units).And we also expect to improve performance by reducing the number of memory accesses and directly using the output of convolution layer not storing it in pooling layer. In this paper, we use MNIST dataset to verify this experiment and confirm that the proposed CNN structure is 12.38% better than existing structure.

Performance Enhancement of Scaling Filter and Transcoder using CUDA (CUDA를 활용한 스케일링 필터 및 트랜스코더의 성능향상)

  • Han, Jae-Geun;Ko, Young-Sub;Suh, Sung-Han;Ha, Soon-Hoi
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.4
    • /
    • pp.507-511
    • /
    • 2010
  • In this paper, we propose to enhance the performance of software transcoder by using GPGPU for scaling filters. Video transcoding is a technique that translates a video file to another video file that has a different coding algorithm and/or a different frame size. Its demand increases as more multimedia devices with different specification coexist in our daily life. Since transcoding is computationally intensive, a software transcoder that runs on a CPU takes long processing time. In this paper, we achieve significant speed-up by parallelizing the scaling filter using a GPGPU that can provide significantly large computation power. Through extensive experiments with various video scripts of different size and with various scaling filter options, it is verified that the enhanced transcoder could achieve 36% performance improvement in the default option, and up to 101% in a certain option.

A Method for Group Mobility Model Construction and Model Representation from Positioning Data Set Using GPGPU (GPGPU에 기반하는 위치 정보 집합에서 집단 이동성 모델의 도출 기법과 그 표현 기법)

  • Song, Ha Yoon;Kim, Dong Yup
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.3
    • /
    • pp.141-148
    • /
    • 2017
  • The current advancement of mobile devices enables users to collect a sequence of user positions by use of the positioning technology and thus the related research regarding positioning or location information are quite arising. An individual mobility model based on positioning data and time data are already established while group mobility model is not done yet. In this research, group mobility model, an extension of individual mobility model, and the process of establishment of group mobility model will be studied. Based on the previous research of group mobility model from two individual mobility model, a group mobility model with more than two individual model has been established and the transition pattern of the model is represented by Markov chain. In consideration of real application, the computing time to establish group mobility mode from huge positioning data has been drastically improved by use of GPGPU comparing to the use of traditional multicore systems.

A Study of How to Improve Execution Speed of Grabcut Using GPGPU (GPGPU를 이용한 Grabcut의 수행 속도 개선 방법에 관한 연구)

  • Kim, Ji-Hoon;Park, Young-Soo;Lee, Sang-Hun
    • Journal of Digital Convergence
    • /
    • v.12 no.11
    • /
    • pp.379-386
    • /
    • 2014
  • In this paper, the processing speed of Grabcut algorithm in order to efficiently improve the GPU (Graphics Processing Unit) for processing the data from the method. Grabcut algorithm has excellent performance object detection algorithm. Grabcut existing algorithms to split the foreground area and the background area, and then background and foreground K-cluster is assigned a cluster. And assigned to gradually improve the results, until the process is repeated. But Drawback of Grabcut algorithm is the time consumption caused by the repetition of clustering. Thus GPGPU (General-Purpose computing on Graphics Processing Unit) using the repeated operations in parallel by processing Grabcut algorithm to effectively improve the processing speed of the method. We proposed method of execution time of the algorithm reduced the average of about 95.58%.

The Statistical Analysis of Differential Probability Using GPGPU Technology (GPGPU 기술을 활용한 차분 확률의 통계적 분석)

  • Jo, Eunji;Kim, Seong-Gyeom;Hong, Deukjo;Sung, Jaechul;Hong, Seokhie
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.29 no.3
    • /
    • pp.477-489
    • /
    • 2019
  • In this paper, we experimentally verify the expected differential probability under the markov cipher assumption and the distribution of the differential probability. Firstly, we validate the expected differential probability of 6round-PRESENT of the lightweight block cipher under the markov cipher assumption by analyzing the empirical differential probability. Secondly, we demonstrate that even though the expected differential probability under the markov cipher assumption seems valid, the empirical distribution does not follow the well-known distribution of the differential probability. The results was deduced from the 4round-GIFT. Finally, in order to analyze whether the key-schedule affects the mis-matching phenomenon, we collect the results while changing the XOR positions of round keys on GIFT. The results show that the key-schedule is not the only factor to affect the mis-matching phenomenon. Leveraging on GPGPU technology, the data collection process can be performed about 157 times faster than using CPU only.

Thread Block Scheduling for GPGPU based on Fine-Grained Resource Utilization (상세 자원 이용률에 기반한 병렬 가속기용 스레드 블록 스케줄링)

  • Bahn, Hyokyung;Cho, Kyungwoon
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.22 no.5
    • /
    • pp.49-54
    • /
    • 2022
  • With the recent widespread adoption of general-purpose GPUs (GPGPUs) in cloud systems, maximizing the resource utilization through multitasking in GPGPU has become an important issue. In this article, we show that resource allocation based on the workload classification of computing-bound and memory-bound is not sufficient with respect to resource utilization, and present a new thread block scheduling policy for GPGPU that makes use of fine-grained resource utilizations of each workload. Unlike previous approaches, the proposed policy reduces scheduling overhead by separating profiling and scheduling, and maximizes resource utilizations by co-locating workloads with different bottleneck resources. Through simulations under various virtual machine scenarios, we show that the proposed policy improves the GPGPU throughput by 130.6% on average and up to 161.4%.

Efficient Thread Allocation Method of Convolutional Neural Network based on GPGPU (GPGPU 기반 Convolutional Neural Network의 효율적인 스레드 할당 기법)

  • Kim, Mincheol;Lee, Kwangyeob
    • Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology
    • /
    • v.7 no.10
    • /
    • pp.935-943
    • /
    • 2017
  • CNN (Convolution neural network), which is used for image classification and speech recognition among neural networks learning based on positive data, has been continuously developed to have a high performance structure to date. There are many difficulties to utilize in an embedded system with limited resources. Therefore, we use GPU (General-Purpose Computing on Graphics Processing Units), which is used for general-purpose operation of GPU to solve the problem because we use pre-learned weights but there are still limitations. Since CNN performs simple and iterative operations, the computation speed varies greatly depending on the thread allocation and utilization method in the Single Instruction Multiple Thread (SIMT) based GPGPU. To solve this problem, there is a thread that needs to be relaxed when performing Convolution and Pooling operations with threads. The remaining threads have increased the operation speed by using the method used in the following feature maps and kernel calculations.