• Title/Summary/Keyword: integrated GPU

Search Result 17, Processing Time 0.025 seconds

The Need of Cache Partitioning on Shared Cache of Integrated Graphics Processor between CPU and GPU (내장형 GPU 환경에서 CPU-GPU 간의 공유 캐시에서의 캐시 분할 방식의 필요성)

  • Sung, Hanul;Eom, Hyeonsang;Yeom, HeonYoung
    • KIISE Transactions on Computing Practices
    • /
    • v.20 no.9
    • /
    • pp.507-512
    • /
    • 2014
  • Recently, Distributed computing processing begins using both CPU(Central processing unit) and GPU(Graphic processing unit) to improve the performance to overcome darksilicon problem which cannot use all of the transistors because of the electric power limitation. There is an integrated graphics processor that CPU and GPU share memory and Last level cache(LLC). But, There is no LLC access rules between CPU and GPU, so if GPU and CPU processes run together at the same time, performance of both processes gets worse because of the contention on the LLC. This Paper gives evidence to prove the need of the Cache Partitioning and is mentioned about the cache partitioning design using page coloring to allocate the L3 Cache space only for the GPU process to guarantee GPU process performance.

A Performance Study on CPU-GPU Data Transfers of Unified Memory Device (통합메모리 장치에서 CPU-GPU 데이터 전송성능 연구)

  • Kwon, Oh-Kyoung;Gu, Gibeom
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.5
    • /
    • pp.133-138
    • /
    • 2022
  • Recently, as GPU performance has improved in HPC and artificial intelligence, its use is becoming more common, but GPU programming is still a big obstacle in terms of productivity. In particular, due to the difficulty of managing host memory and GPU memory separately, research is being actively conducted in terms of convenience and performance, and various CPU-GPU memory transfer programming methods are suggested. Meanwhile, recently many SoC (System on a Chip) products such as Apple M1 and NVIDIA Tegra that bundle CPU, GPU, and integrated memory into one large silicon package are emerging. In this study, data between CPU and GPU devices are used in such an integrated memory device and performance-related research is conducted during transmission. It shows different characteristics from the existing environment in which the host memory and GPU memory in the CPU are separated. Here, we want to compare performance by CPU-GPU data transmission method in NVIDIA SoC chips, which are integrated memory devices, and NVIDIA SMX-based V100 GPU devices. For the experimental workload for performance comparison, a two-dimensional matrix transposition example frequently used in HPC applications was used. We analyzed the following performance factors: the difference in GPU kernel performance according to the CPU-GPU memory transfer method for each GPU device, the transfer performance difference between page-locked memory and pageable memory, overall performance comparison, and performance comparison by workload size. Through this experiment, it was confirmed that the NVIDIA Xavier can maximize the benefits of integrated memory in the SoC chip by supporting I/O cache consistency.

Memory Delay Comparison between 2D GPU and 3D GPU (2차원 구조 대비 3차원 구조 GPU의 메모리 접근 효율성 분석)

  • Jeon, Hyung-Gyu;Ahn, Jin-Woo;Kim, Jong-Myon;Kim, Cheol-Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.17 no.7
    • /
    • pp.1-11
    • /
    • 2012
  • As process technology scales down, the number of cores integrated into a processor increases dramatically, leading to significant performance improvement. Especially, the GPU(Graphics Processing Unit) containing many cores can provide high computational performance by maximizing the parallelism. In the GPU architecture, the access latency to the main memory becomes one of the major reasons restricting the performance improvement. In this work, we analyze the performance improvement of the 3D GPU architecture compared to the 2D GPU architecture quantitatively and investigate the potential problems of the 3D GPU architecture. In general, memory instructions account for 30% of total instructions, and global/local memory instructions constitutes 60% of total memory instructions. Therefore, the performance of the 3D GPU is expected to be improved significantly compared to the 2D GPU by reducing the delay of memory instructions. However, according to our experimental results, the 3D architecture improves the GPU performance only by 2% compared to the 2D architecture due to the memory bottleneck, since the performance reduction due to memory bottleneck in the 3D GPU architecture increases by 245% compared to the 2D architecture. This paper provides the guideline for suitable memory design by analyzing the efficiency of the memory architecture in 3D GPU architecture.

Accelerating Medical Image Processing on Integrated GPU Using OpenCL (OpenCL을 이용한 내장형 GPU에서의 의학영상처리 가속화)

  • Kim, Beom-Jun;Shin, Byeong-seok
    • Journal of the Korea Computer Graphics Society
    • /
    • v.23 no.2
    • /
    • pp.1-10
    • /
    • 2017
  • A variety of filters are applied to improve the quality of noise and low resolution medical images. This is necessary to reduce the radiation dose of the patient and to improve the utilization of the conventional spherical imaging equipment. In the conventional method, it is common to perform filtering using the CPU of the PC. However, it is difficult to produce results in real time by applying various calculations and filters to high-resolution human images using only the CPU performance of a PC used in a hospital. In this paper, we analyze the structure and performance of Intel integrated GPU in CPU and propose a method to perform image filtering using OpenCL parallel processing function. By applying complex filters with high computational complexity to medical images, high quality images can be generated in real time.

Implementation of Integrated CPU-GPU for Efficient Uniform Memory Access Method and Verification System (CPU-GPU간 긴밀성을 위한 효율적인 공유메모리 접근 방법과 검증 시스템 구현)

  • Park, Hyun-moon;Kwon, Jinsan;Hwang, Tae-ho;Kim, Dong-Sun
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.11 no.2
    • /
    • pp.57-65
    • /
    • 2016
  • In this paper, we propose a system for efficient use of shared memory between CPU and GPU. The system, called Fusion Architecture, assures consistency of the shared memory and minimizes cache misses that frequently occurs on Heterogeneous System Architecture or Unified Virtual Memory based systems. It also maximizes the performance for memory intensive jobs by efficient allocation of GPU cores. To test between architectures on various scenarios, we introduce the Fusion Architecture Analyzer, which compares OpenMP, OpenCL, CUDA, and the proposed architecture in terms of memory overhead and process time. As a result, Proposed fusion architectures show that the Fusion Architecture runs benchmarks 55% faster and reduces memory overheads by 220% in average.

Benchmark Results of a Radio Spectrometer Based on Graphics Processing Unit

  • Kim, Jongsoo;Wagner, Jan
    • The Bulletin of The Korean Astronomical Society
    • /
    • v.40 no.2
    • /
    • pp.44.1-44.1
    • /
    • 2015
  • We set up a project to make spectrometers for single dish observations of the Korean VLBI Network (KVN), a new future multi-beam receiver of the ASTE (Atacama Submillimeter Telescope Experiment), and the total power (TP) antennas of the Atacama Large Millimeter/submillimeter Array (ALMA). Traditionally, spectrometers based on ASIC (Application-Specific Integrated circuit) and FPGA (Field-Programmable Gate Array) have been used in radio astronomy. It is, however, that a Graphics Processing Unit (GPU) technology is now viable for spectrometers due to the rapid improvement of its performance. A high-resolution spectrometer should have the following functions: poly-phase filter, data-bit conversion, fast Fourier transform, and complex multiplication. We wrote a program based on CUDA (Compute Unified Device Architecture) for a GPU spectrometer. We measured its performance using two GPU cards, Titan X and K40m, from NVIDIA. A non-optimized GPU code can process a data stream of around 2 GHz bandwidth, which is enough for the KVN spectrometer and promising for the ASTE and ALMA TP spectrometers.

  • PDF

CPU-GPU2 Trigeneous Computing for Iterative Reconstruction in Computed Tomography

  • Oh, Chanyoung;Yi, Youngmin
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.5 no.4
    • /
    • pp.294-301
    • /
    • 2016
  • In this paper, we present methods to efficiently parallelize iterative 3D image reconstruction by exploiting trigeneous devices (three different types of device) at the same time: a CPU, an integrated GPU, and a discrete GPU. We first present a technique that exploits single instruction multiple data (SIMD) architectures in GPUs. Then, we propose a performance estimation model, based on which we can easily find the optimal data partitioning on trigeneous devices. We found that the performance significantly varies by up to 6.23 times, depending on how SIMD units in GPUs are accessed. Then, by using trigeneous devices and the proposed estimation models, we achieve optimal partitioning and throughput, which corresponds to a 9.4% further improvement, compared to discrete GPU-only execution.

Analysis on the Performance Impact of Partitioned LLC for Heterogeneous Multicore Processors (이종 멀티코어 프로세서에서 분할된 공유 LLC가 성능에 미치는 영향 분석)

  • Moon, Min Goo;Kim, Cheol Hong
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.15 no.2
    • /
    • pp.39-49
    • /
    • 2019
  • Recently, CPU-GPU integrated heterogeneous multicore processors have been widely used for improving the performance of computing systems. Heterogeneous multicore processors integrate CPUs and GPUs on a single chip where CPUs and GPUs share the LLC(Last Level Cache). This causes a serious cache contention problem inside the processor, resulting in significant performance degradation. In this paper, we propose the partitioned LLC architecture to solve the cache contention problem in heterogeneous multicore processors. We analyze the performance impact varying the LLC size of CPUs and GPUs, respectively. According to our simulation results, the bigger the LLC size of the CPU, the CPU performance improves by up to 21%. However, the GPU shows negligible performance difference when the assigned LLC size increases. In other words, the GPU is less likely to lose the performance when the LLC size decreases. Because the performance degradation due to the LLC size reduction in GPU is much smaller than the performance improvement due to the increase of the LLC size of the CPU, the overall performance of heterogeneous multicore processors is expected to be improved by applying partitioned LLC to CPUs and GPUs. In addition, if we develop a memory management technique that can maximize the performance of each core in the future, we can greatly improve the performance of heterogeneous multicore processors.

BenchGAD: An Integrated Testing Framework for GPU Accelerated Data-intensive Systems (BenchGAD: GPU 기반 데이터 집약 시스템의 효과적인 테스팅 프레임워크)

  • Gu, Sang-Un;Choi, Byeong-wook;Suh, Young-Kyoon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2018.05a
    • /
    • pp.317-319
    • /
    • 2018
  • 최근 발생하는 데이터는 대용량이며, 형태가 다양하고 빠르게 생성되는 특징이 있다. 이러한 데이터는 CPU, 인메모리 기반인 기존의 데이터 처리 시스템에서 처리하는데 많은 시간이 소모된다. 이 문제를 해결하기 위해 GPU 기반 데이터 집약 시스템이 출현하기 시작했다. 하지만, 이러한 시스템의 성능을 종합적으로 측정하는 테스트 결과는 시스템마다 다른 기준으로 제공하고 있다. 이에 따라, 개발자 및 사용자는 성능 병목 현상을 탐색하고 해결하는 데 큰 어려움을 겪을 수 있다. 즉, 이러한 다른 기준으로는 개발자 및 사용자가 시스템의 통합적인 성능 비교 분석을 수행하기 힘들다. 이러한 문제를 해결하기 위해서, 본 논문은 원스탑 테스팅 프레임워크인 BenchGAD 를 제안하고자한다.

Low-power Scheduling Framework for Heterogeneous Architecture under Performance Constraint

  • Li, Junke;Guo, Bing;Shen, Yan;Li, Deguang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.5
    • /
    • pp.2003-2021
    • /
    • 2020
  • Today's computer systems are widely integrated with CPU and GPU to achieve considerable performance, but energy consumption of such system directly affects operational cost, maintainability and environmental problem, which has been aroused wide concern by researchers, computer architects, and developers. To cope with energy problem, we propose a task-scheduling framework to reduce energy under performance constraint by rationally allocating the tasks across the CPU and GPU. The framework first collects the estimated energy consumption of programs and performance information. Next, we use above information to formalize the scheduling problem as the 0-1 knapsack problem. Then, we elaborate our experiment on typical platform to verify proposed scheduling framework. The experimental results show that our proposed algorithm saves 14.97% energy compared with that of the time-oriented policy and yields 37.23% performance improvement than that of energy-oriented scheme on average.