• Title/Summary/Keyword: CPU-GPU

Search Result 344, Processing Time 0.028 seconds

Analysis of the CPU/GPU Temperature and Energy Efficiency depending on Executed Applications (응용프로그램 실행에 따른 CPU/GPU의 온도 및 컴퓨터 시스템의 에너지 효율성 분석)

  • Choi, Hong-Jun;Kang, Seung-Gu;Kim, Jong-Myon;Kim, Cheol-Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.17 no.5
    • /
    • pp.9-19
    • /
    • 2012
  • As the clock frequency increases, CPU performance improves continuously. However, power and thermal problems in the CPU become more serious as the clock frequency increases. For this reason, utilizing the GPU to reduce the workload of the CPU becomes one of the most popular methods in recent high-performance computer systems. The GPU is a specialized processor originally designed for graphics processing. Recently, the technologies such as CUDA which utilize the GPU resources more easily become popular, leading to the improved performance of the computer system by utilizing the CPU and GPU simultaneously in executing various kinds of applications. In this work, we analyze the temperature and the energy efficiency of the computer system where the CPU and the GPU are utilized simultaneously, to figure out the possible problems in upcoming high-performance computer systems. According to our experimentation results, the temperature of both CPU and GPU increase when the application is executed on the GPU. When the application is executed on the CPU, CPU temperature increases whereas GPU temperature remains unchanged. The computer system shows better energy efficiency by utilizing the GPU compared to the CPU, because the throughput of the GPU is much higher than that of the CPU. However, the temperature of the system tends to be increased more easily when the application is executed on the GPU, because the GPU consumes more power than the CPU.

A Performance Study on CPU-GPU Data Transfers of NVIDIA Tegra and Tesla GPUs (NVIDIA Tegra와 Tesla GPU에서의 CPU-GPU 데이터 전송성능 연구)

  • Kwon, Oh-Kyoung;Gu, Gibeom
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.11a
    • /
    • pp.39-42
    • /
    • 2021
  • 최근 HPC, 인공지능에서 GPU 성능이 향상되면서 사용이 보편화되고 있지만 GPU 프로그래밍은 난이도 측면에서 여전히 큰 장애물이다. 특히 호스트(host) 메모리와 GPU 메모리를 따로 관리해야 하는 어려움 때문에 편의성과 성능 측면에서 연구가 활발히 진행되고 있으며, 다양한 CPU-GPU 메모리 전송프로그래밍 방법들이 제시되고 있다. 본 연구는 NVIDIA Tegra 장치들과 NVIDIA SMX 기반 V100 GPU 카드에서 CPU-GPU 데이터 전송 기법별로 성능비교를 하고자 한다. 특히 NVIDIA Tegra 장치는 CPU와 GPU 통합메모리를 제공하고 있어서 CPU-GPU 메모리 전송방법의 관점에서 기존 GPU 장치와 다른 성능 특징을 보여준다. 성능비교를 위한 실험 워크로드는 HPC 응용프로그램에서 빈번하게 사용하는 2차원 행렬 전치 예제를 사용하였다. 실험을 통해 각 GPU 장치별로 CPU-GPU 메모리 전송 방법에 따른 GPU 커널 성능차이, 페이지 잠긴 메모리와 페이지 가능 메모리의 전송 성능차이, 마지막으로 전체 성능비교를 하였다.

A Performance Study on CPU-GPU Data Transfers of Unified Memory Device (통합메모리 장치에서 CPU-GPU 데이터 전송성능 연구)

  • Kwon, Oh-Kyoung;Gu, Gibeom
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.5
    • /
    • pp.133-138
    • /
    • 2022
  • Recently, as GPU performance has improved in HPC and artificial intelligence, its use is becoming more common, but GPU programming is still a big obstacle in terms of productivity. In particular, due to the difficulty of managing host memory and GPU memory separately, research is being actively conducted in terms of convenience and performance, and various CPU-GPU memory transfer programming methods are suggested. Meanwhile, recently many SoC (System on a Chip) products such as Apple M1 and NVIDIA Tegra that bundle CPU, GPU, and integrated memory into one large silicon package are emerging. In this study, data between CPU and GPU devices are used in such an integrated memory device and performance-related research is conducted during transmission. It shows different characteristics from the existing environment in which the host memory and GPU memory in the CPU are separated. Here, we want to compare performance by CPU-GPU data transmission method in NVIDIA SoC chips, which are integrated memory devices, and NVIDIA SMX-based V100 GPU devices. For the experimental workload for performance comparison, a two-dimensional matrix transposition example frequently used in HPC applications was used. We analyzed the following performance factors: the difference in GPU kernel performance according to the CPU-GPU memory transfer method for each GPU device, the transfer performance difference between page-locked memory and pageable memory, overall performance comparison, and performance comparison by workload size. Through this experiment, it was confirmed that the NVIDIA Xavier can maximize the benefits of integrated memory in the SoC chip by supporting I/O cache consistency.

The Need of Cache Partitioning on Shared Cache of Integrated Graphics Processor between CPU and GPU (내장형 GPU 환경에서 CPU-GPU 간의 공유 캐시에서의 캐시 분할 방식의 필요성)

  • Sung, Hanul;Eom, Hyeonsang;Yeom, HeonYoung
    • KIISE Transactions on Computing Practices
    • /
    • v.20 no.9
    • /
    • pp.507-512
    • /
    • 2014
  • Recently, Distributed computing processing begins using both CPU(Central processing unit) and GPU(Graphic processing unit) to improve the performance to overcome darksilicon problem which cannot use all of the transistors because of the electric power limitation. There is an integrated graphics processor that CPU and GPU share memory and Last level cache(LLC). But, There is no LLC access rules between CPU and GPU, so if GPU and CPU processes run together at the same time, performance of both processes gets worse because of the contention on the LLC. This Paper gives evidence to prove the need of the Cache Partitioning and is mentioned about the cache partitioning design using page coloring to allocate the L3 Cache space only for the GPU process to guarantee GPU process performance.

Evaluation of the Data Migration between CPU Memory and GPU Memory for a NVIDIA Pascal GPU Using Unified Memory (통합 메모리를 사용하는 NVIDIA 파스칼 GPU에서의 CPU 메모리와 GPU 메모리 간 데이터 통신 분석)

  • Shin, Philkyue;Hong, Seongsoo
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2018.07a
    • /
    • pp.7-10
    • /
    • 2018
  • 통합 메모리는 CPU 메모리와 GPU 메모리 간의 데이터 통신을 개발자에게 투명하게 내재적으로 수행하는 소프트웨어 런타임 환경으로 개발자에게 CPU 메모리와 GPU 메모리가 통합된 하나의 메모리로 보이게 해준다. 통합 메모리는 장점에도 불구하고 아직 널리 사용되지 못하고 있는데 그 이유는 내재적으로 수행되는 데이터 통신의 오버헤드가 큰 것으로 알려져 있기 때문이다. 하지만 이 데이터 통신이 구체적으로 어떻게 이루어지고 오버헤드는 어떻게 발생하는지 분석한 연구는 아직 존재하지 않는다. 우리는 NVIDIA 사의 최신 GPU 마이크로아키텍처 중 하나인 파스칼을 사용하는 GPU를 대상으로 하여, 통합 메모리를 사용할 시 데이터 통신이 이루어지는 조건과 GPU 응용의 수행시간에 데이터 통신이 끼치는 영향을 실험을 통해 분석한다. 실험 결과 통합 메모리의 오버헤드는 두 가지 원인 때문에 발생한다. 첫째, 통합 메모리를 사용하면 CPU 또는 GPU가 데이터에 접근할 때마다 이 데이터는 CPU 또는 GPU 메모리로 옮겨지고 옮겨진 데이터는 제거된다. 따라서 재사용할 데이터도 제거되어 추가적인 데이터 통신이 발생하고, 이 데이터 통신의 지연시간은 GPU 응용의 수행시간에 더해진다. 둘째, 통합 메모리를 사용하면 데이터 통신과 커널들이 서로 다른 스트림에 할당되어도 동시에 수행되지 못한다. 따라서 GPU 응용의 수행시간은 동시에 수행되던 데이터 통신과 커널의 수행시간만큼 증가한다.

  • PDF

Fast and Efficient Implementation of Neural Networks using CUDA and OpenMP (CUDA와 OPenMP를 이용한 빠르고 효율적인 신경망 구현)

  • Park, An-Jin;Jang, Hong-Hoon;Jung, Kee-Chul
    • Journal of KIISE:Software and Applications
    • /
    • v.36 no.4
    • /
    • pp.253-260
    • /
    • 2009
  • Many algorithms for computer vision and pattern recognition have recently been implemented on GPU (graphic processing unit) for faster computational times. However, the implementation has two problems. First, the programmer should master the fundamentals of the graphics shading languages that require the prior knowledge on computer graphics. Second, in a job that needs much cooperation between CPU and GPU, which is usual in image processing and pattern recognition contrary to the graphic area, CPU should generate raw feature data for GPU processing as much as possible to effectively utilize GPU performance. This paper proposes more quick and efficient implementation of neural networks on both GPU and multi-core CPU. We use CUDA (compute unified device architecture) that can be easily programmed due to its simple C language-like style instead of GPU to solve the first problem. Moreover, OpenMP (Open Multi-Processing) is used to concurrently process multiple data with single instruction on multi-core CPU, which results in effectively utilizing the memories of GPU. In the experiments, we implemented neural networks-based text extraction system using the proposed architecture, and the computational times showed about 15 times faster than implementation on only GPU without OpenMP.

Analysis on the Performance Impact of Partitioned LLC for Heterogeneous Multicore Processors (이종 멀티코어 프로세서에서 분할된 공유 LLC가 성능에 미치는 영향 분석)

  • Moon, Min Goo;Kim, Cheol Hong
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.15 no.2
    • /
    • pp.39-49
    • /
    • 2019
  • Recently, CPU-GPU integrated heterogeneous multicore processors have been widely used for improving the performance of computing systems. Heterogeneous multicore processors integrate CPUs and GPUs on a single chip where CPUs and GPUs share the LLC(Last Level Cache). This causes a serious cache contention problem inside the processor, resulting in significant performance degradation. In this paper, we propose the partitioned LLC architecture to solve the cache contention problem in heterogeneous multicore processors. We analyze the performance impact varying the LLC size of CPUs and GPUs, respectively. According to our simulation results, the bigger the LLC size of the CPU, the CPU performance improves by up to 21%. However, the GPU shows negligible performance difference when the assigned LLC size increases. In other words, the GPU is less likely to lose the performance when the LLC size decreases. Because the performance degradation due to the LLC size reduction in GPU is much smaller than the performance improvement due to the increase of the LLC size of the CPU, the overall performance of heterogeneous multicore processors is expected to be improved by applying partitioned LLC to CPUs and GPUs. In addition, if we develop a memory management technique that can maximize the performance of each core in the future, we can greatly improve the performance of heterogeneous multicore processors.

GPU Based Incremental Connected Component Processing in Dynamic Graphs (동적 그래프에서 GPU 기반의 점진적 연결 요소 처리)

  • Kim, Nam-Young;Choi, Do-Jin;Bok, Kyoung-Soo;Yoo, Jae-Soo
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.6
    • /
    • pp.56-68
    • /
    • 2022
  • Recently, as the demand for real-time processing increases, studies on a dynamic graph that changes over time has been actively done. There is a connected components processing algorithm as one of the algorithms for analyzing dynamic graphs. GPUs are suitable for large-scale graph calculations due to their high memory bandwidth and computational performance. However, when computing the connected components of a dynamic graph using the GPU, frequent data exchange occurs between the CPU and the GPU during real graph processing due to the limited memory of the GPU. The proposed scheme utilizes the Weighted-Quick-Union algorithm to process large-scale graphs on the GPU. It supports fast connected components computation by applying the size to the connected component label. It computes the connected component by determining the parts to be recalculated and minimizing the data to be transmitted to the GPU. In addition, we propose a processing structure in which the GPU and the CPU execute asynchronously to reduce the data transfer time between GPU and CPU. We show the excellence of the proposed scheme through performance evaluation using real dataset.

A Study of solving the bottleneck between CPU and GPU (CPU와 GPU 간의 병목현상 해결에 관한 연구)

  • Lee, Jin-Ho;Cho, Han-Jin
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2020.07a
    • /
    • pp.3-4
    • /
    • 2020
  • 본 논문에서는 컴퓨팅 시스템에서 발생 할 수 있는, CPU와 GPU 간의 병목현상을 개선방안으로 통신 방식에 대해 비교 분석하였다. CPU와 GPU 간에 발생할 수 있는 병목현상의 해결방법으로, 두 구성 요소 간의 성능 구성 외의 통신방식을 개선 방법으로 PCIe와 NVLink를 비교하고, 성능 극대화 방안을 모색한다. NVLink 연결 방식의 통신 방식을 변경하였을 때 성능을 비교해 봄으로써 병목현상 해소 및 성능 향상에 우수한 결과를 낼 수 있다.

  • PDF

Parallel Processing of Multi-Core Processor and GPUs in Projection Step for Efficient Fluid Simulation (효율적인 유체 시뮬레이션을 위한 투영 단계에서의 멀티 코어 프로세서와 그래픽 프로세서의 병렬처리)

  • Kim, Sun-Tae;Jung, Hwi-Ryong;Hong, Jeong-Mo
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.6
    • /
    • pp.48-54
    • /
    • 2013
  • In these days, the state-of-art technologies employ the heterogeneous parallelization of CPU and GPU for fluid simulations in the field of computer graphics. In this paper, we present a novel CPU-GPU parallel algorithm that solves projection step of fluid simulation more efficiently than existing sequential CPU-GPU processing. Fluid simulation that requires high computational resources can be carried out efficiently by the proposed method.