• Title/Summary/Keyword: GPU 프로그램

Search Result 73, Processing Time 0.019 seconds

A Tool for On-the-fly Repairing of Atomicity Violation in GPU Program Execution

  • Lee, Keonpyo;Lee, Seongjin;Jun, Yong-Kee
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.9
    • /
    • pp.1-12
    • /
    • 2021
  • In this paper, we propose a tool called ARCAV (Atomatic Recovery of CUDA Atomicity violation) to automatically repair atomicity violations in GPU (Graphics Processing Unit) program. ARCAV monitors information of every barrier and memory to make actual memory writes occur at the end of the barrier region or to make the program execute barrier region again. Existing methods do not repair atomicity violations but only detect the atomicity violations in GPU programs because GPU programs generally do not support lock and sleep instructions which are necessary for repairing the atomicity violations. Proposed ARCAV is designed for GPU execution model. ARCAV detects and repairs four patterns of atomicity violations which represent real-world cases. Moreover, ARCAV is independent of memory hierarchy and thread configuration. Our experiments show that the performance of ARCAV is stable regardless of the number of threads or blocks. The overhead of ARCAV is evaluated using four real-world kernels, and its slowdown is 2.1x, in average, of native execution time.

CUDA programming environment을 활용한 Path-Integral Monte Carlo Simulation의 구현

  • Lee, Hwa-Young;Im, Eun-Jin
    • Proceedings of the Korea Society for Industrial Systems Conference
    • /
    • 2009.05a
    • /
    • pp.196-199
    • /
    • 2009
  • 높아지는 Graphic Processing Unit (GPU)의 연산 성능과 GPU에서의 범용 프로그래밍을 위한 개발 환경의 개발, 보급으로 인해 GPU를 일반연산에 활용하는 연구가 활발히 진행되고 있다. 이와같이 일반 연산에 활용되고 있는 GPU로 nVidia Tesla와 AMD/ATI의 FireStream 들이 있다. 특수목적 연산 장치인 GPU를 일반 연산을 위해 프로그래밍하기 위해서는 그에 맞는 프로그램 개발 환경이 필요한데 nVidia에서 개발한 CUDA (Compute Unified Device Architecture) 환경은 자사의 GPU 프로그램 개발을 위해 제공되는 개발 환경이다. CUDA 개발 환경은 nVidia GPU 프로그래밍 뿐만 아니라 차세대 이종 병렬 프로그램 개발 환경의 공개 표준으로 논의되고 있는 OpenCL (Open Computing Language) 와 유사한 특징을 보일 것으로 예상되기 때문에 그 중요성은 특정 GPU 에만 국한되지 않는다. 본 논문에서는 경로 적분 몬테 카를로 (Path Integral Monte Carlo) 방법을 CUDA 개발 환경을 사용하여 nVidia GPU 상에서 병렬화한 결과를 제시하였다.

  • PDF

GP-GPU based Parallelization for Urban Terrain Atmospheric Model CFD_NIMR (도시기상모델 CFD_NIMR의 GP-GPU 실행을 위한 병렬 프로그램의 구현)

  • Kim, Youngtae;Park, Hyeja;Choi, Young-Jeen
    • Journal of Internet Computing and Services
    • /
    • v.15 no.2
    • /
    • pp.41-47
    • /
    • 2014
  • In this paper, we implemented a CUDA Fortran parallel program to run the CFD_NIMR model on GP-GPU's, which simulates air diffusion on urban terrains. A GP-GPU is graphic processing unit in the form of a PCI card, and a general calculation accelerator to perform a large amount of high speed calculations with low cost and electric power. The GP-GPU gives performance enhancement of speed by 15 times to compare the Nvidia Tesla C1060 GPU with Intel XEON 2.0 GHz CPU. In addition, the program on a GP-GPU shows efficient performance compared to an MPI parallel program on multiple CPU's. It is expected that a proposed programming method on the GP-GPU parallel program can be used for numerical models with a similar structure.

Implementation of parallel blocked LU decomposition program for utilizing cache memory on GP-GPUs (GP-GPU의 캐시메모리를 활용하기 위한 병렬 블록 LU 분해 프로그램의 구현)

  • Kim, Youngtae;Kim, Doo-Han;Yu, Myoung-Han
    • Journal of Internet Computing and Services
    • /
    • v.14 no.6
    • /
    • pp.41-47
    • /
    • 2013
  • GP-GPUs are general purposed GPUs for numerical computation based on multiple threads which are originally for graphic processing. GP-GPUs provide cache memory in a form of shared memory which user programs can access directly, unlikely typical cache memory. In this research, we implemented the parallel block LU decomposition program to utilize cache memory in GP-GPUs. The parallel blocked LU decomposition program designed with Nvidia CUDA C run 7~8 times faster than nun-blocked LU decomposition program in the same GP-GPU computation environment.

Performance Evaluation of the GPU Architecture Executing Parallel Applications (병렬 응용프로그램 실행 시 GPU 구조에 따른 성능 분석)

  • Choi, Hong-Jun;Kim, Cheol-Hong
    • The Journal of the Korea Contents Association
    • /
    • v.12 no.5
    • /
    • pp.10-21
    • /
    • 2012
  • The role of GPU has evolved from graphics-specific processing to general-purpose processing with the development of unified shader core architecture. Especially, execution methods for general-purpose parallel applications using GPU have been researched intensively, since the parallel hardware architecture can be utilized efficiently when the parallel applications are executed. However, current GPU architecture has limitations in executing general-purpose parallel applications, since the GPU is not specialized for general-purpose computing yet. To improve the GPU performance when general-purpose parallel applications are executed, the GPU architecture should be evolved. In this work, we analyze the GPU performance according to the architecture varying the number of cores and clock frequency. Our simulation results show that the GPU performance improves by up to 125.8% and 16.2% as the number of cores increases and the clock frequency increases, respectively. However, note that the improvement of the GPU performance is saturated even though the number of cores increases and the clock frequency increases continuously, since the data cannot be provided to the GPU due to the limit of memory bandwidth. Consequently, to accomplish high performance effectiveness on GPU, computational resources must be more suitably considered.

Analysis of the CPU/GPU Temperature and Energy Efficiency depending on Executed Applications (응용프로그램 실행에 따른 CPU/GPU의 온도 및 컴퓨터 시스템의 에너지 효율성 분석)

  • Choi, Hong-Jun;Kang, Seung-Gu;Kim, Jong-Myon;Kim, Cheol-Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.17 no.5
    • /
    • pp.9-19
    • /
    • 2012
  • As the clock frequency increases, CPU performance improves continuously. However, power and thermal problems in the CPU become more serious as the clock frequency increases. For this reason, utilizing the GPU to reduce the workload of the CPU becomes one of the most popular methods in recent high-performance computer systems. The GPU is a specialized processor originally designed for graphics processing. Recently, the technologies such as CUDA which utilize the GPU resources more easily become popular, leading to the improved performance of the computer system by utilizing the CPU and GPU simultaneously in executing various kinds of applications. In this work, we analyze the temperature and the energy efficiency of the computer system where the CPU and the GPU are utilized simultaneously, to figure out the possible problems in upcoming high-performance computer systems. According to our experimentation results, the temperature of both CPU and GPU increase when the application is executed on the GPU. When the application is executed on the CPU, CPU temperature increases whereas GPU temperature remains unchanged. The computer system shows better energy efficiency by utilizing the GPU compared to the CPU, because the throughput of the GPU is much higher than that of the CPU. However, the temperature of the system tends to be increased more easily when the application is executed on the GPU, because the GPU consumes more power than the CPU.

Analysis on the Active/Inactive Status of Computational Resources for Improving the Performance of the GPU (GPU 성능 저하 해결을 위한 내부 자원 활용/비활용 상태 분석)

  • Choi, Hongjun;Son, Dongoh;Kim, Jongmyon;Kim, Cheolhong
    • The Journal of the Korea Contents Association
    • /
    • v.15 no.7
    • /
    • pp.1-11
    • /
    • 2015
  • In recent high performance computing system, GPGPU has been widely used to process general-purpose applications as well as graphics applications, since GPU can provide optimized computational resources for massive parallel processing. Unfortunately, GPGPU doesn't exploit computational resources on GPU in executing general-purpose applications fully, because the applications cannot be optimized to GPU architecture. Therefore, we provide GPU research guideline to improve the performance of computing systems using GPGPU. To accomplish this, we analyze the negative factors on GPU performance. In this paper, in order to clearly classify the cause of the negative factors on GPU performance, GPU core status are defined into 5 status: fully active status, partial active status, idle status, memory stall status and GPU core stall status. All status except fully active status cause performance degradation. We evaluate the ratio of each GPU core status depending on the characteristics of benchmarks to find specific reasons which degrade the performance of GPU. According to our simulation results, partial active status, idle status, memory stall status and GPU core stall status are induced by computational resource underutilization problem, low parallelism, high memory requests, and structural hazard, respectively.

Acceleration Hardware Technology of 3D Graphics (3D 그래픽스 가속 하드웨어 기술)

  • Cho, S.H.;Park, S.M.;Eum, N.W.
    • Electronics and Telecommunications Trends
    • /
    • v.22 no.5
    • /
    • pp.69-77
    • /
    • 2007
  • 3D 그래픽스 관련 산업의 눈부신 성장은 GPU 기술의 발전을 기반으로 이루어졌다. GPU는 기존의 고정된 기능의 파이프라인을 벗어나 프로그램 가능한 형태로 발전하였으며 GPU의 프로그램 능력과 성능의 꾸준한 향상이 이루어지고 있다. 최근에는 GPU 내부의 연산 집중도의 불균형을 해결하기 위한 연구와 GPU의 연산능력을 다른 응용분야에 이용하기 위한 연구가 진행중에 있다. GPU를 이용한 3D 그래픽스 응용프로그램 개발을 위해서 산업 표준의 API들이 존재하는데 데스크톱용 API에서 필수 기능만을 골라 간략화한 모바일 기기용 프로파일 또한 정의되고 있다. 모바일 기기에 사용되는 GPU도 프로그램 가능한 구조로 진화하고 있으며 대중화되기 위해서는 전력소모를 낮추기 위한 노력이 필요하다.

Analysis of GPU Performance and Memory Efficiency according to Task Processing Units (작업 처리 단위 변화에 따른 GPU 성능과 메모리 접근 시간의 관계 분석)

  • Son, Dong Oh;Sim, Gyu Yeon;Kim, Cheol Hong
    • Smart Media Journal
    • /
    • v.4 no.4
    • /
    • pp.56-63
    • /
    • 2015
  • Modern GPU can execute mass parallel computation by exploiting many GPU core. GPGPU architecture, which is one of approaches exploiting outstanding computational resources on GPU, executes general-purpose applications as well as graphics applications, effectively. In this paper, we investigate the impact of memory-efficiency and performance according to number of CTAs(Cooperative Thread Array) on a SM(Streaming Multiprocessors), since the analysis of relation between number of CTA on a SM and them provides inspiration for researchers who study the GPU to improve the performance. Our simulation results show that almost benchmarks increasing the number of CTAs on a SM improve the performance. On the other hand, some benchmarks cannot provide performance improvement. This is because the number of CTAs generated from same kernel is a little or the number of CTAs executed simultaneously is not enough. To precisely classify the analysis of performance according to number of CTA on a SM, we also analyze the relations between performance and memory stall, dram stall due to the interconnect congestion, pipeline stall at the memory stage. We expect that our analysis results help the study to improve the parallelism and memory-efficiency on GPGPU architecture.

Three-dimensional Wave Propagation Modeling using OpenACC and GPU (OpenACC와 GPU를 이용한 3차원 파동 전파 모델링)

  • Kim, Ahreum;Lee, Jongwoo;Ha, Wansoo
    • Geophysics and Geophysical Exploration
    • /
    • v.20 no.2
    • /
    • pp.72-77
    • /
    • 2017
  • We calculated 3D frequency- and Laplace-domain wavefields using time-domain modeling and Fourier transform or Laplace transform. We adopted OpenACC and GPU for an efficient parallel calculation. The OpenACC makes it easy to use GPU accelerators by adding directives in conventional C, C++, and Fortran programming languages. Accordingly, one doesn't have to learn new GPGPU programming languages such as CUDA or OpenCL to use GPU. An OpenACC program allocates GPU memory, transfers data between the host CPU and GPU devices and performs GPU operations automatically or following user-defined directives. We compared performance of 3D wave propagation modeling programs using OpenACC and GPU to that using single-core CPU through numerical tests. Results using a homogeneous model and the SEG/EAGE salt model show that the OpenACC programs are approximately 53 and 30 times faster than those using single-core CPU.