• Title/Summary/Keyword: GPU 메모리

Search Result 127, Processing Time 0.037 seconds

A Tool for On-the-fly Repairing of Atomicity Violation in GPU Program Execution

  • Lee, Keonpyo;Lee, Seongjin;Jun, Yong-Kee
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.9
    • /
    • pp.1-12
    • /
    • 2021
  • In this paper, we propose a tool called ARCAV (Atomatic Recovery of CUDA Atomicity violation) to automatically repair atomicity violations in GPU (Graphics Processing Unit) program. ARCAV monitors information of every barrier and memory to make actual memory writes occur at the end of the barrier region or to make the program execute barrier region again. Existing methods do not repair atomicity violations but only detect the atomicity violations in GPU programs because GPU programs generally do not support lock and sleep instructions which are necessary for repairing the atomicity violations. Proposed ARCAV is designed for GPU execution model. ARCAV detects and repairs four patterns of atomicity violations which represent real-world cases. Moreover, ARCAV is independent of memory hierarchy and thread configuration. Our experiments show that the performance of ARCAV is stable regardless of the number of threads or blocks. The overhead of ARCAV is evaluated using four real-world kernels, and its slowdown is 2.1x, in average, of native execution time.

Analyzing performance imbalance between virtual machines caused by excessive use of GPU memory in RPC-based GPU virtualization environments (RPC 기반 GPU 가상화 환경에서 GPU 메모리의 초과 사용 시 발생하는 가상머신 사이의 성능 불균형 문제 분석)

  • Kang, Jihun;Lee, Jaehak;Gil, Joon-Min
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2019.10a
    • /
    • pp.113-114
    • /
    • 2019
  • 클라우드 환경에서는 가상머신의 고성능 연산을 지원하기 위해 Graphic Processing Unit(GPU)를 사용한다. 가상머신들은 공평성을 위해 독립적인 가상머신 스케줄러를 사용하기 때문에 컴퓨팅 자원의 초과 사용으로 인한 성능 저하가 발생해도 동일한 작업을 수행하는 가상머신들의 성능은 균등하게 측정된다. 하지만 GPU 연산의 경우 다중 작업을 수행할 때 하드웨어 기반 스케줄러를 사용하며 가상머신의 입출력 작업을 위한 하이퍼바이저의 First In First Out(FIFO) 기반 스케줄링 기법으로 인해 가상머신 사이의 공평성을 보장할 수 없다. 본 논문에서는 GPU 메모리를 초과 사용하는 환경에서 가상머신들의 성능을 측정하고 성능 불균형으로 인한 문제를 분석한다.

An Efficient Graph Algorithm Processing Scheme using GPUs with Limited Memory (제한된 메모리를 가진 GPU를 이용한 효율적인 그래프 알고리즘 처리 기법)

  • Song, Sang-ho;Lee, Hyeon-byeong;Choi, Do-jin;Lim, Jong-tae;Bok, Kyoung-soo;Yoo, Jae-soo
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.8
    • /
    • pp.81-93
    • /
    • 2022
  • Recently, research on processing a large-capacity graph using GPUs has been conducting. In order to process a large-capacity graph in a GPU with limited memory, the graph must be divided into subgraphs and then processed by scheduling subgraphs. In this paper, we propose an efficient graph algorithm processing scheme in GPU environments with limited memory and performance evaluation. The proposed scheme consists of a graph differential subgraph scheduling method and a graph segmentation method. The bulk graph segmentation method determines how a large-capacity graph can be segmented into subgraphs so that it can be processed efficiently by the GPU. The differential subgraph scheduling method schedule subgraphs processed by GPUs to reduce redundant transmission of the repeatedly used data between HOST-GPUs. It shows the superiority of the proposed scheme by performing various performance evaluations.

Analysis on the Active/Inactive Status of Computational Resources for Improving the Performance of the GPU (GPU 성능 저하 해결을 위한 내부 자원 활용/비활용 상태 분석)

  • Choi, Hongjun;Son, Dongoh;Kim, Jongmyon;Kim, Cheolhong
    • The Journal of the Korea Contents Association
    • /
    • v.15 no.7
    • /
    • pp.1-11
    • /
    • 2015
  • In recent high performance computing system, GPGPU has been widely used to process general-purpose applications as well as graphics applications, since GPU can provide optimized computational resources for massive parallel processing. Unfortunately, GPGPU doesn't exploit computational resources on GPU in executing general-purpose applications fully, because the applications cannot be optimized to GPU architecture. Therefore, we provide GPU research guideline to improve the performance of computing systems using GPGPU. To accomplish this, we analyze the negative factors on GPU performance. In this paper, in order to clearly classify the cause of the negative factors on GPU performance, GPU core status are defined into 5 status: fully active status, partial active status, idle status, memory stall status and GPU core stall status. All status except fully active status cause performance degradation. We evaluate the ratio of each GPU core status depending on the characteristics of benchmarks to find specific reasons which degrade the performance of GPU. According to our simulation results, partial active status, idle status, memory stall status and GPU core stall status are induced by computational resource underutilization problem, low parallelism, high memory requests, and structural hazard, respectively.

Optimization of Color Format Conversion of WebCam Images Using the CUDA (CUDA를 이용한 웹캠 영상의 색상 형식 변환 최적화)

  • Kim, Jin-Woo;Jung, Yun-Hye;Park, Jin-Hong;Park, Yong-Jin;Han, Tack-Don
    • Journal of Korea Game Society
    • /
    • v.11 no.1
    • /
    • pp.147-157
    • /
    • 2011
  • Webcam doesn't perform memory-alignment in order to reduce the transmission time of image data. Memory-unaligned image data is unsuitable for the processing on GPU. Accordingly, we convert it to available color format for optimization in high speed image processing. In this paper, we propose a technique that accelerates webcam's color format conversion by using NVDIA CUDA. We propose an optimization which is about memory accesses and thread composition, also evaluate memory and computing performance for verifying a hypothesis which is the performance of the proposed architecture and optimizing degree on low-performance GPU. Following the optimization technique, we show performance improvements over maximum 68 percent.

Hardware-based Level Set Method for Fast Lung Segmentation and Visualization (빠른 폐 분할과 가시화를 위한 그래픽 하드웨어 기반 레벨-셋 방법)

  • Park Seong-Jin;Hong He-Len;Shin Yeong-Gil
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2006.06b
    • /
    • pp.268-270
    • /
    • 2006
  • 본 논문에서는 3차원 볼륨영상에서 객체를 빠르게 분할하고 동시에 대화식으로 분할과정을 가시화하기 위하여 그래픽 하드웨어를 사용한 레벨-셋 방법을 제안한다. 이를 위하여 첫째, GPU 내에서 효율적 연산을 수행하기 위해 메모리 관리방법을 제안한다. 이는 GPU 내 텍스쳐 메모리 형식에 적합하게 데이터를 패킹하고, CPU의 주메모리와 GPU의 텍스쳐 메모리를 관리하는 방법을 제시한다. 둘째, GPU 내에서 레벨-셋 값을 갱신하는 과정을 9가지 경우로 나누어 연산을 수행하게 함으로써 연산의 효율성을 높힌다. 셋째, front의 변화를 대화식으로 확인하고, 파라미터 변경에 따른 분할 과정을 효과적으로 측정하기 위하여 그래픽 하드웨어 기반 빠른 가시화 방법을 제안한다. 본 논문에서는 제안방법을 평가하기 위하여 3차원 폐 CT 영상데이터를 사용하여 육안평가를 수행하고, 기존 소프트웨어 기반 레벨-셋 방법과 수행시간 측면에서 비교 분석한다. 본 제안방법은 소프트웨어 기반 레벨-셋 방법보다 빠르게 영상을 분할하고 동시에 가시화함으로써 데이터 량이 많은 의료응용에 효율적으로 적용이 가능하다.

  • PDF

Device Virtualization Frameworks for Accelerating GPU Performance on Virtual Environments (가상화 환경에서 GPU 성능의 향상을 위한 장치 가상화 프레임워크)

  • Joo, Younghyun;Lee, Dongwoo;Eom, Young Ik
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.05a
    • /
    • pp.86-87
    • /
    • 2013
  • 최근 가상화 기술에 대한 많은 관심과 연구들로 인해 가상 머신은 물리(Native) 머신에 가까운 성능을 보이며 프로세서 및 메모리 자원을 제공하고 있다. 하지만 GPU 와 같은 그래픽 하드웨어에 대한 장치 가상화는 다른 가상화 기법에 비해 연구가 미흡한 상태로 가상화 환경에서의 영상처리에 걸림돌이 되고 있다. 가상화 환경에서의 영상처리는 기존의 X 윈도우 시스템을 이용하여 영상을 처리하는데, 이는 2D 영상처리에 최적화 되어 있어서 3D 영상을 처리하는데 성능의 한계 보일 뿐만 아니라 가상 머신에서 메모리가 중복으로 복사되면서 낮은 성능 보여주고 있다. 제안하는 장치 가상화 프레임워크는 기존의 메모리의 중복 복사를 제거하면서 성능을 향상 시킬 수 있다. 본 논문에서는 가상화 환경에서 GPU 성능 향상을 위한 장치 가상화 프레임워크를 제안하고 평가를 통해 본 기법의 타당성을 입증한다.

An Investigation of the Performance of the Colored Gauss-Seidel Solver on CPU and GPU (Coloring이 적용된 Gauss-Seidel 해법을 통한 CPU와 GPU의 연산 효율에 관한 연구)

  • Yoon, Jong Seon;Jeon, Byoung Jin;Choi, Hyoung Gwon
    • Transactions of the Korean Society of Mechanical Engineers B
    • /
    • v.41 no.2
    • /
    • pp.117-124
    • /
    • 2017
  • The performance of the colored Gauss-Seidel solver on CPU and GPU was investigated for the two- and three-dimensional heat conduction problems by using different mesh sizes. The heat conduction equation was discretized by the finite difference method and finite element method. The CPU yielded good performance for small problems but deteriorated when the total memory required for computing was larger than the cache memory for large problems. In contrast, the GPU performed better as the mesh size increased because of the latency hiding technique. Further, GPU computation by the colored Gauss-Siedel solver was approximately 7 times that by the single CPU. Furthermore, the colored Gauss-Seidel solver was found to be approximately twice that of the Jacobi solver when parallel computing was conducted on the GPU.

A design of GPU container co-execution framework measuring interference among applications (GPU 컨테이너 동시 실행에 따른 응용의 간섭 측정 프레임워크 설계)

  • Kim, Sejin;Kim, Yoonhee
    • KNOM Review
    • /
    • v.23 no.1
    • /
    • pp.43-50
    • /
    • 2020
  • As General Purpose Graphics Processing Unit (GPGPU) recently plays an essential role in high-performance computing, several cloud service providers offer GPU service. Most cluster orchestration platforms in a cloud environment using containers allocate the integer number of GPU to jobs and do not allow a node shared with other jobs. In this case, resource utilization of a GPU node might be low if a job does not intensively require either many cores or large size of memory in GPU. GPU virtualization brings opportunities to realize kernel concurrency and share resources. However, performance may vary depending on characteristics of applications running concurrently and interference among them due to resource contention on a node. This paper proposes GPU container co-execution framework with multiple server creation and execution based on Kubernetes, container orchestration platform for measuring interference which may be occurred by sharing GPU resources. Performance changes according to scheduling policies were investigated by executing several jobs on GPU. The result shows that optimal scheduling is not possible only considering GPU memory and computing resource usage. Interference caused by co-execution among applications is measured using the framework.

An Optimization Method for Hologram Generation on Multiple GPU-based Parallel Processing (다중 GPU기반 홀로그램 생성을 위한 병렬처리 성능 최적화 기법)

  • Kook, Joongjin
    • Smart Media Journal
    • /
    • v.8 no.2
    • /
    • pp.9-15
    • /
    • 2019
  • Since the computational complexity for hologram generation increases exponentially with respect to the size of the point cloud, parallel processing using CUDA and/or OpenCL library based on multiple GPUs has recently become popular. The CUDA kernel for parallelization needs to consist of threads, blocks, and grids properly in accordance with the number of cores and the memory size in the GPU. In addition, in case of multiple GPU environments, the distribution in grid-by-grid, in block-by-block, or in thread-by-thread is needed according to the number of GPUs. In order to evaluate the performance of CGH generation, we compared the computational speed in CPU, in single GPU, and in multi-GPU environments by gradually increasing the number of points in a point cloud from 10 to 1,000,000. We also present a memory structure design and a calculation method required in the CUDA-based parallel processing to accelerate the CGH (Computer Generated Hologram) generation operation in multiple GPU environments.