[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KTCCS.2022.11.10.333

GPU Resource Contention Management Technique for Simultaneous GPU Tasks in the Container Environments with Share the GPU

Kang, Jihun (고려대학교 4단계 BK21 컴퓨터학교육연구단)

Publication Information

KIPS Transactions on Computer and Communication Systems / v.11, no.10, 2022 , pp. 333-344 More about this Journal

Abstract

In a container-based cloud environment, multiple containers can share a graphical processing unit (GPU), and GPU sharing can minimize idle time of GPU resources and improve resource utilization. However, in a cloud environment, GPUs, unlike CPU or memory, cannot logically multiplex computing resources to provide users with some of the resources in an isolated form. In addition, containers occupy GPU resources only when performing GPU operations, and resource usage is also unknown because the timing or size of each container's GPU operations is not known in advance. Containers unrestricted use of GPU resources at any given point in time makes managing resource contention very difficult owing to where multiple containers run GPU tasks simultaneously, and GPU tasks are handled in black box form inside the GPU. In this paper, we propose a container management technique to prevent performance degradation caused by resource competition when multiple containers execute GPU tasks simultaneously. Also, this paper demonstrates the efficiency of container management techniques that analyze and propose the problem of degradation due to resource competition when multiple containers execute GPU tasks simultaneously through experiments.

Keywords

HPC Cloud; Container; GPU Computing; GPU Sharing; Resource Race;

Citations & Related Records

Times Cited By KSCI : 1 (Citation Analysis)

Reference
Cited By KSCI

1	Docker, Docker CLI [Internet], https://docs.docker.com/engine/reference/commandline/pause/.
2	NVIDIA, Multi-Process Service(MPS) [Internet], https://docs.nvidia.com/deploy/mps/index.html.
3	H. H. Chen, E. T. Lin, Y. M. Chou, and J. Chou, "Gemini: Enabling multi-tenant gpu sharing based on kernel burst estimation," IEEE Transactions on Cloud Computing(Early Access), 2021.
4	W. Xiao, S. Ren, Y. Li, Y. Zhang, P. Hou, Z. Li, Y. Feng, W. Lin, and Y. Jia, "AntMan: Dynamic scaling on GPU clusters for deep learning," In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, pp.533-548, 2020.
5	Google Brain, Tensorflow [Internet], https://www.tensorflow.org/.
6	Facebook AI Research, PyTorch [Internet] https://pytorch.org/.
7	T. A. Yeh, H. H. Chen, and J. Chou, "Kubeshare: A framework to manage gpus as first-class and shared resources in container cloud," In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pp.173-184, 2020.
8	NVIDIA, Compute Unified Device Architecture (CUDA) [Internet], https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html.
9	NVIDIA, NVIDIA Docker Wiki [Internet], https://github.com/NVIDIA/nvidia-docker/wiki.
10	Docker, Docker [Internet], https://www.docker.com/.
11	NVIDIA, CUDA C++ Programming Guide [Internet], https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
12	NVIDIA, NVIDIA Docker [Internet], https://github.com/NVIDIA/nvidia-docker.
13	Docker, docker ps [Internet], https://docs.docker.com/engine/reference/commandline/ps/.
14	Docker, docker top [Internet], https://docs.docker.com/engine /reference/commandline/top/.
15	P. Thinakaran, J. R. Gunasekaran, B. Sharma, M. T. Kandemir, and C. R. Das, "Kube-knots: Resource harvesting through dynamic container orchestration in gpubased datacenters," In 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp.1-13, 2019.
16	Q. Chen, J. Oh, S. Kim, and Y. Kim, "Design of an adaptive GPU sharing and scheduling scheme in container-based cluster," Cluster Computing, Vol.23, No.3, pp.2179-2191, 2020. DOI
17	M. Lee, H. Ahn, C. H. Hong, and D. S. Nikolopoulos, "gShare: A centralized GPU memory management framework to enable GPU memory sharing for containers," Future Generation Computer Systems, Vol.130, pp.181-192, 2022. DOI
18	J. Gu, S. Song, Y. Li, and H. Luo, "GaiaGPU: sharing GPUs in container clouds," In 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pp.469-476, 2018.
19	Linux Foundation, kubernetes [Internet], https://kubernetes.io/ko/.
20	Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo, "Optimus: an efficient dynamic resource scheduler for deep learning clusters," In Proceedings of the Thirteenth EuroSys Conference, pp.1-14, 2018.
21	Z. Liu, C. Chen, J. Li, Y. Cheng, Y. Kou, and D. Zhang, "KubFBS: A fine-grained and balance-aware scheduling system for deep learning tasks based on kubernetes," Concurrency and Computation: Practice and Experience, Vol.34, No.11, pp.e6836, 2022. DOI
22	NVIDIA, NVIDIA System Management Interface [Internet], https://developer.nvidia.com/nvidia-system-management-interface.
23	J. Shao, J. Ma, Y. Li, B. An, and D. Cao, "GPU scheduling for short tasks in private cloud," In 2019 IEEE International Conference on Service-Oriented System Engineering (SOSE), pp.215-2155, 2019.

KSCI

GPU Resource Contention Management Technique for Simultaneous GPU Tasks in the Container Environments with Share the GPU GPU를 공유하는 컨테이너 환경에서 GPU 작업의 동시 실행을 위한 GPU 자원 경쟁 관리기법

GPU Resource Contention Management Technique for Simultaneous GPU Tasks in the Container Environments with Share the GPU