Distributed In-Memory Caching Method for ML Workload in Kubernetes

쿠버네티스에서 ML 워크로드를 위한 분산 인-메모리 캐싱 방법

  • 윤동현 (한국교통대학교 컴퓨터공학과) ;
  • 송석일 (한국교통대학교 컴퓨터공학과)
  • Received : 2022.12.17
  • Accepted : 2023.08.23
  • Published : 2023.08.31

Abstract

In this paper, we analyze the characteristics of machine learning workloads and, based on them, propose a distributed in-memory caching technique to improve the performance of machine learning workloads. The core of machine learning workload is model training, and model training is a computationally intensive task. Performing machine learning workloads in a Kubernetes-based cloud environment in which the computing framework and storage are separated can effectively allocate resources, but delays can occur because IO must be performed through network communication. In this paper, we propose a distributed in-memory caching technique to improve the performance of machine learning workloads performed in such an environment. In particular, we propose a new method of precaching data required for machine learning workloads into the distributed in-memory cache by considering Kubflow pipelines, a Kubernetes-based machine learning pipeline management tool.

이 논문에서는 기계학습 워크로드의 특징을 분석하고 이를 기반으로 기계학습 워크로드의 성능 향상을 위한 분산 인-메모리 캐싱 기법을 제안한다. 기계학습 워크로드의 핵심은 모델 학습이며 모델 학습은 컴퓨팅 집약적 (Computation Intensive)인 작업이다. 쿠버네티스 기반 클라우드 환경에서 컴퓨팅 프레임워크와 스토리지를 분리한 구조에서 기계학습 워크로드를 수행하는 것은 자원을 효과적으로 할당할 수 있지만, 네트워크 통신을 통해 IO가 수행되야 하므로 지연이 발생할 수 있다. 이 논문에서는 이런 환경에서 수행되는 머신러닝 워크로드의 성능을 향상하기 위한 분산 인-메모리 캐싱 기법을 제안한다. 특히, 제안하는 방법은 쿠버네티스 기반의 머신러닝 파이프라인 관리 도구인 쿠브플로우를 고려하여 머신러닝 워크로드에 필요한 데이터를 분산 인-메모리 캐시에 미리 로드하는 새로운 방법을 제안한다.

Keywords

Acknowledgement

This work is supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant 22AMDP-C161962-02).

References

  1. "What is Kubernetes?," https://kubernetes.io/docs/home/
  2. N. Dryden, R. Bohringer, T. Ben-Nun, and T.R. Hoefler, and T. Bohringer, "Clairvoyant prefetching for distributed machine learning I/O," in Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'21), St. Louis, MO, US, 2021, pp. 1-15.
  3. "What is HDFS?," https://hadoop.apache.org/
  4. "Introducing the Lustre File System," https://doc.lustre.org/lustre_manual.xhtml
  5. "AWS S3," https://aws.amazon.com/ko/s3
  6. "Tensorflow Overview," https://www.tensorflow.org/overview
  7. "Pytorch Documentation," https://pytorch.org/docs/stable/index.html
  8. "Spark Overview," https://spark.apache.org/docs/latest/
  9. R. Gu, K. Zhang, Z. Xu, Y. Che, B. Fan, H. Hou, H. Dai, L. Yi, Y. Ding, G. Chen, and Y. Huang, "Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs," in Proc. of 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 2022, pp. 2182-2195
  10. L. Wang, S. Ye, B. Yang, Y. Lu, H. Zhang, S. Yan, and Q Luo, "Diesel: A dataset-based distributed storage and caching system for large-scale deep learning training," in Proc. of 49th International Conference on Parallel Processing (ICPP'20), Edmonton, AB, Canada, 2020, pp. 17-20.
  11. M. Abdi, A. Mosayyebzadeh, M.H. Hajkazemi, E.U. Kaynar, A. Turk, L. Rudolph, O. Krieger, and P. Desnoyers, "A community cache with complete information," in Proc. of 19th USENIX Conference on File and Storage Technologies (FAST'21), 2021, pp. 23-25.
  12. "Kubeflow Introduction," https://www.kubeflow.org/docs/started/introduction/
  13. "Apache Ignite Documentation Overview," https://ignite.apache.org/docs/latest/