• Title/Summary/Keyword: Kubeflow

Search Result 4, Processing Time 0.086 seconds

Comparative Analysis of Computation Times Based on the Number of Containers for CPU-Intensive Tasks in the Kubeflow Environment (Kubeflow 환경에서 CPU 집약적인 작업을 위한 컨테이너 수에 따른 연산 시간 비교 및 분석)

  • HyunSeung Jung;Taeshin Kang;Heonchang Yu;Jihun Kang
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.11a
    • /
    • pp.93-96
    • /
    • 2023
  • 머신 러닝의 수요가 증가함에 따라, 머신 러닝 워크플로우의 배포 수요도 증가했다. Kubeflow를 통해 머신 러닝 배포를 편리하게 할 수 있으며, Kubeflow Pipelines에서는 하나의 작업을 여러 컨테이너로 분산시켜서 연산하는 것이 가능하다. 하지만 컨테이너 수를 많이 늘릴수록 반드시 성능이 향상되는 것은 아니다. 따라서, 본 연구에서는 성능 향상의 한계를 제공하는 원인을 분석하기 위해서, Kubeflow에서 CPU 집약적인 작업을 여러 컨테이너로 분산시켜서 연산을 수행하였다. 컨테이너 수에 따른 연산 완료 시간을 비교 및 분석한 결과, 컨테이너 수가 증가할수록 연산 속도 향상이 빨라지나, 어느 시점을 지나면 속도가 다시 완만하게 줄어드는 현상을 확인하였다. 이는 리소스 제한으로 인해 모든 컨테이너가 동시에 스케줄링 되지 못한 것이 가장 큰 원인으로 분석하였다.

Dynamic Resource Adjustment Operator Based on Autoscaling for Improving Distributed Training Job Performance on Kubernetes (쿠버네티스에서 분산 학습 작업 성능 향상을 위한 오토스케일링 기반 동적 자원 조정 오퍼레이터)

  • Jeong, Jinwon;Yu, Heonchang
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.7
    • /
    • pp.205-216
    • /
    • 2022
  • One of the many tools used for distributed deep learning training is Kubeflow, which runs on Kubernetes, a container orchestration tool. TensorFlow jobs can be managed using the existing operator provided by Kubeflow. However, when considering the distributed deep learning training jobs based on the parameter server architecture, the scheduling policy used by the existing operator does not consider the task affinity of the distributed training job and does not provide the ability to dynamically allocate or release resources. This can lead to long job completion time and low resource utilization rate. Therefore, in this paper we proposes a new operator that efficiently schedules distributed deep learning training jobs to minimize the job completion time and increase resource utilization rate. We implemented the new operator by modifying the existing operator and conducted experiments to evaluate its performance. The experiment results showed that our scheduling policy improved the average job completion time reduction rate of up to 84% and average CPU utilization increase rate of up to 92%.

A Study on Email Security through Proactive Detection and Prevention of Malware Email Attacks (악성 이메일 공격의 사전 탐지 및 차단을 통한 이메일 보안에 관한 연구)

  • Yoo, Ji-Hyun
    • Journal of IKEEE
    • /
    • v.25 no.4
    • /
    • pp.672-678
    • /
    • 2021
  • New malware continues to increase and become advanced by every year. Although various studies are going on executable files to diagnose malicious codes, it is difficult to detect attacks that internalize malicious code threats in emails by exploiting non-executable document files, malicious URLs, and malicious macros and JS in documents. In this paper, we introduce a method of analyzing malicious code for email security through proactive detection and blocking of malicious email attacks, and propose a method for determining whether a non-executable document file is malicious based on AI. Among various algorithms, an efficient machine learning modeling is choosed, and an ML workflow system to diagnose malicious code using Kubeflow is proposed.

Distributed In-Memory Caching Method for ML Workload in Kubernetes (쿠버네티스에서 ML 워크로드를 위한 분산 인-메모리 캐싱 방법)

  • Dong-Hyeon Youn;Seokil Song
    • Journal of Platform Technology
    • /
    • v.11 no.4
    • /
    • pp.71-79
    • /
    • 2023
  • In this paper, we analyze the characteristics of machine learning workloads and, based on them, propose a distributed in-memory caching technique to improve the performance of machine learning workloads. The core of machine learning workload is model training, and model training is a computationally intensive task. Performing machine learning workloads in a Kubernetes-based cloud environment in which the computing framework and storage are separated can effectively allocate resources, but delays can occur because IO must be performed through network communication. In this paper, we propose a distributed in-memory caching technique to improve the performance of machine learning workloads performed in such an environment. In particular, we propose a new method of precaching data required for machine learning workloads into the distributed in-memory cache by considering Kubflow pipelines, a Kubernetes-based machine learning pipeline management tool.

  • PDF