DOI QR코드

DOI QR Code

Study on Accelerating Distributed ML Training in Orchestration

  • Su-Yeon Kim (Graduate School of Smart Convergence, Kwangwoon University) ;
  • Seok-Jae Moon (Graduate School of Smart Convergence, KwangWoon University)
  • Received : 2024.07.18
  • Accepted : 2024.07.29
  • Published : 2024.09.30

Abstract

As the size of data and models in machine learning training continues to grow, training on a single server is becoming increasingly challenging. Consequently, the importance of distributed machine learning, which distributes computational loads across multiple machines, is becoming more prominent. However, several unresolved issues remain regarding the performance enhancement of distributed machine learning, including communication overhead, inter-node synchronization challenges, data imbalance and bias, as well as resource management and scheduling. In this paper, we propose ParamHub, which utilizes orchestration to accelerate training speed. This system monitors the performance of each node after the first iteration and reallocates resources to slow nodes, thereby speeding up the training process. This approach ensures that resources are appropriately allocated to nodes in need, maximizing the overall efficiency of resource utilization and enabling all nodes to perform tasks uniformly, resulting in a faster training speed overall. Furthermore, this method enhances the system's scalability and flexibility, allowing for effective application in clusters of various sizes.

Keywords

Acknowledgement

This paper was supported by the KwangWoon University Research Grant of 2024.

References

  1. A. Tariq, L. Cao, F. Ahmed, E. Rozner, and P. Sharma, "Accelerating Containerized Machine Learning Workloads," NOMS 2024-2024 IEEE Network Operations and Management Symposium. IEEE, May 06, 2024. DOI: https://doi.org/10.1109/NOMS59830.2024.10575188
  2. Y. Chen, Y. Peng, Y. Bao, C. Wu, Y. Zhu, and C. Guo, "Elastic parameter server load distribution in deep learning clusters," Proceedings of the 11th ACM Symposium on Cloud Computing. ACM, Oct. 12, 2020. DOI: https://doi.org/10.1145/3419111.3421307
  3. I. Thangakrishnan, D. Cavdar, C. Karakus, P. Ghai, Y. Selivonchyk, and C. Pruce, "Herring: Rethinking the Parameter Server at Scale for the Cloud," SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Nov. 2020. DOI: https://doi.org/10.1109/sc41405.2020.00048
  4. A.-L. Jin, W. Xu, S. Guo, B. Hu, and K. Yeung, "PS+: A Simple yet Effective Framework for Fast Training on Parameter Server," IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12. Institute of Electrical and Electronics Engineers (IEEE), pp. 4625-4637, Dec. 01, 2022. DOI: https://doi.org/10.1109/tpds.2022.3200518
  5. A. Renz-Wieland, R. Gemulla, S. Zeuch, and V. Markl, "Dynamic parameter allocation in parameter servers," Proceedings of the VLDB Endowment, vol. 13, no. 12. Association for Computing Machinery (ACM), pp. 1877-1890, Aug. 2020. DOI: https://doi.org/10.14778/3407790.3407796
  6. S. Wang, A. Pi, X. Zhou, J. Wang, and C.-Z. Xu, "Overlapping Communication With Computation in Parameter Server for Scalable DL Training," IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 9. Institute of Electrical and Electronics Engineers (IEEE), pp. 2144-2159, Sep. 01, 2021. DOI: https://doi.org/10.1109/tpds.2021.3062721