Study on Accelerating Distributed ML Training in Orchestration

Su-Yeon Kim;Seok-Jae Moon;

doi:10.7236/IJASC.2024.13.3.143

International journal of advanced smart convergence

Volume 13 Issue 3
/
Pages.143-149
/
2024
/
2288-2847(pISSN)
/
2288-2855(eISSN)

The Institute of Internet, Broadcasting and Communication (한국인터넷방송통신학회)

DOI QR Code

Study on Accelerating Distributed ML Training in Orchestration

Su-Yeon Kim (Graduate School of Smart Convergence, Kwangwoon University) ;
Seok-Jae Moon (Graduate School of Smart Convergence, KwangWoon University)

Received : 2024.07.18
Accepted : 2024.07.29
Published : 2024.09.30

https://doi.org/10.7236/IJASC.2024.13.3.143 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

As the size of data and models in machine learning training continues to grow, training on a single server is becoming increasingly challenging. Consequently, the importance of distributed machine learning, which distributes computational loads across multiple machines, is becoming more prominent. However, several unresolved issues remain regarding the performance enhancement of distributed machine learning, including communication overhead, inter-node synchronization challenges, data imbalance and bias, as well as resource management and scheduling. In this paper, we propose ParamHub, which utilizes orchestration to accelerate training speed. This system monitors the performance of each node after the first iteration and reallocates resources to slow nodes, thereby speeding up the training process. This approach ensures that resources are appropriately allocated to nodes in need, maximizing the overall efficiency of resource utilization and enabling all nodes to perform tasks uniformly, resulting in a faster training speed overall. Furthermore, this method enhances the system's scalability and flexibility, allowing for effective application in clusters of various sizes.

Keywords

Acknowledgement

This paper was supported by the KwangWoon University Research Grant of 2024.

References

A. Tariq, L. Cao, F. Ahmed, E. Rozner, and P. Sharma, "Accelerating Containerized Machine Learning Workloads," NOMS 2024-2024 IEEE Network Operations and Management Symposium. IEEE, May 06, 2024. DOI: https://doi.org/10.1109/NOMS59830.2024.10575188
Y. Chen, Y. Peng, Y. Bao, C. Wu, Y. Zhu, and C. Guo, "Elastic parameter server load distribution in deep learning clusters," Proceedings of the 11th ACM Symposium on Cloud Computing. ACM, Oct. 12, 2020. DOI: https://doi.org/10.1145/3419111.3421307
I. Thangakrishnan, D. Cavdar, C. Karakus, P. Ghai, Y. Selivonchyk, and C. Pruce, "Herring: Rethinking the Parameter Server at Scale for the Cloud," SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Nov. 2020. DOI: https://doi.org/10.1109/sc41405.2020.00048
A.-L. Jin, W. Xu, S. Guo, B. Hu, and K. Yeung, "PS+: A Simple yet Effective Framework for Fast Training on Parameter Server," IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12. Institute of Electrical and Electronics Engineers (IEEE), pp. 4625-4637, Dec. 01, 2022. DOI: https://doi.org/10.1109/tpds.2022.3200518
A. Renz-Wieland, R. Gemulla, S. Zeuch, and V. Markl, "Dynamic parameter allocation in parameter servers," Proceedings of the VLDB Endowment, vol. 13, no. 12. Association for Computing Machinery (ACM), pp. 1877-1890, Aug. 2020. DOI: https://doi.org/10.14778/3407790.3407796
S. Wang, A. Pi, X. Zhou, J. Wang, and C.-Z. Xu, "Overlapping Communication With Computation in Parameter Server for Scalable DL Training," IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 9. Institute of Electrical and Electronics Engineers (IEEE), pp. 2144-2159, Sep. 01, 2021. DOI: https://doi.org/10.1109/tpds.2021.3062721

International journal of advanced smart convergence

Study on Accelerating Distributed ML Training in Orchestration

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)