Acknowledgement
This paper was supported by the KwangWoon University Research Grant of 2024.
References
- A. Tariq, L. Cao, F. Ahmed, E. Rozner, and P. Sharma, "Accelerating Containerized Machine Learning Workloads," NOMS 2024-2024 IEEE Network Operations and Management Symposium. IEEE, May 06, 2024. DOI: https://doi.org/10.1109/NOMS59830.2024.10575188
- Y. Chen, Y. Peng, Y. Bao, C. Wu, Y. Zhu, and C. Guo, "Elastic parameter server load distribution in deep learning clusters," Proceedings of the 11th ACM Symposium on Cloud Computing. ACM, Oct. 12, 2020. DOI: https://doi.org/10.1145/3419111.3421307
- I. Thangakrishnan, D. Cavdar, C. Karakus, P. Ghai, Y. Selivonchyk, and C. Pruce, "Herring: Rethinking the Parameter Server at Scale for the Cloud," SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Nov. 2020. DOI: https://doi.org/10.1109/sc41405.2020.00048
- A.-L. Jin, W. Xu, S. Guo, B. Hu, and K. Yeung, "PS+: A Simple yet Effective Framework for Fast Training on Parameter Server," IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12. Institute of Electrical and Electronics Engineers (IEEE), pp. 4625-4637, Dec. 01, 2022. DOI: https://doi.org/10.1109/tpds.2022.3200518
- A. Renz-Wieland, R. Gemulla, S. Zeuch, and V. Markl, "Dynamic parameter allocation in parameter servers," Proceedings of the VLDB Endowment, vol. 13, no. 12. Association for Computing Machinery (ACM), pp. 1877-1890, Aug. 2020. DOI: https://doi.org/10.14778/3407790.3407796
- S. Wang, A. Pi, X. Zhou, J. Wang, and C.-Z. Xu, "Overlapping Communication With Computation in Parameter Server for Scalable DL Training," IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 9. Institute of Electrical and Electronics Engineers (IEEE), pp. 2144-2159, Sep. 01, 2021. DOI: https://doi.org/10.1109/tpds.2021.3062721