DOI QR코드

DOI QR Code

Design of a ParamHub for Machine Learning in a Distributed Cloud Environment

  • Su-Yeon Kim (Graduate School of Smart Convergence, Kwangwoon University) ;
  • Seok-Jae Moon (Graduate School of Smart Convergence, KwangWoon University)
  • Received : 2024.04.01
  • Accepted : 2024.04.20
  • Published : 2024.05.31

Abstract

As the size of big data models grows, distributed training is emerging as an essential element for large-scale machine learning tasks. In this paper, we propose ParamHub for distributed data training. During the training process, this agent utilizes the provided data to adjust various conditions of the model's parameters, such as the model structure, learning algorithm, hyperparameters, and bias, aiming to minimize the error between the model's predictions and the actual values. Furthermore, it operates autonomously, collecting and updating data in a distributed environment, thereby reducing the burden of load balancing that occurs in a centralized system. And Through communication between agents, resource management and learning processes can be coordinated, enabling efficient management of distributed data and resources. This approach enhances the scalability and stability of distributed machine learning systems while providing flexibility to be applied in various learning environments.

Keywords

Acknowledgement

This paper was supported by the KwangWoon University Research Grant of 2024.

References

  1. Y. Chen, Y. Peng, Y. Bao, C. Wu, Y. Zhu, and C. Guo, "Elastic parameter server load distribution in deep learning clusters," Proceedings of the 11th ACM Symposium on Cloud Computing. ACM, Oct. 12, 2020. DOI: https://doi.org/10.1145/3419111.3421307
  2. N. Provatas, I. Konstantinou, and N. Koziris, "Is Systematic Data Sharding able to Stabilize Asynchronous Parameter Server Training?," 2021 IEEE International Conference on Big Data (Big Data). IEEE, Dec. 15, 2021. DOI: https://doi.org/10.1109/bigdata52589.2021.9672001.
  3. A. Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." arXiv, 2020. DOI: https://doi.org/10.48550/ARXIV.2010.11929.
  4. M. Wang and W. Deng, "Deep face recognition: A survey," Neurocomputing, vol. 429. Elsevier BV, pp. 215-244, Mar. 2021. DOI: https://doi.org/10.1016/j.neucom.2020.10.081.
  5. S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, "A survey of deep learning techniques for autonomous driving," Journal of Field Robotics, vol. 37, no. 3. Wiley, pp. 362-386, Nov. 14, 2019. DOI: https://doi.org/10.1002/rob.21918.
  6. M. Wang, W. Fu, X. He, S. Hao, and X. Wu, "A Survey on Large-Scale Machine Learning," IEEE Transactions on Knowledge and Data Engineering. Institute of Electrical and Electronics Engineers (IEEE), pp. 1-1, 2020. DOI: https://doi.org/10.1109/tkde.2020.3015777.
  7. A. Renz-Wieland, R. Gemulla, S. Zeuch, and V. Markl, "Dynamic Parameter Allocation in Parameter Servers," arXiv, 2020.DOI: https://doi.org/10.48550/ARXIV.2002.00655.
  8. Y. Chao, M. Liao, and J. Gao, "Task allocation for decentralized training in heterogeneous environment." arXiv, 2021. DOI: https://doi.org/10.48550/ARXIV.2111.08272.
  9. Y. M. Park, S. Y. Ahn, E. J. Lim, Y. S. Choi, Y. C. Woo, and W. Choi, "Deep Learning Model Parallelism," Electronics and Telecommunications Trends, vol. 33, no. 4, pp. 1-13, Aug. 2018. DOI: https://doi.org/10.22648/ETRI.2018.J.330401