Acknowledgement
본 연구는 과학기술정보통신부 및 정보통신기획평가원의 대학ICT연구센터지원사업의 연구결과로 수행되었음(IITP-2018-0-01405). 이 논문은 2022년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임(IITP-2019-0-01343).
References
- T. Ben-Nun and T. Hoefler, "Demystifying parallel and distributed deep learning: An in-depth concurrency analysis," In ACM Computing Surveys (CSUR), Vol.52, No.4, pp.1-43, 2019.
- P. Goyal, et al., "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.
- X. Jia, et al., "Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes," arXiv preprint arXiv:1807.11205, 2018.
- C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, "Image classification at supercomputer scale," arXiv preprint arXiv:1811.06992, 2018.
- S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients," arXiv preprint arXiv:1606.06160, 2016.
- C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. "Measuring the effects of data parallelism on neural network training," arXiv:1811.03600, 2018.
- M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D.G. Andersen, and A. J. Smola. "Parameter server for distributed machine learning," In Big Learning NIPS Workshop, 2013.
- Kubeflow 2021, accessed 1 September 2021 [Internet], https://www.kubeflow.org.
- TensorFlow Operator 2021, accessed 1 September 2021 [Internet], https://www.kubeflow.org/docs/components/training/tftraining/#installing-tensorflow-operator.
- S. Li, et al., "Pytorch distributed: Experiences on accelerating data parallel training," arXiv preprint arXiv: 2006.15704, 2020.
- H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, "Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server," In Proceedings of the Eleventh European Conference on Computer Systems, 2016.
- Kubernetes 2021, accessed 1 September 2021 [Internet], https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler.
- Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo. "Optimus: An efficient dynamic resource scheduler for deep learning clusters," In Proceedings of ACM EuroSys, 2018.
- Operator 2021, accessed 1 September 2021 [Internet], https://cloud.redhat.com/learn/topics/operators.
- J. Gu, "Tiresias: A GPU cluster manager for distributed deep learning," In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2019.
- R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. "Multi-resource packing for cluster schedulers," In ACM SIGCOMM Computer Communication Review, Vol.44, No.4, pp.455-466, 2014. https://doi.org/10.1145/2740070.2626334
- Y. Chen, "Convolutional neural network for sentence classification," MS thesis, University of Waterloo, 2015.
- K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
- K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Prometheus 2021, accessed 1 September 2021 [Internet], https://prometheus.io.
- Grafana 2021, accessed 1 September 2021 [Internet], https://grafana.com
- J. Geng, D. Li, and S. Wang. "Accelerating distributed machine learning by smart parameter server," In Proceedings of 3rd Asia-Pacific Workshop Networking, 2019.
- E. Gebremeskel, "Analysis and comparison of distributed training techniques for deep neural networks in a dynamic environment," 2018.
- W. Xiao, "Gandiva: Introspective cluster scheduling for deep learning," In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018.
- Y. Bao, Y. Peng, C. Wu, and Z. Li, "Online job scheduling in distributed machine learning clusters," In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, 2018.
- M. Khalil-Hani and S. Liew, "A-sdlm: an asynchronous stochastic learning algorithm for fast distributed learning," In 13th Australasian Symposium on Parallel and Distributed Computing, 2015.