1 |
Kubernetes 2021, accessed 1 September 2021 [Internet], https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler.
|
2 |
T. Ben-Nun and T. Hoefler, "Demystifying parallel and distributed deep learning: An in-depth concurrency analysis," In ACM Computing Surveys (CSUR), Vol.52, No.4, pp.1-43, 2019.
|
3 |
P. Goyal, et al., "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.
|
4 |
X. Jia, et al., "Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes," arXiv preprint arXiv:1807.11205, 2018.
|
5 |
C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. "Measuring the effects of data parallelism on neural network training," arXiv:1811.03600, 2018.
|
6 |
TensorFlow Operator 2021, accessed 1 September 2021 [Internet], https://www.kubeflow.org/docs/components/training/tftraining/#installing-tensorflow-operator.
|
7 |
J. Gu, "Tiresias: A GPU cluster manager for distributed deep learning," In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2019.
|
8 |
R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. "Multi-resource packing for cluster schedulers," In ACM SIGCOMM Computer Communication Review, Vol.44, No.4, pp.455-466, 2014.
DOI
|
9 |
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
|
10 |
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
|
11 |
S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
|
12 |
C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, "Image classification at supercomputer scale," arXiv preprint arXiv:1811.06992, 2018.
|
13 |
Grafana 2021, accessed 1 September 2021 [Internet], https://grafana.com
|
14 |
J. Geng, D. Li, and S. Wang. "Accelerating distributed machine learning by smart parameter server," In Proceedings of 3rd Asia-Pacific Workshop Networking, 2019.
|
15 |
Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo. "Optimus: An efficient dynamic resource scheduler for deep learning clusters," In Proceedings of ACM EuroSys, 2018.
|
16 |
S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients," arXiv preprint arXiv:1606.06160, 2016.
|
17 |
M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D.G. Andersen, and A. J. Smola. "Parameter server for distributed machine learning," In Big Learning NIPS Workshop, 2013.
|
18 |
Kubeflow 2021, accessed 1 September 2021 [Internet], https://www.kubeflow.org.
|
19 |
S. Li, et al., "Pytorch distributed: Experiences on accelerating data parallel training," arXiv preprint arXiv: 2006.15704, 2020.
|
20 |
E. Gebremeskel, "Analysis and comparison of distributed training techniques for deep neural networks in a dynamic environment," 2018.
|
21 |
W. Xiao, "Gandiva: Introspective cluster scheduling for deep learning," In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018.
|
22 |
Operator 2021, accessed 1 September 2021 [Internet], https://cloud.redhat.com/learn/topics/operators.
|
23 |
Y. Bao, Y. Peng, C. Wu, and Z. Li, "Online job scheduling in distributed machine learning clusters," In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, 2018.
|
24 |
M. Khalil-Hani and S. Liew, "A-sdlm: an asynchronous stochastic learning algorithm for fast distributed learning," In 13th Australasian Symposium on Parallel and Distributed Computing, 2015.
|
25 |
H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, "Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server," In Proceedings of the Eleventh European Conference on Computer Systems, 2016.
|
26 |
Y. Chen, "Convolutional neural network for sentence classification," MS thesis, University of Waterloo, 2015.
|
27 |
Prometheus 2021, accessed 1 September 2021 [Internet], https://prometheus.io.
|