[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2021.03.006

Empirical Performance Evaluation of Communication Libraries for Multi-GPU based Distributed Deep Learning in a Container Environment

Choi, HyeonSeong (Korea Aerospace University)
Kim, Youngrang (Korea Aerospace University)
Lee, Jaehwan (Korea Aerospace University)
Kim, Yoonhee (Sookmyung Women's University)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.15, no.3, 2021 , pp. 911-931 More about this Journal

Abstract

Recently, most cloud services use Docker container environment to provide their services. However, there are no researches to evaluate the performance of communication libraries for multi-GPU based distributed deep learning in a Docker container environment. In this paper, we propose an efficient communication architecture for multi-GPU based deep learning in a Docker container environment by evaluating the performances of various communication libraries. We compare the performances of the parameter server architecture and the All-reduce architecture, which are typical distributed deep learning architectures. Further, we analyze the performances of two separate multi-GPU resource allocation policies - allocating a single GPU to each Docker container and allocating multiple GPUs to each Docker container. We also experiment with the scalability of collective communication by increasing the number of GPUs from one to four. Through experiments, we compare OpenMPI and MPICH, which are representative open source MPI libraries, and NCCL, which is NVIDIA's collective communication library for the multi-GPU setting. In the parameter server architecture, we show that using CUDA-aware OpenMPI with multi-GPU per Docker container environment reduces communication latency by up to 75%. Also, we show that using NCCL in All-reduce architecture reduces communication latency by up to 93% compared to other libraries.

Keywords

Docker; Collective Communication; Distributed Deep Leaning; Multi-GPU;

Citations & Related Records

Reference

1	E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, A. Lumsdaine, R. Castain, D. Daniel, R. Graham, and T. Woodall, "Open MPI: Goals, concept, and design of a next generation MPI implementation," in Proc. of European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting, vol. 3241, pp. 97-104, 2004.
2	V. G. Siddaramanna and A. A. R. John, "Effect of Performance on Containerized Deep Learning Applications," Presented at WinTechCon-2018, organized by IEEE CAS Bangalore Chapter, IEEE Bangalore Section, and IEEE WiE Council, pp. 1-6, 2018.
3	A. Sergeev and M. D. Balso, "Horovod: fast and easy distributed deep learning in TensorFlow," arXiv preprint arXiv:1802.05799, 2018.
4	M. G. Xavier, M. V. Neves, F. D. Rossi, T. C. Ferreto, T. Lange, and C. A. Rose, "Performance evaluation of container-based virtualization for high performance computing environments," in Proc. of the 21st Euromicro International Conference on Parallel, pp. 233-240, 2013.
5	P. Saha, A. Beltre, P. Uminski, and M. Govindaraju, "Evaluation of docker containers for scientific workloads in the cloud," in Proc. of International Conference on Advanced Research Computing, pp. 1-8, 2018.
6	T. Kamarainen, Y. Shan, M. Siekkinen, and A. Ylajaaski, "Virtual machines vs. containers in cloud gaming systems," in Proc. of International Workshop on Network and Systems Support for Games (NetGames), pp. 1-6, 2015.
7	P. Xu, S. Shi, and X. Chu, "Performance evaluation of deep learning tools in Docker containers," in Proc. of the 3rd International Conference on Big Data Computing and Communications (BIGCOM), pp. 395-403, 2017.
8	J. Zhang, X. Lu, and D. K. Panda, "Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?," in Proc. of the 10th International Conference on Utility and Cloud Computing, pp. 151-160, 2017.
9	H. Mikami, Hiroaki, P. Uchupala, Y. Tanaka, and Y. Kageyama, "Massively distributed SGD: ImageNet/ResNet-50 training in a flash," arXiv preprint arXiv:1811.05233, 2018.
10	J. Deng, W. Dong, R. Socher, L. Jia, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009.
11	J. Sylvain, "Nccl 2.0," GTC, 2017.
12	G. Heigold, E. McDermott, V. Vanhoucke, A. Senior, and M. Bacchiani, "Asynchronous stochastic optimization for sequence training of deep neural networks," in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5587-5591, 2014.
13	Z. Li, M. Kihl, Q. Lu, and J. A. Andersson, "Performance Overhead Comparison between Hypervisor and Container Based Virtualization," in Proc. of IEEE 31st International Conference on Advanced Information Networking and Applications (AINA), pp. 955-962, 2017.
14	C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9, 2015.
15	E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, "Regularized evolution for image classifier architecture search," in Proc. of the AAAI Conference on Artificial Intelligence, vol. 33, no. 1, pp. 4780-4789, 2019.
16	Y. Huang, Y. Cheng, A. Bapna, O. First, D. Chen, M. Chen, H. Lee, K. Ngiam, Q. V. Le, Y. Wu, and Z. Chen, "Gpipe: Efficient training of giant neural networks using pipeline parallelism," Advances in Neural Information Processing Systems, vol. 32, 2019.
17	P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.
18	D. Bernstein, "Containers and Cloud: From LXC to Docker to Kubernetes," IEEE Cloud Computing, vol. 1, no. 3, pp. 81-84, Sep. 2014. DOI
19	Overview of amazon web services, Amazon Whitepapers, 2020.
20	J. Dongarra, S. W. Otto, M. Snir, and D. Walker, "An introduction to the MPI standard," Communications of the ACM 18, 1995.
21	M. Zinkevich, M. Weimer, L. Li, and A. Smola, "Parallelized stochastic gradient descent," Advances in Neural Information Processing Systems, 2010.
22	M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, and M. Isard, "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," in Proc. of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16), pp. 265-283, 2016.
23	W. Gropp, E. Lusk, N. Doss, and A. Skjullum, "A high-performance, portable implementation of the MPI message passing interface standard," Parallel Computing, vol. 22, no. 6, pp. 789-828, 1996. DOI
24	B. Barker, "Message passing interface (MPI)," in Proc. of Workshop: High Performance Computing on Stampede, vol. 262, 2015.
25	J. Dean, G. S. Corradeo, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, and P. Tucker, "Large scale distributed deep networks," in Proc. of the 25th International Conference on Neural Information Processing Systems, pp. 1223-1231, 2012.