[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KTCCS.2021.10.7.191

Hybrid All-Reduce Strategy with Layer Overlapping for Reducing Communication Overhead in Distributed Deep Learning

Kim, Daehyun (LG CNS 스마트F&C사업부)
Yeo, Sangho (아주대학교 인공지능학과)
Oh, Sangyoon (아주대학교 소프트웨어학과)

Publication Information

KIPS Transactions on Computer and Communication Systems / v.10, no.7, 2021 , pp. 191-198 More about this Journal

Abstract

Since the size of training dataset become large and the model is getting deeper to achieve high accuracy in deep learning, the deep neural network training requires a lot of computation and it takes too much time with a single node. Therefore, distributed deep learning is proposed to reduce the training time by distributing computation across multiple nodes. In this study, we propose hybrid allreduce strategy that considers the characteristics of each layer and communication and computational overlapping technique for synchronization of distributed deep learning. Since the convolution layer has fewer parameters than the fully-connected layer as well as it is located at the upper, only short overlapping time is allowed. Thus, butterfly allreduce is used to synchronize the convolution layer. On the other hand, fully-connecter layer is synchronized using ring all-reduce. The empirical experiment results on PyTorch with our proposed scheme shows that the proposed method reduced the training time by up to 33% compared to the baseline PyTorch.

Keywords

Distributed Deep Learning; Synchronization; Layer Overlapping; Allreduce;

Citations & Related Records

Reference

1	P. Sun, Y. Wen, R. Han, W. Feng, and S. Yan, "GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training," IEEE Transactions on Big Data, 2019.
2	Baidu Research, Ring all-reduce, [Internet] https://github.com/baidu-research/baidu-allreduce.
3	R. Thakur, R. Rabenseifner, and, W. Gropp, "Optimization of collective communication operations in MPICH," The International Journal of High Performance Computing Applications, Vol.19, No.1, pp.49-66, 2005. DOI
4	H. Mikami, H. Suganuma, Y. Tanaka, and Y. Kageyama, "ImageNet/ResNet-50 Training in 224 Seconds," arXiv preprint arXiv:1811.05233, 2018.
5	A. Sergeev and M. Del Balso, "Horovod: Fast and easy distributed deep learning in TensorFlow," arXiv preprint arXiv:1802.05799, 2018.
6	K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp.770-778, 2016.
7	C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp1-9, 2015.
8	G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, pp.4700-4708, 2017.
9	S. Teerapittayanon, B. McDanel and H. T. Kung, "Distributed deep neural networks over the cloud, the edge and end devices," in IEEE International Conference on Distributed Computing Systems, Atlanta, pp.328-339, 2017.
10	Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, "Xlnet: Generalized autoregressive pretraining for language understanding," in Advances in Neural Information Processing Systems, pp.5753-5763, 2019.
11	X. W. Chen and X. Lin, "Big data deep learning: Challenges and perspectives," IEEE Access, Vol. 2, pp.514-524, 2014. DOI
12	M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald, and E. Muharemagic, "Deep learning applications and challenges in big data analytics," Journal of Big Data, Vol.2, No.1, 2015.
13	M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, and M. Kudlur, "Tensorflow: A system for large-scale machine learning," in USENIX Symposium on Operating Systems Design and Implementation, Savannah, pp.265-283, 2016.
14	R. Collobert, K. Kavukcuoglu and C. Farabet, "Torch7: A matlab-like environment for machine learning," in BigLearn, NIPS Workshop, No. CONF, 2011.
15	F. Seide and A. Agarwal, "CNTK: Microsoft's open-source deep-learning toolkit," in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.2135-2135, 2016.
16	NVIDIA Developer, NCCL, [Internet] https://developer.nvidia.com/nccl.
17	Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, Vol.521, pp.436-444, 2015. DOI
18	X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, G. Hu, S. Shi, X. Chu, and T. Chen, "Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes," arXiv preprint arXiv:1807.11205, 2018.
19	Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," in Proceedings of the ACM International Conference on Multimedia, pp.675-678, 2014.
20	The Software in the Public Interest non-profit organization, Open MPI, [Internet] https://www.open-mpi.org/.
21	Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, "Deep gradient compression: Reducing the communication bandwidth for distributed training," arXiv preprint arXiv: 1712.01887, 2017.
22	A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Communications of the ACM, Vol.60, No.6, pp.84-90, 2017. DOI
23	J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.

KSCI

Hybrid All-Reduce Strategy with Layer Overlapping for Reducing Communication Overhead in Distributed Deep Learning 분산 딥러닝에서 통신 오버헤드를 줄이기 위해 레이어를 오버래핑하는 하이브리드 올-리듀스 기법

Hybrid All-Reduce Strategy with Layer Overlapping for Reducing Communication Overhead in Distributed Deep Learning