DOI QR코드

DOI QR Code

Hybrid All-Reduce Strategy with Layer Overlapping for Reducing Communication Overhead in Distributed Deep Learning

분산 딥러닝에서 통신 오버헤드를 줄이기 위해 레이어를 오버래핑하는 하이브리드 올-리듀스 기법

  • Received : 2020.12.28
  • Accepted : 2021.03.02
  • Published : 2021.07.31

Abstract

Since the size of training dataset become large and the model is getting deeper to achieve high accuracy in deep learning, the deep neural network training requires a lot of computation and it takes too much time with a single node. Therefore, distributed deep learning is proposed to reduce the training time by distributing computation across multiple nodes. In this study, we propose hybrid allreduce strategy that considers the characteristics of each layer and communication and computational overlapping technique for synchronization of distributed deep learning. Since the convolution layer has fewer parameters than the fully-connected layer as well as it is located at the upper, only short overlapping time is allowed. Thus, butterfly allreduce is used to synchronize the convolution layer. On the other hand, fully-connecter layer is synchronized using ring all-reduce. The empirical experiment results on PyTorch with our proposed scheme shows that the proposed method reduced the training time by up to 33% compared to the baseline PyTorch.

분산 딥러닝은 각 노드에서 지역적으로 업데이트한 지역 파라미터를 동기화는 과정이 요구된다. 본 연구에서는 분산 딥러닝의 효과적인 파라미터 동기화 과정을 위해, 레이어 별 특성을 고려한 allreduce 통신과 연산 오버래핑(overlapping) 기법을 제안한다. 상위 레이어의 파라미터 동기화는 하위 레이어의 다음 전파과정 이전까지 통신/계산(학습) 시간을 오버랩하여 진행할 수 있다. 또한 이미지 분류를 위한 일반적인 딥러닝 모델의 상위 레이어는 convolution 레이어, 하위 레이어는 fully-connected 레이어로 구성되어 있다. Convolution 레이어는 fully-connected 레이어 대비적은 수의 파라미터를 가지고 있고 상위에 레이어가 위치하므로 네트워크 오버랩 허용시간이 짧고, 이를 고려하여 네트워크 지연시간을 단축할 수 있는 butterfly all-reduce를 사용하는 것이 효과적이다. 반면 오버랩 허용시간이 보다 긴 경우, 네트워크 대역폭을 고려한 ring all-reduce를 사용한다. 본 논문의 제안 방법의 효과를 검증하기 위해 제안 방법을 PyTorch 플랫폼에 적용하여 이를 기반으로 실험 환경을 구성하여 배치크기에 대한 성능 평가를 진행하였다. 실험을 통해 제안 기법의 학습시간은 기존 PyTorch 방식 대비 최고 33% 단축된 모습을 확인하였다.

Keywords

Acknowledgement

이 논문은 2018년도 정부(교육부)의 재원으로 한국연구재단 기초연구사업(2018R1D1A1B07043858)과 과학기술정보통신부/정보통신기획평가원의 대학ICT연구센터지원사업(IITP-2021-2018-0-01431)의 지원을 받아 연구되었음.

References

  1. S. Teerapittayanon, B. McDanel and H. T. Kung, "Distributed deep neural networks over the cloud, the edge and end devices," in IEEE International Conference on Distributed Computing Systems, Atlanta, pp.328-339, 2017.
  2. X. W. Chen and X. Lin, "Big data deep learning: Challenges and perspectives," IEEE Access, Vol. 2, pp.514-524, 2014. https://doi.org/10.1109/ACCESS.2014.2325029
  3. Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, Vol.521, pp.436-444, 2015. https://doi.org/10.1038/nature14539
  4. M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald, and E. Muharemagic, "Deep learning applications and challenges in big data analytics," Journal of Big Data, Vol.2, No.1, 2015.
  5. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, and M. Kudlur, "Tensorflow: A system for large-scale machine learning," in USENIX Symposium on Operating Systems Design and Implementation, Savannah, pp.265-283, 2016.
  6. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," in Proceedings of the ACM International Conference on Multimedia, pp.675-678, 2014.
  7. R. Collobert, K. Kavukcuoglu and C. Farabet, "Torch7: A matlab-like environment for machine learning," in BigLearn, NIPS Workshop, No. CONF, 2011.
  8. F. Seide and A. Agarwal, "CNTK: Microsoft's open-source deep-learning toolkit," in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.2135-2135, 2016.
  9. NVIDIA Developer, NCCL, [Internet] https://developer.nvidia.com/nccl.
  10. The Software in the Public Interest non-profit organization, Open MPI, [Internet] https://www.open-mpi.org/.
  11. Baidu Research, Ring all-reduce, [Internet] https://github.com/baidu-research/baidu-allreduce.
  12. R. Thakur, R. Rabenseifner, and, W. Gropp, "Optimization of collective communication operations in MPICH," The International Journal of High Performance Computing Applications, Vol.19, No.1, pp.49-66, 2005. https://doi.org/10.1177/1094342005051521
  13. X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, G. Hu, S. Shi, X. Chu, and T. Chen, "Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes," arXiv preprint arXiv:1807.11205, 2018.
  14. H. Mikami, H. Suganuma, Y. Tanaka, and Y. Kageyama, "ImageNet/ResNet-50 Training in 224 Seconds," arXiv preprint arXiv:1811.05233, 2018.
  15. A. Sergeev and M. Del Balso, "Horovod: Fast and easy distributed deep learning in TensorFlow," arXiv preprint arXiv:1802.05799, 2018.
  16. Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, "Deep gradient compression: Reducing the communication bandwidth for distributed training," arXiv preprint arXiv: 1712.01887, 2017.
  17. P. Sun, Y. Wen, R. Han, W. Feng, and S. Yan, "GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training," IEEE Transactions on Big Data, 2019.
  18. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp.770-778, 2016.
  19. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Communications of the ACM, Vol.60, No.6, pp.84-90, 2017. https://doi.org/10.1145/3065386
  20. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp1-9, 2015.
  21. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, pp.4700-4708, 2017.
  22. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
  23. Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, "Xlnet: Generalized autoregressive pretraining for language understanding," in Advances in Neural Information Processing Systems, pp.5753-5763, 2019.