DOI QR코드

DOI QR Code

Multi-communication layered HPL model and its application to GPU clusters

  • Kim, Young Woo (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Oh, Myeong-Hoon (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Park, Chan Yeol (Center for Development of Supercomputing System, Korea Institute of Science and Technology Information)
  • Received : 2020.10.13
  • Accepted : 2021.02.25
  • Published : 2021.06.01

Abstract

High-performance Linpack (HPL) is among the most popular benchmarks for evaluating the capabilities of computing systems and has been used as a standard to compare the performance of computing systems since the early 1980s. In the initial system-design stage, it is critical to estimate the capabilities of a system quickly and accurately. However, the original HPL mathematical model based on a single core and single communication layer yields varying accuracy for modern processors and accelerators comprising large numbers of cores. To reduce the performance-estimation gap between the HPL model and an actual system, we propose a mathematical model for multi-communication layered HPL. The effectiveness of the proposed model is evaluated by applying it to a GPU cluster and well-known systems. The results reveal performance differences of 1.1% on a single GPU. The GPU cluster and well-known large system show 5.5% and 4.1% differences on average, respectively. Compared to the original HPL model, the proposed multi-communication layered HPL model provides performance estimates within a few seconds and a smaller error range from the processor/accelerator level to the large system level.

Keywords

Acknowledgement

We give special thanks to Hyungon Ryu and Simon See at NVAITC and Wan Seo at NVIDIA for technical supports.

References

  1. J. J. Dongarra and W. G. Stewart, LINPACK working note no 15: LINPACK-a package for solving linear systems, no. ANL-82-30, W-31-109-Eng-38, Springfield, VA, USA, 1982.
  2. J. Dongarra, The LINPACK benchmark: An explanation in Supercomputing, vol. 297, Springer, Berlin, Heidelberg, 1988, pp. 456-474.
  3. J. J. Dongarra, L. Piotr, and P. Antoine, The LINPACK benchmark: Past, present and future, University of Tennessee, Technical report, 2001, mimeo.
  4. J. J. Dongarra and L. Julien, The problem with the linpack benchmark 1.0 matrix generator, Int. J. High Perfom. Comput. Appl. 23 (2009), 5-13. https://doi.org/10.1177/1094342008098683
  5. TOP500 The List. Sept. 25, 2020, available at https://www.top500.org/.
  6. G. Quintana-Orti, S. Xiaobai, and H. C. Bischof, A BLAS-3 version of the QR factorization with column pivoting, SIAM J. Sci. Comput. 19 (1998), 1486-1494. https://doi.org/10.1137/S1064827595296732
  7. Z. Wenli, J. Fan, and M. Chen, Efficient determination of block size NB for parallel LINPACK test, in Proc. IASTED Int. Conf. Parallel Distrib. Comput. Syst. (Las Vegas, NV, USA), Nov. 2004.
  8. T. Nguyen and S. B. Baden, Lu factorization: Towards hiding communication overheads with a lookahead-free algorithm, in Proc. IEEE Int. Conf. Cluster Comput. (Chicago, IL, USA), Sept. 2015, pp. 394-397.
  9. T. Davies et al., High performance linpack benchmark: A fault tolerant implementation without checkpointing, in Proc. Int. Conf. Supercomput. (Tucson, AZ, USA), May 2011, pp. 162-171.
  10. M. Fatica, Accelerating linpack with CUDA on heterogenous clusters, in Proc. Workshop GPGPU (Washington, DC, USA), Mar. 2009, pp. 46-51.
  11. E. Phillips and F. Massimiliano, A CUDA implementation of the high performance conjugate gradient benchmark, in International Workshop on PMBS, vol. 8966, Springer, Cham, Switzerland, 2014, pp. 68-84.
  12. D. Rohr, J. Cuveland, and V. Lindenstruth, A model for weak scaling to many GPUs at the basis of the linpack benchmark, in Proc. IEEE Int. Conf. Cluster Comput. (Taipei, Taiwan), Sept. 2016, pp. 192-202.
  13. R. David, K. Matthias Kretz, and B. Matthias, CALDGEMM and HPL, Tech. Rep. Dec. 2010, Available at: http://code.compeng.unifrankfurt.de/attachments/10/techreport.pdf (2010). [last accessed September 2020].
  14. R. Milan et al. D2.2: Report on the ExaNoDe architecture design guidelines, ExaNode. Tech. Rep. 2016, Available at: https:// exanode.eu/wp-content/uploads/2017/04/D2.2.pdf. [last accessed September 2020].
  15. A. Kazi, D2.5: Report on the HPC application bottlenecks of the state-of-the-art HPC platforms, ExaNode, Tech. Rep. 2016, Available at: https://exanode.eu/wp-content/uploads/2017/04/%20D2.5.pdf. [last accessed September 2020].
  16. C. Tom et al., Emulating high performance linpack on a commodity server at the scale of a supercomputer, hal-01654804, 2017.
  17. D. Zivanovic et al., Main memory in HPC: do we need more or could we live with less?, ACM Trans. Architect. Code Optim. 14 (2017), 1-26. https://doi.org/10.1145/3023362
  18. Z. Jia et al., Dissecting the NVIDIA Volta GPU architecture via microbenchmarking, arXiv preprint, CoRR, 2018, arXiv:1804.06826 (2018).
  19. A. Goldhammer and A. Ayer Jr., Understanding performance of PCI express systems, Xilinx WP350(v1.2), Sept. 4, 2014.
  20. J. Razzaq et al., Performance characterization of multiprocessors and accelerators using micro-benchmarks, Int. J. Adv. Syst. Measure. 9 (2016), no. 1-2, 77-90.
  21. H. Nakamura et al., Thorough analysis of PCIe Gen3 communication, in Proc. Int. Conf. ReConFigurable Comput. FPGAs (Cancun, Mexico), Dec. 2017, pp. 1-6.
  22. R. Neugebauer et al., Understanding PCIe performance for end host networking, in Proc. 2018 Conf. ACM Special Interest Group Data Commun. (Budapest, Hungary), Aug. 2018, pp. 327-341.
  23. A. Li et al., Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect, IEEE Trans. Parallel Dist. Syst. 31 (2019), no. 1, 94-110. https://doi.org/10.1109/tpds.2019.2928289
  24. A. Li et al., Tartan: Evaluating modern GPU interconnect via a multi-GPU benchmark suite, in Proc. 2018 IEEE Int. Symp. Workload Charact. (Raleigh, NC, USA), Sept. 2018, pp. 191-202.
  25. C. Pearson et al., Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects, in Proc. 2019 ACM/SPEC Int. Conf. Perform. (Mumbai, India), Eng. Apr. 2019, pp. 209-218.
  26. Y. Ajima et al., The tofu interconnect D, in Proc. IEEE Int. Conf. Cluster Comput. (Belfast, UK), Sept. 2018, pp. 646-654.
  27. S. Thomas, Network fabrics: Cray aries, Zuse Institute, Berlin, 2017, Sept. 25, 2020, available at https://support.hlrn.de/twiki/pub/NewsCenter/ParProgWorkshopFall2017/03_Networks_Cray_Aries.pdf
  28. D. De Sensi, S. Di Girolamo, and T. Hoefler, Mitigating network noise on dragonfly networks through application-aware routing, in Proc. Int. Conf. High Perform. Comput., Netw., Stor. Analysis (Denver, CO, USA), Nov. 2019, pp. 1-32.
  29. J. Louis, Networks for high-performance computing, Available from: https://louisjenkinscs.github.io/survey/Networks_for_HighPerformance_Computing.pdf [last accessed September 2020].
  30. M. Sindi, HowTo-High Performance Linpack (HPL), Tech. Rep. Center for Research Computing, University of Notre Dame, Jan. 2009.
  31. TechPowerUp, NVIDIA Tesla P100 PCIe 16 GB, Sept. 25, 2020, available at https://www.techpowerup.com/gpu-specs/tesla-p100-pcie-16-gb.c2888.
  32. FUGAKU system, RIKEN, Japan, Sept. 25, 2020, available at https://www.r-ccs.riken.jp/en/fugaku/project.
  33. SUMMIT system, Oak Ridge National Laboratory, Oak Ridge, Sept. 25, 2020, available at https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/.
  34. Sierra system, Lawrence Livermore National Laboratory, Livermore, Sept. 25, 2020, available at https://computing.llnl.gov/computers/sierra.
  35. HPC5 system, Eni, Sept. 25, 2020, available at https://www.eni.com/en-IT/operations/green-data-center-hpc5.html.
  36. SELENE system, NVIDIA, Santa Clara, CA, Sept. 25, 2020, available at https://blogs.nvidia.com/blog/2020/06/22/top500-isc-supercomputing/.
  37. MARCONI100 system, CINECA, Bologna, Sept. 25, 2020, available at https://www.hpc.cineca.it/hardware/marconi100.
  38. PIZ DAINT system, CSCS, Sept. 25, 2020, available at https://www.cscs.ch/computers/piz-daint/.
  39. DGX SuperPOD, NVIDIA, Santa Clara, CA, Sept. 25, 2020, available at https://developer.nvidia.com/blog/dgx-superpod-world-recordsupercomputing-enterprise/.
  40. W. Feng, Analyzing MPI performance over 10-Gigabit Ethernet, J. Parallel Distr. Comput. 65 (2005), 1253-1260. https://doi.org/10.1016/j.jpdc.2005.04.011
  41. S. N. Kandadio and H. Xinghong, Performance of HPC Applications over Infiniband, 10 Gb and 1 Gb Ethernet, IBM, Armonk, NY, USA, 2007.
  42. HPC Advisory Council, Interconnect analysis: 10GigE and infiniband in high performance computing, HPC Advisory Council, Tech. Rep. 2009.
  43. J. Vienne et al., Performance analysis and evaluation of infiniband FDR and 40GigE RoCE on hpc and cloud computing systems, in Proc. IEEE Symp. High-Perform. Interconnects (Santa Clara, CA, USA), Aug. 2012, pp. 48-55.