Multi-communication layered HPL model and its application to GPU clusters

Kim, Young Woo;Oh, Myeong-Hoon;Park, Chan Yeol;

doi:10.4218/etrij.2020-0393

ETRI Journal

Volume 43 Issue 3
/
Pages.524-537
/
2021
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Multi-communication layered HPL model and its application to GPU clusters

Kim, Young Woo (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
Oh, Myeong-Hoon (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
Park, Chan Yeol (Center for Development of Supercomputing System, Korea Institute of Science and Technology Information)

Received : 2020.10.13
Accepted : 2021.02.25
Published : 2021.06.01

https://doi.org/10.4218/etrij.2020-0393 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

High-performance Linpack (HPL) is among the most popular benchmarks for evaluating the capabilities of computing systems and has been used as a standard to compare the performance of computing systems since the early 1980s. In the initial system-design stage, it is critical to estimate the capabilities of a system quickly and accurately. However, the original HPL mathematical model based on a single core and single communication layer yields varying accuracy for modern processors and accelerators comprising large numbers of cores. To reduce the performance-estimation gap between the HPL model and an actual system, we propose a mathematical model for multi-communication layered HPL. The effectiveness of the proposed model is evaluated by applying it to a GPU cluster and well-known systems. The results reveal performance differences of 1.1% on a single GPU. The GPU cluster and well-known large system show 5.5% and 4.1% differences on average, respectively. Compared to the original HPL model, the proposed multi-communication layered HPL model provides performance estimates within a few seconds and a smaller error range from the processor/accelerator level to the large system level.

Keywords

Acknowledgement

We give special thanks to Hyungon Ryu and Simon See at NVAITC and Wan Seo at NVIDIA for technical supports.

References

J. J. Dongarra and W. G. Stewart, LINPACK working note no 15: LINPACK-a package for solving linear systems, no. ANL-82-30, W-31-109-Eng-38, Springfield, VA, USA, 1982.
J. Dongarra, The LINPACK benchmark: An explanation in Supercomputing, vol. 297, Springer, Berlin, Heidelberg, 1988, pp. 456-474.
J. J. Dongarra, L. Piotr, and P. Antoine, The LINPACK benchmark: Past, present and future, University of Tennessee, Technical report, 2001, mimeo.
J. J. Dongarra and L. Julien, The problem with the linpack benchmark 1.0 matrix generator, Int. J. High Perfom. Comput. Appl. 23 (2009), 5-13. https://doi.org/10.1177/1094342008098683
TOP500 The List. Sept. 25, 2020, available at https://www.top500.org/.
G. Quintana-Orti, S. Xiaobai, and H. C. Bischof, A BLAS-3 version of the QR factorization with column pivoting, SIAM J. Sci. Comput. 19 (1998), 1486-1494. https://doi.org/10.1137/S1064827595296732
Z. Wenli, J. Fan, and M. Chen, Efficient determination of block size NB for parallel LINPACK test, in Proc. IASTED Int. Conf. Parallel Distrib. Comput. Syst. (Las Vegas, NV, USA), Nov. 2004.
T. Nguyen and S. B. Baden, Lu factorization: Towards hiding communication overheads with a lookahead-free algorithm, in Proc. IEEE Int. Conf. Cluster Comput. (Chicago, IL, USA), Sept. 2015, pp. 394-397.
T. Davies et al., High performance linpack benchmark: A fault tolerant implementation without checkpointing, in Proc. Int. Conf. Supercomput. (Tucson, AZ, USA), May 2011, pp. 162-171.
M. Fatica, Accelerating linpack with CUDA on heterogenous clusters, in Proc. Workshop GPGPU (Washington, DC, USA), Mar. 2009, pp. 46-51.
E. Phillips and F. Massimiliano, A CUDA implementation of the high performance conjugate gradient benchmark, in International Workshop on PMBS, vol. 8966, Springer, Cham, Switzerland, 2014, pp. 68-84.
D. Rohr, J. Cuveland, and V. Lindenstruth, A model for weak scaling to many GPUs at the basis of the linpack benchmark, in Proc. IEEE Int. Conf. Cluster Comput. (Taipei, Taiwan), Sept. 2016, pp. 192-202.
R. David, K. Matthias Kretz, and B. Matthias, CALDGEMM and HPL, Tech. Rep. Dec. 2010, Available at: http://code.compeng.unifrankfurt.de/attachments/10/techreport.pdf (2010). [last accessed September 2020].
R. Milan et al. D2.2: Report on the ExaNoDe architecture design guidelines, ExaNode. Tech. Rep. 2016, Available at: https:// exanode.eu/wp-content/uploads/2017/04/D2.2.pdf. [last accessed September 2020].
A. Kazi, D2.5: Report on the HPC application bottlenecks of the state-of-the-art HPC platforms, ExaNode, Tech. Rep. 2016, Available at: https://exanode.eu/wp-content/uploads/2017/04/%20D2.5.pdf. [last accessed September 2020].
C. Tom et al., Emulating high performance linpack on a commodity server at the scale of a supercomputer, hal-01654804, 2017.
D. Zivanovic et al., Main memory in HPC: do we need more or could we live with less?, ACM Trans. Architect. Code Optim. 14 (2017), 1-26. https://doi.org/10.1145/3023362
Z. Jia et al., Dissecting the NVIDIA Volta GPU architecture via microbenchmarking, arXiv preprint, CoRR, 2018, arXiv:1804.06826 (2018).
A. Goldhammer and A. Ayer Jr., Understanding performance of PCI express systems, Xilinx WP350(v1.2), Sept. 4, 2014.
J. Razzaq et al., Performance characterization of multiprocessors and accelerators using micro-benchmarks, Int. J. Adv. Syst. Measure. 9 (2016), no. 1-2, 77-90.
H. Nakamura et al., Thorough analysis of PCIe Gen3 communication, in Proc. Int. Conf. ReConFigurable Comput. FPGAs (Cancun, Mexico), Dec. 2017, pp. 1-6.
R. Neugebauer et al., Understanding PCIe performance for end host networking, in Proc. 2018 Conf. ACM Special Interest Group Data Commun. (Budapest, Hungary), Aug. 2018, pp. 327-341.
A. Li et al., Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect, IEEE Trans. Parallel Dist. Syst. 31 (2019), no. 1, 94-110. https://doi.org/10.1109/tpds.2019.2928289
A. Li et al., Tartan: Evaluating modern GPU interconnect via a multi-GPU benchmark suite, in Proc. 2018 IEEE Int. Symp. Workload Charact. (Raleigh, NC, USA), Sept. 2018, pp. 191-202.
C. Pearson et al., Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects, in Proc. 2019 ACM/SPEC Int. Conf. Perform. (Mumbai, India), Eng. Apr. 2019, pp. 209-218.
Y. Ajima et al., The tofu interconnect D, in Proc. IEEE Int. Conf. Cluster Comput. (Belfast, UK), Sept. 2018, pp. 646-654.
S. Thomas, Network fabrics: Cray aries, Zuse Institute, Berlin, 2017, Sept. 25, 2020, available at https://support.hlrn.de/twiki/pub/NewsCenter/ParProgWorkshopFall2017/03_Networks_Cray_Aries.pdf
D. De Sensi, S. Di Girolamo, and T. Hoefler, Mitigating network noise on dragonfly networks through application-aware routing, in Proc. Int. Conf. High Perform. Comput., Netw., Stor. Analysis (Denver, CO, USA), Nov. 2019, pp. 1-32.
J. Louis, Networks for high-performance computing, Available from: https://louisjenkinscs.github.io/survey/Networks_for_HighPerformance_Computing.pdf [last accessed September 2020].
M. Sindi, HowTo-High Performance Linpack (HPL), Tech. Rep. Center for Research Computing, University of Notre Dame, Jan. 2009.
TechPowerUp, NVIDIA Tesla P100 PCIe 16 GB, Sept. 25, 2020, available at https://www.techpowerup.com/gpu-specs/tesla-p100-pcie-16-gb.c2888.
FUGAKU system, RIKEN, Japan, Sept. 25, 2020, available at https://www.r-ccs.riken.jp/en/fugaku/project.
SUMMIT system, Oak Ridge National Laboratory, Oak Ridge, Sept. 25, 2020, available at https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/.
Sierra system, Lawrence Livermore National Laboratory, Livermore, Sept. 25, 2020, available at https://computing.llnl.gov/computers/sierra.
HPC5 system, Eni, Sept. 25, 2020, available at https://www.eni.com/en-IT/operations/green-data-center-hpc5.html.
SELENE system, NVIDIA, Santa Clara, CA, Sept. 25, 2020, available at https://blogs.nvidia.com/blog/2020/06/22/top500-isc-supercomputing/.
MARCONI100 system, CINECA, Bologna, Sept. 25, 2020, available at https://www.hpc.cineca.it/hardware/marconi100.
PIZ DAINT system, CSCS, Sept. 25, 2020, available at https://www.cscs.ch/computers/piz-daint/.
DGX SuperPOD, NVIDIA, Santa Clara, CA, Sept. 25, 2020, available at https://developer.nvidia.com/blog/dgx-superpod-world-recordsupercomputing-enterprise/.
W. Feng, Analyzing MPI performance over 10-Gigabit Ethernet, J. Parallel Distr. Comput. 65 (2005), 1253-1260. https://doi.org/10.1016/j.jpdc.2005.04.011
S. N. Kandadio and H. Xinghong, Performance of HPC Applications over Infiniband, 10 Gb and 1 Gb Ethernet, IBM, Armonk, NY, USA, 2007.
HPC Advisory Council, Interconnect analysis: 10GigE and infiniband in high performance computing, HPC Advisory Council, Tech. Rep. 2009.
J. Vienne et al., Performance analysis and evaluation of infiniband FDR and 40GigE RoCE on hpc and cloud computing systems, in Proc. IEEE Symp. High-Perform. Interconnects (Santa Clara, CA, USA), Aug. 2012, pp. 48-55.

ETRI Journal

Multi-communication layered HPL model and its application to GPU clusters

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)