[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.4218/etrij.2020-0393

Multi-communication layered HPL model and its application to GPU clusters

Kim, Young Woo (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute)
Oh, Myeong-Hoon (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute)
Park, Chan Yeol (Center for Development of Supercomputing System, Korea Institute of Science and Technology Information)

Publication Information

ETRI Journal / v.43, no.3, 2021 , pp. 524-537 More about this Journal

Abstract

High-performance Linpack (HPL) is among the most popular benchmarks for evaluating the capabilities of computing systems and has been used as a standard to compare the performance of computing systems since the early 1980s. In the initial system-design stage, it is critical to estimate the capabilities of a system quickly and accurately. However, the original HPL mathematical model based on a single core and single communication layer yields varying accuracy for modern processors and accelerators comprising large numbers of cores. To reduce the performance-estimation gap between the HPL model and an actual system, we propose a mathematical model for multi-communication layered HPL. The effectiveness of the proposed model is evaluated by applying it to a GPU cluster and well-known systems. The results reveal performance differences of 1.1% on a single GPU. The GPU cluster and well-known large system show 5.5% and 4.1% differences on average, respectively. Compared to the original HPL model, the proposed multi-communication layered HPL model provides performance estimates within a few seconds and a smaller error range from the processor/accelerator level to the large system level.

Keywords

GPU cluster; GPU model; HPL; Linpack; mathematical model; multi-communication layered model;

Citations & Related Records

Reference

1	J. J. Dongarra, L. Piotr, and P. Antoine, The LINPACK benchmark: Past, present and future, University of Tennessee, Technical report, 2001, mimeo.
2	S. Thomas, Network fabrics: Cray aries, Zuse Institute, Berlin, 2017, Sept. 25, 2020, available at https://support.hlrn.de/twiki/pub/NewsCenter/ParProgWorkshopFall2017/03_Networks_Cray_Aries.pdf
3	T. Davies et al., High performance linpack benchmark: A fault tolerant implementation without checkpointing, in Proc. Int. Conf. Supercomput. (Tucson, AZ, USA), May 2011, pp. 162-171.
4	J. J. Dongarra and L. Julien, The problem with the linpack benchmark 1.0 matrix generator, Int. J. High Perfom. Comput. Appl. 23 (2009), 5-13. DOI
5	G. Quintana-Orti, S. Xiaobai, and H. C. Bischof, A BLAS-3 version of the QR factorization with column pivoting, SIAM J. Sci. Comput. 19 (1998), 1486-1494. DOI
6	Z. Wenli, J. Fan, and M. Chen, Efficient determination of block size NB for parallel LINPACK test, in Proc. IASTED Int. Conf. Parallel Distrib. Comput. Syst. (Las Vegas, NV, USA), Nov. 2004.
7	M. Fatica, Accelerating linpack with CUDA on heterogenous clusters, in Proc. Workshop GPGPU (Washington, DC, USA), Mar. 2009, pp. 46-51.
8	D. Rohr, J. Cuveland, and V. Lindenstruth, A model for weak scaling to many GPUs at the basis of the linpack benchmark, in Proc. IEEE Int. Conf. Cluster Comput. (Taipei, Taiwan), Sept. 2016, pp. 192-202.
9	R. David, K. Matthias Kretz, and B. Matthias, CALDGEMM and HPL, Tech. Rep. Dec. 2010, Available at: http://code.compeng.unifrankfurt.de/attachments/10/techreport.pdf (2010). [last accessed September 2020].
10	C. Tom et al., Emulating high performance linpack on a commodity server at the scale of a supercomputer, hal-01654804, 2017.
11	Z. Jia et al., Dissecting the NVIDIA Volta GPU architecture via microbenchmarking, arXiv preprint, CoRR, 2018, arXiv:1804.06826 (2018).
12	A. Goldhammer and A. Ayer Jr., Understanding performance of PCI express systems, Xilinx WP350(v1.2), Sept. 4, 2014.
13	R. Milan et al. D2.2: Report on the ExaNoDe architecture design guidelines, ExaNode. Tech. Rep. 2016, Available at: https:// exanode.eu/wp-content/uploads/2017/04/D2.2.pdf. [last accessed September 2020].
14	H. Nakamura et al., Thorough analysis of PCIe Gen3 communication, in Proc. Int. Conf. ReConFigurable Comput. FPGAs (Cancun, Mexico), Dec. 2017, pp. 1-6.
15	R. Neugebauer et al., Understanding PCIe performance for end host networking, in Proc. 2018 Conf. ACM Special Interest Group Data Commun. (Budapest, Hungary), Aug. 2018, pp. 327-341.
16	A. Li et al., Tartan: Evaluating modern GPU interconnect via a multi-GPU benchmark suite, in Proc. 2018 IEEE Int. Symp. Workload Charact. (Raleigh, NC, USA), Sept. 2018, pp. 191-202.
17	HPC Advisory Council, Interconnect analysis: 10GigE and infiniband in high performance computing, HPC Advisory Council, Tech. Rep. 2009.
18	J. Vienne et al., Performance analysis and evaluation of infiniband FDR and 40GigE RoCE on hpc and cloud computing systems, in Proc. IEEE Symp. High-Perform. Interconnects (Santa Clara, CA, USA), Aug. 2012, pp. 48-55.
19	M. Sindi, HowTo-High Performance Linpack (HPL), Tech. Rep. Center for Research Computing, University of Notre Dame, Jan. 2009.
20	TechPowerUp, NVIDIA Tesla P100 PCIe 16 GB, Sept. 25, 2020, available at https://www.techpowerup.com/gpu-specs/tesla-p100-pcie-16-gb.c2888.
21	FUGAKU system, RIKEN, Japan, Sept. 25, 2020, available at https://www.r-ccs.riken.jp/en/fugaku/project.
22	SUMMIT system, Oak Ridge National Laboratory, Oak Ridge, Sept. 25, 2020, available at https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/.
23	Sierra system, Lawrence Livermore National Laboratory, Livermore, Sept. 25, 2020, available at https://computing.llnl.gov/computers/sierra.
24	HPC5 system, Eni, Sept. 25, 2020, available at https://www.eni.com/en-IT/operations/green-data-center-hpc5.html.
25	SELENE system, NVIDIA, Santa Clara, CA, Sept. 25, 2020, available at https://blogs.nvidia.com/blog/2020/06/22/top500-isc-supercomputing/.
26	MARCONI100 system, CINECA, Bologna, Sept. 25, 2020, available at https://www.hpc.cineca.it/hardware/marconi100.
27	A. Kazi, D2.5: Report on the HPC application bottlenecks of the state-of-the-art HPC platforms, ExaNode, Tech. Rep. 2016, Available at: https://exanode.eu/wp-content/uploads/2017/04/%20D2.5.pdf. [last accessed September 2020].
28	DGX SuperPOD, NVIDIA, Santa Clara, CA, Sept. 25, 2020, available at https://developer.nvidia.com/blog/dgx-superpod-world-recordsupercomputing-enterprise/.
29	W. Feng, Analyzing MPI performance over 10-Gigabit Ethernet, J. Parallel Distr. Comput. 65 (2005), 1253-1260. DOI
30	S. N. Kandadio and H. Xinghong, Performance of HPC Applications over Infiniband, 10 Gb and 1 Gb Ethernet, IBM, Armonk, NY, USA, 2007.
31	C. Pearson et al., Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects, in Proc. 2019 ACM/SPEC Int. Conf. Perform. (Mumbai, India), Eng. Apr. 2019, pp. 209-218.
32	J. J. Dongarra and W. G. Stewart, LINPACK working note no 15: LINPACK-a package for solving linear systems, no. ANL-82-30, W-31-109-Eng-38, Springfield, VA, USA, 1982.
33	J. Dongarra, The LINPACK benchmark: An explanation in Supercomputing, vol. 297, Springer, Berlin, Heidelberg, 1988, pp. 456-474.
34	TOP500 The List. Sept. 25, 2020, available at https://www.top500.org/.
35	T. Nguyen and S. B. Baden, Lu factorization: Towards hiding communication overheads with a lookahead-free algorithm, in Proc. IEEE Int. Conf. Cluster Comput. (Chicago, IL, USA), Sept. 2015, pp. 394-397.
36	E. Phillips and F. Massimiliano, A CUDA implementation of the high performance conjugate gradient benchmark, in International Workshop on PMBS, vol. 8966, Springer, Cham, Switzerland, 2014, pp. 68-84.
37	D. Zivanovic et al., Main memory in HPC: do we need more or could we live with less?, ACM Trans. Architect. Code Optim. 14 (2017), 1-26. DOI
38	J. Razzaq et al., Performance characterization of multiprocessors and accelerators using micro-benchmarks, Int. J. Adv. Syst. Measure. 9 (2016), no. 1-2, 77-90.
39	A. Li et al., Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect, IEEE Trans. Parallel Dist. Syst. 31 (2019), no. 1, 94-110. DOI
40	Y. Ajima et al., The tofu interconnect D, in Proc. IEEE Int. Conf. Cluster Comput. (Belfast, UK), Sept. 2018, pp. 646-654.
41	D. De Sensi, S. Di Girolamo, and T. Hoefler, Mitigating network noise on dragonfly networks through application-aware routing, in Proc. Int. Conf. High Perform. Comput., Netw., Stor. Analysis (Denver, CO, USA), Nov. 2019, pp. 1-32.
42	J. Louis, Networks for high-performance computing, Available from: https://louisjenkinscs.github.io/survey/Networks_for_HighPerformance_Computing.pdf [last accessed September 2020].
43	PIZ DAINT system, CSCS, Sept. 25, 2020, available at https://www.cscs.ch/computers/piz-daint/.