Acknowledgement
We give special thanks to Hyungon Ryu and Simon See at NVAITC and Wan Seo at NVIDIA for technical supports.
References
- J. J. Dongarra and W. G. Stewart, LINPACK working note no 15: LINPACK-a package for solving linear systems, no. ANL-82-30, W-31-109-Eng-38, Springfield, VA, USA, 1982.
- J. Dongarra, The LINPACK benchmark: An explanation in Supercomputing, vol. 297, Springer, Berlin, Heidelberg, 1988, pp. 456-474.
- J. J. Dongarra, L. Piotr, and P. Antoine, The LINPACK benchmark: Past, present and future, University of Tennessee, Technical report, 2001, mimeo.
- J. J. Dongarra and L. Julien, The problem with the linpack benchmark 1.0 matrix generator, Int. J. High Perfom. Comput. Appl. 23 (2009), 5-13. https://doi.org/10.1177/1094342008098683
- TOP500 The List. Sept. 25, 2020, available at https://www.top500.org/.
- G. Quintana-Orti, S. Xiaobai, and H. C. Bischof, A BLAS-3 version of the QR factorization with column pivoting, SIAM J. Sci. Comput. 19 (1998), 1486-1494. https://doi.org/10.1137/S1064827595296732
- Z. Wenli, J. Fan, and M. Chen, Efficient determination of block size NB for parallel LINPACK test, in Proc. IASTED Int. Conf. Parallel Distrib. Comput. Syst. (Las Vegas, NV, USA), Nov. 2004.
- T. Nguyen and S. B. Baden, Lu factorization: Towards hiding communication overheads with a lookahead-free algorithm, in Proc. IEEE Int. Conf. Cluster Comput. (Chicago, IL, USA), Sept. 2015, pp. 394-397.
- T. Davies et al., High performance linpack benchmark: A fault tolerant implementation without checkpointing, in Proc. Int. Conf. Supercomput. (Tucson, AZ, USA), May 2011, pp. 162-171.
- M. Fatica, Accelerating linpack with CUDA on heterogenous clusters, in Proc. Workshop GPGPU (Washington, DC, USA), Mar. 2009, pp. 46-51.
- E. Phillips and F. Massimiliano, A CUDA implementation of the high performance conjugate gradient benchmark, in International Workshop on PMBS, vol. 8966, Springer, Cham, Switzerland, 2014, pp. 68-84.
- D. Rohr, J. Cuveland, and V. Lindenstruth, A model for weak scaling to many GPUs at the basis of the linpack benchmark, in Proc. IEEE Int. Conf. Cluster Comput. (Taipei, Taiwan), Sept. 2016, pp. 192-202.
- R. David, K. Matthias Kretz, and B. Matthias, CALDGEMM and HPL, Tech. Rep. Dec. 2010, Available at: http://code.compeng.unifrankfurt.de/attachments/10/techreport.pdf (2010). [last accessed September 2020].
- R. Milan et al. D2.2: Report on the ExaNoDe architecture design guidelines, ExaNode. Tech. Rep. 2016, Available at: https:// exanode.eu/wp-content/uploads/2017/04/D2.2.pdf. [last accessed September 2020].
- A. Kazi, D2.5: Report on the HPC application bottlenecks of the state-of-the-art HPC platforms, ExaNode, Tech. Rep. 2016, Available at: https://exanode.eu/wp-content/uploads/2017/04/%20D2.5.pdf. [last accessed September 2020].
- C. Tom et al., Emulating high performance linpack on a commodity server at the scale of a supercomputer, hal-01654804, 2017.
- D. Zivanovic et al., Main memory in HPC: do we need more or could we live with less?, ACM Trans. Architect. Code Optim. 14 (2017), 1-26. https://doi.org/10.1145/3023362
- Z. Jia et al., Dissecting the NVIDIA Volta GPU architecture via microbenchmarking, arXiv preprint, CoRR, 2018, arXiv:1804.06826 (2018).
- A. Goldhammer and A. Ayer Jr., Understanding performance of PCI express systems, Xilinx WP350(v1.2), Sept. 4, 2014.
- J. Razzaq et al., Performance characterization of multiprocessors and accelerators using micro-benchmarks, Int. J. Adv. Syst. Measure. 9 (2016), no. 1-2, 77-90.
- H. Nakamura et al., Thorough analysis of PCIe Gen3 communication, in Proc. Int. Conf. ReConFigurable Comput. FPGAs (Cancun, Mexico), Dec. 2017, pp. 1-6.
- R. Neugebauer et al., Understanding PCIe performance for end host networking, in Proc. 2018 Conf. ACM Special Interest Group Data Commun. (Budapest, Hungary), Aug. 2018, pp. 327-341.
- A. Li et al., Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect, IEEE Trans. Parallel Dist. Syst. 31 (2019), no. 1, 94-110. https://doi.org/10.1109/tpds.2019.2928289
- A. Li et al., Tartan: Evaluating modern GPU interconnect via a multi-GPU benchmark suite, in Proc. 2018 IEEE Int. Symp. Workload Charact. (Raleigh, NC, USA), Sept. 2018, pp. 191-202.
- C. Pearson et al., Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects, in Proc. 2019 ACM/SPEC Int. Conf. Perform. (Mumbai, India), Eng. Apr. 2019, pp. 209-218.
- Y. Ajima et al., The tofu interconnect D, in Proc. IEEE Int. Conf. Cluster Comput. (Belfast, UK), Sept. 2018, pp. 646-654.
- S. Thomas, Network fabrics: Cray aries, Zuse Institute, Berlin, 2017, Sept. 25, 2020, available at https://support.hlrn.de/twiki/pub/NewsCenter/ParProgWorkshopFall2017/03_Networks_Cray_Aries.pdf
- D. De Sensi, S. Di Girolamo, and T. Hoefler, Mitigating network noise on dragonfly networks through application-aware routing, in Proc. Int. Conf. High Perform. Comput., Netw., Stor. Analysis (Denver, CO, USA), Nov. 2019, pp. 1-32.
- J. Louis, Networks for high-performance computing, Available from: https://louisjenkinscs.github.io/survey/Networks_for_HighPerformance_Computing.pdf [last accessed September 2020].
- M. Sindi, HowTo-High Performance Linpack (HPL), Tech. Rep. Center for Research Computing, University of Notre Dame, Jan. 2009.
- TechPowerUp, NVIDIA Tesla P100 PCIe 16 GB, Sept. 25, 2020, available at https://www.techpowerup.com/gpu-specs/tesla-p100-pcie-16-gb.c2888.
- FUGAKU system, RIKEN, Japan, Sept. 25, 2020, available at https://www.r-ccs.riken.jp/en/fugaku/project.
- SUMMIT system, Oak Ridge National Laboratory, Oak Ridge, Sept. 25, 2020, available at https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/.
- Sierra system, Lawrence Livermore National Laboratory, Livermore, Sept. 25, 2020, available at https://computing.llnl.gov/computers/sierra.
- HPC5 system, Eni, Sept. 25, 2020, available at https://www.eni.com/en-IT/operations/green-data-center-hpc5.html.
- SELENE system, NVIDIA, Santa Clara, CA, Sept. 25, 2020, available at https://blogs.nvidia.com/blog/2020/06/22/top500-isc-supercomputing/.
- MARCONI100 system, CINECA, Bologna, Sept. 25, 2020, available at https://www.hpc.cineca.it/hardware/marconi100.
- PIZ DAINT system, CSCS, Sept. 25, 2020, available at https://www.cscs.ch/computers/piz-daint/.
- DGX SuperPOD, NVIDIA, Santa Clara, CA, Sept. 25, 2020, available at https://developer.nvidia.com/blog/dgx-superpod-world-recordsupercomputing-enterprise/.
- W. Feng, Analyzing MPI performance over 10-Gigabit Ethernet, J. Parallel Distr. Comput. 65 (2005), 1253-1260. https://doi.org/10.1016/j.jpdc.2005.04.011
- S. N. Kandadio and H. Xinghong, Performance of HPC Applications over Infiniband, 10 Gb and 1 Gb Ethernet, IBM, Armonk, NY, USA, 2007.
- HPC Advisory Council, Interconnect analysis: 10GigE and infiniband in high performance computing, HPC Advisory Council, Tech. Rep. 2009.
- J. Vienne et al., Performance analysis and evaluation of infiniband FDR and 40GigE RoCE on hpc and cloud computing systems, in Proc. IEEE Symp. High-Perform. Interconnects (Santa Clara, CA, USA), Aug. 2012, pp. 48-55.