1 |
J. J. Dongarra, L. Piotr, and P. Antoine, The LINPACK benchmark: Past, present and future, University of Tennessee, Technical report, 2001, mimeo.
|
2 |
S. Thomas, Network fabrics: Cray aries, Zuse Institute, Berlin, 2017, Sept. 25, 2020, available at https://support.hlrn.de/twiki/pub/NewsCenter/ParProgWorkshopFall2017/03_Networks_Cray_Aries.pdf
|
3 |
T. Davies et al., High performance linpack benchmark: A fault tolerant implementation without checkpointing, in Proc. Int. Conf. Supercomput. (Tucson, AZ, USA), May 2011, pp. 162-171.
|
4 |
J. J. Dongarra and L. Julien, The problem with the linpack benchmark 1.0 matrix generator, Int. J. High Perfom. Comput. Appl. 23 (2009), 5-13.
DOI
|
5 |
G. Quintana-Orti, S. Xiaobai, and H. C. Bischof, A BLAS-3 version of the QR factorization with column pivoting, SIAM J. Sci. Comput. 19 (1998), 1486-1494.
DOI
|
6 |
Z. Wenli, J. Fan, and M. Chen, Efficient determination of block size NB for parallel LINPACK test, in Proc. IASTED Int. Conf. Parallel Distrib. Comput. Syst. (Las Vegas, NV, USA), Nov. 2004.
|
7 |
M. Fatica, Accelerating linpack with CUDA on heterogenous clusters, in Proc. Workshop GPGPU (Washington, DC, USA), Mar. 2009, pp. 46-51.
|
8 |
D. Rohr, J. Cuveland, and V. Lindenstruth, A model for weak scaling to many GPUs at the basis of the linpack benchmark, in Proc. IEEE Int. Conf. Cluster Comput. (Taipei, Taiwan), Sept. 2016, pp. 192-202.
|
9 |
R. David, K. Matthias Kretz, and B. Matthias, CALDGEMM and HPL, Tech. Rep. Dec. 2010, Available at: http://code.compeng.unifrankfurt.de/attachments/10/techreport.pdf (2010). [last accessed September 2020].
|
10 |
C. Tom et al., Emulating high performance linpack on a commodity server at the scale of a supercomputer, hal-01654804, 2017.
|
11 |
Z. Jia et al., Dissecting the NVIDIA Volta GPU architecture via microbenchmarking, arXiv preprint, CoRR, 2018, arXiv:1804.06826 (2018).
|
12 |
A. Goldhammer and A. Ayer Jr., Understanding performance of PCI express systems, Xilinx WP350(v1.2), Sept. 4, 2014.
|
13 |
J. Vienne et al., Performance analysis and evaluation of infiniband FDR and 40GigE RoCE on hpc and cloud computing systems, in Proc. IEEE Symp. High-Perform. Interconnects (Santa Clara, CA, USA), Aug. 2012, pp. 48-55.
|
14 |
R. Milan et al. D2.2: Report on the ExaNoDe architecture design guidelines, ExaNode. Tech. Rep. 2016, Available at: https:// exanode.eu/wp-content/uploads/2017/04/D2.2.pdf. [last accessed September 2020].
|
15 |
H. Nakamura et al., Thorough analysis of PCIe Gen3 communication, in Proc. Int. Conf. ReConFigurable Comput. FPGAs (Cancun, Mexico), Dec. 2017, pp. 1-6.
|
16 |
R. Neugebauer et al., Understanding PCIe performance for end host networking, in Proc. 2018 Conf. ACM Special Interest Group Data Commun. (Budapest, Hungary), Aug. 2018, pp. 327-341.
|
17 |
A. Li et al., Tartan: Evaluating modern GPU interconnect via a multi-GPU benchmark suite, in Proc. 2018 IEEE Int. Symp. Workload Charact. (Raleigh, NC, USA), Sept. 2018, pp. 191-202.
|
18 |
HPC Advisory Council, Interconnect analysis: 10GigE and infiniband in high performance computing, HPC Advisory Council, Tech. Rep. 2009.
|
19 |
TechPowerUp, NVIDIA Tesla P100 PCIe 16 GB, Sept. 25, 2020, available at https://www.techpowerup.com/gpu-specs/tesla-p100-pcie-16-gb.c2888.
|
20 |
M. Sindi, HowTo-High Performance Linpack (HPL), Tech. Rep. Center for Research Computing, University of Notre Dame, Jan. 2009.
|
21 |
FUGAKU system, RIKEN, Japan, Sept. 25, 2020, available at https://www.r-ccs.riken.jp/en/fugaku/project.
|
22 |
SUMMIT system, Oak Ridge National Laboratory, Oak Ridge, Sept. 25, 2020, available at https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/.
|
23 |
Sierra system, Lawrence Livermore National Laboratory, Livermore, Sept. 25, 2020, available at https://computing.llnl.gov/computers/sierra.
|
24 |
HPC5 system, Eni, Sept. 25, 2020, available at https://www.eni.com/en-IT/operations/green-data-center-hpc5.html.
|
25 |
SELENE system, NVIDIA, Santa Clara, CA, Sept. 25, 2020, available at https://blogs.nvidia.com/blog/2020/06/22/top500-isc-supercomputing/.
|
26 |
MARCONI100 system, CINECA, Bologna, Sept. 25, 2020, available at https://www.hpc.cineca.it/hardware/marconi100.
|
27 |
S. N. Kandadio and H. Xinghong, Performance of HPC Applications over Infiniband, 10 Gb and 1 Gb Ethernet, IBM, Armonk, NY, USA, 2007.
|
28 |
A. Kazi, D2.5: Report on the HPC application bottlenecks of the state-of-the-art HPC platforms, ExaNode, Tech. Rep. 2016, Available at: https://exanode.eu/wp-content/uploads/2017/04/%20D2.5.pdf. [last accessed September 2020].
|
29 |
DGX SuperPOD, NVIDIA, Santa Clara, CA, Sept. 25, 2020, available at https://developer.nvidia.com/blog/dgx-superpod-world-recordsupercomputing-enterprise/.
|
30 |
W. Feng, Analyzing MPI performance over 10-Gigabit Ethernet, J. Parallel Distr. Comput. 65 (2005), 1253-1260.
DOI
|
31 |
E. Phillips and F. Massimiliano, A CUDA implementation of the high performance conjugate gradient benchmark, in International Workshop on PMBS, vol. 8966, Springer, Cham, Switzerland, 2014, pp. 68-84.
|
32 |
C. Pearson et al., Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects, in Proc. 2019 ACM/SPEC Int. Conf. Perform. (Mumbai, India), Eng. Apr. 2019, pp. 209-218.
|
33 |
J. J. Dongarra and W. G. Stewart, LINPACK working note no 15: LINPACK-a package for solving linear systems, no. ANL-82-30, W-31-109-Eng-38, Springfield, VA, USA, 1982.
|
34 |
J. Dongarra, The LINPACK benchmark: An explanation in Supercomputing, vol. 297, Springer, Berlin, Heidelberg, 1988, pp. 456-474.
|
35 |
TOP500 The List. Sept. 25, 2020, available at https://www.top500.org/.
|
36 |
T. Nguyen and S. B. Baden, Lu factorization: Towards hiding communication overheads with a lookahead-free algorithm, in Proc. IEEE Int. Conf. Cluster Comput. (Chicago, IL, USA), Sept. 2015, pp. 394-397.
|
37 |
D. Zivanovic et al., Main memory in HPC: do we need more or could we live with less?, ACM Trans. Architect. Code Optim. 14 (2017), 1-26.
DOI
|
38 |
Y. Ajima et al., The tofu interconnect D, in Proc. IEEE Int. Conf. Cluster Comput. (Belfast, UK), Sept. 2018, pp. 646-654.
|
39 |
J. Razzaq et al., Performance characterization of multiprocessors and accelerators using micro-benchmarks, Int. J. Adv. Syst. Measure. 9 (2016), no. 1-2, 77-90.
|
40 |
A. Li et al., Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect, IEEE Trans. Parallel Dist. Syst. 31 (2019), no. 1, 94-110.
DOI
|
41 |
D. De Sensi, S. Di Girolamo, and T. Hoefler, Mitigating network noise on dragonfly networks through application-aware routing, in Proc. Int. Conf. High Perform. Comput., Netw., Stor. Analysis (Denver, CO, USA), Nov. 2019, pp. 1-32.
|
42 |
J. Louis, Networks for high-performance computing, Available from: https://louisjenkinscs.github.io/survey/Networks_for_HighPerformance_Computing.pdf [last accessed September 2020].
|
43 |
PIZ DAINT system, CSCS, Sept. 25, 2020, available at https://www.cscs.ch/computers/piz-daint/.
|