performance guide tesla v100 - one stop systems · 2018. 7. 23. · cpu server: dual xeon gold...

35
May 2018 TESLA V100 PERFORMANCE GUIDE

Upload: others

Post on 21-Aug-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

May 2018

TESLA V100 PERFORMANCE GUIDE

Page 2: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

2

TESLA V100The Fastest and Most Productive GPU for AI and HPC

Volta Architecture

Most Productive GPU

Tensor Core

125 Programmable TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink & HBM2

Efficient Bandwidth

Page 3: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

3

WEAK NODESLots of Nodes Interconnected with

Vast Network Overhead

STRONG NODESFew Lightning-Fast Nodes with

Performance of Hundreds of Weak Nodes

Page 4: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

4

CUSTOMER REFERENCE POINT IS SERVER NODE

Server Node They’ve Known for 20 Years

2x CPU: $4K

Memory: $1K

Interconnect NIC: $1K

Misc: $1K

Server Cost: $7K

Server Node with GPUs

Server: $7K

4x v100 GPU: $28K

GPU Server Cost: $35K

CUSTOMERS WANT TO COMPARE NODE COST

Page 5: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

5

GPU NODE CPU NODE# CPU NODES

TO MATCH 1 GPU NODE

$ SPEND ON CPU NODES

$ SAVED WITH GPU

NODE

AMBER $35K+

$5K (Core networking &

OEM margins)

Total: $40K

$7K +

$2K (Core networking &

OEM margins)

Total: $9K

95 CPU Nodes $855K $815K(95% savings)

GTC 16 CPU Nodes $144K $104K(72% savings)

HOW TESLA SAVES MONEY

Page 6: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

6

550+ GPU-ACCELERATED APPLICATIONS

DEFINING THE NEXT GIANT WAVE IN HPC

EVERY DEEP LEARNING FRAMEWORK ACCELERATED

All Top 15 Apps Are Accelerated

OAK RIDGE SUMMITUS’s next fastest supercomputer200+ Petaflop HPC; 3+ Exaflop of AI

ABCI Supercomputer (AIST)Japan’s fastest AI supercomputer

Piz DaintEurope’s fastest supercomputer

#1 IN AI AND HPC ACCELERATION

Page 7: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

7

V100 PCIE PERFORMANCE SUMMARYPerformance Equivalency of Single GPU Server vs SKYLAKE CPU Servers

LIFE SCIENCEAMBER

V100 GPU Node Outperforms 190 CPU Nodes

1 Server with GPUs

# of CPU Servers

MATERIAL SCIENCEQUANTUM ESPRESSO

V100 GPU Node Outperforms 12 CPU Nodes

PHYSICSMILC

V100 GPU Node Out Performs 25 CPU Nodes

GEOPHYSICSSPECFEM3D

V100 GPU Node Out Performs 80 CPU Nodes

1 Server with GPUs

# of CPU Servers

# of CPU Servers

1 Server with GPUs

# of CPU Servers

1 Server with GPUs

For benchmark details, input models, see following slides in the guide

Page 8: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

8

MOLECULAR DYNAMICS

Page 9: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

9

AMBER Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 configCUDA Version: CUDA 9.0.176; Dataset: PME-Cellulose_NVETo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

AMBERMolecular Dynamics

Suite of programs to simulate molecular dynamics on biomolecule

VERSION16.12

ACCELERATED FEATURESPMEMD Explicit Solvent & GB; Explicit & Implicit Solvent, REMD, aMD

SCALABILITYMulti-GPU and Single-Node

MORE INFORMATIONhttp://ambermd.org/gpus

# of CPU Only Servers

1 server 4x V100 GPUs

95 CPU Servers

59xSpeed up vs CPU server

SKL/V100 #’s may change

30x 126x

202 CPU Servers

48 CPU Servers

1 server 8x V100 GPUs

1 server 2x V100 GPUs

Page 10: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

10

HOOMD-Blue Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 configCUDA Version: CUDA 9.0.176; Dataset: microsphereTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

1 server 8x V100 GPUs

HOOMD-BlueMolecular Dynamics

Particle dynamics package written grounds up for GPUs

VERSION2.2.2

ACCELERATED FEATURESCPU & GPU versions available

SCALABILITYMulti-GPU and Multi-Node

MORE INFORMATIONhttp://codeblue.umich.edu/hoomd-blue/index.html

34 CPU Servers

43 CPU Servers

79 CPU Servers

# of CPU Only Servers

30x 38x 70xSpeed up vs CPU server

1 server 2x V100 GPUs

1 server 4x V100 GPUs

Page 11: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

11

LAMMPS Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config CUDA Version: CUDA 9.0.176, Dataset: Atomic-Fluid Lennard-Jones 2.5 CutoffTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

7 CPU Servers

LAMMPSMolecular Dynamics

Classical molecular dynamics package

VERSION2018

ACCELERATED FEATURESLennard-Jones, Gay-Berne, Tersoff, many more potentials

SCALABILITYMulti-GPU and Multi-Node

More Informationhttp://lammps.sandia.gov/index.html

20 CPU Servers

# of CPU Only Servers

1 server 2x V100 GPUs

6x 18xSpeed up vs CPU server

1 server 4x V100 GPUs

11 CPU Servers

1 server 8x V100 GPUs

10x

Page 12: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

12

NAMD Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe CUDA Version: CUDA 9.0.176, Dataset: STMVTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

NAMDGeoscience (Oil & Gas)

Designed for high-performance simulationof large molecular systems

VERSION2.13

ACCELERATED FEATURESFull electrostatics with PME and most simulation features

SCALABILITYUp to 100M atom capable, multi-GPU, Scale Scales to 2xP100

More Informationhttp://www.ks.uiuc.edu/Research/namd/

11 CPU Servers

# of CPU Only Servers 14 CPU Servers

1 server 2x V100 GPUs

1 server 4x V100 GPUs

9x 12xSpeed up vs CPU server

Page 13: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

13

MATERIALS SCIENCE

Page 14: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

14

Quantum Espresso Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIe CUDA Version: CUDA 9.0.176; Dataset: AUSURF112-jRTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

Quantum EspressoMaterial Science (Quantum Chemistry)

An Open-source suite of computer codes for electronic structure calculations and materials modeling at the nanoscale.

VERSION6.2.1

ACCELERATED FEATURESLinear algebra (matix multiply), explicit computational kernels, 3D FFTs

SCALABILITYMulti-GPU and Multi-Node

MORE INFORMATIONhttp://www.quantum-espresso.org

7 CPU Servers

13 CPU Servers

# of CPU Only Servers

Speed up vs CPU server

1 server 2x V100 GPUs

4x 8x

1 server 4x V100 GPUs

Page 15: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

15

PHYSICS

Page 16: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

16

Chroma Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.176; Dataset: szscl21_24_128To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

ChromaPhysics

Lattice Quantum Chromodynamics (LQCD)

VERSION2018

ACCELERATED FEATURESWilson-clover fermions, Krylov solvers,Domain-decomposition

SCALABILITYMulti-GPU

MORE INFORMATIONhttp://jeffersonlab.github.io/chroma/

6 CPU Servers

13 CPU Servers

# of CPU Only Servers

1 server 4x V100 GPUs

6x 16xSpeed up vs CPU server

1 server 2x V100 GPUs

Page 17: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

17

GTC Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.176; Dataset: mpi#proc.inTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

GTCPhysics

GTC is used for Gyrokinetic Particle Simulation of Turbulent Transport in Burning Plasmas.

VERSION2017

ACCELERATED FEATURESPush, shift, and collision

SCALABILITYMulti-GPU

MORE INFORMATIONwww.nvidia.com/gtc-p

9 CPU Servers

18 CPU Servers

# of CPU Only Servers

1 server 4x V100 GPUs

9x 16xSpeed up vs CPU server

1 server 2x V100 GPUs

Page 18: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

18

MILC Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: Dual Xeon Gold [email protected] with Tesla V100 PCIe (16GB)CUDA Version: CUDA 9.0.176; Dataset: MILC APEX MediumTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

MILCPhysics

Lattice Quantum Chromodynamics (LQCD) codes simulate how elemental particles are formed and bound by the “strong force” to create larger particles like protons and neutrons

VERSION2018

ACCELERATED FEATURESStaggered fermions, Krylov solvers, Gauge-link fattening

SCALABILITYMulti-GPU and Multi-Node

MORE INFORMATIONwww.nvidia.com/milc

15 CPU Servers

# of CPU Only Servers 28 CPU Servers

1 server 4x V100 GPUs

14x 27xSpeed up vs CPU server

1 server 2x V100 GPUs

Page 19: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

19

QUDA Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 configCUDA 9.0.103; Dataset: Dslash Wilson-Clover; Precision: Single; Gauge Compression/Recon: 12; Problem Size 32x32x32x64To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

68 CPU Servers

QUDAPhysics

A library for Lattice Quantum Chromo Dynamics on GPUs

VERSION2017

ACCELERATED FEATURESAll

SCALABILITYMulti-GPU and Multi-Node

MORE INFORMATIONwww.nvidia.com/quda

38 CPU Servers

89 CPU Servers

# of CPU Only Servers

1 server 8x V100 GPUs

29x 51x 68xSpeed up vs CPU server

1 server 2x V100 GPUs

1 server 4x V100 GPUs

No change

Page 20: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

20

GEOSCIENCE

Page 21: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

21

RTM Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 configCUDA Version: CUDA 9.0.103; Dataset: TTI RX 2pass mgpuTo arrive at CPU node equivalence, we use linear scaling to scale beyond 1 nodes.

RTMGeoscience (Oil & Gas)

Reverse time migration (RTM) modeling is a critical component in the seismic processing workflow of oil and gas exploration

VERSION2017

ACCELERATED FEATURESBatch algorithm

SCALABILITYMulti-GPU and Multi-Node

6 CPU Servers

11 CPU Servers

21 CPU Servers

# of CPU Only Servers

1 server 8x V100 GPUs

5x 10x 20xSpeed up vs CPU server

1 server 2x V100 GPUs

1 server 4x V100 GPUs

Speed-up slight change, rounded up old raw data

Page 22: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

22

SPECFEM3D Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.176; Dataset: four_material_simple_modelTo arrive at CPU node equivalence, we use linear scaling to scale beyond 1 nodes.

SPECFEM3DGeoscience

Simulates Seismic wave propagation

VERSION2.0.2

SCALABILITYMulti-GPU and Single-Node

MORE INFORMATIONhttps://geodynamics.org/cig/software/specfem3d_globe/

28 CPU Servers

# of CPU Only Servers 54 CPU Servers

1 server 4x V100 GPUs

27x 52xSpeed up vs CPU server

1 server 2x V100 GPUs

Page 23: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

23

ENGINEERING

Page 24: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

24

SIMULIA Abaqus Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 7.5; Dataset: LS-EPP-Combined-WC-Mkl (RR)To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

SIMULIA AbaqusEngineering

Simulation tool for analysis of structures

VERSION2017

ACCELERATED FEATURESDirect Sparse SolverAMS Eigen SolverSteady-state Dynamics Solver

SCALABILITYMulti-GPU and Multi-Node

MORE INFORMATIONhttp://www.nvidia.com/simulia-abaqus

# of CPU Only Servers

10 CPU Servers

5 CPU Servers

2.2x 3.0x

1 server 2x V100 GPU’s

1 server 4x V100 GPU’s

Speed-up slight change, rounded up old raw data

Speed up vs CPU server

Page 25: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

25

ANSYS Fluent Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 6.0; Dataset: Water JacketTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

ANSYS Fluent Engineering

General purpose software for the simulation of fluid dynamics

VERSION18

ACCELERATED FEATURESPressure-based Coupled Solver and Radiation Heat Transfer

SCALABILITYMulti-GPU and Multi-Node

MORE INFORMATIONhttp://www.nvidia.com/ansys-fluent

# of CPU Only Servers

2x

1 server 2x V100 GPUs

1 server 4x V100 GPUs

3x

4 CPU Servers

3 CPU Servers

Speed up vs CPU server

Page 26: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

26

COMPUTATIONAL FINANCE

Page 27: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

27

STAC-A2Computational Finance

Financial risk management benchmark created by leading global banks working with the Securities Technology Analysis Center (STAC) used to assess financial compute solutions.

VERSIONSTAC-A2 (Beta 2)

ACCELERATED FEATURESFull framework accelerated

SCALABILITYMulti-GPU

More Informationwww.STACresearch.com/nvidia

STAC-A2 Benchmark Performance Results8 x V100 GPU Server vs Dual Skylake Platinum 8180 Sever

8.9x

System Configuration: GPU Server SUT ID: NVDA171020 | STAC-A2 Pack for CUDA (Rev D) | GPU Server: 8 x NVIDIA Tesla V100 (Volta) GPU,2 x Intel® Xeon E5-2680 v4 @ 2.4 GHz | CUDA Version 9.0 | Compared to: CPU Server SUT ID: INTC170920, | STAC-A2 Pack for Intel Composer XE (Rev K) | 2 x 28-Core Intel Xeon Platinum 8180 @ 2.5GHz“Throughput” is STAC-A2.β2.HPORTFOLIO.SPEED, “Latency” is STAC-A2.β2.GREEKS.WARM, and “Energy Efficiency” is STAC-A2.β2.ENERG_EFF.|“STAC” and all STAC names are trademarks or registered trademarks of the Securities Technology Analysis Center, LLC.

6.2x

2.9x

Page 28: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

28

BENCHMARKS

Page 29: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

29

Cloverleaf Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon E5-2690 [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.103; Dataset: bm32To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

CloverleafBenchmark – Mini-App

Hydrodynamics

VERSION1.3

ACCELERATED FEATURESLagrangian-Eulerianexplicit hydrodynamics mini-application

SCALABILITYMulti-Node (MPI)

MORE INFORMATIONhttp://uk-mac.github.io/CloverLeaf/

19 CPU Servers

22 CPU Servers

# of CPU Only Servers

25x 29xSpeed up vs CPU server

1 server 2x V100 GPUs

1 server 4x V100 GPUs

No change

Page 30: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

30

HPCG Performance EquivalencySingle GPU Server vs Multiple Broadwell CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.103; Dataset: 256x256x256 local sizeTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

HPCGBenchmark

Exercises computational and data access patterns that closely match a broad set of important HPC applications

VERSION3

ACCELERATED FEATURESAll

SCALABILITYMulti-GPU and Multi-Node

MORE INFORMATIONhttp://www.hpcg-benchmark.org/index.html

12 CPU Servers

23 CPU Servers

# of CPU Only Servers

1 server 2x V100 GPUs

1 server 4x V100 GPUs

19x 22xSpeed up vs CPU server

Page 31: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

31

Linpack Performance EquivalencySingle GPU Server vs Multiple S Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.103; Dataset: HPL.datTo arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

LinpackBenchmark

Measures floating point computing power

VERSION2.1

ACCELERATED FEATURESAll

SCALABILITYMulti-GPU and Multi-Node

MORE INFORMATIONhttps://www.top500.org/project/linpack/

6 CPU Servers

# of CPU Only Servers 11 CPU Servers

9x 18x 22xSpeed up vs CPU server

1 server 2x V100 GPUs

1 server 4x V100 GPUs

Page 32: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

32

MiniFE Performance EquivalencySingle GPU Server vs Multiple Skylake CPU-Only Servers

CPU Server: Dual Xeon Gold [email protected], GPU Servers: same CPU server w/ V100 PCIeCUDA Version: CUDA 9.0.103; Dataset: 350x350x350To arrive at CPU node equivalence, we use measured benchmark with up to 8 CPU nodes. Then we use linear scaling to scale beyond 8 nodes.

MiniFEBenchmark – Mini-App

Finite Element Analysis

VERSION0.3

ACCELERATED FEATURESAll

SCALABILITYMulti-GPU

MORE INFORMATIONhttps://mantevo.org/about/applications/

8 CPU Servers

15 CPU Servers

# of CPU Only Servers

7x 15x

1 server 2x V100 GPUs

1 server 4x V100 GPUs

Speed up vs CPU server

Page 33: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

33

DEEP LEARNING

Page 34: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

34

Caffe2Deep Learning

A popular, GPU-accelerated Deep Learning framework developed at UC Berkeley and Facebook

VERSION1.0

ACCELERATED FEATURESFull framework accelerated

SCALABILITYMulti-GPU

More Informationresearch.fb.com/downloads/caffe2/

Caffe2 Deep Learning FrameworkTraining on 8 x V100 GPU Server vs 8 x P100 GPU Server

2.0x Avg. Speedup

2.2x Avg. Speedup

Server with 8x V100 SXM2 16GB

Server with 8x V100 16GB PCIe

CPU Server: Dual Xeon E5-2698 [email protected], GPU servers as shownUbuntu 14.04.5; CUDA Version: CUDA 9.0.176; NCCL 2.0.5, CuDNN 7.0.2.43; Driver 384.66Data set: ImageNet, Batch sizes: Inception V3 and ResNet-50 64 for P100 SXM2, 128 for Tesla V100, Seq2Seq 192, VGG16 96

SXM2 up to 27% faster

than PCIe

Page 35: PERFORMANCE GUIDE TESLA V100 - One Stop Systems · 2018. 7. 23. · CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ V100 PCIe or V100 SXM2 on 8X v100 config

35

NVIDIA TENSORRT 3Massive Throughput and Amazing Efficiency at Low Latency

CNN - IMAGES

Imag

es/S

ec (

Targ

et 7

ms

late

ncy)

ResNet-50 Throughput

17ms

CPU + Caffe P100 + TensorRT

P4 +TensorRT

CPU throughput based on measured inference throughput performance on Broadwell-based Xeon E2690v4 CPU, and doubled to reflect Intel’s stated claim that Xeon Scalable Processor will deliver 2x the performance of Broadwell-based Xeon CPUs on Deep Learning Inference.

V100 + TensorRT

Imag

es/S

ec (

Targ

et 7

ms

late

ncy)

GoogleNet Throughput

8ms

CPU + Caffe P100 + TensorRT

P4 +TensorRT

V100 + TensorRT

7ms 7ms

CNN - IMAGES