tesla master deck - hpc saudi 2019 | saudi hpc 2019 · architecting modern datacenters. big...

62
Jan 2018 TESLA PLATFORM

Upload: others

Post on 28-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

Jan 2018

TESLA PLATFORM

Page 2: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

2

A NEW ERA OF COMPUTING

PC INTERNETWinTel, Yahoo!1 billion PC users

1995 2005 2015

MOBILE-CLOUDiPhone, Amazon AWS2.5 billion mobile users

AI & IOTDeep Learning, GPU100s of billions of devices

Page 3: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

3

Artificial IntelligenceComputer GraphicsGPU Computing

NVIDIA“THE AI COMPUTING COMPANY”

Page 4: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

4

1980 1990 2000 2010 2020

GPU-Computing perf

1.5X per year

1000X

by

2025

RISE OF GPU COMPUTING

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.

Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

102

103

104

105

106

107

Single-threaded perf

1.5X per year

1.1X per year

APPLICATIONS

SYSTEMS

ALGORITHMS

CUDA

ARCHITECTURE

Page 5: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

5

ELEVEN YEARS OF GPU COMPUTING

2010

Fermi: World’s First HPC GPU

World’s First Atomic Model of HIV Capsid

GPU-Trained AI Machine Beats World Champion in Go

2014

Stanford Builds AI Machine using GPUs

World’s First 3-D Mapping of Human Genome

Google Outperforms Humans in ImageNet

2012

Discovered How H1N1 Mutates to Resist Drugs

Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs

2008

World’s First GPU Top500 System

2006

CUDA Launched

AlexNet beats expert code by huge margin using GPUs

Top 13 Greenest Supercomputers Powered

by NVIDIA GPUs

2017

Page 6: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

6

TESLA PLATFORMWorld’s Leading Data Center Platform for Accelerating HPC and AI

TESLA GPU & SYSTEMS

NVIDIA SDK

INDUSTRY FRAMEWORKS & TOOLS

APPLICATIONS

FRAMEWORKS

INTERNET SERVICES

DEEP LEARNING SDK

CLOUDTESLA GPU NVIDIA DGX-1 NVIDIA HGX-1

ENTERPRISE APPLICATIONS

Manufacturing

Automotive

Healthcare Finance

Retail

Defense

DeepStream SDKNCCL cuBLAS

cuSPARSEcuDNN TensorRT

ECOSYSTEM TOOLS

HPC

+450 Applications

COMPUTEWORKS

CUDA C/C++ FORTRAN

SYSTEM OEMTESLA GPU NVIDIA DGX-1 CLOUDNVIDIA HGX-1TESLA GPU NVIDIA DGX-1

Page 7: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

7

500+ GPU-ACCELERATED APPLICATIONS

All Top 15 HPC Apps Accelerated

VASP

AMBER

NAMD

GROMACS

Gaussian

Simulia Abaqus

WRF

OpenFOAM

ANSYS

LS-DYNA

BLAST

LAMMPS

ANSYS Fluent

Quantum Espresso

GAMESS

14X GPU DEVELOPERS

2017

615,00045,000

2012

DEFINING THE NEXT GIANT WAVE IN HPC

OAK RIDGE SUMMIT

US’s next fastest supercomputer

200+ Petaflop HPC; 3+ Exaflop of AI

ABCI Supercomputer (AIST)

Japan’s fastest AI supercomputer

Piz Daint

Europe’s fastest supercomputer

MOST ADOPTED PLATFORM FOR ACCELERATING HPC

Page 8: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

8

EVERY DEEP LEARNING FRAMEWORK ACCELERATED

25X COMPANIES ENGAGED

2017

39,637

1500

2014

AVAILABLE EVERYWHERE

Cloud Services

Systems

Desktops

MOST ADOPTED PLATFORM FOR ACCELERATING AI

Page 9: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

9

TESLA PLATFORM FOR HPC

Page 10: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

10

0

10

20

30

40

50

60

70

80

20 40 60 80 1000

# of CPUs

1 Node with 4x V100 GPUs

48 CPU Nodes Comet Supercomputer

ns/

day

AMBER Simulation of CRISPR, Nature’s Tool for Genome Editing

ARCHITECTING MODERN DATACENTERSBIG INEFFICIENCIES WITH CPU NODES

Single GPU Server 3.5x Faster than the Largest CPU Data Center

AMBER 16 Pre-release, CRSPR based on PDB ID 5f9r, 336,898 atomsCPU: Dual Socket Intel E5-2680v3 12 cores, 128 GB DDR4 per node, FDR IB

Page 11: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

11

WEAK NODESLots of Nodes Interconnected with

Vast Network Overhead

STRONG NODESFew Lightning-Fast Nodes with

Performance of Hundreds of Weak Nodes

Network Fabric

Server Racks

Page 12: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

12

ARCHITECTING MODERN DATACENTERS

Strong Core CPU for Sequential code

Volta 5,120 CUDA Cores

125 TFLOPS Tensor Core

NVLink for Strong Scaling

ARCHITECTING MODERN DATACENTERS

Page 13: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

13

70% OF THE WORLD’S SUPERCOMPUTINGWORKLOAD ACCELERATED

Intersect360 Research, Nov 2017 “HPC Application Support for GPU Computing”

VASP

AMBER

NAMD

GROMACS

Gaussian

Simulia Abaqus

WRF

OpenFOAM

ANSYS

LS-DYNA

BLAST

LAMMPS

ANSYS Fluent

Quantum Espresso

GAMESS

500+ Accelerated ApplicationsTop 15 HPC Applications

Page 14: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

14

GPU-ACCELERATED HPC APPLICATIONS500+ APPLICATIONS

MFG, CAD, & CAE

111 apps

Including:• Ansys

Fluent• Abaqus

SIMULIA• AutoCAD• CST Studio

Suite

LIFE SCIENCES

50+app

Including:• Gaussian• VASP• AMBER• HOOMD-

Blue• GAMESS

DATA SCI. & ANALYTICS

Including:• MapD• Kinetica• Graphistry

23apps

DEEP LEARNING

32 apps

Including:• Caffe2• MXNet• Tensorflow

MEDIA & ENT.

142 apps

Including:• DaVinci

Resolve• Premiere

Pro CC• Redshift

Renderer

PHYSICS

20 apps

Including:• QUDA• MILC• GTC-P

OIL & GAS

17 apps

Including:• RTM• SPECFEM

3D

SAFETY & SECURITY

15apps

Including:• Cyllance• FaceControl• Syndex Pro

TOOLS & MGMT.

15apps

Including:• Bright

Cluster Manager

• HPCtoolkit• Vampir

FEDERAL & DEFENSE

13 apps

Including:• ArcGIS Pro• EVNI• SocetGXP

CLIMATE & WEATHER

4apps

Including:• Cosmos• Gales• WRF

COMP.FINANCE

16 apps

Including:• O-Quant

Options Pricing

• MUREX• MISYS

Page 15: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

15

DEEP LEARNING COMES TO HPC

ERRORS

REGRESSION TESTING (FP16/INT8)

INFERENCE (FP16/INT8)

TRAINING (FP32/FP16)

SIMULATION (FP64/FP32)

NEW DATA

TRAINING SET REGRESSION SET NEW DATA

Page 16: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

16

UIUC & NCSA: ASTROPHYSICS

5,000X LIGO Signal Processing

U. FLORIDA & UNC: DRUG DISCOVERY

300,000X Molecular Energetics Prediction

SLAC: ASTROPHYSICS

Gravitational Lensing: From Weeks to 10ms

AI ACCELERATES SCIENCE

U.S. DoE: PARTICLE PHYSICS

33% More Accurate Neutrino Detection

PRINCETON & ITER: CLEAN ENERGY

50% Higher Accuracy for Fusion Sustainment

U. PITT: DRUG DISCOVERY

35% Higher Accuracy for Protein Scoring

AI ACCELERATES SCIENTIFIC DISCOVERY

Page 17: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

17

ONE PLATFORM BUILT FOR BOTHDATA SCIENCE & COMPUTATIONAL SCIENCE

Accelerating AITesla Platform Accelerating HPC

CUDA

Page 18: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

18

DRAMATICALLY MORE FOR YOUR MONEY

1 RACK ($0.8M)

# of Racks (~30 KW Per Rack)

15105 20

360 CPUs

1152 CPUs

36 CPUs + 72 V100s

5 RACKS ($2.0M)

VASP

RTM

Compute Servers,

85%

Non-Compute 15%

EQUAL THROUGHPUT WITH FEWER RACKS BUDGET: SMALLER, EFFICIENT

Compute Servers,

39%

Rack, Cabling

Infrastructure

Networking

Non-compute,

61%

Source: Traditional Data Centers Cost model by Microsoft Research on Datacenter Costs

14 RACKS ($6.0M)

1764 CPUsResNet-50 (DL Training)22 RACKS ($9.2M)

25

Save Up To $8M With Each GPU-Accelerated Rack

0

Page 19: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

19

DATA CENTER SAVINGS FOR MIXED WORKLOADS5X Better HPC TCO for Same Throughput

160 Self-hosted Servers

96 KWatts

12 Accelerated Servers w/4 V100 GPUs

20 KWatts

MIXED WORKLOAD:Materials Science (VASP)Life Sciences (AMBER)Physics (MILC)Deep Learning (ResNet-50)

SAMETHROUGHPUT

1/3 THE COST

1/4THE SPACE

1/5THE POWER

MIXED WORKLOAD:Materials Science (VASP)Life Sciences (AMBER)Physics (MILC)Deep Learning (ResNet-50)

Page 20: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

20

TESLA V100The Fastest and Most Productive GPU for AI and HPC

Volta Architecture

Most Productive GPU

Tensor Core

125 Programmable

TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink &

HBM2

Efficient Bandwidth

Page 21: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

21

3+EFLOPSTensor Ops

AI Exascale Today

ACME

DIRAC FLASH GTC

HACC LSDALTON NAMD

NUCCOR NWCHEM QMCPACK

RAPTOR SPECFEM XGC

Accelerated Science

10XPerf Over Titan

20 PF

200 PF

Performance Leadership

VOLTA TO FUEL SUMMITNext Milestone In AI Supercomputing

5-10XApplication Perf Over Titan

Page 22: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

22

BREAKTHROUGH EFFICIENCY ON THE PATH TO EXASCALE

Ahead Of The Curve

GFLO

PS p

er

Watt

0

5

10

15

20

25

30

35

9.5 SaturnV

P100

Top GPU Systems in Green500 List with measured performance and NVIDIA Projections for V100

33 GF/WExascale Goal

14.1 Tsubame 3

P1005.3 Tsubame-

KFCK80

4.4 Tsubame-

KFCK20X

3.2 EurotechAurora

K20

V100

13/13 Greenest Supercomputers Powered by Tesla P100

TSUBAME 3.0

Kukai

AIST AI Cloud

RAIDEN GPU subsystem

Piz Daint

Wilkes-2

GOSAT-2 (RCF2)

DGX Saturn V

Reedbush-H

JADE

Facebook Cluster

Cedar

DAVIDE

Page 23: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

23

Delivered Value Grows Over Time

POWER OF GPU COMPUTING PLATFORM

0

10

20

30

40

50

60

K20 (2013)

K40(2014)

K80(2015)

P100(2016)

V100(2017)

AMBER Performance (ns/ day)

AMBER 12CUDA 4

AMBER 14CUDA 4

AMBER 14CUDA 6

AMBER 16CUDA 8

AMBER 16CUDA 9

0

2400

4800

7200

9600

12000

8X K80 (2014)

8X MAXWELL (2015)

DGX-1 (2016)

DGX-1V (2017)

GoogleNet Performance (i/s)

cuDNN 2CUDA 6

cuDNN 4CUDA 7

cuDNN 6CUDA 8

NCCL 1.6

cuDNN 7CUDA 9NCCL 2

Amber dataset: Cellulose NVE; GoogLeNet dataset: Imagenet

Page 24: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

24

TESLA PLATFORM FOR AI

Page 25: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

25

AI REVOLUTIONIZING OUR WORLD

Search, Assistants, Translation, Recommendations, Shopping, Photos… Detect, Diagnose and Treat Diseases

Powering Breakthroughs in Agriculture, Manufacturing, EDA

Page 26: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

Bigger and More Compute Intensive

NEURAL NETWORK COMPLEXITY IS EXPLODING

2013 2014 2015 2016 2017 2018

Speech(GOP * Bandwidth)

DeepSpeech

DeepSpeech 2

DeepSpeech 3

30X

2011 2012 2013 2014 2015 2016 2017

Image(GOP * Bandwidth)

ResNet-50

Inception-v2

Inception-v4

AlexNet

GoogleNet

350X

2014 2015 2016 2017 2018

Translation(GOP * Bandwidth)

MoE

OpenNMT

GNMT

10X

Page 27: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

27

PLATFORM BUILT FOR AIDelivering 125 TFLOPS of DL Performance with Volta

TENSOR CORE

VOLTA-OPTIMIZED cuDNN

MATRIX DATA OPTIMIZATION:

Dense Matrix of Tensor Compute

TENSOR-OP CONVERSION:

FP32 to Tensor Op Data for Frameworks

TENSOR CORE

VOLTA TENSOR CORE 4x4 matrix processing array

D[FP32] = A[FP16] * B[FP16] + C[FP32]Optimized For Deep Learning

ALL MAJOR FRAMEWORKS

Page 28: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

28

Training

Device

Datacenter

GPU DEEP LEARNING IS A NEW COMPUTING MODEL

TRAINING

Billions of Trillions of Operations

GPU train larger models, accelerate

time to market

Page 29: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

29

REVOLUTIONARY AI PERFORMANCE3X Faster DL Training Performance

3X Reduction in Time to Train Over P100

0 10 20

1XV100

1XP100

2XCPU

Relative Time to Train Improvements(LSTM)

Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4

15 Days

18 Hours

6 Hours

Over 80X DL Training Performance in 3 Years

1x K80cuDNN2

4x M40cuDNN3

8x P100cuDNN6

8x V100cuDNN7

0x

20x

40x

60x

80x

100x

Q1

15

Q3

15

Q2

17

Q2

16

Exponential Performance over time(GoogleNet)

Speedup v

s K80

GoogleNet Training Performance on versions of cuDNNVs 1x K80 cuDNN2

Page 30: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

30

NVIDIA GPUS POWER WORLD’S FASTEST DEEP LEARNING PERFORMANCE

Image of ResNet 50 network

(…) Preferred Networks Nov '171024 Tesla P100

IBM Aug '17256 Tesla P100

Facebook June '17256 Tesla P100

48 Mins

60 Mins

15 Mins

Time to Train

ResNet-50 ResNet-50 | Dataset: Imagenet | Trained for 90 Epochs

Page 31: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

31

Training

Device

Datacenter

GPU DEEP LEARNING IS A NEW COMPUTING MODEL

DATACENTER INFERENCING

10s of billions of image, voice, video

queries per day

GPU inference for fast response,

maximize datacenter throughput

Page 32: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

32

NVIDIA TENSORRT PROGRAMMABLE INFERENCE ACCELERATOR

TESLA V100

DRIVE PX 2

TESLA P4

JETSON TX2

NVIDIA DLA

TensorRT

Page 33: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

33

IMAGES

0

1,000

2,000

3,000

4,000

5,000

6,000

Images/

Sec (

Targ

et

7m

s la

tency)NVIDIA TENSORRT 3

World’s Fastest Inference Platform

ResNet-50 Throughput

14ms

CPU + TensorFlow

V100 + TensorFlow

V100 +TensorRT

7ms 7ms

TRANSLATION

0

100

200

300

400

500

600

Sente

nces/

Sec (

Targ

et

200m

s la

tency) OpenNMT Throughput

280ms

CPU + Torch

V100 + Torch

V100 +TensorRT

153ms

117ms

Page 34: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

34

NVIDIA PLATFORM SAVES DATA CENTER COSTSGame Changing Inference Performance

1 HGX Server

45,000 images/sec

3 KWatts

Image recognition using Resnet-50

160 CPU Servers

45,000 images/sec

65 KWatts

INFERENCE WORKLOAD:Image recognition using Resnet 50

INFERENCE WORKLOAD:Image recognition using Resnet 50

SAMETHROUGHPUT

1/4THE SPACE

1/22THE POWER

Page 35: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

35

GPU-ACCELERATED INFERENCE

iFLYTEKSPEECH RECOGNITION

VALOSSAVIDEO INTELLIGENCE

MICROSOFT BINGVISUAL SEARCH

Page 36: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

36

TESLA PRODUCT FAMILY

Page 37: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

37

END-TO-END PRODUCT FAMILY

HYPERSCALE HPC

Deep learning training & inference

Training & Inference - Tesla V100

Most Efficient Inference & Transcoding - Tesla P4

STRONG-SCALE HPC

HPC and DL workloads scaling to multiple GPUs

Tesla V100 with NVLink

MIXED-APPS HPC

HPC workloads with mix of CPU and GPU workloads

Tesla V100 with PCI-E

FULLY INTEGRATED SUPERCOMPUTER

DGX-1 Server

Fully integrated deep learning solution

DGX Station

Page 38: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

38

75% Perf at Half the Power

OPTIMIZED FOR DATACENTER EFFICIENCY30% More Performance in a Rack

Computer Vision

V100@ MAXQ

Computer Vision

V100@ MAXP

13 KW Rack4 Nodes of 8xV100

1XResNet-50 Rack

Throughput

13 KW Rack7 Nodes of 8xV100

1.3XResNet-50 Rack

Throughput

ResNet-50 Training

0

10

20

30

40

50

60

70

80

50 100 150 200 250 300

DL PerfDL Perf / Watt

Watts

Max Performance

Max Efficiency

Page 39: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

39

TESLA V100

Core 5120 CUDA cores, 640 Tensor cores 5120 CUDA cores, 640 Tensor cores

Compute 7.8 TF DP ∙ 15.7 TF SP ∙ 125 TF DL 7 TF DP ∙ 14 TF SP ∙ 112 TF DL

Memory HBM2: 900 GB/s ∙ 16 GB HBM2: 900 GB/s ∙ 16 GB

InterconnectNVLink (up to 300 GB/s) +

PCIe Gen3 (up to 32 GB/s)PCIe Gen3 (up to 32 GB/s)

Power 300W 250W

Available Now Now

For NVLink Servers For PCIe Servers

Page 40: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

40

TESLA PLATFORM FOR CLOUD PROVIDERS

Page 41: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

41

CLOUD GPU DEMAND OUTSTRIPS SUPPLY

Q3 2016 Q4 2016

“P2 instance is one of the fastest growing

instance in AWS history.”

- Andrew Jassy, AWS CEO, re:Invent 2016

AWS Launches P2 Instance

“We’ve had thousands of customers participate

in the N-Series preview since we launched it

back in August.”

- Corey Sanders, Director of Compute, Azure

Azure Launches N-Series Preview

Page 42: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

42

Compute

AWS P3 - up to 8X

V100 SXM2

Available only in N.

Virginia, Oregon, Ireland,

Tokyo

AWS P2 – up to 8X

K80 Physical cards

https://aws.amazon.com/

ec2/instance-types/p3/

https://aws.amazon.com

/ec2/instance-types/p2/

GPU Server - up to

4X K80

GPU Server - up to

4X P100 PCIe Public Beta available

https://cloud.google.com

/gpu/

GPU Server - up to

2X K80, 1X P100 PCIe

(In Bare-metal)

https://www.ibm.com/cl

oud-

computing/bluemix/gpu-

computing

NC series - up to 2X K80

NC v2 & ND series - up

to 4X P100 PCIe/ 4X P40 Available only in US West 2

Region

https://azure.microsoft.com/

en-us/pricing/details/virtual-

machines/series/#n- series

X7 shape - up to 2X

P100 (In Bare-metal

and VM)

– Available only in

Ashburn region. Frankfurt

to come in Jan 2018

https://cloud.oracle.com

/infrastructure/compute

Virtual W/S AWS G3 – M60

GPU Server - P100 PCIe

vWSprivate alpha available

GPU Server - P100 PCIe

vWSpublic beta – Jan 18

GPU Server - up to 2X

M60, 2X M10

https://www.ibm.com/cloud-

computing/bluemix/gpu-

computing

GPU Server - M60

https://azure.microsoft.com/

en-us/pricing/details/virtual-

machines/series/#n-series

Virtual PCGPU Server - up to 4X

K520 Physical cards

GPU Server - M60

GPU Server - M10

Vmware Horizon Air

vPC launch Jan

https://www.ibm.com/cloud-

computing/bluemix/gpu-

computing

GLOBAL CSP OFFERINGS

Page 43: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

43

NVIDIA GPU CLOUD

Innovate in minutes, not weeksRemoves all the DIY complexity of DL and HPC software integration

Cross platformContainers run locally on DGX Systems and TITAN PCs, or on cloud service provider GPU instances

Always up to dateMonthly updates by NVIDIA to ensure maximum performance

AI and HPC Everywhere, For Everyone

NVIDIA GPU Cloud integrates GPU-optimized

deep learning frameworks, HPC apps,

runtimes, libraries, and OS into a ready-to-run

container, available at no charge

Page 44: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

44

NVIDIA GPU CLOUDSIMPLIFYING AI & HPC

DEEP LEARNING HPC APPS HPC VIZ

Page 45: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

45

NGC GPU-OPTIMIZED DEEPLEARNING CONTAINERS

NVCaffe

Caffe2

Microsoft Cognitive Toolkit (CNTK)

DIGITS

MXNet

PyTorch

TensorFlow

Theano

Torch

CUDA (base level container for developers)

NEW! – NVIDIA TensorRT inference accelerator with ONNX support

A Comprehensive Catalog of Deep Learning Software

Page 46: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

46

HPC APPS COMING TO NVIDIA GPU CLOUD

Page 47: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

47

Large-scale Volumetric Rendering

Physically Accurate Ray Tracing

Production-quality Images

Seamless integration with ParaView

Early Access NOW

Signup now at nvidia.com/gpu-cloud

U CLOUD FOR HPC VISUALIZATION

UNIFIED VISUALIZATIONFOR LARGE DATA SETS

ParaView with NVIDIA OptiX

ParaView with NVIDIA Holodeck

ParaView with NVIDIA IndeX

NVIDIA GPU CLOUD FOR HPC VISUALIZATION

Page 48: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

48

TESLA PLATFORM FOR DEVELOPERS

Page 49: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

49

Page 50: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

50

HOW GPU ACCELERATION WORKSApplication Code

+

GPU CPU5% of Code

Compute-Intensive Functions

Rest of SequentialCPU Code

Page 51: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

51

DEEP LEARNING

GPU ACCELERATED LIBRARIES“Drop-in” Acceleration for Your Applications

LINEAR ALGEBRA PARALLEL ALGORITHMS

SIGNAL, IMAGE & VIDEO

TensorRT

nvGRAPH NCCL

cuBLAS

cuSPARSE cuRAND

DeepStream SDK NVIDIA NPPcuFFT

CUDA

Math library

cuSOLVER

CODEC SDKcuDNN

Page 52: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

52

CUDA TOOLKIT 9

Optimized for Volta:

• Tensor Cores

• Second-Generation NVLink

• HBM2 Stacked Memory

UNLEASHES POWER OF VOLTA

COOPERATIVE THREAD GROUPS

Flexible Thread Groups

Efficient Parallel Algorithms

• Synchronize Across Thread Blocks in a Single GPU or Multi-GPUs

• GEMM Optimizations for RNNs (cuBLAS)

• >20x Faster Image Processing (NPP)

• FFT Optimizations Across Various Sizes (cuFFT)

FASTER LIBRARIES

DEVELOPER TOOLS & PLATFORM UPDATES

• 1.3x Faster Compiling

• New OS and Compiler Support

• Unified Memory Profiling

• NVLink Visualization

Page 53: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

53

WHAT IS OPENACC

main(){<serial code>#pragma acc kernels{ <parallel code>

}}

Add Simple Compiler Directive

OpenACC is a directives-

based programming approach

to parallel computing

designed for performance

and portability on CPUs

and accelerators for HPC (OpenPOWER, Sunway, x86 CPU & Xeon Phi, NVIDIA GPU, PEZY-SC)

Read more at www.openacc.org

Page 54: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

54

OPENACC: EASY ONBOARD TO GPU COMPUTINGA Widely Adopted Directives Model for Parallel Programing

POWER

Sunway

x86 CPU

x86 Xeon Phi

NVIDIA GPU

AMD

PEZY-SC

0

20

40

60

80

100

120

140

160

Multicore Broadwell Multicore POWER8

PGI OpenACC

Intel/IBM OpenMP

10x 11x 11x

120x

77x

158x

AWE Hydrodynamics CloverLeaf mini-App

(bm32 data set)

SIMPLE. POWERFUL. PORTABLE.

Speedup v

s S

ingle

Hasw

ell

Core

10x

Volta V1002x1x 4x

5 CAAR Codes: GTC, XGC, ACME, FLASH, LSDalton

3 of Top 5 HPC Apps:ANSYS Fluent, VASP, Gaussian

2017 Gordon Bell Finalist:CAM-SE on TaihuLight

ADOPTED BY KEY HPC CODES

Page 55: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

55

LSDalton

Quantum Chemistry

12X speedup in 1 week

Numeca

CFD

10X faster kernels2X faster app

PowerGrid

Medical Imaging

40 days to 2 hours

INCOMP3D

CFD

3X speedup

NekCEM

Computational Electromagnetics

2.5X speedup60% less energy

COSMO

Climate Weather

40X speedup3X energy efficiency

CloverLeaf

CFD

4X speedupSingle CPU/GPU code

MAESTROCASTRO

Astrophysics

4.4X speedup4 weeks effort

Page 56: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

56

Resourceshttps://www.openacc.org/resources

Success Storieshttps://www.openacc.org/success-stories

Eventshttps://www.openacc.org/events

OPENACC RESOURCESGuides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow

Compilers and Tools https://www.openacc.org/tools

FREE

Compilers

Page 57: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

57

NVIDIA DEEP LEARNING SDK

Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications

High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs

Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks

Multi-GPU and multi-node scaling that accelerates training on up to eight GPU

High performance GPU-acceleration for deep learning

“ We are amazed by the steady stream

of improvements made to the NVIDIA

Deep Learning SDK and the speedups

that they deliver.”

— Frédéric Bastien, Team Lead (Theano) MILA

developer.nvidia.com/deep-learning-software

Page 58: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

58

NVIDIA COLLECTIVECOMMUNICATIONS LIBRARY (NCCL)

Multi-GPU and multi-node collective communication primitives

developer.nvidia.com/nccl

High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs

Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization

Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVLink

Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more

Multi-Node:

InfiniBand verbs,

IP Sockets

Multi-GPU:

NVLink, PCIe

Automatic

Topology

Detection

216.925

843.475

1684.79

3281.07

6569.6

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

0 8 16 24 32

NCCL 2

Images/

Second

Near-Linear Multi-Node Scaling

Microsoft Cognitive Toolkit multi-node scaling performance (images/sec), NVIDIA DGX-1 + cuDNN 6

(FP32), ResNet50, Batch size: 64

Page 59: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

59

NVIDIA DIGITSInteractive Deep Learning GPU Training System

developer.nvidia.com/digits

Interactive deep learning training application for engineers and data scientists

Simplify deep neural network training with an interactive interface to train and validate, and visualize results

Built-in workflows for image classification, object detection and image segmentation

Improve model accuracy with pre-trained models from the DIGITS Model Store

Faster time to solution with multi-GPU acceleration

Page 60: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

60

NVIDIA cuDNNDeep Learning Primitives

developer.nvidia.com/cudnn

High performance building blocks for deep learning frameworks

Drop-in acceleration for widely used deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, PyTorch, Tensorflow, Theano and others

Accelerates industry vetted deep learning algorithms, such as convolutions, LSTM RNNs, fully connected, and pooling layers

Fast deep learning training performance tuned for NVIDIA GPUs

“ NVIDIA has improved the speed of cuDNN

with each release while extending the

interface to more operations and devices

at the same time.”

— Evan Shelhamer, Lead Caffe Developer, UC Berkeley

0

2,000

4,000

6,000

8,000

10,000

12,000

8x K80 8x Maxwell DGX-1 DGX-1V

Images/

Second

cuDNN 7

NCCL 2

cuDNN 6

NCCL 1.6

cuDNN 4

cuDNN 2

Deep Learning Training Performance

Page 61: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data

61

NVIDIA TensorRT 3

Compiler for Optimized Neural Networks

Weight & Activation Precision Calibration

Layer & Tensor Fusion

Kernel Auto-Tuning

Multi-Stream Execution

TensorRT

Compiled & Optimized Neural

Network

Trained NeuralNetwork

Kernel Auto-tuning

Layer & Tensor Fusion

Dynamic Tensor

Memory

Weight & Activation

Precision Calibration

Multi-Stream

Execution

Programmable Inference Accelerator

Page 62: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data