enduring differentiationgilesm/cuda/lecs/enduring_differentiation-2x2.pdfai (tensor cores): ~20...

Timothy Lanfear

ENDURING DIFFERENTIATION

WHERE ARE WE?

LIFE AFTER DENNARD SCALING

1980 1990 2000 2010 2020

40 Years of Microprocessor Trend Data

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte,

O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected

for 2010-2015 by K. Rupp

Single-threaded

1.5X per year

1.1X per yearTransistors

(thousands)

AMBER Performance (ns/day)

AMBER 12

CUDA 4

AMBER 14

CUDA 5

AMBER 14

CUDA 6

AMBER 16

CUDA 8

GoogleNet Performance (i/s)

cuDNN 2

CUDA 6

cuDNN 4

CUDA 7

cuDNN 6

CUDA 8

NCCL 1.6

cuDNN 7

CUDA 9

NCCL 2

8x K80

8x Maxwell

DGX-1V

GPU-ACCELERATED PERFORMANCE

Delivered value grows over time

10X Perf in 8 Years 6X Perf in 6 Years

TESLA VALUE: $15-20M COST SAVINGS DELIVERED

Life Sciences (AMBER) Oil & Gas (RTM)

Amber performance: Nano Seconds Per Day delivered on 1xServer with GPUs and CPUS

TESLA PLATFORM ADVANTAGE

GPU-ACCELERATED EFFICIENCY

On Track To Meet Exascale Goal

9.5 SaturnV

Top GPU Systems in Green500 Lists and NVIDIA Projections for V100

35 GF/WExascale Goal

14.1 TSUBAME 3.0

P1005.3 TiTechW780IK80

4.4 Tsubame-

KFCK20X

3.2 EurotechAurora

13/13 Greenest Supercomputers Powered by Tesla P100

TSUBAME 3.0

AIST AI Cloud

RAIDEN GPU subsystem

Piz Daint

Wilkes-2

GOSAT-2 (RCF2)

DGX Saturn V

Reedbush-H

Facebook Cluster

DAVIDE

HOW ARE WE DOING THIS?

• What are the most important dimensions of our differentiation?

• Why are GPUs so much more efficient than CPUs?

• How can we continue scaling performance/efficiency as Moore’s Law fades?

• Why can’t competitors replicate GPU efficiency, performance, scaling, etc., with

lots of weak CPU cores? (e.g., Intel KNC/KNL/KNM)

• How is optimizing GPUs for AI affecting their suitability for HPC?

And, is our differentiation sustainable?

ENERGY EFFICIENCY

64-bit DP

1000 pJ

28nm IC

256-bit access8 kB SRAM 50 pJ

16000 pJ DRAM Rd/Wr

500 pJ Efficient off-chip link

20 pJ 26 pJ 256 pJ

COMPUTATION VERSUS COMMUNICATIONS

CPU126 pJ/flop (SP)

Optimized for Latency

Deep Cache Hierarchy

Broadwell E5 v414 nm

GPU28 pJ/flop (SP)

Optimized for Throughput

Explicit Managementof On-chip Memory

Pascal16 nm

HOW IS POWER SPENT IN A CPU?

Out of Order, High Performance

Clock + Pins

Rename

DataSupply

Natarajan [2003] (Alpha 21264)

In Order, Embedded

Clock + Control Logic

Data Supply

Instruction Supply

Register File

ALU 6%

Dally [2008] (Embedded in-order CPU)

Overhead

Payload

Arithmetic

13Source: Azizi [PhD 2010]

SIMPLER CORES = ENERGY EFFICIENCY

Overhead

Payload

Arithmetic

THROUGHPUT PROCESSORS

RISE OF LEAKAGE

FREQUENCY VS. LEAKAGE

Source: Gordon Moore, Intel; IEEE19

OPTIMIZED FOR DATACENTER EFFICIENCY40% More Performance in a Rack

Computer Vision

V100Max Efficiency

Computer Vision

V100Max Performance

13 KW Rack4 Nodes of 8xV100

13ResNet-50 Networks

Trained Per Day

13 KW Rack7 Nodes of 8xV100

18ResNet-50 Networks

Trained Per Day

ResNet-50 Training, Max Efficiency run with V100@160W | V100 performance measured on pre-production hardware.

80% Perf at Half the Power

SP ENERGY EFFICIENCY @ 28 NM

Fermi Kepler Maxwell

HETEROGENEOUS COMPUTING

OPTIMIZING SERIAL/PARALLEL EXECUTIONApplication Code

GPU CPU

Parallel Work

Majority of Ops

Serial Work

System and Sequential Ops

TWO TYPES OF ACCELERATORS

CPUOptimized for Serial Tasks

GPU AcceleratorOptimized for Parallel Tasks

Heterogeneous Computing ModelComplementary Processors Work Together

Many-Weak-Cores (MWC) ModelSingle CPU Core for Both Serial & Parallel Work

Xeon Phi (And Others)Many Weak Serial Cores

EXTENSIBILITY

NVLINK: A MEMORY FABRIC, NOT A NETWORKDGX-1: 8 NVLink-Connected GPUs

LATENCY HIDING FOR LOAD/STORE/ATOMICSWhere are the NICs? There are no NICs.

STRONG SCALING

1 4 8 16 32 64 128

AlexNet TrainingDGX-1 Faster than 128 Knights Landing Servers

Knights Landing Servers 1x DGX1

Speed-u

s 1x K

GPU-Accelerated Server

GPU-Accelerated ServerNVIDIA DGX-1

Speed-u

0 4 8 12 16 20 24 28 32

2x P100: 13x Faster

# of KNL Nodes

4x P100: 21x Faster

8x P100: 35x Faster

LAMMPS: Molecular Dynamics8x Tesla P100 PCIe Server Faster than 32 KNL Servers

Speed-u

0 8 16 24 32 40 48 56 64

# of KNL Nodes

GTC-P: Plasma Turbulence8x Tesla P100 PCIe Server Faster than 64 KNL Servers

2x P100

4x P100

8x P100: 8x Faster

STRONG SCALING

Gflop/s

10,000

15,000

20,000

25,000

0 18 36 54 72 90 108 126 144

# of CPU Nodes (in SuperMUC Supercomputer)

1x DGX-1: 8K Gflop/s

2K Gflop/s3K Gflop/s

5K Gflop/s

7K Gflop/s

LQCD- Higher Energy PhysicsSATURNV DGX Servers vs SuperMUC Supercomputer

QUDA version 0.9beta, using double-half mixed precision DDalphaAMG using double-single

# of CPU Servers to Match Performance of SATURNV

2,300CPU Servers

S3D: Discovering New Fuel for Engines

3,800CPU Servers

SPECFEM3D: Simulating Earthquakes

STRONG SCALING

NEW TENSOR CORE

New CUDA TensorOpinstructions and data formats

4×4 matrix processing array

DFP32 = AFP16 × BFP16 + CFP32

Optimized for deep learning

Activation Inputs Weights Inputs Output Results

TESLA PLATFORM

TESLA IS A PLATFORMWorld’s Leading Data Center Platform for Accelerating HPC and AI

TESLA GPU & SYSTEMS

NVIDIA SDK

INDUSTRY FRAMEWORKS & TOOLS

APPLICATIONS

ECOSYSTEM TOOLS

+450 Applications

FRAMEWORKS

INTERNET SERVICES

DEEP LEARNING SDK COMPUTEWORKS

SYSTEM OEM CLOUDTESLA GPU NVIDIA DGX-1 NVIDIA HGX-1

ENTERPRISE APPLICATIONS

Manufacturing

Automotive

Healthcare Finance

Retail

Defense

cuBLAS DeepStream SDK

cuSPARSEcuDNN TensorRT

MULTIPLE GROWTH MARKETS

10M Users 40 years of video/day

TESLA PLATFORM

GROWTH MARKETS

Video TranscodingHPC AI Training AI Inference Desktop Virtualization

CONCLUSION

PASCAL TO VOLTA

• Area: ~600 mm2 → ~800 mm2 (~33% more area)

• Process: ~ small Pascal → Volta improvement (a few percent)

• Clocks: similar dynamic range, power limited

• Memory BW (sustained): 50% improvement

• Communications (NVLink): 160 GB/s → 300 GB/s (almost double!)

• AI (Tensor Cores): ~20 TFLOPS → 120 TFLOPS (~6x!)

Architecture with Technology

REVOLUTIONARY PERFORMANCE FOR HPC AND AISingle Platform For Data Science and Computation Science

1.5X HPC Performance In 1 Year 3X AI Performance in 1 Year

LSTM ( Neural Machine Translation)

0 10 20

1X V100

1X P100

2X CPU 15 Days

18 Hours

6 Hours

1.5 1.61.5

Molecular Dynamics(VMD)

Physics (QUDA) Seismic (RTM) STREAM

V100 P

ance N

NMT Training for 13 Epochs |German ->English, WMT15 subset |System Config Info: CPU = 2x Xeon E5 2699 V4 w/ P100s or V100s |

QUDA, RTM, STREAM | System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100.

VMD | System Config Info: Xeon E5-2698v3 w/ 1x Tesla P100 and E5-2697Av4 w/ 1x Tesla V100 | V100 measured on pre-production hardware.

P100 V100 Ratio

Training acceleration 10 TOPS 120 TOPS 12x

Inference acceleration 21 TFLOPS 120 TOPS 6x

FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x

HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x

NVLink Bandwidth 160 GB/s 300 GB/s 1.9x

L2 Cache 4 MB 6 MB 1.5x

L1 Caches 1.3 MB 10 MB 7.7x

GPU PERFORMANCE COMPARISON

GPU TRAJECTORY

1980 1990 2000 2010 2020

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte,

O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected

for 2010-2015 by K. Rupp

Single-threaded

perf 1.5X per year

1.1X per year

GPU-Computing perf

1.5X per year

APPLICATIONS

SYSTEMS

ALGORITHMS

ARCHITECTURE

enduring differentiationgilesm/cuda/lecs/enduring_differentiation-2x2.pdfai (tensor cores): ~20...

Documents

Ноябрь 2003 : СКИФ К-500 0,717/0,423 tflops

ac servo motor driver series lecs - smc pneumatics - … ·...

logicalfoundations...

amazon elastic inference - developer guide€¦ · amazon...

bt3021 2015 lecs 1 3 introduction to dsp

abdo lecs aya

15 tflops haswell vs. 60 tflops knight landing for hpc...

14 cores / 1.7 ghz / tdp65w 14 cores / 2.0 ghz ... pascal /...

final ppt mip lecs

nvswitch and dgx-2 · 2018-08-21 · ©2018 nvidia...

people.eecs.berkeley.edupeople.eecs.berkeley.edu/~russell/slides/chapter01.pdfai...

environment i lecs 18 to 23 wbb34

bluegene/l facts platform characteristics 512-node prototype...

ascii_red - an overview of the intel tflops supercomputer

contrôleur de servomoteur ac de smc - série lecs

1st 3 lecs pm notes

what is scientific computing? - boinc · 2002 nec earth...

all you need to know about programming nvidia’s dgx-2 ·...

tumor cell entry into the lymph node is controlled by ccl1...

evolution des processeurs - centre national de la ... ·...