turing architecture and cuda 10 new features · 2018. 11. 26. · turing architecture and cuda 10...

38
Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture and CUDA 10 New Features

Upload: others

Post on 03-Mar-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Minseok Lee, Developer Technology Engineer, NVIDIA

Turing Architecture and CUDA 10

New Features

Page 2: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Turing Architecture

Turing MPSRT Core

Inference Accelerated, Graphics Reinvented, Volta’s Programmability

New SM Architecture Multi-Precision

Tensor Core

Page 3: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

3

Universal Inference Acceleration

320 Turing Tensor cores

2,560 CUDA cores

65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS

16GB | 320GB/s

ANNOUNCING TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU

Page 4: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Turing TU102 SMTU102

INT32 64

FP32 64

Tensor Cores 8

RT Core 1

Register File 256 KB

L1 and shmem 96 KB

Max threads 1024

Compute Capability 75

Page 5: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Tensor Core

D =

FP32(FP16)

FP16 FP16 FP32(FP16)

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

A B C

Page 6: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Multi-Precision Tensor Core

Input Precision OutputVolta

(Ops/Cycle/SM)

Turing

(Ops/Cycle/SM)

FP16 FP16 or FP32 1024 1024

INT8 INT32 NA 2048

INT4 INT32 NA 4096

BOOL INT32 NA 16384

Page 7: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

7

Page 8: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Multi-Process Service (MPS)

Turing MPS:

Inherits Volta’s enhanced MPS architecture

• Reduced launch latency

• Improved launch throughput

• Improved quality of service with

scheduler partitioning

Hardware Accelerated

Work Submission

Hardware Isolation

TURING MULTI-PROCESS SERVICE

Turing

A B C

CUDA MULTI-PROCESS SERVICE CONTROLCPU Processes

GPU Execution

A B C

Page 9: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

RT CoreAccelerate Ray Tracing with Turing RT Cores

● Boundary Volume Hierarchy (BVH) traversal

● Ray/Triangle Intersection

● 10+ Giga Rays/Sec

Available in NVIDIA OptiX

● Single-ray shader programming model using C++

● AI Accelerated rendering

● Free for Commercial-Use

Page 10: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

CUDA 10 Key Features

New GPU Architecture, Tensor Cores, …

TURING AND NEW SYSTEMSCUDA Graphs, Warp Matrix, …

CUDA PLATFORM

GPU-accelerated hybrid JPEG decoding,Symmetric Eigenvalue Solvers, FFT Scaling

LIBRARIES

New Nsight Products – Nsight Systems and Nsight Compute

DEVELOPER TOOLS

Scientific Computing

Page 11: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Asynchronous Task GraphsEnable Execution Optimization when Workflow is Known Up-Front

DL Inference

Loop & Functionoffload

Deep Neural NetworkTraining

HPC SimulationLinear Algebra

Page 12: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

CUDA GraphsNew Model for Submitting CUDA Work

Graph of Dependencies

End

A

B X

C D

E Y

Any CUDA stream can be mapped to a graph

A

B

C

Wait

E

Wait

D

Wait

X

Y

Wait

CUDA Work in Streams

Page 13: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Definition of A CUDA GraphSequence of operations, connected by dependencies.

Operations (Nodes) are one of:

Kernel Launch CUDA kernel running on GPU

CPU Function Call Callback function on CPU

Memcpy/Memset Data management

Sub-Graph Graphs are hierarchical

A

B X

C D

E Y

End

Page 14: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Repeated Graph ExecutionGenerated Once, Launched Repeatedly

for(int i=0; i<1000; i++) {

launch_graph( G );

}

A

B X

C D

E Y

End

Page 15: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Execution OptimizationLaunch Latency & Overhead Reduction

● Predefined graph lunches any number of kernels in one single operation

● Benefits especially short-running kernels

time

Launch

A

Launch

B

Launch

C

Launch

D

Launch

E

A B C D E

Build

GraphLaunch Graph

CPU Idle

CPU Idle

A B C D E

Page 16: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

CUDA Stream to A GraphConstruct a graph from normal CUDA stream syntax

// Start by initating stream capture

cudaStreamBeginCapture(&stream1);

A<<< ..., stream1 >>>();

cudaEventRecord(e1, stream1);

B<<< ..., stream1 >>>();

cudaStreamWaitEvent(stream2, e1);

C<<< ..., stream2 >>>();

cudaEventRecord(e2, stream2);

cudaStreamWaitEvent(stream1, e2);

D<<< ..., stream1 >>>();

// Now convert the stream to a graph

cudaStreamEndCapture(stream1, &graph);

A

B

Wait

D

C

Wait

stream1 stream2 graph

D

B C

A

Page 17: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Capture External WorkStream Capture extends into Library Calls

// Start by initating stream capture

cudaStreamBeginCapture(&stream);

// Captures my kernel launches, recurse into library calls

X<<< ..., stream >>>();

libraryCall(stream); // Launches A, B, C, D

Z<<< ..., stream >>>();

// Now convert the stream to a graph

cudaStreamEndCapture(stream, &graph);

X

Z

A

D

B C

X

Z

D

B C

A

Resultantgraph

Insertinggraph

Library call

Page 18: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Construct Graph Explicitly

D

B C

A

// Define graph of work + dependencies

cudaGraphCreate(&graph);

cudaGraphAddNode(graph, kernel_a, {}, ...);

cudaGraphAddNode(graph, kernel_b, { kernel_a }, ...);

cudaGraphAddNode(graph, kernel_c, { kernel_a }, ...);

cudaGraphAddNode(graph, kernel_d, { kernel_b, kernel_c }, ...);

// Instantiate graph and apply optimizations

cudaGraphInstantiate(&instance, graph);

// Launch executable graph 100 times

for(int i=0; i<100; i++)

cudaGraphLaunch(instance, stream);

Graph fromframework

Page 19: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Graph Execution SemanticsGraph Work can be ordered with other work in the stream

stream

launchWork(cudaGraphExec_t i1, cudaGraphExec_t i2,CPU_Func cpu, cudaStream_t stream) {

A <<< 256, 256, 0, stream >>>(); // Kernel launch

cudaGraphLaunch(i1, stream); // Graph1 launch

cudaStreamAddCallback(stream, cpu); // CPU callback

cudaGraphLaunch(i2, stream); // Graph2 launch

cudaStreamSynchronize(stream);

}

A

CPU

Page 20: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Graph Execution SemanticsGraphs ONLY use the stream for the start/end dependencies

stream

A

CPU

End

A

B X

C D

E Y

Branches in graph still execute

concurrently even though graph is launched into a

stream

Page 21: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

How to Access Tensor CoresWarp Matrix Multiply-Accumulate (WMMA) API in CUDA C++

● Specialized matrix load, matrix multiply and accumulate, matrix store

● Turing (sm_75) WMMA supports 8-bit integer operation

= +

A32x16

B16x8

C32x8

D32x8

WMMA 32x8x16

= +

WMMA 8x32x16

A8x16

B16x32

C8x32

D8x32

= + A

16x16B

16x16C

16x16D

16x16

WMMA 16x16x16

Page 22: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

How Tensor Core is UsedA large matrix multiplication can be divided into a set of 16X16 matrix products, which are assigned to Tensor cores

A C

B

16

16

Page 23: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

FP16 WMMA Example__device__ void tensor_op_16_16_16(half *a, half *b, float *c){

wmma::fragment<wmma::matrix_a, 16, 16, 16, half, …> a_frag;wmma::fragment<wmma::matrix_b, 16, 16, 16, half, …> b_frag;wmma::fragment<wmma::accumulator, 16, 16, 16, float, …> c_frag;

wmma::load_matrix_sync(a_frag, a, …);wmma::load_matrix_sync(b_frag, b, …);

wmma::fill_fragment(c_frag, 0.0f);

wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

wmma::store_matrix_sync(c, c_frag, …);}

Tensor Core Input/Output

Fragment

Load Input Matrix into Input Fragment

Initialize Output FragmentTensor Core Computation

Store Output Matrix into Memory

Page 24: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Turing INT8 WMMA Example__device__ void tensor_op_16_16_16(char *a, char *b, int *c){

wmma::fragment<wmma::matrix_a, 16, 16, 16, char, …> a_frag;wmma::fragment<wmma::matrix_b, 16, 16, 16, char, …> b_frag;wmma::fragment<wmma::accumulator, 16, 16, 16, int, …> c_frag;

wmma::load_matrix_sync(a_frag, a, …);wmma::load_matrix_sync(b_frag, b, …);

wmma::fill_fragment(c_frag, 0.0f);

wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

wmma::store_matrix_sync(c, c_frag, …);}

Page 25: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Experimental sub-byte WMMASupport experimental 4-bit/1-bit Operations with 32-bit output

● Access via special namespace: nvcuda::wmma::experimental

namespace experimental {

namespace precision {

struct u4; // 4-bit unsigned

struct s4; // 4-bit signed

struct b1; // 1-bit

}

enum bmmaBitOp { bmmaBitOpXOR = 1 };

enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 };

}

Page 26: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Binary Tensor Core Example

Concept

▪ Train neural networks on lower-precision data: faster compute, lower memory size

▪ Reduce data to positive / negative sign value – can fit in single bit (1 = +ve, 0 = -ve)

▪ 1-bit weight & activation calculations based only on sign of data

1-bit

Ref: Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1, M. Coubariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y Bengio, 2016

https://arxiv.org/pdf/1602.02830.pdf

Page 27: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Turing WMMA API SummaryInput Precision Output Supported Sizes Max Ops/Clock/SM

Nati

ve T

ypes

half * half or float16 x 16 x 16

32 x 8 x 16

8 x 32 x 16

1024

charinteger (int32) 2048

unsigned char

Experi

menta

l

precision::u4 (4-bit unsigned)

integer (int32)8 x 8 x 32 4096

precision::s4 (4-bit signed)

precision::b1 (1-bit) 8 x 8 x 128 16384

* Also available on Volta sm_70. Note: WMMA requires recompilation for Turing sm_75 for peak performance

Page 28: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

CUTLASS 1.1Collection of CUDA C++ template abstractions for implementing high-performance matrix-multiplication (GEMM)

● Turing/CUDA10-Optimized: support 8-bit, 4-bit, and 1-bit integers

● Include detailed documentation and variouls example

● Exhibit performance comparable to cuBLAS

Page 29: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

CUDA 10 Math LibrariesTURING

Large FFT & 16-GPU Strong Scaling

Symmetric Eigensolver & Cholesky Performance

cuSPARSE Sparse-Dense Matrix Multiply Performance

PERFORMANCE

GPU-accelerated hybrid JPEG decoding

FP16 & INT8 GEMMs for TensorRTInference

NEW ALGORITHMS AND APIs

Faster & Independent Library Releases

Library and CUDA compatibility with enterprise drivers

COMPATIBILITY & RELEASE CADENCE

Turing optimized GEMMs, & GEMM extensions for Tensor Cores

Turing architecture-optimized libraries

DEEP LEARNING

Scientific Computing

Page 30: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

cuBLAS 10Include Turing-optimized mixed-precision GEMMs with Tensor Cores

Page 31: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

DL Inference Test on T4

Page 32: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

cuFFT 10Strong scaling on multi-GPU systems such as NVIDIA’s DGX

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

2 4 8 16

cuFFT 9.2 cuFFT 10.0 Linear (cuFFT 10.0)

cuFFT (10.0 and 9.2) using 3D C2C FFT 1024 size on DGX-2

Page 33: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

cuSolver 10

Improved performance with new implementations for

Cholesky factorization

Symmetric & Generalized Symmetric Eigensolver

QR factorization

Up to 44x Faster on Symmetric Eigensolver

(DSYEVD)

Benchmarks use 2 x Intel Gold 6140 (Skylake) processors with Intel MKL 2018

and NVIDIA Tesla V100 (Volta) GPUs

1.1

15.8

18.0

0.9

3.6

0

5

10

15

20

25

30

4096 8192

MKL2018 CUDA 9.2 CUDA 10.0

Tim

e (

s)

Matrix Size

Page 34: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Nsight Product Family

Nsight Systems

System-wide application algorithm

tuning

Nsight Compute

CUDA Kernel Profiling and

Debugging

Nsight Graphics

Graphics Shader Profiling and

Debugging

Page 35: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Nsight SystemsSystem-wide Performance Analysis

Observe Application Behavior: CPU threads, GPU traces, Memory Bandwidth and more

Locate Optimization Opportunities: CUDA & OpenGL APIs, Unified Memory transfers, User Annotations using NVTX

Ready for Big Data: Fast GUI capable of visualizing in excess of 10 million events on laptops, Container support, Minimum user privileges

Page 36: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Processes and

threads

CUDA and OpenGL API

trace

Multi-GPU

Kernel and memory

transfer activities

cuDNN and cuBLAS

trace

Thread/core

migration

Thread state

Page 37: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

Nsight ComputeNext Generation Kernel Profiler

Interactive CUDA API debugging and kernel profiling

Fast Data Collection

Improved Workflow and Fully Customizable(Baselining, Programmable UI/Rules)

Command Line, Standalone, IDE Integration

Platform Support

OS: Linux (x86, ARM), Windows

GPUs: Pascal, Volta, Turing

Kernel Profile

Comparisons with

Baseline

Metric Data

Source Correlation

Page 38: Turing Architecture and CUDA 10 New Features · 2018. 11. 26. · Turing Architecture and CUDA 10 New Features. Turing Architecture RT Core Turing MPS Inference Accelerated, Graphics

SEOUL | NOVEMBER 7 - 8,2018

www.nvidia.com/ko-kr/ai-conference/