turing architecture and cuda 10 new features · 2018. 11. 26. · turing architecture and cuda 10...
TRANSCRIPT
Minseok Lee, Developer Technology Engineer, NVIDIA
Turing Architecture and CUDA 10
New Features
Turing Architecture
Turing MPSRT Core
Inference Accelerated, Graphics Reinvented, Volta’s Programmability
New SM Architecture Multi-Precision
Tensor Core
3
Universal Inference Acceleration
320 Turing Tensor cores
2,560 CUDA cores
65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS
16GB | 320GB/s
ANNOUNCING TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU
Turing TU102 SMTU102
INT32 64
FP32 64
Tensor Cores 8
RT Core 1
Register File 256 KB
L1 and shmem 96 KB
Max threads 1024
Compute Capability 75
Tensor Core
D =
FP32(FP16)
FP16 FP16 FP32(FP16)
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
C0,0 C0,1 C0,2 C0,3
C1,0 C1,1 C1,2 C1,3
C2,0 C2,1 C2,2 C2,3
C3,0 C3,1 C3,2 C3,3
A B C
Multi-Precision Tensor Core
Input Precision OutputVolta
(Ops/Cycle/SM)
Turing
(Ops/Cycle/SM)
FP16 FP16 or FP32 1024 1024
INT8 INT32 NA 2048
INT4 INT32 NA 4096
BOOL INT32 NA 16384
7
Multi-Process Service (MPS)
Turing MPS:
Inherits Volta’s enhanced MPS architecture
• Reduced launch latency
• Improved launch throughput
• Improved quality of service with
scheduler partitioning
Hardware Accelerated
Work Submission
Hardware Isolation
TURING MULTI-PROCESS SERVICE
Turing
A B C
CUDA MULTI-PROCESS SERVICE CONTROLCPU Processes
GPU Execution
A B C
RT CoreAccelerate Ray Tracing with Turing RT Cores
● Boundary Volume Hierarchy (BVH) traversal
● Ray/Triangle Intersection
● 10+ Giga Rays/Sec
Available in NVIDIA OptiX
● Single-ray shader programming model using C++
● AI Accelerated rendering
● Free for Commercial-Use
CUDA 10 Key Features
New GPU Architecture, Tensor Cores, …
TURING AND NEW SYSTEMSCUDA Graphs, Warp Matrix, …
CUDA PLATFORM
GPU-accelerated hybrid JPEG decoding,Symmetric Eigenvalue Solvers, FFT Scaling
LIBRARIES
New Nsight Products – Nsight Systems and Nsight Compute
DEVELOPER TOOLS
Scientific Computing
Asynchronous Task GraphsEnable Execution Optimization when Workflow is Known Up-Front
DL Inference
Loop & Functionoffload
Deep Neural NetworkTraining
HPC SimulationLinear Algebra
CUDA GraphsNew Model for Submitting CUDA Work
Graph of Dependencies
End
A
B X
C D
E Y
Any CUDA stream can be mapped to a graph
A
B
C
Wait
E
Wait
D
Wait
X
Y
Wait
CUDA Work in Streams
Definition of A CUDA GraphSequence of operations, connected by dependencies.
Operations (Nodes) are one of:
Kernel Launch CUDA kernel running on GPU
CPU Function Call Callback function on CPU
Memcpy/Memset Data management
Sub-Graph Graphs are hierarchical
A
B X
C D
E Y
End
Repeated Graph ExecutionGenerated Once, Launched Repeatedly
for(int i=0; i<1000; i++) {
launch_graph( G );
}
A
B X
C D
E Y
End
Execution OptimizationLaunch Latency & Overhead Reduction
● Predefined graph lunches any number of kernels in one single operation
● Benefits especially short-running kernels
time
Launch
A
Launch
B
Launch
C
Launch
D
Launch
E
A B C D E
Build
GraphLaunch Graph
CPU Idle
CPU Idle
A B C D E
CUDA Stream to A GraphConstruct a graph from normal CUDA stream syntax
// Start by initating stream capture
cudaStreamBeginCapture(&stream1);
A<<< ..., stream1 >>>();
cudaEventRecord(e1, stream1);
B<<< ..., stream1 >>>();
cudaStreamWaitEvent(stream2, e1);
C<<< ..., stream2 >>>();
cudaEventRecord(e2, stream2);
cudaStreamWaitEvent(stream1, e2);
D<<< ..., stream1 >>>();
// Now convert the stream to a graph
cudaStreamEndCapture(stream1, &graph);
A
B
Wait
D
C
Wait
stream1 stream2 graph
D
B C
A
Capture External WorkStream Capture extends into Library Calls
// Start by initating stream capture
cudaStreamBeginCapture(&stream);
// Captures my kernel launches, recurse into library calls
X<<< ..., stream >>>();
libraryCall(stream); // Launches A, B, C, D
Z<<< ..., stream >>>();
// Now convert the stream to a graph
cudaStreamEndCapture(stream, &graph);
X
Z
A
D
B C
X
Z
D
B C
A
Resultantgraph
Insertinggraph
Library call
Construct Graph Explicitly
D
B C
A
// Define graph of work + dependencies
cudaGraphCreate(&graph);
cudaGraphAddNode(graph, kernel_a, {}, ...);
cudaGraphAddNode(graph, kernel_b, { kernel_a }, ...);
cudaGraphAddNode(graph, kernel_c, { kernel_a }, ...);
cudaGraphAddNode(graph, kernel_d, { kernel_b, kernel_c }, ...);
// Instantiate graph and apply optimizations
cudaGraphInstantiate(&instance, graph);
// Launch executable graph 100 times
for(int i=0; i<100; i++)
cudaGraphLaunch(instance, stream);
Graph fromframework
Graph Execution SemanticsGraph Work can be ordered with other work in the stream
stream
launchWork(cudaGraphExec_t i1, cudaGraphExec_t i2,CPU_Func cpu, cudaStream_t stream) {
A <<< 256, 256, 0, stream >>>(); // Kernel launch
cudaGraphLaunch(i1, stream); // Graph1 launch
cudaStreamAddCallback(stream, cpu); // CPU callback
cudaGraphLaunch(i2, stream); // Graph2 launch
cudaStreamSynchronize(stream);
}
A
CPU
Graph Execution SemanticsGraphs ONLY use the stream for the start/end dependencies
stream
A
CPU
End
A
B X
C D
E Y
Branches in graph still execute
concurrently even though graph is launched into a
stream
How to Access Tensor CoresWarp Matrix Multiply-Accumulate (WMMA) API in CUDA C++
● Specialized matrix load, matrix multiply and accumulate, matrix store
● Turing (sm_75) WMMA supports 8-bit integer operation
= +
A32x16
B16x8
C32x8
D32x8
WMMA 32x8x16
= +
WMMA 8x32x16
A8x16
B16x32
C8x32
D8x32
= + A
16x16B
16x16C
16x16D
16x16
WMMA 16x16x16
How Tensor Core is UsedA large matrix multiplication can be divided into a set of 16X16 matrix products, which are assigned to Tensor cores
A C
B
16
16
FP16 WMMA Example__device__ void tensor_op_16_16_16(half *a, half *b, float *c){
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, …> a_frag;wmma::fragment<wmma::matrix_b, 16, 16, 16, half, …> b_frag;wmma::fragment<wmma::accumulator, 16, 16, 16, float, …> c_frag;
wmma::load_matrix_sync(a_frag, a, …);wmma::load_matrix_sync(b_frag, b, …);
wmma::fill_fragment(c_frag, 0.0f);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
wmma::store_matrix_sync(c, c_frag, …);}
Tensor Core Input/Output
Fragment
Load Input Matrix into Input Fragment
Initialize Output FragmentTensor Core Computation
Store Output Matrix into Memory
Turing INT8 WMMA Example__device__ void tensor_op_16_16_16(char *a, char *b, int *c){
wmma::fragment<wmma::matrix_a, 16, 16, 16, char, …> a_frag;wmma::fragment<wmma::matrix_b, 16, 16, 16, char, …> b_frag;wmma::fragment<wmma::accumulator, 16, 16, 16, int, …> c_frag;
wmma::load_matrix_sync(a_frag, a, …);wmma::load_matrix_sync(b_frag, b, …);
wmma::fill_fragment(c_frag, 0.0f);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
wmma::store_matrix_sync(c, c_frag, …);}
Experimental sub-byte WMMASupport experimental 4-bit/1-bit Operations with 32-bit output
● Access via special namespace: nvcuda::wmma::experimental
namespace experimental {
namespace precision {
struct u4; // 4-bit unsigned
struct s4; // 4-bit signed
struct b1; // 1-bit
}
enum bmmaBitOp { bmmaBitOpXOR = 1 };
enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 };
}
Binary Tensor Core Example
Concept
▪ Train neural networks on lower-precision data: faster compute, lower memory size
▪ Reduce data to positive / negative sign value – can fit in single bit (1 = +ve, 0 = -ve)
▪ 1-bit weight & activation calculations based only on sign of data
1-bit
Ref: Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1, M. Coubariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y Bengio, 2016
https://arxiv.org/pdf/1602.02830.pdf
Turing WMMA API SummaryInput Precision Output Supported Sizes Max Ops/Clock/SM
Nati
ve T
ypes
half * half or float16 x 16 x 16
32 x 8 x 16
8 x 32 x 16
1024
charinteger (int32) 2048
unsigned char
Experi
menta
l
precision::u4 (4-bit unsigned)
integer (int32)8 x 8 x 32 4096
precision::s4 (4-bit signed)
precision::b1 (1-bit) 8 x 8 x 128 16384
* Also available on Volta sm_70. Note: WMMA requires recompilation for Turing sm_75 for peak performance
CUTLASS 1.1Collection of CUDA C++ template abstractions for implementing high-performance matrix-multiplication (GEMM)
● Turing/CUDA10-Optimized: support 8-bit, 4-bit, and 1-bit integers
● Include detailed documentation and variouls example
● Exhibit performance comparable to cuBLAS
CUDA 10 Math LibrariesTURING
Large FFT & 16-GPU Strong Scaling
Symmetric Eigensolver & Cholesky Performance
cuSPARSE Sparse-Dense Matrix Multiply Performance
PERFORMANCE
GPU-accelerated hybrid JPEG decoding
FP16 & INT8 GEMMs for TensorRTInference
NEW ALGORITHMS AND APIs
Faster & Independent Library Releases
Library and CUDA compatibility with enterprise drivers
COMPATIBILITY & RELEASE CADENCE
Turing optimized GEMMs, & GEMM extensions for Tensor Cores
Turing architecture-optimized libraries
DEEP LEARNING
Scientific Computing
cuBLAS 10Include Turing-optimized mixed-precision GEMMs with Tensor Cores
DL Inference Test on T4
cuFFT 10Strong scaling on multi-GPU systems such as NVIDIA’s DGX
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
2 4 8 16
cuFFT 9.2 cuFFT 10.0 Linear (cuFFT 10.0)
cuFFT (10.0 and 9.2) using 3D C2C FFT 1024 size on DGX-2
cuSolver 10
Improved performance with new implementations for
Cholesky factorization
Symmetric & Generalized Symmetric Eigensolver
QR factorization
Up to 44x Faster on Symmetric Eigensolver
(DSYEVD)
Benchmarks use 2 x Intel Gold 6140 (Skylake) processors with Intel MKL 2018
and NVIDIA Tesla V100 (Volta) GPUs
1.1
15.8
18.0
0.9
3.6
0
5
10
15
20
25
30
4096 8192
MKL2018 CUDA 9.2 CUDA 10.0
Tim
e (
s)
Matrix Size
Nsight Product Family
Nsight Systems
System-wide application algorithm
tuning
Nsight Compute
CUDA Kernel Profiling and
Debugging
Nsight Graphics
Graphics Shader Profiling and
Debugging
Nsight SystemsSystem-wide Performance Analysis
Observe Application Behavior: CPU threads, GPU traces, Memory Bandwidth and more
Locate Optimization Opportunities: CUDA & OpenGL APIs, Unified Memory transfers, User Annotations using NVTX
Ready for Big Data: Fast GUI capable of visualizing in excess of 10 million events on laptops, Container support, Minimum user privileges
Processes and
threads
CUDA and OpenGL API
trace
Multi-GPU
Kernel and memory
transfer activities
cuDNN and cuBLAS
trace
Thread/core
migration
Thread state
Nsight ComputeNext Generation Kernel Profiler
Interactive CUDA API debugging and kernel profiling
Fast Data Collection
Improved Workflow and Fully Customizable(Baselining, Programmable UI/Rules)
Command Line, Standalone, IDE Integration
Platform Support
OS: Linux (x86, ARM), Windows
GPUs: Pascal, Volta, Turing
Kernel Profile
Comparisons with
Baseline
Metric Data
Source Correlation
SEOUL | NOVEMBER 7 - 8,2018
www.nvidia.com/ko-kr/ai-conference/