developing cuda kernels to push tensor ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s gemm k mixed...

65
Andrew Kerr, May 21, 2020 DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100

Upload: others

Post on 13-May-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

Andrew Kerr, May 21, 2020

DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100

Page 2: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

2

CUTLASS Team

Andrew Kerr, Haicheng Wu, Manish Gupta, Duane Merrill, Pradeep Ramani

Contributors

Mostafa Hagog, Timothy Costa, Alan Kaatz, John Tran, Stephen Jones, Kyrylo Perelygin, Luke Durant, Piotr Majcher, Paul Springer, Markus Hohnerbach

Acknowledgements

Joel McCormack, Julien Demouth, Olivier Giroux, Bryce Lelbach, Cris Cecka

ACKNOWLEDGEMENTS

Page 3: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

3

Overview

NVIDIA Ampere Architecture and CUTLASS 2.2

Tensor Cores on NVIDIA Ampere Architecture

Accelerated matrix operations

Efficient data movements for Tensor Cores

Strategies for maximizing performance

CUTLASS on NVIDIA A100

Optimal CUDA C++ templates for Tensor Cores

AGENDA

Page 4: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

4

OVERVIEW

Page 5: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

5

NVIDIA AMPERE ARCHITECTURE

New and Faster Tensor Core Operations

▪ Floating-point Tensor Core operations 8x and 16x faster than F32 CUDA Cores

▪ Integer Tensor Core operations 32x and 64x faster than F32 CUDA Cores

▪ New IEEE double-precision Tensor Cores 2x faster than F64 CUDA Cores

Additional Data Types and Mode

▪ Bfloat16, double, Tensor Float 32

Asynchronous copy

▪ Copy directly into shared memory – deep software pipelines

Many additional new features – see “Inside NVIDIA Ampere Architecture”

NVIDIA A100

Page 6: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

6

PROGRAMMING NVIDIA AMPERE ARCHITECTUREDeep Learning and Math Libraries using Tensor Cores (with CUDA kernels under the hood)

• cuDNN, cuBLAS, cuTENSOR, cuSOLVER, cuFFT, cuSPARSE

• “CUDNN V8: New Advances in Deep Learning Acceleration” (GTC 2020 - S21685)

• “How CUDA Math Libraries Can Help you Unleash the Power of the New NVIDIA A100 GPU” (GTC 2020 – S21681)

• “Inside the Compilers, Libraries and Tools for Accelerated Computing” (GTC 2020 – S21766)

CUDA C++ Device Code

• CUTLASS, CUDA Math API, CUB, Thrust, libcu++

CUDA device code

CUDA-accelerated math libraries with host-side API

GPU

GPU-accelerated application

Page 7: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

7

PROGRAMMING NVIDIA AMPERE ARCHITECTURE

This is a talk for CUDA programmers

with CUDA C++

CUDA device code

CUDA-accelerated math libraries with host-side API

GPU

GPU-accelerated application

Page 8: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

8

CUTLASS

https://github.com/NVIDIA/cutlass

CUDA C++ Templates for Deep Learning and Linear Algebra

CUDA 9.1 CUDA 10.1 CUDA 11

CUDA 9.2 CUDA 10.2

CUTLASS Preview Release

CUTLASS 1.3 – native NVIDIA V100 Tensor

Cores

CUTLASS 2.2 –NVIDIA A100

CUTLASS 1.0CUTLASS 2.0 – native NVIDIA Turing Tensor

Cores

Page 9: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

9

CUTLASS

CUTLASS 2.2: optimal performance on NVIDIA Ampere Architecture

• Higher throughput Tensor Cores: more than 2x speedup for all data types

• New floating-point types: bfloat16, Tensor Float 32, double

• Deep software pipelines with cp.async: efficient and latency tolerant

CUTLASS 2.1

• Planar complex: complex-valued GEMMs with batching options targeting Volta and Turing Tensor Cores

• BLAS-style host side API

CUTLASS 2.0: significant refactoring using modern C++11 programming

• Efficient: particularly for Turing Tensor Cores

• Tensor Core programming model: reusable components for linear algebra kernels in CUDA

• Documentation, profiling tools, reference implementations, SDK examples, more..

What’s new?

https://github.com/NVIDIA/cutlass

Page 10: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

10

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

1,000,000

12

8

11

52

21

76

32

00

42

24

52

48

62

72

72

96

83

20

93

44

10

36

8

11

39

2

12

41

6

13

44

0

14

46

4

15

48

8

GFL

OP/

s

GEMM K

CUTLASS PERFORMANCE ON NVIDIA AMPERE ARCHITECTURE

0

50,000

100,000

150,000

200,000

250,000

32

54

4

10

56

15

68

20

80

25

92

31

04

36

16

41

28

46

40

51

52

56

64

61

76

66

88

72

00

77

12

GFL

OP/

s

GEMM K

Mixed Precision Floating Point

CUTLASS 2.2 - CUDA 11 Toolkit – NVIDIA A100

Double Precision Floating Point Mixed Precision Integer

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

20,000

32

16

0

28

8

41

6

54

4

67

2

80

0

92

8

10

56

11

84

13

12

14

40

15

68

16

96

18

24

19

52

GFL

OP/

s

GEMM K

Tensor Core – F64Tensor Core – BF16, F16

CUDA Core – F64Tensor Core – TF32

CUDA Core – F32

Tensor Core – INT4

CUDA Core – INT8

Tensor Core – INT8

m=3456, n=4096

5.7x

2x13x

7.7x

13.8x

Page 11: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

11

TENSOR CORES ON NVIDIA AMPERE ARCHITECTURE

Page 12: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

12

Matrix operations: D = op(A, B) + C

▪ Matrix multiply-add

▪ XOR-POPC

Input Data types: A, B

▪ half, bfloat16, Tensor Float 32, double, int8, int4, bin1

Accumulation Data Types: C, D

▪ half, float, int32_t, double

WHAT ARE TENSOR CORES?

Page 13: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

13

WHAT ARE TENSOR CORES?

Matrix operations: D = op(A, B) + C

▪ Matrix multiply-add

▪ XOR-POPC

M-by-N-by-K matrix operation

▪ Warp-synchronous, collective operation

▪ 32 threads within warp collectively hold A, B, C, and D operands

Page 14: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

14

NVIDIA AMPERE ARCHITECTURE - TENSOR CORE OPERATIONS

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends

PTXData Types(A * B + C)

ShapeSpeedup on NVIDIA A100 (vs F32 CUDA cores)

Speedup on Turing*(vs F32 Cores)

Speedup on Volta*(vs F32 Cores)

mma.sync.m16n8k16

mma.sync.m16n8k8

F16 * F16 + F16

F16 * F16 + F32

BF16 * BF16 + F32

16-by-8-by-16

16-by-8-by-816x 8x 8x

mma.sync.m16n8k8 TF32 * TF32 + F32 16-by-8-by-8 8x N/A N/A

mma.sync.m8n8k4 F64 * F64 + F64 8-by-8-by-4 2x N/A N/A

mma.sync.m16n8k32

mma.sync.m8n8k16S8 * S8 + S32

16-by-8-by-32

8-by-8-by-1632x 16x N/A

mma.sync.m16n8k64 S4 * S4 + S32 16-by-8-by-64 64x 32x N/A

mma.sync.m16n8k256 B1 ^ B1 + S32 16-by-8-by-256 256x 128x N/A

* Instructions with equivalent functionality for Turing and Volta differ in shape from the NVIDIA Ampere Architecture in several cases.

Page 15: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

15

Warp-wide Tensor Core operation: 8-by-8-by-128b

TENSOR CORE OPERATION: FUNDAMENTAL SHAPE

Page 16: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

16

mma.sync.aligned(via inline PTX)

int32_t D[2];uint32_t const A;uint32_t const B;int32_t const C[2];

// Example targets 8-by-8-by-16 Tensor Core operation

asm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 "

" { %0, %1 }, "

" %2, "

" %3, "

" { %4, %5 }; ": "=r"(D[0]), "=r"(D[1])

: "r"(A), "r"(B),"r"(C[0]), "r"(C[1])

);

8-by-8-by-16

S8 * S8 + S32

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends

Page 17: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

17Warp-wide Tensor Core operation: 16-by-8-by-128b

EXPANDING THE M DIMENSION

Page 18: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

18

mma.sync.aligned(via inline PTX)

float D[4];uint32_t const A[2];uint32_t const B;float const C[4];

// Example targets 16-by-8-by-8 Tensor Core operation

asm("mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 "

" { %0, %1, %2, %3 }, "

" { %4, %5}, "

" %6, "

" { %7, %8, %9, %10 };": "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3])

: "r"(A[0]), "r"(A[1]),"r"(B),"f"(C[0]), "f"(C[1])

);

16-by-8-by-8

F16 * F16 + F32

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends

Page 19: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

19Warp-wide Tensor Core operation: 16-by-8-by-256b

EXPANDING THE K DIMENSION

Page 20: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

20

mma.sync.aligned(via inline PTX)

float D[4];uint32_t const A[4];uint32_t const B[2];float const C[4];

// Example targets 16-by-8-by-32 Tensor Core operation

asm("mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "

" { %0, %1, %2, %3 }, "

" { %4, %5, %6, %7 }, "

" { %8, %9 }, "

" { %10, %11, %12, %13 };": "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3])

: "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),"r"(B[0]), "r"(B[1]),"f"(C[0]), "f"(C[1]), "f"(C[2]), "f"(C[3])

);

16-by-8-by-16

F16 * F16 + F32

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends

Page 21: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

21

mma.sync.aligned(via inline PTX)

int32_t D[4];uint32_t const A[4];uint32_t const B[2];int32_t const C[4];

// Example targets 16-by-8-by-32 Tensor Core operation

asm("mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32 "

" { %0, %1, %2, %3 }, "

" { %4, %5, %6, %7 }, "

" { %8, %9 }, "

" { %10, %11, %12, %13 };": "=r"(D[0]), "=r"(D[1]), "=r"(D[2]), "=r"(D[3])

: "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),"r"(B[0]), "r"(B[1]),"r"(C[0]), "r"(C[1]), "r"(C[2]), "r"(C[3])

);

16-by-8-by-32

S8 * S8 + S32

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends

Page 22: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

22

mma.sync.aligned(via inline PTX)

uint32_t D[2]; // two registers needed (vs. four)uint32_t const A[4];uint32_t const B[2];uint32_t const C[2]; // two registers needed (vs. four)

// Example targets 16-by-8-by-16 Tensor Core operation

asm("mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "

" { %0, %1}, "

" { %2, %3, %4, %5 }, "

" { %6, %7 }, "

" { %8, %9 }; ": "=r"(D[0]), "=r"(D[1])

: "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),"r"(B[0]), "r"(B[1]),"r"(C[0]), "r"(C[1])

);

16-by-8-by-16

HALF-PRECISION : F16 * F16 + F16

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends

C[0]

C[1]

Page 23: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

23

mma.sync.aligned(via inline PTX)

uint64_t D[2]; // two 64-bit accumulatorsuint64_t const A; // one 64-bit element for A operanduint64_t const B; // one 64-bit element for B operanduint64_t const C[2]; // two 64-bit accumulators

// Example targets 8-by-8-by-4 Tensor Core operation

asm("mma.sync.aligned.m8n8k4.row.col.f64.f64.f64.f64 "

" { %0, %1}, "

“ %2, "

" %3, "

" { %4, %5 }; ": "=l"(D[0]), "=l"(D[1])

: “l"(A), “l"(B),“l"(C[0]), “l"(C[1])

);

8-by-8-by-4

DOUBLE-PRECISION: F64 * F64 + F64

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends

Page 24: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

24

cutlass::arch::Mma

/// Matrix multiply-add operationtemplate <

/// Size of the matrix product (concept: GemmShape)typename Shape,

/// Number of threads participatingint kThreads,

/// Data type of A elementstypename ElementA,

/// Layout of A matrix (concept: MatrixLayout)typename LayoutA,

/// Data type of B elementstypename ElementB,

/// Layout of B matrix (concept: MatrixLayout)typename LayoutB,

/// Element type of C matrixtypename ElementC,

/// Layout of C matrix (concept: MatrixLayout)typename LayoutC,

/// Inner product operatortypename Operator

>struct Mma;

m-by-n-by-k

CUTLASS: wraps PTX in template

https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/arch/mma_sm80.h

Page 25: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

25

cutlass::arch::Mma

__global__ void kernel() {

// arrays containing logical elementsArray<half_t, 8> A;Array<half_t, 4> B;Array< float, 4> C;

// define the appropriate matrix operationarch::Mma< GemmShape<16, 8, 16>, 32, ... > mma;

// in-place matrix multiply-accumulatemma(C, A, B, C);

...}

16-by-8-by-16

CUTLASS: wraps PTX in template

https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/arch/mma_sm80.h

Page 26: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

26

EFFICIENT DATA MOVEMENT FOR TENSOR CORES

Page 27: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

27

CUDA example

__global__ void tensor_core_example_8x8x16(int32_t *D, uint32_t const *A, uint32_t const *B, int32_t const *C) {

// Compute the coordinates of accesses to A and B matrices

int outer = threadIdx.x / 4; // m or n dimensionint inner = threadIdx.x % 4; // k dimension

// Compute the coordinates for the accumulator matricesint c_row = threadIdx.x / 4;int c_col = 2 * (threadIdx.x % 4);

// Compute linear offsets into each matrixint ab_idx = outer * 4 + inner;int cd_idx = c_row * 8 + c_col;

// Issue Tensor Core operationasm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 "" { %0, %1 }, "" %2, "" %3, "" { %4, %5 }; ": "=r"(D[cd_idx]), "=r"(D[cd_idx + 1])

: "r"(A[ab_idx]), "r"(B[ab_idx]),"r"(C[cd_idx]), "r"(C[cd_idx + 1])

);}

HELLO WORLD: TENSOR CORES

Map each thread to coordinates of the matrix operation

Load inputs from memory

Perform the matrix operation

Store the result to memory

Page 28: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

28

CUDA example__global__ void tensor_core_example_8x8x16(int32_t *D, uint32_t const *A, uint32_t const *B, int32_t const *C) {

// Compute the coordinates of accesses to A and B matrices

int outer = threadIdx.x / 4; // m or n dimensionint inner = threadIdx.x % 4; // k dimension

// Compute the coordinates for the accumulator matricesint c_row = threadIdx.x / 4;int c_col = 2 * (threadIdx.x % 4);

// Compute linear offsets into each matrixint ab_idx = outer * 4 + inner;int cd_idx = c_row * 8 + c_col;

// Issue Tensor Core operationasm("mma.sync.aligned.m8n8k16.row.col.s32.s8.s8.s32 "" { %0, %1 }, "" %2, "" %3, "" { %4, %5 }; ": "=r"(D[cd_idx]), "=r"(D[cd_idx + 1])

: "r"(A[ab_idx]), "r"(B[ab_idx]),

"r"(C[cd_idx]), "r"(C[cd_idx + 1]));

}

PERFORMANCE IMPLICATIONS

Load A and B inputs from memory: 2 x 4B per thread

Perform one Tensor Core operation: 2048 flops per warp

2048 flops require 256 B of loaded data

➔ 8 flops/byte

NVIDIA A100 Specifications:

• 624 TFLOP/s (INT8)

• 1.6 TB/s (HBM2)

➔ 400 flops/byte

8 flops/byte * 1.6 TB/s ➔ 12 TFLOP/s

This kernel is global memory bandwidth limited.

Page 29: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

29

FEEDING THE DATA PATHEfficient storing and loading through Shared Memory

Tiled, hierarchical model: reuse data in Shared Memory and in Registers

See CUTLASS GTC 2018 talk for more details about this model.

Page 30: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

30

FEEDING THE DATA PATHMove data from Global Memory to Tensor Cores as efficiently as possible

• Latency-tolerant pipeline from Global Memory

• Conflict-free Shared Memory stores

• Conflict-free Shared Memory loads

Page 31: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

31

ASYNCHRONOUS COPY: EFFICIENT PIPELINESNew NVIDIA Ampere Architecture feature: cp.async

• Asynchronous copy directly from Global to Shared Memory

• See “Inside the NVIDIA Ampere Architecture” for more details (GTC 2020 – S21730)

Enables efficient software pipelines

• Minimizes data movement: L2 ➔ L1 ➔ RF ➔ SMEM becomes L2 ➔ SMEM

• Saves registers: RF no longer needed to hold the results of long-latency load instructions

• Indirection: fetch several stages in advance for greater latency tolerance

CommittedStage

SMEM write pointer

Copies in flight

Circular buffer in Shared Memory

cp.asynccp.asynccp.asyncld.shared

Page 32: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

32

FEEDING THE DATA PATHMove data from Global Memory to Tensor Cores as efficiently as possible

• Latency-tolerant pipeline from Global Memory

• Conflict-free Shared Memory stores

• Conflict-free Shared Memory loads

Page 33: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

33

GLOBAL MEMORY TO TENSOR CORES

Global Memory

Shared Memory

Tensor Cores

cp.async

M dimension

K dimension

Page 34: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

34

LDMATRIX: FETCH TENSOR CORE OPERANDS PTX instruction to load a matrix from Shared Memory

Shared Memory Tensor Cores

Each thread supplies a pointer to 128b row of data in Shared Memory

Each 128b row is broadcast to groups of four threads

(potentially different threads than the one supplying the pointer)

Data matches arrangement of inputs to Tensor Core operations

Page 35: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

35

LDMATRIX: PTX INSTRUCTION

Each thread supplies a pointer to 128b row of data in Shared Memory

Each 128b row is broadcast to groups of four threads

(potentially different threads than the one supplying the pointer)

Data matches arrangement of inputs to Tensor Core operations

PTX instruction to load a matrix from SMEM

// Inline PTX assembly for ldmatrix

uint32_t R[4];uint32_t smem_ptr;

asm volatile ("ldmatrix.sync.aligned.x4.m8n8.shared.b16 ""{%0, %1, %2, %3}, [%4]; "

: "=r"(R[0]), "=r"(R[1]), "=r"(R[2]), "=r"(R[3])

: "r"(smem_ptr)

);

Page 36: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

36

GLOBAL MEMORY TO TENSOR CORES

Tensor Cores

ldmatrix

cp.async

Global Memory

Shared Memory

Page 37: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

37

NVIDIA AMPERE ARCHITECTURE – SHARED MEMORY BANK TIMING

Bank conflicts between threads in the same phase

4B words are accessed in 1 phase

8B words are accessed in 2 phases:

• Process addresses of the first 16 threads in a warp

• Process addresses of the second 16 threads in a warp

16B words are accessed in 4 phases:

• Each phase processes 8 consecutive threads of a warp

Slide borrowed from: Guillaume Thomas-Collignon and Paulius Micikevicius. "Volta Architecture and performance optimization.” GTC 2018.

http://on-demand.gputechconf.com/gtc/2018/presentation/s81006-volta-architecture-and-performance-optimization.pdf

128 bit access size

Phase 0: T0 .. T7

Phase 1: T8 .. T15

Phase 2: T16 .. T23

Phase 3: T24 .. T31

Page 38: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

38

GLOBAL MEMORY TO TENSOR CORES

Global Memory

Shared Memory

Registers

Bank conflict on either store or load from Shared Memory

ldmatrix

cp.async

Page 39: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

39

GLOBAL TO SHARED MEMORY

Load(128 bits per thread)

Store(128 bits per thread)

Permuted Shared Memory layout

XOR function maps thread index to Shared Memory location

Page 40: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

40

GLOBAL TO SHARED MEMORY

Phase 0: T0 .. T7

Phase 1: T8 .. T15

Phase 2: T16 .. T23

Phase 3: T24 .. T31

Load(128 bits per thread)

Store(128 bits per thread)

Page 41: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

41

GLOBAL TO SHARED MEMORY

Phase 0: T0 .. T7

Phase 1: T8 .. T15

Phase 2: T16 .. T23

Phase 3: T24 .. T31

Load(128 bits per thread)

Store(128 bits per thread)

Page 42: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

42

GLOBAL TO SHARED MEMORY

Phase 0: T0 .. T7

Phase 1: T8 .. T15

Phase 2: T16 .. T23

Phase 3: T24 .. T31

Load(128 bits per thread)

Store(128 bits per thread)

Page 43: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

43

GLOBAL TO SHARED MEMORY

Phase 0: T0 .. T7

Phase 1: T8 .. T15

Phase 2: T16 .. T23

Phase 3: T24 .. T31

Load(128 bits per thread)

Store(128 bits per thread)

Page 44: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

44

FEEDING THE DATA PATHMove data from Global Memory to Tensor Cores as efficiently as possible

• Latency-tolerant pipeline from Global Memory

• Conflict-free Shared Memory stores

• Conflict-free Shared Memory loads

Page 45: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

45

LOADING FROM SHARED MEMORY TO REGISTERS

Page 46: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

46

LOADING FROM SHARED MEMORY TO REGISTERS

Page 47: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

47

LOADING FROM SHARED MEMORY TO REGISTERS

Page 48: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

48

LOADING FROM SHARED MEMORY TO REGISTERS

Page 49: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

49

ADVANCING TO NEXT K GROUP

K=16 .. 31K=0 ..15

Page 50: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

50

ADVANCING TO NEXT K GROUP

smem_ptr = row_idx * 8 + column_idx; smem_ptr = smem_ptr ^ 2;

K=0..15 K=16..31

Page 51: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

51

LOADING FROM SHARED MEMORY TO REGISTERS

Phase 0K=16..31

Page 52: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

52

LOADING FROM SHARED MEMORY TO REGISTERS

Phase 1K=16..31

Page 53: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

53

LOADING FROM SHARED MEMORY TO REGISTERS

Phase 2

K=16..31

Page 54: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

54

LOADING FROM SHARED MEMORY TO REGISTERS

Phase 3

K=16..31

Page 55: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

55

CUTLASSCUDA C++ Templates as an Optimal Abstraction Layer for Tensor Cores

• Latency-tolerant pipeline from Global Memory

• Conflict-free Shared Memory stores

• Conflict-free Shared Memory loads

Page 56: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

56

CUTLASS: OPTIMAL ABSTRACTION FOR TENSOR CORESusing Mma = cutlass::gemm::warp::DefaultMmaTensorOp<GemmShape<64, 64, 16>, half_t, LayoutA, // GEMM A operandhalf_t, LayoutB, // GEMM B operandfloat, RowMajor // GEMM C operand

>;

__shared__ ElementA smem_buffer_A[Mma::Shape::kM * GemmK];__shared__ ElementB smem_buffer_B[Mma::Shape::kN * GemmK];

// Construct iterators into SMEM tilesMma::IteratorA iter_A({smem_buffer_A, lda}, thread_id);Mma::IteratorB iter_B({smem_buffer_B, ldb}, thread_id);

Mma::FragmentA frag_A;Mma::FragmentB frag_B;Mma::FragmentC accum;

Mma mma;

accum.clear();

#pragma unroll 1for (int k = 0; k < GemmK; k += Mma::Shape::kK) {

iter_A.load(frag_A); // Load fragments from A and B matricesiter_B.load(frag_B);

++iter_A; ++iter_B; // Advance along GEMM K to next tile in A// and B matrices

// Compute matrix productmma(accum, frag_A, frag_B, accum);

}

Shared Memory Tensor CoresWarp-level matrix multiply

Page 57: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

57

CUTLASS: OPTIMAL ABSTRACTION FOR TENSOR CORESusing Mma = cutlass::gemm::warp::DefaultMmaTensorOp<GemmShape<64, 64, 16>, half_t, LayoutA, // GEMM A operandhalf_t, LayoutB, // GEMM B operandfloat, RowMajor // GEMM C operand

>;

__shared__ ElementA smem_buffer_A[Mma::Shape::kM * GemmK];__shared__ ElementB smem_buffer_B[Mma::Shape::kN * GemmK];

// Construct iterators into SMEM tilesMma::IteratorA iter_A({smem_buffer_A, lda}, thread_id);Mma::IteratorB iter_B({smem_buffer_B, ldb}, thread_id);

Mma::FragmentA frag_A;Mma::FragmentB frag_B;Mma::FragmentC accum;

Mma mma;

accum.clear();

#pragma unroll 1for (int k = 0; k < GemmK; k += Mma::Shape::kK) {

iter_A.load(frag_A); // Load fragments from A and B matricesiter_B.load(frag_B);

++iter_A; ++iter_B; // Advance along GEMM K to next tile in A// and B matrices

// Compute matrix productmma(accum, frag_A, frag_B, accum);

}

Tile Iterator Constructors:

Initialize pointers into permuted Shared Memory buffers

Fragments:

Register-backed arrays holding each thread’s data

Warp-level matrix multiply:

Decomposes a large matrix multiply into Tensor Core operations

Tile Iterator:

load() - Fetches data from permuted Shared Memory buffers

operator++() - advances to the next logical matrix in SMEM

Page 58: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

58

CUTLASS ON NVIDIA A100

Page 59: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

59

cuBLAS

CUTLASS

99% 99% 98% 99%95%

93%96% 95%

98% 97% 98% 97%94%

90%

97% 99%

90% 90% 90% 90%

80%83%

80% 80%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

NN NT TN TT NN NT TN TT NN NT TN TT NN NT TN TT NN NT TN TT NN NT TN TT

DGEMM IGEMM SGEMM TensorOp (f16) TensorOp (f32) TensorOp (TF32)

CUTLASS RELATIVE PERFORMANCE TO CUBLASCUTLASS 2.2 – CUDA 11 Toolkit – NVIDIA A100

Page 60: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

60

CUTLASS RELATIVE PERFORMANCE TO CUBLAS

cuBLAS

CUTLASS 2.2 – CUDA 11 Toolkit – Three generations of GPU architectures

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

NN NT TN TT NN NT TN TT NN NT TN TT NN NT TN TT NN NT TN TT NN NT TN TT

DGEMM IGEMM SGEMM TensorOp (f16) TensorOp (f32) TensorOp (TF32)

2080Ti

A100

TitanV

CUTLASS

Page 61: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

61

0

50,000

100,000

150,000

200,000

250,000

64

16

0

25

6

35

2

44

8

54

4

64

0

73

6

83

2

92

8

10

24

11

20

12

16

13

12

14

08

15

04

16

00

16

96

17

92

18

88

19

84

20

80

21

76

22

72

23

68

24

64

25

60

26

56

27

52

28

48

29

44

30

40

31

36

32

32

33

28

34

24

35

20

36

16

37

12

38

08

39

04

40

00

40

96

GFL

OP/

s

GEMM K

ARBITRARY PROBLEM SIZECUTLASS Templates Cover the Design Space

128b alignment

64b alignment

32b alignment

16b alignment

CUDA 10.2 and before

CUTLASS 2.2 – NVIDIA A100 - Tensor Cores: F16 * F16 + F32

Page 62: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

62

CONCLUSION

Page 63: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

63

CONCLUSION: NVIDIA A100 IS FAST AND PROGRAMMABLE

Tensor Cores on NVIDIA A100 in CUDA

• Order of magnitude speedup for matrix computations

• Programmable in CUDA via mma.sync with zero overhead

• Kernel design can avoid memory bottlenecks

• CUDA 11 Toolkit capable of near-peak performance

CUTLASS 2.2: May 2020

• Open source CUDA C++ template library for CUDA development

• Reusable building blocks for utilizing Tensor Cores on NVIDIA GPUs

• Near-optimal performance on NVIDIA Ampere Architecture

Try it out! https://github.com/NVIDIA/cutlass

Page 64: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision

64

REFERENCES

NVIDIA Ampere Architecture:

“Inside the NVIDIA Ampere Architecture” (GTC 2020 – S21730)

“NVIDIA Ampere Architecture In-Depth” (blog post)

“CUDA New Features and Beyond” (GTC 2020 – S21760)

“Tensor Core Performance on NVIDIA GPUs” (GTC 2020 – S21929)

“Inside the Compilers, Libraries and Tools for Accelerated Computing” (GTC 2020 - S21766)

CUTLASS

https://github.com/NVIDIA/cutlass (open source software, New BSD license)

GTC 2018 and GTC 2019 talks: GEMM structure and Volta Tensor Cores

CUTLASS Parallel For All blog post

Page 65: DEVELOPING CUDA KERNELS TO PUSH TENSOR ......32 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 s GEMM K Mixed Precision Floating Point CUTLASS 2.2 - CUDA 11 Toolkit –NVIDIA A100 Double Precision