better performance at lower occupancy · faster codes run at lower occupancy: cufft 2.2 cufft 2.3...

75
Better Performance at Lower Occupancy Vasily Volkov UC Berkeley September 22, 2010 1

Upload: others

Post on 14-Apr-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Better Performance at Lower Occupancy

Vasily Volkov

UC Berkeley

September 22, 2010 1

Page 2: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

It is common to recommend: • running more threads per multiprocessor • running more threads per thread block

Motivation: this is the only way to hide latencies

• But…

2

Prologue

Page 3: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Faster codes run at lower occupancy:

CUFFT 2.2 CUFFT 2.3

Threads per block 256 64 4x smaller thread blocks

Occupancy (G80) 33% 17% 2x lower occupancy

Performance (G80) 45 Gflop/s 93 Gflop/s 2x higher performance

CUBLAS 1.1 CUBLAS 2.0

Threads per block 512 64 8x smaller thread blocks

Occupancy (G80) 67% 33% 2x lower occupancy

Performance (G80) 128 Gflop/s 204 Gflop/s 1.6x higher performance

Batch of 1024-point complex-to-complex FFTs, single precision:

Multiplication of two large matrices, single precision (SGEMM):

3 Maximizing occupancy, you may lose performance

Page 4: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Two common fallacies:

‒ multithreading is the only way to hide latency on GPU

‒ shared memory is as fast as registers

4

Page 5: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

This talk

I. Hide arithmetic latency using fewer threads

II. Hide memory latency using fewer threads

III. Run faster by using fewer threads

IV. Case study: matrix multiply

V. Case study: FFT

5

Page 6: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Part I:

Hide arithmetic latency using fewer threads

6

Page 7: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

x = a + b;// takes ≈20 cycles to execute

y = a + c;// independent, can start anytime

(stall)

z = x + d;// dependent, must wait for completion

Arithmetic latency

Latency: time required to perform an operation ‒ ≈20 cycles for arithmetic; 400+ cycles for memory ‒ Can’t start a dependent operation for this time ‒ Can hide it by overlapping with other operations

7

Page 8: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Arithmetic throughput

8

Latency is often confused with throughput

‒ E.g. “arithmetic is 100x faster than memory – costs 4 cycles per warp (G80), whence memory operation costs 400 cycles”

‒ One is rate, another is time

Throughput: how many operations complete per cycle

‒ Arithmetic: 1.3 Tflop/s = 480 ops/cycle (op=multiply-add)

‒ Memory: 177 GB/s ≈ 32 ops/cycle (op=32-bit load)

Page 9: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Hide latency = do other operations when waiting for latency

• Will run faster

• But not faster than the peak

• How to get the peak?

9

Page 10: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Use Little’s law

10

Needed parallelism = Latency x Throughput

Page 11: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Arithmetic parallelism in numbers

11

GPU model Latency (cycles)

Throughput (cores/SM)

Parallelism (operations/SM)

G80-GT200 ≈24 8 ≈192

GF100 ≈18 32 ≈576

GF104 ≈18 48 ≈864

(latency varies between different types of ops)

Can’t get 100% throughput with less parallelism

‒ Not enough operations in the flight = idle cycles

Page 12: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Thread-level parallelism (TLP)

12

It is usually recommended to use threads to supply the needed parallelism, e.g. 192 threads per SM on G80:

x = x + ax = x + bx = x + c

y = y + ay = y + by = y + c

thread 1 thread 2 thread 3

w = w + aw = w + bw = w + c

thread 4

z = z + az = z + bz = z + c

4 independent operations

Page 13: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Instruction-level parallelism (ILP)

13

But you can also use parallelism among instructions in a single thread:

x = x + ay = y + a

w = w + az = z + a

x = x + by = y + b

w = w + bz = z + b

instr

uctio

ns

thread

4 independent

operations

Page 14: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

You can use both ILP and TLP on GPU

This applies to all CUDA-capable GPUs. E.g. on G80: ‒ Get ≈100% peak with 25% occupancy if no ILP

‒ Or with 8% occupancy, if 3 operations from each thread can be concurrently processed

On GF104 you must use ILP to get >66% of peak! ‒ 48 cores/SM, one instruction is broadcast across 16 cores

‒ So, must issue 3 instructions per cycle

‒ But have only 2 warp schedulers

‒ Instead, it can issue 2 instructions per warp in the same cycle

14

Page 15: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Let’s check it experimentally Do many arithmetic instructions with no ILP:

15

#pragma unroll UNROLL

for( int i = 0; i < N_ITERATIONS; i++ )

{

a = a * b + c;

}

Choose large N_ITERATIONS and suitable UNROLL

Ensure a, b and c are in registers and a is used later

Run 1 block (use 1 SM), vary block size ‒ See what fraction of peak (1.3TFLOPS/15) we get

Page 16: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

No ILP: need 576 threads to get 100% utilization 16

Experimental result (GTX480)

0%

20%

40%

60%

80%

100%

0 128 256 384 512 640 768 896 1024

frac

tio

n o

f p

eak

threads per SM

peak=89.6 Gflop/s

Page 17: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Introduce instruction-level parallelism

Try ILP=2: two independent instruction per thread

17

#pragma unroll UNROLL

for( int i = 0; i < N_ITERATIONS; i++ )

{

a = a * b + c;

d = d * b + c;

}

If multithreading is the only way to hide latency on GPU, we’ve got to get the same performance

Page 18: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

ILP=2: need 320 threads to get 100% utilization 18

GPUs can hide latency using ILP

0%

20%

40%

60%

80%

100%

0 128 256 384 512 640 768 896 1024

frac

tio

n o

f p

eak

threads per SM

Page 19: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Add more instruction-level parallelism

ILP=3: triples of independent instructions

19

#pragma unroll UNROLL

for( int i = 0; i < N_ITERATIONS; i++ )

{

a = a * b + c;

d = d * b + c;

e = e * b + c;

}

How far can we push it?

Page 20: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

ILP=3: need 256 threads to get 100% utilization 20

Have more ILP – need fewer threads

0%

20%

40%

60%

80%

100%

0 128 256 384 512 640 768 896 1024

frac

tio

n o

f p

eak

threads per SM

Page 21: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

ILP=4: need 192 threads to get 100% utilization 21

Unfortunately, doesn’t scale past ILP=4

0%

20%

40%

60%

80%

100%

0 128 256 384 512 640 768 896 1024

frac

tio

n o

f p

eak

threads per SM

Page 22: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Summary: can hide latency either way

22

0%

20%

40%

60%

80%

100%

0 256 512 768 1024

Thread parallelism

fixed instruction paralleism (ILP=1)

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5 6

Instruction parallelism

fixed thread parallelism (12.5% occupancy)

Page 23: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Applies to other GPUs too, e.g. to G80:

23

0%

20%

40%

60%

80%

100%

0 128 256 384 512

Thread parallelism

fixed instruction paralleism (ILP=1)

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5 6

Instruction parallelism

fixed thread parallelism (8% occupancy)

Page 24: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Fallacy: Increasing occupancy is the only way to improve latency hiding

– No, increasing ILP is another way.

24

Page 25: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Fallacy: Occupancy is a metric of utilization

– No, it’s only one of the contributing factors.

25

Page 26: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Fallacy: “To hide arithmetic latency completely, multiprocessors should be running at least 192 threads on devices of compute capability 1.x (…) or, on devices of compute capability 2.0, as many as 384 threads” (CUDA Best Practices Guide)

– No, it is doable with 64 threads per SM on G80-

GT200 and with 192 threads on GF100.

26

Page 27: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Part II:

Hide memory latency using fewer threads

27

Page 28: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Hiding memory latency

Apply same formula but for memory operations:

28

Latency Throughput Parallelism

Arithmetic ≈18 cycles 32 ops/SM/cycle 576 ops/SM

Memory < 800 cycles (?) < 177 GB/s < 100 KB

Needed parallelism = Latency x Throughput

So, hide memory latency = keep 100 KB in the flight

‒ Less if kernel is compute bound (needs fewer GB/s)

Page 29: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Again, there are multiple ways to hide latency ‒ Use multithreading to get 100KB in the flight

‒ Use instruction parallelism (more fetches per thread)

‒ Use bit-level parallelism (use 64/128-bit fetches)

Do more work per thread – need fewer threads ‒ Fetch 4B/thread – need 25 000 threads

‒ Fetch 100 B/thread – need 1 000 threads

29

How many threads is 100 KB?

Page 30: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Copy one float per thread:

30

__global__ void memcpy( float *dst, float *src )

{

int block = blockIdx.x + blockIdx.y * gridDim.x;

int index = threadIdx.x + block * blockDim.x;

float a0 = src[index];

dst[index] = a0;

}

Empirical validation

Run many blocks, allocate shared memory dynamically to control occupancy

Page 31: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Copying 1 float per thread (GTX480)

Must maximize occupancy to hide latency? 31

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

frac

tio

n o

f p

eak

occupancy

peak=177.4GB/s

Page 32: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

32

__global__ void memcpy( float *dst, float *src )

{

int iblock= blockIdx.x + blockIdx.y * gridDim.x;

int index = threadIdx.x + 2 * iblock * blockDim.x;

float a0 = src[index];

//no latency stall

float a1 = src[index+blockDim.x];

//stall

dst[index] = a0;

dst[index+blockDim.x] = a1;

}

Do more parallel work per thread

Note, threads don’t stall on memory access – Only on data dependency

Page 33: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Copying 2 float values per thread

33

Can get away with lower occupancy now

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

frac

tio

n o

f p

eak

occupancy

Page 34: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

34

__global__ void memcpy( float *dst, float *src )

{

int iblock = blockIdx.x + blockIdx.y * gridDim.x;

int index = threadIdx.x + 4 * iblock * blockDim.x;

float a[4];//allocated in registers

for(int i=0;i<4;i++) a[i]=src[index+i*blockDim.x];

for(int i=0;i<4;i++) dst[index+i*blockDim.x]=a[i];

}

Do more parallel work per thread

Note, local arrays are allocated in registers if possible

Page 35: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Copying 4 float values per thread

35

Mere 25% occupancy is sufficient. How far we can go?

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

frac

tio

n o

f p

eak

occupancy

Page 36: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Copying 8 float values per thread

36

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

frac

tio

n o

f p

eak

occupancy

Page 37: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Copying 8 float2 values per thread

37

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

frac

tio

n o

f p

eak

occupancy

Page 38: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

87% of pin bandwidth at only 8% occupancy! 38

Copying 8 float4 values per thread

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

frac

tio

n o

f p

eak

occupancy

Page 39: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

84% of peak at 4% occupancy 39

Copying 14 float4 values per thread

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

frac

tio

n o

f p

eak

occupancy

Page 40: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

40

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

occupancy

0%

20%

40%

60%

80%

100%

0 64 128 192 256

bytes per thread

Two ways to hide memory latency

Page 41: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Fallacy: “Low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation” (CUDA Best Practices Guide)

– We’ve just seen 84% of the peak at mere 4%

occupancy. Note that this is above 71% that cudaMemcpy achieves at best.

41

Page 42: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Fallacy: “In general, more warps are required if the ratio of the number of instructions with no off-chip memory operands (…) to the number of instructions with off-chip memory operands is low.” (CUDA Programming Guide)

– No, we’ve seen 87% of memory peak with only 4

warps per SM in a memory intensive kernel.

42

Page 43: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Part III:

Run faster by using fewer threads

43

Page 44: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

44

Fewer threads = more registers per thread

Registers per thread: GF100: 20 at 100% occupancy, 63 at 33% occupancy — 3x GT200: 16 at 100% occupancy, ≈128 at 12.5% occupancy — 8x Is using more registers per thread better?

More threads More registers

per thread

32768

registers

per SM

Page 45: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Only registers are fast enough to get the peak

Consider a*b+c: 2 flops, 12 bytes in, 4 bytes out

This is 8.1 TB/s for 1.3 Tflop/s!

Registers can accommodate it. Can shared memory?

‒ 4B*32banks*15 SMs*half 1.4GHz = 1.3TB/s only 45

a, b, c @

8.1 TB/sa*b+c @

1.3 Tflop/sresult @ 2.7 GB/s

Page 46: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Bandwidth needed vs bandwidth available

46

1.3

TB/s 8 TB/s177 GB/s

Global

memory

Shared

memory

Needed to get the

peak

7.6x 6x

Registers are at

least this fast

Page 47: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Fallacy: “In fact, for all threads of a warp, accessing the shared memory is as fast as accessing a register as long as there are no bank conflicts between the threads..” (CUDA Programming Guide)

– No, shared memory bandwidth is 6x lower than

register bandwidth on Fermi. (3x before Fermi.)

47

Page 48: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Running fast may require low occupancy

• Must use registers to run close to the peak

• The larger the bandwidth gap, the more data must come from registers

• This may require many registers = low occupancy

This often can be accomplished by computing multiple outputs per thread

48

Page 49: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

More data is local to a thread in registers

‒ may need fewer shared memory accesses

Fewer threads, but more parallel work in thread

‒ So, low occupancy should not be a problem 49

Compute multiple outputs per thread

4 threads8 threads16 threads

1 output/thread 2 outputs/thread 4 outputs/thread4x4 matrix

Page 50: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

From Tesla to Fermi: regression? The gap between shared memory and arithmetic throughput has increased:

‒ G80-GT200: 16 banks vs 8 thread processors (2:1)

‒ GF100: 32 banks vs 32 thread processors (1:1)

‒ GF104: 32 banks vs 48 thread processors (2:3)

Using fast register memory could help. But instead, register use is restricted:

‒ G80-GT200: up to ≈128 registers per thread

‒ Fermi: up to ≈64 registers per thread

50

Page 51: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Part IV:

Case study: matrix multiply

51

Page 52: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Baseline: matrix multiply in CUDA SDK

• I’ll show very specific steps for SDK 3.1, GTX480 • Original code shows 137 Gflop/s • First few changes:

– Use larger matrices, e.g. 1024x1024 (matrixMul.cu) • “uiWA = uiHA = uiWB = uiHB = uiWC = uiHC = 1024;” • Get 240 Gflop/s

– Remove “–maxrregcount 32” (or increase to 63) • Not important now, but will matter later

– Increase BLOCK_SIZE to 32 (matrixMul.h) • Must add #pragma unroll (see next slide); 242 Gflop/s

52

Page 53: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Matrix multiply example in SDK

53

float Csub = 0;

for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep)

{

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

AS(ty, tx) = A[a + wA * ty + tx];

BS(ty, tx) = B[b + wB * ty + tx];

__syncthreads();

#pragma unroll

for (int k = 0; k < BLOCK_SIZE; ++k)

Csub += AS(ty, k) * BS(k, tx);

__syncthreads();

}

int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

C[c + wB * ty + tx] = Csub;

Page 54: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Baseline performance • One output per thread so far • 242 Gflop/s

– 2 flops per 2 shared memory accesses = 4 B/flop – So, bound by shared memory bandwidth to 336 Gflop/s – We’ll approach 500 Gflop/s in a few slides

• 21 register per thread (sm_20) • 67% occupancy • But only 1 block fits per SM

– Can’t overlap global memory access with arithmetic

54

Page 55: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Two outputs per thread (I)

In the new code we use 2x smaller thread blocks – But same number of blocks

matrixMul.cu:

55

// setup execution parameters

dim3 threads(BLOCK_SIZE, BLOCK_SIZE/2); //32x16

dim3 grid(uiWC / BLOCK_SIZE, uiHC / BLOCK_SIZE);

2x fewer threads, but 2x more work per thread:

Page 56: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Two outputs per thread (II)

Define 2 outputs and do 2x more loads 56

float Csub[2] = {0,0};//array is allocated in registers

for (int a = aBegin, b = bBegin; a <= aEnd;

a += aStep, b += bStep)

{

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

AS(ty, tx) = A[a + wA * ty + tx];

BS(ty, tx) = B[b + wB * ty + tx];

AS(ty+16, tx) = A[a + wA * (ty+16) + tx];

BS(ty+16, tx) = B[b + wB * (ty+16) + tx];

__syncthreads();

Page 57: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Two outputs per thread (III)

Do 2x more flops and stores

57

#pragma unroll

for (int k = 0; k < BLOCK_SIZE; ++k)

{

Csub[0] += AS(ty, k) * BS(k, tx);

Csub[1] += AS(ty+16, k) * BS(k, tx);

}

__syncthreads();

}

int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

C[c + wB * ty + tx] = Csub[0];

C[c + wB * (ty+16) + tx] = Csub[1];

Page 58: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

• Now 341 Gflop/s — 1.4x speedup – Already above 336 Gflop/s bound

• 28 registers – 2x more work with only 1.3x more registers

• Now 2 threads blocks fit per SM – Because fewer threads per block, 1536 max per SM – Now can overlap of memory access with arithmetic – This is one reason for the speedup

• Same 67% occupancy 58

Two outputs per thread: performance

Page 59: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Shared memory traffic is now lower • Date fetched from shared memory is now reused:

59

for (int k = 0; k < BLOCK_SIZE; ++k)

{

Csub[0] += AS(ty, k) * BS(k, tx);

Csub[1] += AS(ty+16, k) * BS(k, tx);

}

• Now 5B/flop in shared memory accesses

• New bound: 448 Gflop/s – We’ll surpass this too

Page 60: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Apply same idea again

Shrink thread blocks by another factor of 2:

60

// setup execution parameters

dim3 threads(BLOCK_SIZE, BLOCK_SIZE/4); //32x8

dim3 grid(uiWC / BLOCK_SIZE, uiHC / BLOCK_SIZE);

Four outputs per thread (I)

Page 61: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Four outputs per thread (II)

61

float Csub[4] = {0,0,0,0};//array is in registers

for (int a = aBegin, b = bBegin; a <= aEnd;

a += aStep, b += bStep)

{

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

AS(ty, tx) = A[a + wA * ty + tx];

BS(ty, tx) = B[b + wB * ty + tx];

AS(ty+8, tx) = A[a + wA * (ty+8) + tx];

BS(ty+8, tx) = B[b + wB * (ty+8) + tx];

AS(ty+16, tx) = A[a + wA * (ty+16) + tx];

BS(ty+16, tx) = B[b + wB * (ty+16) + tx];

AS(ty+24, tx) = A[a + wA * (ty+24) + tx];

BS(ty+24, tx) = B[b + wB * (ty+24) + tx];

__syncthreads();

Page 62: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Four outputs per thread (III)

62

#pragma unroll

for (int k = 0; k < BLOCK_SIZE; ++k)

{

Csub[0] += AS(ty, k) * BS(k, tx);

Csub[1] += AS(ty+8, k) * BS(k, tx);

Csub[2] += AS(ty+16, k) * BS(k, tx);

Csub[3] += AS(ty+24, k) * BS(k, tx);

}

__syncthreads();

}

int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

C[c + wB * ty + tx] = Csub[0];

C[c + wB * (ty+8) + tx] = Csub[1];

C[c + wB * (ty+16) + tx] = Csub[2];

C[c + wB * (ty+24) + tx] = Csub[3];

Page 63: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

• Now 427 Gflop/s — 1.76x speedup vs. baseline! – Because access shared memory even less

• 41 registers – Only ≈2x more registers – So, ≈2x fewer registers per thread block

• 50% occupancy — 1.33x lower – Better performance at lower occupancy

• 3 thread blocks per SM – Because fewer registers per thread block

63

Four outputs per thread: performance

Page 64: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

• Now 485 Gflop/s — 2x speedup vs. baseline! – Only 2.25 B/flop — 1.8x lower

• 63 registers — 3x more – But do 8x more work!

• 33% occupancy — 2x lower – Better performance at lower occupancy

• 4 thread blocks per SM

64

Eight outputs per thread: performance

Page 65: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

MAGMA BLAS — up to 838 Gflop/s

‒ 36 outputs per thread

‒ 0.67 B/flop only — 6x lower

‒ 33% occupancy

‒ 2 thread blocks per SM

65

How much faster we can get?

Page 66: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

GFLOPS go up, occupancy goes down

66

0%

10%

20%

30%

40%

50%

60%

70%

1 2 4 8 36

occ

up

ancy

outputs per thread

0

100

200

300

400

500

600

700

800

900

1 2 4 8 36

Gfl

op

/s

outputs per thread

Page 67: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Register use goes up, smem traffic down

67

0

1

2

3

4

1 2 4 8 36

B/f

lop

outputs per thread

0

16

32

48

64

1 2 4 8 36

regi

ste

rs/t

hre

ad

outputs per thread

Page 68: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Part V:

Case Study: FFT

68

Page 69: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Mapping Cooley-Tukey to GPU

69

• Cooley-Tukey splits large FFT into smaller FFTs

• Assume FFT fits into thread block

• Small FFT are done in registers

• Shuffles are done using shared memory

Page 70: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Fewer threads – lower shared memory traffic

70

2 outputs/thread 4 outputs 16 outputs

8 threads

3 shuffles

4 threads

1 shuffle

1 thread

no shuffles

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

8 outputs

2 threads

1 shuffle

Page 71: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

71

__global__ void FFT1024( float2 *dst, float2 *src ){

float2 a[2]; int tid = threadIdx.x;

__shared__ float smem[1024];

load<2>( a, src+tid+1024*blockIdx.x, 512 );

FFT2( a );

#pragma unroll

for( int i = 0; i < 9; i++ ) {

int k = 1<<i;

twiddle<2>( a, tid/k, 1024/k );

transpose<2>( a, &smem[tid+(tid&~(k-1))], k, &smem[tid], 512 );

FFT2( a );

}

store<2>( a, dst+tid+1024*blockIdx.x, 512 );

}

Two outputs per thread

Page 72: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

72

__global__ void FFT1024( float2 *dst, float2 *src ){

float2 a[16]; int tid = threadIdx.x;

__shared__ float smem[1024];

load<16>( a, src+tid+1024*blockIdx.x, 64 );

FFT4( a, 4, 4, 1 );// four FFT4

twiddle<4>( a, threadIdx.x, 1024, 4 );

transpose<4>( a, &smem[tid*4], 1, &smem[tid], 64, 4 );

#pragma unroll

for( int i = 2; i < 10-4; i += 4 ) {

int k = 1<<i;

FFT16( a );

twiddle<16>( a, threadIdx.x/k, 1024/k );

transpose<16>( a, &smem[tid+15*(tid&~(k-1))], k, &smem[tid], 64 );

}

FFT16( a );

store<16>( a, dst+tid+1024*blockIdx.x, 64 );

}

Sixteen outputs per thread

Page 73: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

GFLOPS go up, occupancy goes down

73

0

50

100

150

200

250

300

350

400

450

2 4 8 16

Gfl

op

/s

outputs per thread

0%

20%

40%

60%

80%

100%

2 4 8 16

occ

up

ancy

outputs per thread

Page 74: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Summary

• Do more parallel work per thread to hide latency with fewer threads

• Use more registers per thread to access slower shared memory less

• Both may be accomplished by computing multiple outputs per thread

74

Page 75: Better Performance at Lower Occupancy · Faster codes run at lower occupancy: CUFFT 2.2 CUFFT 2.3 Threads per block 256 64 4x smaller thread blocks Occupancy (G80) 33% 17% 2x lower

Compute more outputs per thread

75

0%

20%

40%

60%

80%

100%

1 2 4 8 16

Occ

up

ancy

Outputs per thread

0

100

200

300

400

500

1 2 4 8 16

Gfl

op

/s

Outputs per thread