cache performance iicr4bd/3330/s2017/... · 3/30/2017  · a[k*114+j] isat 10 0000 0000 0100...

106
Cache Performance II 1

Upload: others

Post on 17-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

Cache Performance II

1

Page 2: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache operation (associative)

valid tag valid tag data data1 10 1 00 00 11 AA BB

1 11 1 01 B4 B5 33 44

10011 1

index

=

=

tag

AND

AND

OR is hit? (1)

offset

data(B5)

2

Page 3: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache operation (associative)

valid tag valid tag data data1 10 1 00 00 11 AA BB

1 11 1 01 B4 B5 33 44

10011 1

index

=

=

tag

AND

AND

OR is hit? (1)

offset

data(B5)

2

Page 4: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache operation (associative)

valid tag valid tag data data1 10 1 00 00 11 AA BB

1 11 1 01 B4 B5 33 44

10011 1

index

=

=

tag

AND

AND

OR is hit? (1)

offset

data(B5)

2

Page 5: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

Tag-Index-Offset formulas

m memory addreses bits (Y86-64: 64)

E number of blocks per set (“ways”)

S = 2s number of sets

s (set) index bits

B = 2b block size

b (block) offset bits

t = m − (s + b) tag bits

C = B × S × E cache size (excluding metadata)3

Page 6: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

Tag-Index-Offset exercisem memory addreses bits (Y86-64: 64)E number of blocks per set (“ways”)S = 2s number of setss (set) index bitsB = 2b block sizeb (block) offset bitst = m − (s + b) tag bitsC = B × S × E cache size (excluding metadata)

My desktop:L1 Data Cache: 32 KB, 8 blocks/set, 64 byte blocksL2 Cache: 256 KB, 4 blocks/set, 64 byte blocksL3 Cache: 8 MB, 16 blocks/set, 64 byte blocks

Divide the address 0x34567 into tag, index, offset foreach cache.

4

Page 7: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O exercise: L1quantity value for L1block size (given) B = 64Byte

B = 2b (b: block offset bits)

block offset bits b = 6blocks/set (given) E = 8cache size (given) C = 32KB = E × B × S

S = C

B × E(S: number of sets)

number of sets S = 32KB64Byte × 8 = 64

S = 2s (s: set index bits)set index bits s = log2(64) = 6

5

Page 8: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O exercise: L1quantity value for L1block size (given) B = 64Byte

B = 2b (b: block offset bits)block offset bits b = 6

blocks/set (given) E = 8cache size (given) C = 32KB = E × B × S

S = C

B × E(S: number of sets)

number of sets S = 32KB64Byte × 8 = 64

S = 2s (s: set index bits)set index bits s = log2(64) = 6

5

Page 9: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O exercise: L1quantity value for L1block size (given) B = 64Byte

B = 2b (b: block offset bits)block offset bits b = 6blocks/set (given) E = 8cache size (given) C = 32KB = E × B × S

S = C

B × E(S: number of sets)

number of sets S = 32KB64Byte × 8 = 64

S = 2s (s: set index bits)set index bits s = log2(64) = 6

5

Page 10: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O exercise: L1quantity value for L1block size (given) B = 64Byte

B = 2b (b: block offset bits)block offset bits b = 6blocks/set (given) E = 8cache size (given) C = 32KB = E × B × S

S = C

B × E(S: number of sets)

number of sets S = 32KB64Byte × 8 = 64

S = 2s (s: set index bits)set index bits s = log2(64) = 6

5

Page 11: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O exercise: L1quantity value for L1block size (given) B = 64Byte

B = 2b (b: block offset bits)block offset bits b = 6blocks/set (given) E = 8cache size (given) C = 32KB = E × B × S

S = C

B × E(S: number of sets)

number of sets S = 32KB64Byte × 8 = 64

S = 2s (s: set index bits)set index bits s = log2(64) = 6

5

Page 12: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O exercise: L1quantity value for L1block size (given) B = 64Byte

B = 2b (b: block offset bits)block offset bits b = 6blocks/set (given) E = 8cache size (given) C = 32KB = E × B × S

S = C

B × E(S: number of sets)

number of sets S = 32KB64Byte × 8 = 64

S = 2s (s: set index bits)set index bits s = log2(64) = 6

5

Page 13: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O resultsL1 L2 L3

sets 64 1024 8192block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)

6

Page 14: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O: splittingL1 L2 L3

block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)

0x34567: 3 4 5 6 70011 0100 0101 0110 0111

bits 0-5 (all offsets): 100111 = 0x27

7

Page 15: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O: splittingL1 L2 L3

block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)

0x34567: 3 4 5 6 70011 0100 0101 0110 0111

bits 0-5 (all offsets): 100111 = 0x27

7

Page 16: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O: splittingL1 L2 L3

block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)

0x34567: 3 4 5 6 70011 0100 0101 0110 0111

bits 0-5 (all offsets): 100111 = 0x27

L1:bits 6-11 (L1 set): 01 0101 = 0x15bits 12- (L1 tag): 0x34

7

Page 17: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O: splittingL1 L2 L3

block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)

0x34567: 3 4 5 6 70011 0100 0101 0110 0111

bits 0-5 (all offsets): 100111 = 0x27

L1:bits 6-11 (L1 set): 01 0101 = 0x15bits 12- (L1 tag): 0x34

7

Page 18: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O: splittingL1 L2 L3

block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)

0x34567: 3 4 5 6 70011 0100 0101 0110 0111

bits 0-5 (all offsets): 100111 = 0x27

L2:bits 6-15 (set for L2): 01 0001 0101 = 0x115bits 16-: 0x3

7

Page 19: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O: splittingL1 L2 L3

block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)

0x34567: 3 4 5 6 70011 0100 0101 0110 0111

bits 0-5 (all offsets): 100111 = 0x27

L2:bits 6-15 (set for L2): 01 0001 0101 = 0x115bits 16-: 0x3

7

Page 20: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

T-I-O: splittingL1 L2 L3

block offset bits 6 6 6set index bits 6 10 13tag bits (the rest)

0x34567: 3 4 5 6 70011 0100 0101 0110 0111

bits 0-5 (all offsets): 100111 = 0x27

L3:bits 6-18 (set for L3): 0 1101 0001 0101 =0xD15bits 18-: 0x0

7

Page 21: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

matrix squaring

Bij =n∑

k=1Aik × Akj

/* version 1: inner loop is k, middle is j */for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)for (int k = 0; k < N; ++k)

B[i*N+j] += A[i * N + k] * A[k * N + j];

8

Page 22: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

matrix squaring

Bij =n∑

k=1Aik × Akj

/* version 1: inner loop is k, middle is j*/for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)for (int k = 0; k < N; ++k)

B[i*N+j] += A[i * N + k] * A[k * N + j];

/* version 2: outer loop is k, middle is i */for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)

B[i*N+j] += A[i * N + k] * A[k * N + j]; 9

Page 23: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

matrix squaring

Bij =n∑

k=1Aik × Akj

/* version 1: inner loop is k, middle is j*/for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)for (int k = 0; k < N; ++k)

B[i*N+j] += A[i * N + k] * A[k * N + j];

/* version 2: outer loop is k, middle is i */for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)

B[i*N+j] += A[i * N + k] * A[k * N + j]; 9

Page 24: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

performance

0 100 200 300 400 500 600N

0.0

0.2

0.4

0.6

0.8

1.0

1.2 billions of instructionsk innerk outer

0 100 200 300 400 500 600N

0.0

0.2

0.4

0.6

0.8

1.0 billions of cyclesk innerk outer

10

Page 25: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

alternate view 1: cycles/instruction

0 100 200 300 400 500 600N

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 cycles/instruction

11

Page 26: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

alternate view 2: cycles/operation

0 100 200 300 400 500 600N

1.0

1.5

2.0

2.5

3.0

3.5 cycles/multiply or add

12

Page 27: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

loop orders and locality

loop body: Bij+ = AikAkj

kij order: Bij, Akj have spatial locality

kij order: Aik has temporal locality

… better than …

ijk order: Aik has spatial locality

ijk order: Bij has temporal locality

13

Page 28: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

loop orders and locality

loop body: Bij+ = AikAkj

kij order: Bij, Akj have spatial locality

kij order: Aik has temporal locality

… better than …

ijk order: Aik has spatial locality

ijk order: Bij has temporal locality

13

Page 29: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

matrix squaring

Bij =n∑

k=1Aik × Akj

/* version 1: inner loop is k, middle is j*/for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)for (int k = 0; k < N; ++k)

B[i*N+j] += A[i * N + k] * A[k * N + j];

/* version 2: outer loop is k, middle is i */for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)

B[i*N+j] += A[i * N + k] * A[k * N + j]; 14

Page 30: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

matrix squaring

Bij =n∑

k=1Aik × Akj

/* version 1: inner loop is k, middle is j*/for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)for (int k = 0; k < N; ++k)

B[i*N+j] += A[i * N + k] * A[k * N + j];

/* version 2: outer loop is k, middle is i */for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)

B[i*N+j] += A[i * N + k] * A[k * N + j]; 14

Page 31: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

matrix squaring

Bij =n∑

k=1Aik × Akj

/* version 1: inner loop is k, middle is j*/for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)for (int k = 0; k < N; ++k)

B[i*N+j] += A[i * N + k] * A[k * N + j];

/* version 2: outer loop is k, middle is i */for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)

B[i*N+j] += A[i * N + k] * A[k * N + j]; 14

Page 32: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

L1 misses

0 100 200 300 400 500 600N

0

20

40

60

80

100

120

140 read misses/1K instructionsk innerk outer

15

Page 33: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

L1 miss detail (1)

0 50 100 150 200N

0

20

40

60

80

100

120

140

matrix smallerthan L1 cache

read misses/1K instruction

16

Page 34: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

L1 miss detail (2)

0 50 100 150 200N

0

20

40

60

80

100

120

140

matrix smallerthan L1 cache

N = 93; 93 * 11 210

N = 114; 114 * 9 210

N = 27

read misses/1K instruction

17

Page 35: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

addressesA[k*114+j] is at 10 0000 0000 0100A[k*114+j+1] is at 10 0000 0000 1000A[(k+1)*114+j] is at 10 0011 1001 0100A[(k+2)*114+j] is at 10 0101 0101 1100…A[(k+9)*114+j] is at 11 0000 0000 1100

recall: 6 index bits, 6 block offset bits (L1)

18

Page 36: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

addressesA[k*114+j] is at 10 0000 0000 0100A[k*114+j+1] is at 10 0000 0000 1000A[(k+1)*114+j] is at 10 0011 1001 0100A[(k+2)*114+j] is at 10 0101 0101 1100…A[(k+9)*114+j] is at 11 0000 0000 1100

recall: 6 index bits, 6 block offset bits (L1)

18

Page 37: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

conflict misses

powers of two — lower order bits unchanged

A[k*93+j] and A[(k+11)*93+j]:1023 elements apart (4092 bytes; 63.9 cache blocks)

64 sets in L1 cache: usually maps to same set

A[k*93+(j+1)] will not be cached (next i loop)

even if in same block as A[k*93+j]

19

Page 38: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

L2 misses

0 100 200 300 400 500 600N

0

2

4

6

8

10 L2 misses/1K instructionsijkkij

20

Page 39: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

systematic approach (1)

for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)

B[i*N+j] += A[i*N+k] * A[k*N+j];

goal: get most out of each cache missif N is larger than the cache:miss for Bij — 1 comptuationmiss for Aik — N computationsmiss for Akj — 1 computationeffectively caching just 1 element

21

Page 40: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

‘flat’ 2D arrays and cache blocks

A[N]A[0]

Ax0 AxN

22

Page 41: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

array usage: kij order

Ax0 AxN

Aik

Ak0 to AkN

Bi0 to BiN

Akj

Bij

for all k: for all i: for all j: Bij+ = Aik × Akj

N calculations for Aik

1 for Akj, Bij

Aik reused in innermost loop (over j)definitely cached (plus rest of cache block)

Akj reused in next middle loop (over i)cached only if entire row fitsBij reused in next outer loop

probably not still in cache next time(but, at least some spatial locality)

23

Page 42: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

array usage: kij order

Ax0 AxN

Aik

Ak0 to AkN

Bi0 to BiN

Akj

Bij

for all k: for all i: for all j: Bij+ = Aik × Akj

N calculations for Aik

1 for Akj, Bij

Aik reused in innermost loop (over j)definitely cached (plus rest of cache block)

Akj reused in next middle loop (over i)cached only if entire row fitsBij reused in next outer loop

probably not still in cache next time(but, at least some spatial locality)

23

Page 43: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

array usage: kij order

Ax0 AxN

Aik

Ak0 to AkN

Bi0 to BiN

Akj

Bij

for all k: for all i: for all j: Bij+ = Aik × Akj

N calculations for Aik

1 for Akj, Bij

Aik reused in innermost loop (over j)definitely cached (plus rest of cache block)

Akj reused in next middle loop (over i)cached only if entire row fitsBij reused in next outer loop

probably not still in cache next time(but, at least some spatial locality)

23

Page 44: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

array usage: kij order

Ax0 AxN

Aik

Ak0 to AkN

Bi0 to BiN

Akj

Bij

for all k: for all i: for all j: Bij+ = Aik × Akj

N calculations for Aik

1 for Akj, Bij

Aik reused in innermost loop (over j)definitely cached (plus rest of cache block)

Akj reused in next middle loop (over i)cached only if entire row fits

Bij reused in next outer loopprobably not still in cache next time(but, at least some spatial locality)

23

Page 45: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

array usage: kij order

Ax0 AxN

Aik

Ak0 to AkN

Bi0 to BiN

Akj

Bij

for all k: for all i: for all j: Bij+ = Aik × Akj

N calculations for Aik

1 for Akj, Bij

Aik reused in innermost loop (over j)definitely cached (plus rest of cache block)

Akj reused in next middle loop (over i)cached only if entire row fits

Bij reused in next outer loopprobably not still in cache next time(but, at least some spatial locality)

23

Page 46: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

inefficiencies

if N is large enough that a row doesn’t fit in cache:keeping one block in cacheaccessing one element of that block

if N is small enough that a row does fit in cache:keeping one block in cache for Aik, using one elementkeeping row in cache for Akj, using N times

24

Page 47: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

systematic approach (2)

for (int k = 0; k < N; ++k) {for (int i = 0; i < N; ++i) {

Aik loaded once in this loop (N2 times):for (int j = 0; j < N; ++j)

Bij, Akj loaded each iteration (if N big):B[i*N+j] += A[i*N+k] * A[k*N+j];

N3 multiplies, N3 adds

about 1 load per operation

25

Page 48: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

a transformation

for (int kk = 0; kk < N; kk += 2)for (int k = kk; k < kk + 2; ++k)

for (int i = 0; i < N; i += 2)for (int j = 0; j < N; ++j)

B[i*N+j] += A[i*N+k] * A[k*N+j];

split the loop over k — should be exactly the same(assuming even N)

26

Page 49: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

a transformation

for (int kk = 0; kk < N; kk += 2)for (int k = kk; k < kk + 2; ++k)

for (int i = 0; i < N; i += 2)for (int j = 0; j < N; ++j)

B[i*N+j] += A[i*N+k] * A[k*N+j];

split the loop over k — should be exactly the same(assuming even N)

26

Page 50: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

simple blocking

for (int kk = 0; kk < N; kk += 2)/* was here: for (int k = kk; k < kk + 2; ++k) */for (int i = 0; i < N; i += 2)

for (int j = 0; j < N; ++j)for (int k = kk; k < kk + 2; ++k)

B[i*N+j] += A[i*N+k] * A[k*N+j];

now reorder split loop

27

Page 51: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

simple blocking

for (int kk = 0; kk < N; kk += 2)/* was here: for (int k = kk; k < kk + 2; ++k) */for (int i = 0; i < N; i += 2)

for (int j = 0; j < N; ++j)for (int k = kk; k < kk + 2; ++k)

B[i*N+j] += A[i*N+k] * A[k*N+j];

now reorder split loop

27

Page 52: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

simple blocking – expanded

for (int kk = 0; kk < N; kk += 2) {for (int i = 0; i < N; i += 2) {for (int j = 0; j < N; ++j) {

/* process a "block": */B[i*N+j] += A[i*N+kk] * A[kk*N+j];B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j];

}}

}

28

Page 53: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

simple blocking – expanded

for (int kk = 0; kk < N; kk += 2) {for (int i = 0; i < N; i += 2) {for (int j = 0; j < N; ++j) {

/* process a "block": */B[i*N+j] += A[i*N+kk] * A[kk*N+j];B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j];

}}

}

Temporal locality in Bijs

28

Page 54: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

simple blocking – expanded

for (int kk = 0; kk < N; kk += 2) {for (int i = 0; i < N; i += 2) {for (int j = 0; j < N; ++j) {

/* process a "block": */B[i*N+j] += A[i*N+kk] * A[kk*N+j];B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j];

}}

}

More spatial locality in Aik

28

Page 55: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

simple blocking – expanded

for (int kk = 0; kk < N; kk += 2) {for (int i = 0; i < N; i += 2) {for (int j = 0; j < N; ++j) {

/* process a "block": */B[i*N+j] += A[i*N+kk] * A[kk*N+j];B[i*N+j] += A[i*N+kk+1] * A[(kk+1)*N+j];

}}

}

Still have good spatial locality in Akj, Bij

28

Page 56: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

improvement in read misses

0 100 200 300 400 500 600N

0

5

10

15

20read misses/1K instructions of unblocked

blocked (kk+=2)unblocked

29

Page 57: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

simple blocking (2)

same thing for i in addition to k?

for (int kk = 0; kk < N; kk += 2) {for (int ii = 0; ii < N; ii += 2) {for (int j = 0; j < N; ++j) {

/* process a "block": */for (int k = kk; k < kk + 2; ++k)

for (int i = 0; i < ii + 2; ++i)B[i*N+j] += A[i*N+k] * A[k*N+j];

}}

}

30

Page 58: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

simple blocking — expanded

for (int k = 0; k < N; k += 2) {for (int i = 0; i < N; i += 2) {for (int j = 0; j < N; ++j) {

/* process a "block": */Bi+0,j += Ai+0,k+0 * Ak+0,j

Bi+0,j += Ai+0,k+1 * Ak+1,j

Bi+1,j += Ai+1,k+0 * Ak+0,j

Bi+1,j += Ai+1,k+1 * Ak+1,j}

}}

Now Akj reused in inner loop — more calculationsper load!

31

Page 59: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

simple blocking — expanded

for (int k = 0; k < N; k += 2) {for (int i = 0; i < N; i += 2) {for (int j = 0; j < N; ++j) {

/* process a "block": */Bi+0,j += Ai+0,k+0 * Ak+0,j

Bi+0,j += Ai+0,k+1 * Ak+1,j

Bi+1,j += Ai+1,k+0 * Ak+0,j

Bi+1,j += Ai+1,k+1 * Ak+1,j}

}}

Now Akj reused in inner loop — more calculationsper load!

31

Page 60: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

array usage (better)

Aik to Ai+1,k+1

Ak0to

Ak+1,N

Bi0 to Bi+1,N

more temporal locality:N calculations for each Aik

2 calculations for each Bij (for k, k + 1)2 calculations for each Akj (for k, k + 1)

more spatial locality:calculate on each Ai,k and Ai,k+1 togetherboth in same cache block — same amount of cache loads

32

Page 61: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

array usage (better)

Aik to Ai+1,k+1

Ak0to

Ak+1,N

Bi0 to Bi+1,N

more temporal locality:N calculations for each Aik

2 calculations for each Bij (for k, k + 1)2 calculations for each Akj (for k, k + 1)

more spatial locality:calculate on each Ai,k and Ai,k+1 togetherboth in same cache block — same amount of cache loads

32

Page 62: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

generalizing cache blocking

for (int kk = 0; kk < N; kk += K) {for (int ii = 0; ii < N; ii += I) {

with I by K block of A cached:for (int jj = 0; jj < N; jj += J) {

with K by J block of A, I by J block of B cached:for i, j, k in I by J by K block:

B[i * N + j] += A[i * N + k]* A[k * N + j];

}}

}

33

Page 63: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

generalizing cache blocking

for (int kk = 0; kk < N; kk += K) {for (int ii = 0; ii < N; ii += I) {

with I by K block of A cached:for (int jj = 0; jj < N; jj += J) {

with K by J block of A, I by J block of B cached:for i, j, k in I by J by K block:

B[i * N + j] += A[i * N + k]* A[k * N + j];

}}

}

Bij used K times for one miss

33

Page 64: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

generalizing cache blocking

for (int kk = 0; kk < N; kk += K) {for (int ii = 0; ii < N; ii += I) {

with I by K block of A cached:for (int jj = 0; jj < N; jj += J) {

with K by J block of A, I by J block of B cached:for i, j, k in I by J by K block:

B[i * N + j] += A[i * N + k]* A[k * N + j];

}}

}

Aik used > J times for one miss

33

Page 65: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

generalizing cache blocking

for (int kk = 0; kk < N; kk += K) {for (int ii = 0; ii < N; ii += I) {

with I by K block of A cached:for (int jj = 0; jj < N; jj += J) {

with K by J block of A, I by J block of B cached:for i, j, k in I by J by K block:

B[i * N + j] += A[i * N + k]* A[k * N + j];

}}

}

Akj used I times for one miss

33

Page 66: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

generalizing cache blocking

for (int kk = 0; kk < N; kk += K) {for (int ii = 0; ii < N; ii += I) {

with I by K block of A cached:for (int jj = 0; jj < N; jj += J) {

with K by J block of A, I by J block of B cached:for i, j, k in I by J by K block:

B[i * N + j] += A[i * N + k]* A[k * N + j];

}}

}

catch: IK + KJ + IJ elements must fit in cache

33

Page 67: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

keeping values in cache

can’t explicitly ensure values are kept in cache

…but reusing values effectively does thiscache will try to keep recently used values

cache optimization idea: choose what’s in the cache

34

Page 68: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

array usage: block

Aik block(I × K)

Akj block(K × J) Bij block

(I × J)

inner loop keeps “blocks” from A, B in cache

Bij calculation uses strips from AK calculations for one load (cache miss)Aik calculation uses strips from A, BJ calculations for one load (cache miss)

(approx.) KIJ fully cached calculationsfor KI + IJ + KJ loads(assuming everything stays in cache)

35

Page 69: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

array usage: block

Aik block(I × K)

Akj block(K × J) Bij block

(I × J)

inner loop keeps “blocks” from A, B in cache

Bij calculation uses strips from AK calculations for one load (cache miss)

Aik calculation uses strips from A, BJ calculations for one load (cache miss)

(approx.) KIJ fully cached calculationsfor KI + IJ + KJ loads(assuming everything stays in cache)

35

Page 70: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

array usage: block

Aik block(I × K)

Akj block(K × J) Bij block

(I × J)

inner loop keeps “blocks” from A, B in cacheBij calculation uses strips from AK calculations for one load (cache miss)

Aik calculation uses strips from A, BJ calculations for one load (cache miss)

(approx.) KIJ fully cached calculationsfor KI + IJ + KJ loads(assuming everything stays in cache)

35

Page 71: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

array usage: block

Aik block(I × K)

Akj block(K × J) Bij block

(I × J)

inner loop keeps “blocks” from A, B in cacheBij calculation uses strips from AK calculations for one load (cache miss)Aik calculation uses strips from A, BJ calculations for one load (cache miss)

(approx.) KIJ fully cached calculationsfor KI + IJ + KJ loads(assuming everything stays in cache)

35

Page 72: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache blocking efficiency

load I × K elements of Aik:do > J multiplies with each

load K × J elements of Akj:do I multiplies with each

load I × J elements of Bij:do K adds with each

bigger blocks — more work per load!

catch: IK + KJ + IJ elements must fit in cache

36

Page 73: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache blocking rule of thumb

fill the most of the cache with useful data

and do as much work as possible from that

example: my desktop 32KB L1 cache

I = J = K = 48 uses 482 × 3 elements, or 27KB.

assumption: conflict misses aren’t important

37

Page 74: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

view 2: divide and conquer

partial_square(float *A, float *B,int startI, int endI, ...) {

for (int i = startI; i < endI; ++i) {for (int j = startJ; j < endJ; ++j) {

...}square(float *A, float *B, int N) {for (int ii = 0; ii < N; ii += BLOCK)...

/* segment of A, B in use fits in cache! */partial_square(

A, B,ii, ii + BLOCK,jj, jj + BLOCK, ...);

}38

Page 75: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache blocking ugliness — fringe

fullblock

partialblock

39

Page 76: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache blocking ugliness — fringe

for (int kk = 0; kk < N; kk += K) {for (int ii = 0; ii < N; ii += I) {for (int jj = 0; jj < N; jj += J) {

for (int k = kk; k < min(kk+K, N) ; ++k) {// ...

}}

}}

40

Page 77: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache blocking ugliness — fringe

for (kk = 0; kk + K <= N; kk += K) {for (ii = 0; ii + I <= N; ii += I) {for (jj = 0; jj + J <= N; ii += J) {

// ...}for (; jj < N; ++jj) {

// handle remainder}

}for (; ii < N; ++ii) {// handle remainder

}}for (; kk < N; ++kk) {// handle remainder

} 41

Page 78: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache blocking and miss rate

0 100 200 300 400 500 600N

0.000.010.020.030.040.050.060.070.080.09 read misses/multiply or add

blockedunblocked

42

Page 79: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

what about performance?

0 100 200 300 400 500 600N

0.0

0.5

1.0

1.5

2.0 cycles per multiply/add [less optimized loop]

unblockedblocked

0 200 400 600 800 1000N

0.0

0.1

0.2

0.3

0.4

0.5 cycles per multiply/add [optimized loop]

unblockedblocked

43

Page 80: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

performance for big sizes

0 2000 4000 6000 8000 10000N

0.0

0.2

0.4

0.6

0.8

1.0matrix inL3 cache

cycles per multiply or addunblockedblocked

44

Page 81: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

optimized loop???

performance difference wasn’t visible at small sizes

until I optimized arithmetic in the loop

(by supplying better options to GCC)

1: loading Bi,j through Bi,j+7 with one instruction

2: doing adds and multiplies with less instructions

3: simplifying address computations

but… how can that make cache blocking better???

45

Page 82: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

optimized loop???

performance difference wasn’t visible at small sizes

until I optimized arithmetic in the loop

(by supplying better options to GCC)

1: loading Bi,j through Bi,j+7 with one instruction

2: doing adds and multiplies with less instructions

3: simplifying address computations

but… how can that make cache blocking better???45

Page 83: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

overlapping loads and arithmetic

time

load load load

multiply

add

multiply multiply multiply multiply

add add add

speed of load might not matter if these are slower

46

Page 84: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

register reuse

for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i)for (int j = 0; j < N; ++j)

B[i*N+j] += A[i*N+k] * A[k*N+j];// optimize into:for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i) {float Aik = A[i*N+k]; // hopefully keep in register!

// faster than even cache hit!for (int j = 0; j < N; ++j)

B[i*N+j] += Aik * A[k*N+j];}

}

can compiler do this for us?47

Page 85: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

can compiler do register reuse?

Not easily — What if A = B?

for (int k = 0; k < N; ++k)for (int i = 0; i < N; ++i) {// want to preload A[i*N+k] here!for (int j = 0; j < N; ++j) {

// but if A = B, modifying here!B[i*N+j] += A[i*N+k] * A[k*N+j];

}}

}

48

Page 86: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

Automatic register reuse

Compiler would need to generate overlap check:if ((B > A + N * N || B < A) &&

(B + N * N > A + N * N ||B + N * N < A)) {

for (int k = 0; k < N; ++k) {for (int i = 0; i < N; ++i) {

float Aik = A[i*N+k];for (int j = 0; j < N; ++j) {

B[i*N+j] += Aik * A[k*N+j];}

}}

} else { /* other version */ }

49

Page 87: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

“register blocking”

for (int k = 0; k < N; ++k) {for (int i = 0; i < N; i += 2) {

float Ai0k = A[(i+0)*N + k];float Ai1k = A[(i+1)*N + k];for (int j = 0; j < N; j += 2) {

float Akj0 = A[k*N + j+0];float Akj1 = A[k*N + j+1];B[(i+0)*N + j+0] += Ai0k * Akj0;B[(i+1)*N + j+0] += Ai1k * Akj0;B[(i+0)*N + j+1] += Ai0k * Akj1;B[(i+1)*N + j+1] += Ai1k * Akj1;

}}

}

50

Page 88: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache blocking: summary

reorder calculation to reduce cache misses:

make explicit choice about what is in cache

perform calculations in cache-sized blocks

get more spatial and temporal localitytemporal locality — reuse values in manycalculations

before they are replaced in the cache

spatial locality — use adjacent values in calculationsbefore cache block is replaced

51

Page 89: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

avoiding conflict misses

problem — array is scattered throughout memory

observation: 32KB cache can store 32KB contiguousarray

contiguous array is split evenly among sets

solution: copy block into contiguous array

52

Page 90: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

avoiding conflict misses (code)

process_block(ii, jj, kk) {float B_copy[I * J];/* pseudocode for loop to save space */for i = ii to ii + I, j = jj to jj + J:

B_copy[i * J + j] = B[i * N + j];for i = ii to ii + I, j = jj to jj + J, k:

B_copy[i * J + j] += A[k * N + j] * A[i * N + k];for all i, j:

B[i * N + j] = B_copy[i * J + j];}

53

Page 91: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

prefetching

processors detect sequential access patterns

e.g. accessing memory address 0, 8, 16, 24, …?

processor will prefetch 32, 48, etc.

another way to take advantage of spatial locality

part of why miss rate is so low

54

Page 92: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

matrix sum

int sum1(int matrix[4][8]) {int sum = 0;for (int i = 0; i < 4; ++i) {

for (int j = 0; j < 8; ++j) {sum += matrix[i][j];

}}

}

access pattern:matrix[0][0], [0][1], [0][2], …, [1][0] …

55

Page 93: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

matrix sum: spatial locality

[0][0] iter. 0 miss[0][1] iter. 1 hit (same block as before)[0][2] iter. 2 miss[0][3] iter. 3 hit (same block as before)[0][4] iter. 4 miss[0][5] iter. 5 hit[0][6] iter. 6 …[0][7] iter. 7[1][0] iter. 8[1][1] iter. 9… …

matrix in memory (4 bytes/row)

8-bytecache block?

56

Page 94: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

matrix sum: spatial locality

[0][0] iter. 0 miss[0][1] iter. 1 hit (same block as before)[0][2] iter. 2 miss[0][3] iter. 3 hit (same block as before)[0][4] iter. 4 miss[0][5] iter. 5 hit[0][6] iter. 6 …[0][7] iter. 7[1][0] iter. 8[1][1] iter. 9… …

matrix in memory (4 bytes/row)8-byte

cache block?

56

Page 95: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

matrix sum: spatial locality

[0][0] iter. 0 miss[0][1] iter. 1 hit (same block as before)[0][2] iter. 2 miss[0][3] iter. 3 hit (same block as before)[0][4] iter. 4 miss[0][5] iter. 5 hit[0][6] iter. 6 …[0][7] iter. 7[1][0] iter. 8[1][1] iter. 9… …

matrix in memory (4 bytes/row)8-byte

cache block?

56

Page 96: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

block size and spatial locality

larger blocks — exploit spatial locality

… but larger blocks means fewer blocks for same size

less good at exploiting temporal locality

57

Page 97: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

alternate matrix sum

int sum2(int matrix[4][8]) {int sum = 0;// swapped loop orderfor (int j = 0; j < 8; ++j) {

for (int i = 0; i < 4; ++i) {sum += matrix[i][j];

}}

}

access pattern:matrix[0][0], [1][0], [2][0], …, [0][1], …

58

Page 98: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

matrix sum: bad spatial locality

[0][0] iter. 0[0][1] iter. 4[0][2] iter. 8[0][3] iter. 12[0][4] iter. 16[0][5] iter. 20[0][6] iter. 24[0][7] iter. 28[1][0] iter. 1[1][1] iter. 5… …

matrix in memory (4 bytes/row)

8-bytecache block?

miss unless value notevicted for 4 iterations

59

Page 99: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

matrix sum: bad spatial locality

[0][0] iter. 0[0][1] iter. 4[0][2] iter. 8[0][3] iter. 12[0][4] iter. 16[0][5] iter. 20[0][6] iter. 24[0][7] iter. 28[1][0] iter. 1[1][1] iter. 5… …

matrix in memory (4 bytes/row)8-byte

cache block?miss unless value notevicted for 4 iterations

59

Page 100: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache organization and miss rate

depends on program; one example:

SPEC CPU2000 benchmarks, 64B block size

LRU replacement policies

data cache miss rates:Cache size direct-mapped 2-way 8-way fully assoc.1KB 8.63% 6.97% 5.63% 5.34%2KB 5.71% 4.23% 3.30% 3.05%4KB 3.70% 2.60% 2.03% 1.90%16KB 1.59% 0.86% 0.56% 0.50%64KB 0.66% 0.37% 0.10% 0.001%128KB 0.27% 0.001% 0.0006% 0.0006%

Data: Cantin and Hill, “Cache Performance for SPEC CPU2000 Benchmarks”http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/ 60

Page 101: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache organization and miss rate

depends on program; one example:

SPEC CPU2000 benchmarks, 64B block size

LRU replacement policies

data cache miss rates:Cache size direct-mapped 2-way 8-way fully assoc.1KB 8.63% 6.97% 5.63% 5.34%2KB 5.71% 4.23% 3.30% 3.05%4KB 3.70% 2.60% 2.03% 1.90%16KB 1.59% 0.86% 0.56% 0.50%64KB 0.66% 0.37% 0.10% 0.001%128KB 0.27% 0.001% 0.0006% 0.0006%

Data: Cantin and Hill, “Cache Performance for SPEC CPU2000 Benchmarks”http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/ 60

Page 102: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

is LRU always better?

least recently used exploits temporal locality

61

Page 103: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

making LRU look bad

* = least recently useddirect-mapped (2 sets) fully-associative (1 set)

read 0 miss: mem[0]; — miss: mem[0], —*read 1 miss: mem[0]; mem[1] miss: mem[0]*, mem[1]read 3 miss: mem[0]; mem[3] miss: mem[3], mem[1]*read 0 hit: mem[0]; mem[3] miss: mem[3]*, mem[0]read 2 miss: mem[2]; mem[3] miss: mem[2], mem[0]*read 3 hit: mem[2]; mem[3] miss: mem[2]*, mem[3]read 1 hit: mem[2]; mem[1] hit: mem[1], mem[3]*read 2 hit: mem[2]; mem[1] miss: mem[1]*, mem[2]

62

Page 104: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache operation (associative)

valid tag data valid tag data1 10 00 11 1 00 AA BB

1 11 B4 B5 1 01 33 44

10011 1

index

=

=

tag

AND

AND

OR is hit? (1)

offset

data(B5)

63

Page 105: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache operation (associative)

valid tag data valid tag data1 10 00 11 1 00 AA BB

1 11 B4 B5 1 01 33 44

10011 1

index

=

=

tag

AND

AND

OR is hit? (1)

offset

data(B5)

63

Page 106: Cache Performance IIcr4bd/3330/S2017/... · 3/30/2017  · A[k*114+j] isat 10 0000 0000 0100 A[k*114+j+1] isat 10 0000 0000 1000 A[(k+1)*114+j] isat 10 0011 1001 0100 A[(k+2)*114+j]

cache operation (associative)

valid tag data valid tag data1 10 00 11 1 00 AA BB

1 11 B4 B5 1 01 33 44

10011 1

index

=

=

tag

AND

AND

OR is hit? (1)

offset

data(B5)

63