cache (memory) performance optimization...6 6 block-level optimizations • tags are too large,...

27
1 Cache (Memory) Performance Optimization Page 1

Upload: others

Post on 14-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

1

1

Cache (Memory) PerformanceOptimization

Page 1

Page 2: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

2

2

Improving Cache Performance

Average memory access time = Hit time + Miss rate x Miss penalty

To improve performance: • reduce the miss rate (e.g., larger cache) • reduce the miss penalty (e.g., L2 cache) • reduce the hit time

The simplest design strategy is to design the largest primary cache without slowing down the clock or adding pipeline stages

Design the largest primary cache without slowing down the clock Or adding pipeline stages.

Page 2

Page 3: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

3

3

RAM-tag 2-way Set-Associative Cache

Tag Data Word

=

Tag Index

t k Tag Data Word

Data Word

=

t

MU

X

Set t

Page 3

Page 4: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

4

4

CAM-tag 2-way Set-Associative Cache

Tag Word =

Tag Index

t k

Data Word

Dem

ux

=

=

=

TagWord =

Dem

ux =

=

=

Page 4

Page 5: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

5

5

Causes for Cache Misses

• Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with infinite cache

• Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect placement & replacement policy

• Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity

Page 5

Page 6: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

6

6

Block-level Optimizations

• Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss

penalty could be large. • Sub-block placement

– A valid bit added to units smaller than the full block, called sub- locks

– Only read a sub- lock on a miss – If a tag matches, is the word in the cache?

100 300 204

1 1 1 0 0 0

bb

1 1 1

1 0 1

Main reason for subblock placement is to reduce tag overhead.

Page 6

Page 7: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

7

7

Write Alternatives

- Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hit

- Design data RAM that can perform read and write in one cycle, restore old value after tag miss

- Hold write data for store in single buffer ahead ofcache, write cache data during next store’s tagcheck - Need to bypass from write buffer if read matches write buffer

tag

Page 7

Page 8: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

8

8

Pipelining Cache Writes (Alpha 21064)

Tag DataV

=

Block Offset

Tag Index

t k b

t

HIT Data Word or Byte

2k

lines

WE Bypass

Writes usually take longer than reads because the tags have to

Be checked before writing the data.First, tags and data are split, so they can be addressed independently.

Page 8

Page 9: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

9

9

Prefetching

• Speculate on future instruction and data accesses and fetch them into cache(s)

– Instruction accesses easier to predict than data accesses

• Varieties of prefetching – Hardware prefetching – Software prefetching – Mixed schemes

• What types of misses does prefetching affect?

Reduces compulsory misses, can increase conflict and capacity misses.

Page 9

Page 10: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

10

10

Issues in Prefetching

• Usefulness – should produce hits • Timeliness – not late and not too early • Cache and bandwidth pollution

L1 Data

L1 Instruction

Unified L2 Cache

RF

CPU

Prefetched data

Page 10

Page 11: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

11

11

Hardware Prefetching of Instructions

• Instruction prefetch in Alpha AXP 21064 – Fetch two blocks on a miss; the requested

block and the next consecutive block – Requested block placed in cache, and next

block in instruction stream buffer

L1 Instruction

Unified L2 Cache

RF

CPU

Stream Buffer (4 blocks)

Prefetched instruction blockReq

block

Req block

Need to check the stream buffer if the requested block is in there. Never more than one 32-byte block in the stream buffer.

Page 11

Page 12: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

12

12

Hardware Data Prefetching

• One Block Lookahead (OBL) scheme – Initiate prefetch for block b + 1 when block b is

accessed – Why is this different from doubling block size?

• Prefetch-on-miss: – Prefetch b + 1 upon miss on b

• Tagged prefetch: – Tag bit for each memory block – Tag bit = 0 signifies that a block is demand-

fetched or if a prefetched block is referenced for the first time

– Prefetch for b + 1 initiated only if tag bit = 0 on b

HP PA 7200 uses OBL prefetching

Tag prefetching is twice as effective as prefetch-on-miss in reducing miss rates.

Page 12

Page 13: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

13

13

Comparison of Variant OBL Schemes

Demand-fetched Prefetched

Demand-fetched Prefetched

Demand-fetched Prefetched Demand-fetched Prefetched

0 Demand-fetched 1 Prefetched 0 0 0

0 Demand-fetched 0 Prefetched 1 Prefetched 0 0

0 Demand-fetched 0 Prefetched 0 Prefetched 1 Prefetched 0

Prefetch-on-miss accessing contiguous blocks

Tagged prefetch accessing contiguous blocks

Page 13

Page 14: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

14

14

Software Prefetching

for(i=0; i < N; i++) {fetch( &a[i + 1] );fetch( &b[i + 1] );SUM = SUM + a[i] * b[i];

}

• What property do we require of the cache for prefetching to work ?

Cache should be non-blocking or lockup-free.By that we mean that the processor can proceed while the prefetched

Data is being fetched; and the caches continue to supply instructions

And data while waiting for the prefetched data to return.

Page 14

Page 15: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

15

15

Software Prefetching Issues

• Timing is the biggest issue, not predictability – If you prefetch very close to when the data is

required, you might be too late – Prefetch too early, cause pollution – Estimate how long it will take for the data to

come into L1, so we can set P appropriately – Why is this hard to do?

for(i=0; i < N; i++) {fetch( &a[i + P] );fetch( &b[i + P] );SUM = SUM + a[i] * b[i];

} Don’t ignore cost of prefetch instructions

Page 15

Page 16: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

16

16

Compiler Optimizations

• Restructuring code affects the data block access sequence

– Group data accesses together to improvespatial locality

– Re- order data accesses to improve temporallocality

• Prevent data from entering the cache – Useful for variables that are only accessed

once • Kill data that will never be used

– Streaming data exploits spatial locality but not temporal locality

Miss-rate reduction without any hardware changes. Hardware designer’s favorite solution.

Page 16

Page 17: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

17

17

Loop Interchange

for(j=0; j < N; j++) {for(i=0; i < M; i++) {x[i][j] = 2 * x[i][j];}}

for(i=0; i < M; i++) {for(j=0; j < N; j++) {x[i][j] = 2 * x[i][j];}}

What type of locality does this improve?

Page 17

Page 18: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

18

18

Loop Fusion

for(i=0; i < N; i++)for(j=0; j < M; j++)a[i][j] = b[i][j] * c[i][j]; for(i=0; i < N; i++)for(j=0; j < M; j++)

d[i][j] = a[i][j] * c[i][j];

for(i=0; i < M; i++)for(j=0; j < N; j++) {a[i][j] = b[i][j] * c[i][j];d[i][j] = a[i][j] * c[i][j];

}

What type of locality does this improve?

Page 18

Page 19: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

19

19

for(i=0; i < N; i++)for(j=0; j < N; j++) {

r = 0;for(k=0; k < N; k++)

r = r + y[i][k] * z[k][j];x[i][j] = r;

}

Blocking

x j j

i k

Not touched Old access New access

zy k

i

Page 19

Page 20: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

20

20

for(jj=0; jj < N; jj=jj+B)for(kk=0; kk < N; kk=kk+B)

for(i=0; i < N; i++)for(j=jj; j < min(jj+B,N); j++) {

r = 0;for(k=kk; k < min(kk+B,N); k++)

r = r + y[i][k] * z[k][j];x[i][j] = x[i][j] + r;

}

Blocking

x j j

i k

What type of locality does this improve?

zy k

i

Y benefits from spatial locality z benefits from temporal locality

Page 20

Page 21: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

21

21

CPU-Cache Interaction (5-stage pipeline)

PC addr inst

Primary Instruction Cache

0x4 Add

IR

D

nop

hit?

PCen

Decode, Register

Fetch R

addr

wdata

rdata PrimaryData Cache

we A

B YALU

MD1 MD2

Cache Refill Data from Lower Levels of Memory Hierarchy

hit?

Stall entire CPU on data cache miss

To Memory Control

M E

Instruction miss ?

Page 21

Page 22: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

22

22

Memory (DRAM) Performance

• Upon a cache miss – 4 clocks to send the address – 24 clocks for the access time per word – 4 clocks to send a word of data

• Latency worsens with increasing block size

1 Gb DRAM

50-100 ns access time Needs refreshing

Need 128 or 116 clocks, 128 for a dumb memory.

Page 22

Page 23: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

23

23

Memory organizations

Memory

Bus

Cache

CPU

1 word wide

Memory B

us Cache

CPU

Wide memory

Bus

Cache

CPU

bank 0 bank 1

Interleaved memory

Alpha AXP 21064 256 bits wide memory and cache.

Page 23

Page 24: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

24

24

Interleaved Memory

• Banks are often 1 word wide • Send an address to all the banks • How long to get 4 words back?

Bus

Cache

CPU

B 0 B 1 B 2 B 3

Bank i contains all words whose address

modulo N is i

4 + 24 + 4* 4 clocks = 44 clocks from interleaved memory.

Page 24

Page 25: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

25

25

Independent Memory

• Send an address to all the banks • How long to get 4 words back?

Bank i contains all words whose address

modulo N is i

Bus

Non Blocking

Cache

CPU

B 0 B 1 B 2 B 3 B

us

Bus

Bus

4 + 24 + 4 = 32 clocks from main memory for 4 words.

Page 25

Page 26: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

26

26

Memory Bank Conflicts

Consider a 128-bank memory in the NEC SX/3 where each bank can service independent requests

int x[256][512];for(j=0; j < 512; j++)

for(i=0; i < 256; i++)x[i][j] = 2 * x[i][j];

Consider column k Address of elements in column is i*512 + k

Where do these addresses go?

x[i][k], 0 <= i < 256

Page 26

Page 27: Cache (Memory) Performance Optimization...6 6 Block-level Optimizations • Tags are too large, i.e., too much overhead – Simple solution: Larger blocks, but miss penalty could be

27

27

Bank Assignment

B0 B1 B2 B3 0 1 2 3

0 1 2 3

0 1 2 3

0 1 2 3

Address within Bank

Bank number = Address MOD NumBanks Address within bank = Address .

NumBanks

Should randomize address to bank mapping.

Could use odd/prime number of banks. Problem?

Bank number for i*512 + k should depend on i Can pick more significant bits in address

Page 27