eeng449b/savvides lec 18.1 4/13/04 april 13, 2004 prof. andreas savvides spring 2004 eeng...

27
EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004 http://www.eng.yale.edu/courses/ eeng449bG EENG 449bG/CPSC 439bG Computer Systems Lecture 18 Memory Hierarchy Design Part II

Post on 21-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.1

4/13/04

April 13, 2004

Prof. Andreas Savvides

Spring 2004

http://www.eng.yale.edu/courses/eeng449bG

EENG 449bG/CPSC 439bG Computer Systems

Lecture 18

Memory Hierarchy Design Part II

Page 2: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.2

4/13/04

Q1: Where can a Block Be Placed in a Cache?

Page 3: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.3

4/13/04

Set Associatively

• Direct mapped = one-way set associative

• Fully associative = set associative with 1 set

• Most popular cache configurations in today’s processors

– Direct mapped, 2-way set associative, 4-way set associative

Page 4: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.4

4/13/04

Examples

• 32 KB cache for a byte addressable processor, 32-bit address space. Which bits of the address are used for the tag, index and byte-within-block for the following configuration:

a) 8-byte block size, direct mapped

8-byte block size => 3 bits for byte-within block

32 KB / 8 B = 4 K Block in the cache => need 12 bits to index

32 bits – (12 + 3) bits = 17 bits remaining => need 17 bits for every tag

Byte-within-blockindextag

031531 214 …… 1

Page 5: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.5

4/13/04

Examples

4-byte block size => 2 bits for byte-within block

32 KB / (4 B x 8) = 1 K Sets in the cache => need 10 bits to index

32 bits – (10 + 2) bits = 20 bits remaining => need 20 bits for every tag

Byte-within-blockindextag

021231 111 ……

• 32 KB cache for a byte addressable processor, 32-bit address space. Which bits of the address are used for the tag, index and byte-within-block for the following configuration:

a) 4-byte block size, 8-way set associative

Page 6: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.6

4/13/04

Q2: How is a block found if it is in the cache?

Selects the desired data from the block

Selects the set

Compared against for a hit

• If cache size remains the same increasing associativity increases The number of blocks per set => decrease index size and increase tag

Page 7: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.7

4/13/04

Examples

• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block

16 words / (4 words) = 4 blocks in the cache => need 2 bits to index

6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag

0

15

34

78

1112

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Page 8: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.8

4/13/04

Examples

• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block

16 words / (4 words) = 4 blocks in the cache => need 2 bits to index

6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag

0

15

34

78

1112

32 33 34 35

4 5 6 7

8 9 10 11

12 13 14 15

Block replacement

Page 9: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.9

4/13/04

Examples

• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block

16 words / (4 words) = 4 blocks in the cache => need 2 bits to index

6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag

0

15

34

78

1112

32 33 34 35

4 5 6 7

8 9 10 11

12 13 14 15

Page 10: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.10

4/13/04

Examples

• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block

16 words / (4 words) = 4 blocks in the cache => need 2 bits to index

6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag

0

15

34

78

1112

16 17 18 19

4 5 6 7

8 9 10 11

12 13 14 15

Block replacement

Page 11: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.11

4/13/04

Examples

• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block

16 words / (4 words) = 4 blocks in the cache => need 2 bits to index

6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag

0

15

34

78

1112

16 17 18 19

20 21 22 23

8 9 10 11

12 13 14 15

Block replacement

Page 12: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.12

4/13/04

Examples

• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block

16 words / (4 words) = 4 blocks in the cache => need 2 bits to index

6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag

0

15

34

78

1112

16 17 18 19

20 21 22 23

24 25 26 27

12 13 14 15

Block replacement

Page 13: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.13

4/13/04

Examples

• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block

16 words / (4 words) = 4 blocks in the cache => need 2 bits to index

6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag

0

15

34

78

1112

32 33 34 35

20 21 22 23

24 25 26 27

12 13 14 15

Block replacement

Page 14: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.14

4/13/04

Examples

• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block

16 words / (4 words) = 4 blocks in the cache => need 2 bits to index

6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag

0

15

34

78

1112

32 33 34 35

20 21 22 23

24 25 26 27

12 13 14 15

Page 15: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.15

4/13/04

Examples• Processor contains a 16 word, 4-way

associate cache, with a 1 word block size. Which of the following addresses will hit in the cache?

(LRU Replacement)

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 121-word block size => 0 bits for word-within block

16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index

6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag

0Set 0

Set 1

Set 2

Set 3

1

4

5

6 10 14

32 16

23

Page 16: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.16

4/13/04

Examples• Processor contains a 16 word, 4-way

associate cache, with a 1 word block size. Which of the following addresses will hit in the cache?

(LRU Replacement)

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 121-word block size => 0 bits for word-within block

16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index

6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag

24Set 0

Set 1

Set 2

Set 3

1

4

5

6 10 14

32 16

23

Block replacement

Page 17: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.17

4/13/04

Examples• Processor contains a 16 word, 4-way

associate cache, with a 1 word block size. Which of the following addresses will hit in the cache?

(LRU Replacement)

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 121-word block size => 0 bits for word-within block

16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index

6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag

24Set 0

Set 1

Set 2

Set 3

1

4

5

6 10 14

32 16

23 35

Page 18: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.18

4/13/04

Examples• Processor contains a 16 word, 4-way

associate cache, with a 1 word block size. Which of the following addresses will hit in the cache?

(LRU Replacement)

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 121-word block size => 0 bits for word-within block

16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index

6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag

24Set 0

Set 1

Set 2

Set 3

1

4

5

6 10 14

12 16

23 35

Block replacement

Page 19: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.19

4/13/04

Address Breakdown

• Physical address is 44 bits wide, 36-bit block address and 6-bit offset

• Calculating cache index size

• Blocks are 64 bytes so offset needs 6 bits

• Tag size = 38 – 9 = 29 bits

92512264

356,652

ityAssociativSet sizeBlock size CacheIndex

Page 20: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.20

4/13/04

How to Improve Cache Performance?

Four main categories of optimizations1. Reducing miss penalty

- multilevel caches, critical word first, read miss before write miss, merging write buffers and victim caches

2. Reducing miss rate- larger block size, larger cache size, higher associativity, way prediction and

pseudoassociativity and computer optimizations

2. Reduce the miss penalty or miss rate via parallelism- non-blocking caches, hardware prefetching and compiler prefetching

3. Reduce the time to hit in the cache- small and simple caches, avoiding address translation, pipelined cache access

yMissPenaltMissRateHitTimeAMAT

Last week Today

Page 21: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.21

4/13/04

Reducing miss rate

• Way Predication:– Perform tag comparison with a single block in

every set» Less comparisons -> simple hardware ->

faster clock

• Pseudoassociative Caches:– Access proceeds as in a direct-mapped cache for a

hit– If a miss, compare to a second entry for a match,

where the second entry can be found fast

Page 22: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.22

4/13/04

Reducing miss rate

• Compiler Optimization:– Loop Interchange & Blocking:

» Exchange the nesting of the loops to make the code access the data in the order it is stored

For (j=0 ->100)For (i=0->5000) Becomes x[i][j] = 2 * x[i][j]

For (i=0 ->5000)For (j=0->100) x[i][j] = 2 * x[i][j]

Maximize the use of the data before replacing it

Page 23: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.23

4/13/04

Reducing miss Penalty• Methods include:

1) Multi-level caches

L2 Equations:AMAT=Hit TimeL1 + Miss RateL1 X Miss PenaltyL1

Miss PenaltyL1=Hit TimeL2 + Miss RateL2 X Miss PenaltyL2

AMAT=Hit TimeL1 + Miss RateL1 X (Hit TimeL2 + Miss RateL2

X Miss PenaltyL2)Definitions:

– Local Miss Rate- misses in this cache divided by the total number of accesses to this cache (Miss RateL2)

– Global Miss Rate- misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL2 X Miss RateL1 )

Page 24: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.24

4/13/04

Reducing miss Penalty• Methods include:

2) Critical word first and early restart

– Don’t wait for the full block to be loaded before restarting the CPU» Early restart – As soon as the requested word

of the block arrives, send it to the CPU and let the CPU continue execution

» Critical word first – request the missed word first from memory and send it to the CPU as soon as it arrives; Let the CPU continue execution while filling the rest of the words in the block.

– Very useful with large blocks– Spatial locality problem: we often want the next

sequential word soon, so not always a benefit (early restart)

Page 25: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.25

4/13/04

Reducing miss Penalty• Methods include:

3) Prioritize read misses over writes

» Write buffers offer RAW conflicts with main memory reads on cache misses

» If simply wait for write buffer to empty might increase the read miss penalty by 50%

» Check write buffer contents before read: if not conflict, let the memory access continue

» Write back?• Read miss may require write of dirty blocks• Normal: write dirty block to memory and then do the

read• Instead, copy the dirty block to the write buffer, do

the read and then do the write• CPU stalls less since it can restart as soon as the read

completes

Page 26: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.26

4/13/04

Reducing miss Penalty4) Merging the write buffer

– CPU stalls if the write-back buffer is full– The buffer may contain an entry matching the address

written to– If so, the writes are merged

Page 27: EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004  EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 18.27

4/13/04

Reducing miss Penalty

5) Victim caches:

• How to get the hit time of direct-mapped yet still avoid conflict misses?

• Add buffer to place data discarded from the cache

• Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4KB direct mapped data cache