1 cmpe 421 parallel computer architecture part4 caching with associativity

27
1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

Upload: sheryl-byrd

Post on 04-Jan-2016

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

1

CMPE 421 Parallel Computer Architecture

PART4Caching with Associativity

Page 2: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

2

Fully Associative CacheReducing Cache Misses by More Flexible Placement Blocks Instead of direct mapped, we allow any memory block to

be placed in any cache slot. There are many different potential addresses that mapped to

each index Use any available entry to store memory elements Remember: Direct memory caches are more rigid, any cache

data goes directly where the index says to, even if the rest of the cache is empty

But in Fully associative cache, nothing gets “thrown out” of the cache until it is completely full.

It’s harder to check for a hit (hit time will increase). Requires lots more hardware (a comparator for each

cache slot). Each tag will be a complete block address (No index bits

are used).

Page 3: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

3

Fully Associative Cache

Must compare tags of all entries in parallel to find the desired one (if there is a hit)

But Direct mapped cache only need to look one place

No conflict misses, only capacity misses Practical only for caches with small number of blocks,

since searching increases the hardware cost

Page 4: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

4

Fully Associative Cache

Page 5: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

5

Direct Mapped vs Fully Associative

Direct MappedDirect Mapped

0:1:23:4:5:6:7:89:

10:11:12:13:14:15:

V Tag DataIndexIndex

Address = Tag | Index | Block offset

Fully AssociativeFully Associative

No IndexNo Index

Address = Tag | Block offset

Each address has only one possible location

Each address has only one possible location

Tag DataV

Page 6: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

6

Trade off

Fully Associate is much more flexible, so the miss rate will be lower. Direct Mapped requires less hardware (cheaper).

– will also be faster!

Tradeoff of miss rate vs. hit time. Therefore we might be able to compromise to find best solution

between direct mapped cache and fully associative cache We can also provide more flexibility without going to a fully associative

placement policy. For each memory location, provide a small number of cache slots that

can hold the memory element. This is much more flexible than direct-mapped, but requires less

hardware than fully associative.

SOLUTION: Set Associative

Page 7: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

7

SET Associative Cache A fixed number of locations where each block can be

placed. N-way set associative means there are N places (slots)

where each block can be placed. Divide the cache into a number of sets each set is of size

N “ways” (N way set associative) Therefore, A memory block maps to unique set (specified

by index field) and can be placed in any “way” of that set So there N choices A memory block can be mapped is Set-accociative cache

- (Block address) modulo (Number of set in the cache)

- Remember that in a direct mapped cache the position of memory block is given by

(Block address) modulo (Number of cache blocks)

Page 8: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

8

A Compromise

2-Way set associative2-Way set associative

Address = Tag | Index | Block offset

4-Way set associative4-Way set associative

Address = Tag | Index | Block offset

0:

1:

2:

3:

4:

5:

6:

7:

V Tag Data

Each address has two possiblelocations with the same index

Each address has two possiblelocations with the same index

One fewer index bit: 1/2 the indexes

One fewer index bit: 1/2 the indexes

0:

1:

2:

3:

V Tag Data

Each address has four possiblelocations with the same index

Each address has four possiblelocations with the same index

Two fewer index bits: 1/4 the indexes

Two fewer index bits: 1/4 the indexes

Page 9: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

9

Range of Set Associative Caches

Block offset Byte offsetIndexTag

Decreasing associativity

Fully associative(only one set)Tag is all the bits exceptblock and byte offset

Direct mapped(only one way)Smaller tags

Increasing associativity

Selects the setUsed for tag compare Selects the word in the block

Index is the set number is used to determine which set the block can be placed in

Page 10: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

10

Range of Set Associative Caches

For a fixed size cache, each increase by a factor of two in

associativity doubles the number of blocks per set (i.e. the numbers or ways)

And halves the number of sets,Decreases the size of the index by 1 bitAnd increases the size of the tag by 1 bit

Block offset Byte offsetIndexTag

Page 11: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

11

Set Associative Cache

0

Cache

Main Memory

Q1: How do we find it?

Use next 1 low order memory address bit to determine which cache set (i.e., modulo the number of sets in the cache)

Tag Data

Q2: Is it there?

Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache

V

0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xx

Two low order bits define the byte in the word (32-b words)One word blocks

1

01

0

1

(block address) modulo (# set in the cache)

Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block

Page 12: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

12

Set Associative Cache Organization

FIGURE 7.17 The implementation of a four-way set-associative cache requires four comparators and a 4-to-1 multiplexor. The comparators determine which element of the selected set (if any) matches the tag. The output of the comparators is used to select the data from one of the four blocks of the indexed set, using a multiplexor with a decoded select signal. In some implementations, the Output enable signals on the data portions of the cache RAMs can be used to select the entry in the set that drives the output. The Output enable signal comes from the comparators, causing the element that matches to drive the data outputs.

Page 13: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

13

Set Associative Cache Organization

This is called a 4-way set associative cache because there are four cache entries for each cache index. Essentially, you have four direct mapped cache working in parallel.

This is how it works: the cache index selects a set from the cache. The four tags in the set are compared in parallel with the upper bits of the memory address.

If no tags match the incoming address tag, we have a cache miss.

Otherwise, we have a cache hit and we will select the data from the way where the tag matches occur.

This is simple enough. What is its disadvantages?

Page 14: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

14

N-way Set Associative Cache versus Direct Mapped Cache:

N way set associative cache will also be slower than a direct mapped cache because

N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set

selection In a direct mapped cache, Cache Block is available

BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later

if miss.

Page 15: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

15

Remember the Example for Direct Mapping (ping pong effect)

0 4 0 4

0 4 0 4

Consider the main memory word reference string 0 4 0 4 0 4 0 4

miss miss miss miss

miss miss miss miss

00 Mem(0) 00 Mem(0)01 4

01 Mem(4)000

00 Mem(0)01

4

00 Mem(0)01 4

00 Mem(0)01

401 Mem(4)

00001 Mem(4)

000

Start with an empty cache - all blocks initially marked as not valid

Ping pong effect due to conflict misses - two memory locations that map into the same cache block

8 requests, 8 misses

Page 16: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

16

Solution: Use set associative cache

0 4 0 4

Consider the main memory word reference string 0 4 0 4 0 4 0 4

miss miss hit hit

000 Mem(0) 000 Mem(0)

Start with an empty cache - all blocks initially marked as not valid

010 Mem(4) 010 Mem(4)000 Mem(0) 000 Mem(0)

010 Mem(4)

Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist!

8 requests, 2 misses

Page 17: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

17

Set Associative Example

V Tag DataIndex00000000

000:

001:

010:

011:

100:

101:

110:

111:

01001110001100110100010011110001101100001100111000

MissMissMissMissMiss

Index V Tag Data

0

0000000

00:

01:

10:

11:

V Tag DataIndex0

0000000

0:

1:

Direct-Mapped 2-Way Set Assoc. 4-Way Set Assoc.

01001110001100110100010011110001101100001100111000

MissMissHitMissMiss

01001110001100110100010011110001101100001100111000

MissMissHitMissHit

Byte offset (2 bits)Block offset (2 bits)Index (1-3 bits)Tag (3-5 bits)

010 -1 110010

0100 -

1 1100 -1

011110

01101100

1 01001

1 11001

1 01101

-

--

Page 18: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

18

New Performance Numbers

Miss rates for DEC 3100 (MIPS machine)

spice Direct 0.3% 0.6% 0.4%

gcc Direct 2.0% 1.7% 1.9%

spice 2-way 0.3% 0.6% 0.4%

gcc 4-way 1.6% 1.4% 1.5%

Benchmark Associativity Instruction Data miss Combinedrate miss rate

Separate 64KB Instruction/Data Caches

gcc 2-way 1.6% 1.4% 1.5%

spice 4-way 0.3% 0.6% 0.4%

Page 19: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

19

Benefits of Set Associative Caches The choice of direct mapped or set associative depends

on the cost of a miss versus the cost of implementation

0

2

4

6

8

10

12

1-way 2-way 4-way 8-way

Associativity

Mis

s R

ate

4KB8KB16KB32KB64KB128KB256KB512KB

Data from Hennessy & Patterson, Computer Architecture, 2003

Largest gains are in going from direct mapped to 2-way (20%+ reduction in miss rate)

Page 20: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

20

Benefits of Set Associative Caches

As the cache size grow, the relative improvement from associativity increases only slightly

Since overall miss rate of a larger cache is lower, the opportunity for improving the miss rate decreases

And the obsolete improvement in miss rate from associativity shrinks significantly

Page 21: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

21

Cache Block Replacement PolicyFor deciding which block to replace when a new entry is coming Random Replacement:

Hardware randomly selects a cache item and throw it out

First in First Out (FIFO) Equally fair / equally unfair to all frames

Least Recently Used (LRU) strategy: Use idea of temporal locality to select the entry that has not been accessed

recently Additional bit(s) required in the cache entry to track access order

- Must update on each access, must scan all on a replace

For two way set associative cache one needs one bit for LRU replacement.

Common approach is to use pseudo LRU strategy Example of a Simple “Pseudo” Least Recently Used Implementation: Assume 64 Fully Associative Entries Hardware replacement pointer points to one cache entry Whenever an access is made to the entry the pointer points to:

- Move the pointer to the next entry

-Otherwise: do not move the pointer

Page 22: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

Source of Cache Misses

Direct Mapped

N way Set Associative

Fully Associative

Cache Size Big Medium Small

Compulsory Miss Same Same Same

Conflict Miss High Medium Zero

Capacity Miss Low(er) Medium High

Page 23: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

Designing a cache

Design Cache Effect on Miss rate Negative performance effect

Increase size Decrease Capacity Misses May increase Access time

Increase Associativity

Decrease conflict Misses May increase Access time

Increase Block Size

May decrease compulsory misses

May increase miss penalty

May Increase Capacity Misses

Not: If you are running “billions” of instructions compulsory misses are insignificand

Page 24: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

Key Cache Design Parameters

L1 typical L2 typical

Total size (blocks) 250 to 2000

4000 to 250,000

Total size (KB) 16 to 64 500 to 8000

Block size (B) 32 to 64 32 to 128

Miss penalty (clocks) 10 to 25 100 to 1000

Miss rates (global for L2)

2% to 5% 0.1% to 2%

Page 25: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

Two Machines’ Cache Parameters

Intel P4 AMD Opteron

L1 organization Split I$ and D$ Split I$ and D$

L1 cache size 8KB for D$, 96KB for trace cache (~I$)

64KB for each of I$ and D$

L1 block size 64 bytes 64 bytes

L1 associativity 4-way set assoc. 2-way set assoc.

L1 replacement ~ LRU LRU

L1 write policy write-through write-back

L2 organization Unified Unified

L2 cache size 512KB 1024KB (1MB)

L2 block size 128 bytes 64 bytes

L2 associativity 8-way set assoc. 16-way set assoc.

L2 replacement ~LRU ~LRU

L2 write policy write-back write-back

Page 26: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

Where can a block be placed/found?

# of sets Blocks per set

Direct mapped # of blocks in cache 1

Set associative (# of blocks in cache)/ associativity

Associativity (typically 2 to 16)

Fully associative 1 # of blocks in cache

Location method # of comparisons

Direct mapped Index 1

Set associative Index the set; compare set’s tags

Degree of associativity

Fully associative Compare all blocks tags # of blocks

Page 27: 1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

27

Multilevel caches

Two level cache structure allows the primary cache (L1) to focus on reducing hit time to yield a shorter clock cycle.

The second level cache (L2) focuses on reducing the penalty of long memory access time.

Compared to the cache of a single cache machine, L1 on a multilevel cache machine is usually smaller, has a smaller block size, and has a higher miss rate.

Compared to the cache of a single cache machine, L2 on a multilevel cache machine is often larger with a larger block size.

The access time of L2 is less critical than that of the cache of a single cache machine.