outline cache writes dram configurations performance associative caches multi-level caches

Outline

• Cache writes

• DRAM configurations

• Performance

• Associative caches

• Multi-level caches

00011011

DataTagValid

Reference Stream:Hit/Miss0b010010000b000101000b001110000b00010000

Direct-mapped CacheBlocksize=4words, wordsize= 4bytes

Tag IndexByte Offset

Block Offset

01110000

110

1

00011011

DataTagValid



Block Offset


01110000

110

1

0100110100100011

DataTagValid

110

1



Block Offset


M[64-79]

0100110100100011

DataTagValid

110

1



Block Offset


M[64-79]M[208-223]

0100110100100011

DataTagValid

110

1



Block Offset


M[64-79]M[208-223]M[32-47]

0100110100100011

DataTagValid

110

1



Block Offset


M[64-79]M[208-223]M[32-47]Not Valid

0100110100100011

DataTagValid

110

1



Block Offset


M[64-79]M[208-223]M[32-47]

0100110100100011

DataTagValid

110

1

Reference Stream: Hit/Miss0b01001000 H0b000101000b001110000b00010000


Block Offset


M[64-79]M[208-223]M[32-47]

0100000100100011

DataTagValid

110

1

Reference Stream: Hit/Miss0b01001000 H0b00010100 M0b001110000b00010000


Block Offset


M[64-79]M[16-31]M[32-47]

0100000100100011

DataTagValid

111

1

Reference Stream: Hit/Miss0b01001000 H0b00010100 M0b00111000 M0b00010000


Block Offset


M[64-79]M[16-31]M[32-47]M[48-63]

0100000100100011

DataTagValid

111

1

Reference Stream: Hit/Miss0b01001000 H0b00010100 M0b00111000 M0b00010000 H


Block Offset


M[64-79]M[16-31]M[32-47]M[48-63]

Cache Writes

• There are multiple copies of the data lying around– L1 cache, L2 cache, DRAM

• Do we write to all of them?

• Do we wait for the write to complete before the processor can proceed?

Do we write to all of them?

• Write-through

• Write-back– creates data - different values

for same item in cache and DRAM.– This data is referred to as


• Write-through - write to all levels of hierarchy

• Write-back– creates data - different values

for same item in cache and DRAM.– This data is referred to as


• Write-through - write to all levels of hierarchy• Write-back - write to lower level only when cache

line gets evicted from cache– creates inconsistent data - different values for same

item in cache and DRAM – stale data.

– Inconsistent data in highest level in cache is referred to as dirty

– If they all match, they are clean

– The old data is stale.

Write-Through

CPU

L1

L2 Cache

DRAM

Sw $3, 0($5)

Write-Back

CPU

L1

L2 Cache

DRAM

Sw $3, 0($5)

Write-through vs Write-back

• Which performs the write faster?

• Which has faster evictions from a cache?

• Which causes more bus traffic?


• Which performs the write faster?– Write-back - it only writes the L1 cache

• Which has faster evictions from a cache?




• Which has faster evictions from a cache?– Write-through - no write involved, just

overwrite tag




• Which has faster evictions from a cache?– Write-through - no write involved, just

overwrite tag

• Which causes more bus traffic?– Write-through. DRAM is written every store.

Write-back only writes on eviction.

Does processor wait for write?

• Write buffer

– Any loads must check write buffer in parallel with cache access.

– Buffer values are more recent than cache values.

Does processor wait for write?

• Write buffer - intermediate queue for pending writes

– Any loads must check write buffer in parallel with cache access.

– Buffer values are more recent than cache values.

Outline

• Cache writes


• Performance


Challenge

• DRAM is designed for density, not speed

• DRAM is ______ than the bus

• We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow.

• Widening anything increases the cost by quite a bit.

Challenge

• DRAM is designed for density, not speed

• DRAM is slower than the bus

• We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow.

• Widening anything increases the cost by quite a bit.

Narrow Configuration CPU

Cache

DRAM

Bus

• Given:– 1 clock cycle request– 15 cycles / word DRAM latency– 1 cycle / word bus latency

• If a cache block is 8 words, what is the miss penalty of an L2 cache miss?

Narrow Configuration CPU

Cache

DRAM

Bus



• 1cycle + 15 cycles/word * 8 words + 1 cycle/word * 8 words = 129 cycles

Wide Configuration CPU

Cache

DRAM

Bus

• Given:– 1 clock cycle request– 15 cycles / 2 words DRAM latency– 1 cycle / 2 words bus latency


Wide Configuration CPU

Cache

DRAM

Bus

• Given:– 1 clock cycle request– 15 cycles / 2 words DRAM latency– 1 cycle / 2 words bus latency


• 1cycle + 15 cycles/2 words * 8 words + 1 cycle/2words*8words = 65 cycles

Interleaved Configuration CPU

Cache

DRAM

Bus



DRAM

Interleaved Configuration CPU

Cache

DRAM

Bus



• 1 cycle + 15 cycles / 2 words * 8 words + 1 cycle / word * 8 words = 69 cycles

DRAM

Recent DRAM trends

• Fewer, Bigger DRAMs

• New bus protocols (RAMBUS)

• small DRAM caches (page mode)

• SDRAM (synchronous DRAM)– one request & length nets several continuous

responses.

Outline

• Cache writes


• Performance


Performance• Execute Time = (Cpu cycles + Memory-stall

cycles) * clock cycle time• Memory-stall cycles =

– accesses * misses * cycles = – program access miss – memory access * Miss rate * Miss penalty – program – instructions * misses * cycles = – program inst miss – instructions * misses * miss penalty– program inst

Example 1

• instruction cache miss rate: 2%

• data cache miss rate: 3%

• miss penalty: 50 cycles

• ld/st instructions are 25% of instructions

• CPI with perfect cache is 2.3

• How much faster is the computer with a perfect cache?

Example 1

• misses = Iacc * Imr + Dacc * Dmr

• instr instr instr

Example 1



• = 1 * .02 + .25 * .03 = .02 + .0075 = .0275

•

Example 1



• = 1 * .02 + .25 * .03 = .02 + .0075 = .0275

• Memory cycles = I * .0275 * 50 = I* 1.375

•

Example 1



• = 1 * .02 + .25 * .03 = .02 + .0075 = .0275

• Memory cycles = I * .0275 * 50 = I* 1.375

• ExecT = (Cpu CPI * I + MemCycles)*Clk

•

Example 1



• = 1 * .02 + .25 * .03 = .02 + .0075 = .0275

• Memory cycles = I * .0275 * 50 = I* 1.375


• = (2.3 * I + 1.375 * I) * clk = 3.675IC

Example 1



• = 1 * .02 + .25 * .03 = .02 + .0075 = .0275

• Memory cycles = I * .0275 * 50 = I* 1.375


• = (2.3 * I + 1.375 * I) * clk = 3.675IC

• speedup = 3.675 IC / 2.3IC = 1.6

Example 2• Double the clock rate from Example1.

What is the ideal speedup when taking into account the memory system?

• How long is the miss penalty now?



• How long is the miss penalty now? 100 cycles

• Memory cycles =




• Memory cycles = I * .0275 * 100 = I * 2.75




• Memory cycles = I * .0275 * 100 = I * 2.75

• Exec = (2.3*I + 2.75*I)*clk = 5.05I(C/2)




• Memory cycles = I * .0275 * 100 = I * 2.75

• Exec = (2.3*I + 2.75*I)*clk = 5.05I(C/2)

• speedup = old = 3.675IC = 3.675 = 1.5

• new = 5.05IC/2 2.525

Outline

• Cache writes


• Performance


10100010010001000011

DataTagValid

110

1

Reference Stream: Hit/Miss0b001110000b000111000b001110000b00011000


Block Offset


M[160-167]M[72-79]M[16-23]Not Valid

000110

00111

DataTagValid

111

1

Reference Stream: Hit/Miss0b00111000 M0b000111000b001110000b00011000


Block Offset


101010000

M[160-167]M[72-79]M[16-23]M[56-63]

000110

00011

DataTagValid

111

1

Reference Stream: Hit/Miss0b00111000 M0b00011100 M0b001110000b00011000


Block Offset


101010000

M[160-167]M[72-79]M[16-23]M[24-31]

000110

00111

DataTagValid

111

1

Reference Stream: Hit/Miss0b00111000 M0b00011100 M0b00111000 M0b00011000


Block Offset


101010000

M[160-167]M[72-79]M[16-23]M[56-63]

000110

00111

DataTagValid

111

1

Reference Stream: Hit/Miss0b00111000 M0b00011100 M0b00111000 M0b00011000 M


Block Offset


101010000

M[160-167]M[72-79]M[16-23]M[56-63]

Problem

• Conflicting addresses cause high miss rates

Solution

• Relax the direct-mapping

• Allow each address to be mapped into 2 or 4 locations (a set)

Cache Configurations

00011011

DataTagValid

01

DataTagValid DataTagValid

Direct-Mapped

2-way Associative - each set has two blocks

DataTagValid DataTagValidFully Associative - all addresses map to the same set


00011011

DataTagValid

01


Direct-Mapped



Block


00011011

DataTagValid

01


Direct-Mapped



BlockSet

10010

00101

0000

0001

DataTagValid

1

1

1

1



Block Offset

2-way Set Associative CacheBlocksize=2words, wordsize= 4bytes

DataTagValidIndex

Set

Block

10010

00101

0000

0001

DataTagValid

1

1

1

1



Block Offset

2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes

DataTagValidIndex

10010

00111

0000

0001

DataTagValid

1

1

1

1



Block Offset


DataTagValidIndex

10010

00111

0000

0001

DataTagValid

1

1

1

1

Reference Stream: Hit/Miss0b00111000 M0b00011100 H0b001110000b00011000


Block Offset


DataTagValidIndex

10010

00111

0000

0001

DataTagValid

1

1

1

1

Reference Stream: Hit/Miss0b00111000 M0b00011100 H0b00111000 H0b00011000


Block Offset


DataTagValidIndex

10010

00111

0000

0001

DataTagValid

1

1

1

1

Reference Stream: Hit/Miss0b00111000 M0b00011100 H0b00111000 H0b00011000 H


Block Offset


DataTagValidIndex

Implementation

0

1

DataTagValid

Byte Address0x100100100


=Hit? MUX

Block offset

Data

DataTagValid

MUX=

MUX

Performance Implications

• Increasing associativity increases/decreases hit rate

• Increasing associativity increases/decreases access time

• Increasing associativity increases/decreases miss penalty


• Increasing associativity increases hit rate

• Increasing associativity increases/decreases access time




• Increasing associativity increases access time




• Increasing associativity increases access time

• Increasing associativity has no effect on miss penalty

0

1

Direct-Mapped Cache

DataTagValid

0

00

0

Miss Rate:Tag Index Byte OffsetBlock Offset

Example 2-way associative


0

1

Direct-Mapped Cache

DataTagValid

0

00

0

Tag Index Byte OffsetBlock Offset


Reference Stream:0b10010000b00111000b10010000b0111000

0

1 100

Direct-Mapped Cache

DataTagValid

0

10

0




0

1 100001

Direct-Mapped Cache

DataTagValid

0

11

0




Which block to replace?

• 0b1001000

• 0b0011100


• 0b1001000 - It entered the cache first– FIFO - First In First Out

• 0b0011100


• 0b1001000 - It entered the cache first– FIFO - First In First Out

• 0b0011100 - Longer since it has been used– LRU - Least Recently Used

• Random

Replacement Algorithms

• LRU & FIFO simple conceptually, but implementation difficult for high assoc.

• LRU & FIFO must be approximated with high associativity

• Random sometimes better than approximated LRU/FIFO

• Tradeoff between accuracy, implementation cost

L1

L2 Cache

DRAM

Memory

Me

L1 cache’s perspective

L1’s miss penalty containsthe access of L2, and possiblythe access of DRAM!!!

Multi-level Caches

• Base CPI 1.0, 500MHz clock

• main memory-100 cycles, L2 - 10 cycles

• L1 miss rate per instruction - 5%

• w/L2 - 2% of instructions go to DRAM

• What is the speedup with the L2 cache?

There is a typo in the book for this example!

Multi-level Caches

• CPI = 1 + memory stalls / instructions

Multi-level Caches


• CPIold = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr

Multi-level Caches



• CPInew = 1 + L2%*L2penalty + Mem%*MemPenalty

Multi-level Caches




• = 1 + 5% * 10 + 2% * 100 = 3.5

Multi-level Caches




• = 1 + 5% * 10 + 2% * 100 = 3.5

• = 1 + (5-2)%*10 + 2%*(10+100) = 3.5

Multi-level Caches




• = 1 + 5% * 10 + 2% * 100 = 3.5

• = 1 + (5-2)%*10 + 2%*(10+100) = 3.5

• Speedup = 6/3.5 = 1.7

• DO GROUPWORK NOW

Summary

• Direct-mapped– simple– _____ access time– _______ hit rate

• Variable block size– still simple– _______ access time

Summary

• Direct-mapped– simple– fast access time– marginal hit rate

• Variable block size– still simple– _____ access time– _____ hit rate by exploiting __________

Summary

• Direct-mapped– simple– fast access time– marginal hit rate

• Variable block size– still simple– fast access time– higher hit rate by exploiting spatial locality

Summary• Associative caches

– ________ the access time– ________ the hit rate– associativity above ___ has little to no gain

• Multi-level caches– __________ worst-case miss penalty– __________ average miss penalty


– increase the access time– increase the hit rate– associativity above 8 has little to no gain

• Multi-level caches– __________ worst-case miss penalty– __________ average miss penalty


– increase the access time– increase the hit rate– associativity above 8 has little to no gain

• Multi-level caches– increases worst-case miss penalty (because you

waste time accessing another cache)– Reduces average miss penalty (because so

many are caught and handled quickly)

outline cache writes dram configurations performance associative caches multi-level caches

Documents

aroundl1 cache

cache writesthere

l2 cache

multiple copies