outline cache writes dram configurations performance associative caches multi-level caches
TRANSCRIPT
Outline
• Cache writes
• DRAM configurations
• Performance
• Associative caches
• Multi-level caches
00011011
DataTagValid
Reference Stream:Hit/Miss0b010010000b000101000b001110000b00010000
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
Tag IndexByte Offset
Block Offset
01110000
110
1
00011011
DataTagValid
Reference Stream:Hit/Miss0b010010000b000101000b001110000b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
01110000
110
1
00011011
DataTagValid
Reference Stream:Hit/Miss0b010010000b000101000b001110000b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
01110000
110
1
0100110100100011
DataTagValid
110
1
Reference Stream:Hit/Miss0b010010000b000101000b001110000b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
M[64-79]
0100110100100011
DataTagValid
110
1
Reference Stream:Hit/Miss0b010010000b000101000b001110000b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
M[64-79]M[208-223]
0100110100100011
DataTagValid
110
1
Reference Stream:Hit/Miss0b010010000b000101000b001110000b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
M[64-79]M[208-223]M[32-47]
0100110100100011
DataTagValid
110
1
Reference Stream:Hit/Miss0b010010000b000101000b001110000b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
M[64-79]M[208-223]M[32-47]Not Valid
0100110100100011
DataTagValid
110
1
Reference Stream:Hit/Miss0b010010000b000101000b001110000b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
M[64-79]M[208-223]M[32-47]
0100110100100011
DataTagValid
110
1
Reference Stream: Hit/Miss0b01001000 H0b000101000b001110000b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
M[64-79]M[208-223]M[32-47]
0100110100100011
DataTagValid
110
1
Reference Stream: Hit/Miss0b01001000 H0b000101000b001110000b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
M[64-79]M[208-223]M[32-47]
0100000100100011
DataTagValid
110
1
Reference Stream: Hit/Miss0b01001000 H0b00010100 M0b001110000b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
M[64-79]M[16-31]M[32-47]
0100000100100011
DataTagValid
110
1
Reference Stream: Hit/Miss0b01001000 H0b00010100 M0b001110000b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
M[64-79]M[16-31]M[32-47]
0100000100100011
DataTagValid
111
1
Reference Stream: Hit/Miss0b01001000 H0b00010100 M0b00111000 M0b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
M[64-79]M[16-31]M[32-47]M[48-63]
0100000100100011
DataTagValid
111
1
Reference Stream: Hit/Miss0b01001000 H0b00010100 M0b00111000 M0b00010000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
M[64-79]M[16-31]M[32-47]M[48-63]
0100000100100011
DataTagValid
111
1
Reference Stream: Hit/Miss0b01001000 H0b00010100 M0b00111000 M0b00010000 H
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=4words, wordsize= 4bytes
M[64-79]M[16-31]M[32-47]M[48-63]
Cache Writes
• There are multiple copies of the data lying around– L1 cache, L2 cache, DRAM
• Do we write to all of them?
• Do we wait for the write to complete before the processor can proceed?
Do we write to all of them?
• Write-through
• Write-back– creates data - different values
for same item in cache and DRAM.– This data is referred to as
Do we write to all of them?
• Write-through - write to all levels of hierarchy
• Write-back– creates data - different values
for same item in cache and DRAM.– This data is referred to as
Do we write to all of them?
• Write-through - write to all levels of hierarchy• Write-back - write to lower level only when cache
line gets evicted from cache– creates inconsistent data - different values for same
item in cache and DRAM – stale data.
– Inconsistent data in highest level in cache is referred to as dirty
– If they all match, they are clean
– The old data is stale.
Write-Through
CPU
L1
L2 Cache
DRAM
Sw $3, 0($5)
Write-Back
CPU
L1
L2 Cache
DRAM
Sw $3, 0($5)
Write-through vs Write-back
• Which performs the write faster?
• Which has faster evictions from a cache?
• Which causes more bus traffic?
Write-through vs Write-back
• Which performs the write faster?– Write-back - it only writes the L1 cache
• Which has faster evictions from a cache?
• Which causes more bus traffic?
Write-through vs Write-back
• Which performs the write faster?– Write-back - it only writes the L1 cache
• Which has faster evictions from a cache?– Write-through - no write involved, just
overwrite tag
• Which causes more bus traffic?
Write-through vs Write-back
• Which performs the write faster?– Write-back - it only writes the L1 cache
• Which has faster evictions from a cache?– Write-through - no write involved, just
overwrite tag
• Which causes more bus traffic?– Write-through. DRAM is written every store.
Write-back only writes on eviction.
Does processor wait for write?
• Write buffer
– Any loads must check write buffer in parallel with cache access.
– Buffer values are more recent than cache values.
Does processor wait for write?
• Write buffer - intermediate queue for pending writes
– Any loads must check write buffer in parallel with cache access.
– Buffer values are more recent than cache values.
Outline
• Cache writes
• DRAM configurations
• Performance
• Associative caches
Challenge
• DRAM is designed for density, not speed
• DRAM is ______ than the bus
• We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow.
• Widening anything increases the cost by quite a bit.
Challenge
• DRAM is designed for density, not speed
• DRAM is slower than the bus
• We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow.
• Widening anything increases the cost by quite a bit.
Narrow Configuration CPU
Cache
DRAM
Bus
• Given:– 1 clock cycle request– 15 cycles / word DRAM latency– 1 cycle / word bus latency
• If a cache block is 8 words, what is the miss penalty of an L2 cache miss?
Narrow Configuration CPU
Cache
DRAM
Bus
• Given:– 1 clock cycle request– 15 cycles / word DRAM latency– 1 cycle / word bus latency
• If a cache block is 8 words, what is the miss penalty of an L2 cache miss?
• 1cycle + 15 cycles/word * 8 words + 1 cycle/word * 8 words = 129 cycles
Wide Configuration CPU
Cache
DRAM
Bus
• Given:– 1 clock cycle request– 15 cycles / 2 words DRAM latency– 1 cycle / 2 words bus latency
• If a cache block is 8 words, what is the miss penalty of an L2 cache miss?
Wide Configuration CPU
Cache
DRAM
Bus
• Given:– 1 clock cycle request– 15 cycles / 2 words DRAM latency– 1 cycle / 2 words bus latency
• If a cache block is 8 words, what is the miss penalty of an L2 cache miss?
• 1cycle + 15 cycles/2 words * 8 words + 1 cycle/2words*8words = 65 cycles
Interleaved Configuration CPU
Cache
DRAM
Bus
• Given:– 1 clock cycle request– 15 cycles / word DRAM latency– 1 cycle / word bus latency
• If a cache block is 8 words, what is the miss penalty of an L2 cache miss?
DRAM
Interleaved Configuration CPU
Cache
DRAM
Bus
• Given:– 1 clock cycle request– 15 cycles / word DRAM latency– 1 cycle / word bus latency
• If a cache block is 8 words, what is the miss penalty of an L2 cache miss?
• 1 cycle + 15 cycles / 2 words * 8 words + 1 cycle / word * 8 words = 69 cycles
DRAM
Recent DRAM trends
• Fewer, Bigger DRAMs
• New bus protocols (RAMBUS)
• small DRAM caches (page mode)
• SDRAM (synchronous DRAM)– one request & length nets several continuous
responses.
Outline
• Cache writes
• DRAM configurations
• Performance
• Associative caches
Performance• Execute Time = (Cpu cycles + Memory-stall
cycles) * clock cycle time• Memory-stall cycles =
– accesses * misses * cycles = – program access miss – memory access * Miss rate * Miss penalty – program – instructions * misses * cycles = – program inst miss – instructions * misses * miss penalty– program inst
Example 1
• instruction cache miss rate: 2%
• data cache miss rate: 3%
• miss penalty: 50 cycles
• ld/st instructions are 25% of instructions
• CPI with perfect cache is 2.3
• How much faster is the computer with a perfect cache?
Example 1
• misses = Iacc * Imr + Dacc * Dmr
• instr instr instr
Example 1
• misses = Iacc * Imr + Dacc * Dmr
• instr instr instr
• = 1 * .02 + .25 * .03 = .02 + .0075 = .0275
•
Example 1
• misses = Iacc * Imr + Dacc * Dmr
• instr instr instr
• = 1 * .02 + .25 * .03 = .02 + .0075 = .0275
• Memory cycles = I * .0275 * 50 = I* 1.375
•
Example 1
• misses = Iacc * Imr + Dacc * Dmr
• instr instr instr
• = 1 * .02 + .25 * .03 = .02 + .0075 = .0275
• Memory cycles = I * .0275 * 50 = I* 1.375
• ExecT = (Cpu CPI * I + MemCycles)*Clk
•
Example 1
• misses = Iacc * Imr + Dacc * Dmr
• instr instr instr
• = 1 * .02 + .25 * .03 = .02 + .0075 = .0275
• Memory cycles = I * .0275 * 50 = I* 1.375
• ExecT = (Cpu CPI * I + MemCycles)*Clk
• = (2.3 * I + 1.375 * I) * clk = 3.675IC
Example 1
• misses = Iacc * Imr + Dacc * Dmr
• instr instr instr
• = 1 * .02 + .25 * .03 = .02 + .0075 = .0275
• Memory cycles = I * .0275 * 50 = I* 1.375
• ExecT = (Cpu CPI * I + MemCycles)*Clk
• = (2.3 * I + 1.375 * I) * clk = 3.675IC
• speedup = 3.675 IC / 2.3IC = 1.6
Example 2• Double the clock rate from Example1.
What is the ideal speedup when taking into account the memory system?
• How long is the miss penalty now?
Example 2• Double the clock rate from Example1.
What is the ideal speedup when taking into account the memory system?
• How long is the miss penalty now? 100 cycles
• Memory cycles =
Example 2• Double the clock rate from Example1.
What is the ideal speedup when taking into account the memory system?
• How long is the miss penalty now? 100 cycles
• Memory cycles = I * .0275 * 100 = I * 2.75
Example 2• Double the clock rate from Example1.
What is the ideal speedup when taking into account the memory system?
• How long is the miss penalty now? 100 cycles
• Memory cycles = I * .0275 * 100 = I * 2.75
• Exec = (2.3*I + 2.75*I)*clk = 5.05I(C/2)
Example 2• Double the clock rate from Example1.
What is the ideal speedup when taking into account the memory system?
• How long is the miss penalty now? 100 cycles
• Memory cycles = I * .0275 * 100 = I * 2.75
• Exec = (2.3*I + 2.75*I)*clk = 5.05I(C/2)
• speedup = old = 3.675IC = 3.675 = 1.5
• new = 5.05IC/2 2.525
Outline
• Cache writes
• DRAM configurations
• Performance
• Associative caches
10100010010001000011
DataTagValid
110
1
Reference Stream: Hit/Miss0b001110000b000111000b001110000b00011000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=2words, wordsize= 4bytes
M[160-167]M[72-79]M[16-23]Not Valid
000110
00111
DataTagValid
111
1
Reference Stream: Hit/Miss0b00111000 M0b000111000b001110000b00011000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=2words, wordsize= 4bytes
101010000
M[160-167]M[72-79]M[16-23]M[56-63]
000110
00111
DataTagValid
111
1
Reference Stream: Hit/Miss0b00111000 M0b000111000b001110000b00011000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=2words, wordsize= 4bytes
101010000
M[160-167]M[72-79]M[16-23]M[56-63]
000110
00011
DataTagValid
111
1
Reference Stream: Hit/Miss0b00111000 M0b00011100 M0b001110000b00011000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=2words, wordsize= 4bytes
101010000
M[160-167]M[72-79]M[16-23]M[24-31]
000110
00011
DataTagValid
111
1
Reference Stream: Hit/Miss0b00111000 M0b00011100 M0b001110000b00011000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=2words, wordsize= 4bytes
101010000
M[160-167]M[72-79]M[16-23]M[24-31]
000110
00111
DataTagValid
111
1
Reference Stream: Hit/Miss0b00111000 M0b00011100 M0b00111000 M0b00011000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=2words, wordsize= 4bytes
101010000
M[160-167]M[72-79]M[16-23]M[56-63]
000110
00111
DataTagValid
111
1
Reference Stream: Hit/Miss0b00111000 M0b00011100 M0b00111000 M0b00011000
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=2words, wordsize= 4bytes
101010000
M[160-167]M[72-79]M[16-23]M[56-63]
000110
00111
DataTagValid
111
1
Reference Stream: Hit/Miss0b00111000 M0b00011100 M0b00111000 M0b00011000 M
Tag IndexByte Offset
Block Offset
Direct-mapped CacheBlocksize=2words, wordsize= 4bytes
101010000
M[160-167]M[72-79]M[16-23]M[56-63]
Problem
• Conflicting addresses cause high miss rates
Solution
• Relax the direct-mapping
• Allow each address to be mapped into 2 or 4 locations (a set)
Cache Configurations
00011011
DataTagValid
01
DataTagValid DataTagValid
Direct-Mapped
2-way Associative - each set has two blocks
DataTagValid DataTagValidFully Associative - all addresses map to the same set
Cache Configurations
00011011
DataTagValid
01
DataTagValid DataTagValid
Direct-Mapped
2-way Associative - each set has two blocks
DataTagValid DataTagValidFully Associative - all addresses map to the same set
Block
Cache Configurations
00011011
DataTagValid
01
DataTagValid DataTagValid
Direct-Mapped
2-way Associative - each set has two blocks
DataTagValid DataTagValidFully Associative - all addresses map to the same set
BlockSet
10010
00101
0000
0001
DataTagValid
1
1
1
1
Reference Stream: Hit/Miss0b001110000b000111000b001110000b00011000
Tag IndexByte Offset
Block Offset
2-way Set Associative CacheBlocksize=2words, wordsize= 4bytes
DataTagValidIndex
Set
Block
10010
00101
0000
0001
DataTagValid
1
1
1
1
Reference Stream: Hit/Miss0b001110000b000111000b001110000b00011000
Tag IndexByte Offset
Block Offset
2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes
DataTagValidIndex
10010
00111
0000
0001
DataTagValid
1
1
1
1
Reference Stream: Hit/Miss0b00111000 M0b000111000b001110000b00011000
Tag IndexByte Offset
Block Offset
2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes
DataTagValidIndex
10010
00111
0000
0001
DataTagValid
1
1
1
1
Reference Stream: Hit/Miss0b00111000 M0b000111000b001110000b00011000
Tag IndexByte Offset
Block Offset
2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes
DataTagValidIndex
10010
00111
0000
0001
DataTagValid
1
1
1
1
Reference Stream: Hit/Miss0b00111000 M0b00011100 H0b001110000b00011000
Tag IndexByte Offset
Block Offset
2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes
DataTagValidIndex
10010
00111
0000
0001
DataTagValid
1
1
1
1
Reference Stream: Hit/Miss0b00111000 M0b00011100 H0b001110000b00011000
Tag IndexByte Offset
Block Offset
2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes
DataTagValidIndex
10010
00111
0000
0001
DataTagValid
1
1
1
1
Reference Stream: Hit/Miss0b00111000 M0b00011100 H0b00111000 H0b00011000
Tag IndexByte Offset
Block Offset
2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes
DataTagValidIndex
10010
00111
0000
0001
DataTagValid
1
1
1
1
Reference Stream: Hit/Miss0b00111000 M0b00011100 H0b00111000 H0b00011000
Tag IndexByte Offset
Block Offset
2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes
DataTagValidIndex
10010
00111
0000
0001
DataTagValid
1
1
1
1
Reference Stream: Hit/Miss0b00111000 M0b00011100 H0b00111000 H0b00011000 H
Tag IndexByte Offset
Block Offset
2-way Set Associative Cache Blocksize=2words, wordsize= 4bytes
DataTagValidIndex
Implementation
0
1
DataTagValid
Byte Address0x100100100
Tag IndexByte Offset
=Hit? MUX
Block offset
Data
DataTagValid
MUX=
MUX
Performance Implications
• Increasing associativity increases/decreases hit rate
• Increasing associativity increases/decreases access time
• Increasing associativity increases/decreases miss penalty
Performance Implications
• Increasing associativity increases hit rate
• Increasing associativity increases/decreases access time
• Increasing associativity increases/decreases miss penalty
Performance Implications
• Increasing associativity increases hit rate
• Increasing associativity increases access time
• Increasing associativity increases/decreases miss penalty
Performance Implications
• Increasing associativity increases hit rate
• Increasing associativity increases access time
• Increasing associativity has no effect on miss penalty
0
1
Direct-Mapped Cache
DataTagValid
0
00
0
Miss Rate:Tag Index Byte OffsetBlock Offset
Example 2-way associative
Reference Stream: Hit/Miss0b1001000 M0b00111000b10010000b0111000
0
1
Direct-Mapped Cache
DataTagValid
0
00
0
Tag Index Byte OffsetBlock Offset
Example 2-way associative
Reference Stream:0b10010000b00111000b10010000b0111000
0
1 100
Direct-Mapped Cache
DataTagValid
0
10
0
Tag Index Byte OffsetBlock Offset
Example 2-way associative
Reference Stream:0b10010000b00111000b10010000b0111000
0
1 100
Direct-Mapped Cache
DataTagValid
0
10
0
Tag Index Byte OffsetBlock Offset
Example 2-way associative
Reference Stream:0b10010000b00111000b10010000b0111000
0
1 100001
Direct-Mapped Cache
DataTagValid
0
11
0
Tag Index Byte OffsetBlock Offset
Example 2-way associative
Reference Stream:0b10010000b00111000b10010000b0111000
0
1 100001
Direct-Mapped Cache
DataTagValid
0
11
0
Tag Index Byte OffsetBlock Offset
Example 2-way associative
Reference Stream:0b10010000b00111000b10010000b0111000
0
1 100001
Direct-Mapped Cache
DataTagValid
0
11
0
Tag Index Byte OffsetBlock Offset
Example 2-way associative
Reference Stream:0b10010000b00111000b10010000b0111000
Which block to replace?
• 0b1001000
• 0b0011100
Which block to replace?
• 0b1001000 - It entered the cache first– FIFO - First In First Out
• 0b0011100
Which block to replace?
• 0b1001000 - It entered the cache first– FIFO - First In First Out
• 0b0011100 - Longer since it has been used– LRU - Least Recently Used
• Random
Replacement Algorithms
• LRU & FIFO simple conceptually, but implementation difficult for high assoc.
• LRU & FIFO must be approximated with high associativity
• Random sometimes better than approximated LRU/FIFO
• Tradeoff between accuracy, implementation cost
L1
L2 Cache
DRAM
Memory
Me
L1 cache’s perspective
L1’s miss penalty containsthe access of L2, and possiblythe access of DRAM!!!
Multi-level Caches
• Base CPI 1.0, 500MHz clock
• main memory-100 cycles, L2 - 10 cycles
• L1 miss rate per instruction - 5%
• w/L2 - 2% of instructions go to DRAM
• What is the speedup with the L2 cache?
There is a typo in the book for this example!
Multi-level Caches
• CPI = 1 + memory stalls / instructions
Multi-level Caches
• CPI = 1 + memory stalls / instructions
• CPIold = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr
Multi-level Caches
• CPI = 1 + memory stalls / instructions
• CPIold = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr
• CPInew = 1 + L2%*L2penalty + Mem%*MemPenalty
Multi-level Caches
• CPI = 1 + memory stalls / instructions
• CPIold = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr
• CPInew = 1 + L2%*L2penalty + Mem%*MemPenalty
• = 1 + 5% * 10 + 2% * 100 = 3.5
Multi-level Caches
• CPI = 1 + memory stalls / instructions
• CPIold = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr
• CPInew = 1 + L2%*L2penalty + Mem%*MemPenalty
• = 1 + 5% * 10 + 2% * 100 = 3.5
• = 1 + (5-2)%*10 + 2%*(10+100) = 3.5
Multi-level Caches
• CPI = 1 + memory stalls / instructions
• CPIold = 1 + 5% miss/instr * 100 cycles/miss = 1 + 5 = 6 cycles / instr
• CPInew = 1 + L2%*L2penalty + Mem%*MemPenalty
• = 1 + 5% * 10 + 2% * 100 = 3.5
• = 1 + (5-2)%*10 + 2%*(10+100) = 3.5
• Speedup = 6/3.5 = 1.7
• DO GROUPWORK NOW
Summary
• Direct-mapped– simple– _____ access time– _______ hit rate
• Variable block size– still simple– _______ access time
Summary
• Direct-mapped– simple– fast access time– marginal hit rate
• Variable block size– still simple– _____ access time– _____ hit rate by exploiting __________
Summary
• Direct-mapped– simple– fast access time– marginal hit rate
• Variable block size– still simple– fast access time– higher hit rate by exploiting spatial locality
Summary• Associative caches
– ________ the access time– ________ the hit rate– associativity above ___ has little to no gain
• Multi-level caches– __________ worst-case miss penalty– __________ average miss penalty
Summary• Associative caches
– increase the access time– increase the hit rate– associativity above 8 has little to no gain
• Multi-level caches– __________ worst-case miss penalty– __________ average miss penalty
Summary• Associative caches
– increase the access time– increase the hit rate– associativity above 8 has little to no gain
• Multi-level caches– increases worst-case miss penalty (because you
waste time accessing another cache)– Reduces average miss penalty (because so
many are caught and handled quickly)