cmp 301a computer architecture 1 lecture 2. outline zdirect mapped caches: reading and writing...
TRANSCRIPT
CMP 301AComputer
Architecture 1
Lecture 2
Outline
Direct mapped caches: Reading and writing
policies
Measuring cache performance
Improving cache performance
Enhancing main memory performance
Flexible placement of blocks: Associativity
Multilevel caches
Read and Write Policies
Cache read is much easier to handle than cache write: Instruction cache is much easier to design than data cache
Cache write: How do we keep data in the cache and memory consistent?
Two write options: Write Through: write to cache and memory at the same time.
Isn’t memory too slow for this?
Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss.
Need a “dirty” bit for each cache block Greatly reduce the memory bandwidth requirement Control can be complex
Write Buffer for Write Through
A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write
cycleMemory system designer’s nightmare:
Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation
Problem: Write buffer may hold updated value of location needed by a read miss??!!
ProcessorCache
Write Buffer
DRAM
Write Allocate versus Not AllocateAssume: a 16-bit write to memory location 0x0 and causes a miss
Do we read in the rest of the block (Byte 2, 3, ... 31)?Yes: Write AllocateNo: Write Not Allocate
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0431
:
Cache Tag Example: 0x00
Ex: 0x00
0x00
Valid Bit
:
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte Select
Ex: 0x00
9
6
Measuring cache performanceImpact of cache miss on Performance
Suppose a processor executes at Clock Rate = 1 GHz (1 ns per cycle), Ideal (no misses) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control
Suppose that 10% of memory operations (involving data) get 100 cycle miss penalty
Suppose that 1% of instructions get same miss penalty
ninstructioper stalls average CPI ideal CPI
miss
cycle
Inst_Mop
miss
instr.
Inst_Mop
miss
cycle
Data_Mops
miss
instr.
Data_Mops
instr.
cycles CPI
1.5instr.
cycle)0.10.31.1(
10001.0110010.030011CPI
..
78% of the time the proc is stalled waiting for memory!
7
Improving Cache Performance
Average memory access time(AMAT) =Hit time + Miss rate x Miss penalty
To improve performance:• reduce the hit time• reduce the miss rate• reduce the miss penalty
Enhancing main memory performance
Increasing memory and bus width Transfer more
words every clock cycle
Isn’t too much wiring
Using interleaved memory organization Reduce access time
with less wiring Double Date Rate
(DDR) DRAMs
Enhancing main memory performance (Cont)
10
Flexible placement of blocks: Associativity
0 1 2 3 4 5 6 70 1 2 3Set Number
Cache
Fully (2-way) Set DirectAssociative Associative Mappedanywhere anywhere in only into
set 0 block 4 (12 mod 4) (12 mod 8)
0 1 2 3 4 5 6 7 8 91 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9
2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9
3 30 1
Memory
Block Number
block 12 can be placed
11
Flexible placement of blocks: Associativity
A Two-way Set Associative CacheN-way set associative: N entries for each Cache Index
N direct mapped caches operates in parallelExample: Two-way set associative cache
Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
And yet Another Extreme Example: Fully Associative
Fully Associative Cache -- push the set associative idea to its limit! Forget about the Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 32 B blocks, we need N 27-bit
comparatorsBy definition: Conflict Miss = 0 for a fully associative cache
:
Cache Data
Byte 0
0431
:
Cache Tag (27 bits long)
Valid Bit
:
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Cache Tag
Byte Select
Ex: 0x01
X
X
X
X
X
14
Replacement PolicyIn an associative cache, which block from a set should be evicted when the set becomes full?
• Random
• Least-Recently Used (LRU)• LRU cache state must be updated on every access• true implementation only feasible for small sets (2-way)
• First-In, First-Out (FIFO) a.k.a. Round-Robin• used in highly associative caches
• Not-Most-Recently Used (NMRU)• FIFO with exception for most-recently used block or blocks
Replacement only happens on misses