CSE 141 Carro
Cache Memory
CSE 141 Carro
Technology Trends
DRAM
Year Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
Capacity Speed (latency)
Logic: 2x in 3 years 2x in 3 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x in 10 years
1000:1! 2:1!
CSE 141 Carro
Who Cares About the Memory Hierarchy?
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
Processor-DRAM Memory Gap (latency)
CSE 141 Carro
Impact on Performance
• Suppose a processor executes a program at– CPI = 1.1
– 50% arith/logic, 30% ld/st, 20% control
• Suppose that 10% of memoryoperations get a miss penalty of 50 cycles
• CPI = ideal CPI + average stalls per instruction
• impact of data misses
= 1.1 cycles / instr + 0.30 datamops / instrx 0.10 misses / datamop x 50 cycles / miss
= 1.1 cycles + 1.5 cycles = 2.6 cycles
• 58 % of the time the processor is stalled waiting for memory!
• a 1% instruction miss rate would add an additional 0.5 cycles tothe CPI!
Ideal CPI 1.1Data Mis s 1.5Ins t Mis s 0.5
CSE 141 Carro
Why hierarchy works
• The Principle of Locality:– Program access a relatively small portion of the address space at any
instant of time.
Address Space0 2^n - 1
Probabilityof reference
CSE 141 Carro
Memory Locality
• Memory hierarchies take advantage of memory locality.• Memory locality is the principle that future memory
accesses are near past accesses.• Memories take advantage of two types of locality
– Temporal locality -- near in time => we will often access the samedata again very soon
– Spatial locality -- near in space/distance => our next access isoften very close to our last access (or recent accesses).
This sequence of addresses exhibits both temporal and spatial locality1,2,3,1,2,3,8,8,47,9,10,8,8...
CSE 141 Carro
Locality and cacheing
• Memory hierarchies exploit locality by cacheing (keepingclose to the processor) data likely to be used again.
• This is done because we can build large, slow memoriesand small, fast memories, but we can’t build large, fastmemories.
• If it works, we get the illusion of SRAM access time withdisk capacity
SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per Mbyte.
CSE 141 Carro
A typical memory hierarchy
CPU
memory
memory
memory
memory
on-chip cache
off-chip cache
main memory
disk
small expensive $/bit
cheap $/bit
big
•so then where is my program and data??
CSE 141 Carro
Cache Fundamentals
• cache hit -- an access where the data
is found in the cache.
• cache miss -- an access which isn’t
• hit time -- time to access the cache
• miss penalty -- time to move data from further level to closer,then to CPU
• hit ratio -- percentage of time the data is found in the cache
• miss ratio = 1 - hit ratio
cpu
lowest-levelcache
next-levelmemory/cache
CSE 141 Carro
Cache Fundamentals, cont.
• cache block size or cache line size -- the
amount of data that get transferred on a
cache miss.
• instruction cache -- cache that only holds
instructions.
• data cache -- cache that only caches data.
• unified cache -- cache that holds both.
cpu
lowest-levelcache
next-levelmemory/cache
CSE 141 Carro
Cacheing Issues
On a memory access -
• How do I know if this is a hit or miss?
On a cache miss -
• where to put the new data?
• what data to throw out?
• how to remember what data this is?
cpu
lowest-levelcache
next-levelmemory/cache
access
miss
CSE 141 Carro
• Our first example:– block size is one word of data
– "direct mapped"
For each item of data at the lower level, there is exactly one location in the cache where it might be.
e.g., lots of items at the lower level share locations in the upper level
First example of a Cache
CSE 141 Carro
• Mapping: address is modulo the number of blocks in thecache
Direct Mapped Cache
00001 00101 01001 01101 10001 10101 11001 11101
000
Cache
Memory
001
010
011
100
101
110
111
CSE 141 Carro
• For MIPS:
Direct Mapped CacheAddress (showing bit positions)
20 10
Byteıoffset
Valid Tag DataIndex
0
1
2
1021
1022
1023
Tag
Index
Hit Data
20 32
31 30 13 12 11 2 1 0
=
What kind of locality are we taking advantage of?
Why must we havethe valid bit?
CSE 141 Carro
How many bits in a cache, anyway?
• Suppose a direct mapped cache, with 64KB of data, andone-word blocks, with 32-bit address.
• 64KB -> 16Kwords, 214 words, in this case, 214 blocks
• Each block has 32 bits of data plus a tag (32-14-2=16 bits)plus a valid bit. So:
• 214 x (32 + 16 +1)= 214 x 49 = 784 x 210 = 784Kbits
• another way: 98KB for a 64KB data. 1.5 times larger
CSE 141 Carro
What happens in a cache miss?
• Read: send the address to main memory and wait
• write: if we write only in the cache, we have aninconsistency!
• If we always write to cache AND memory, we do a write-through
• what is our penalty?
• Solution 1: write buffer
• Solution 2: write back
CSE 141 Carro
• Taking advantage of spatial locality:
Direct Mapped Cache and Locality
Address (showing bit positions)
16 12 Byteıoffset
V Tag Data
Hit Data
16 32
4Kıentries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
31 16 15 4 32 1 0
CSE 141 Carro
Solution: compare the tag, if miss read, and then write! Cost?
What about misses now?
• Read misses: get a full block from main memory
• Writes: if we write only one word, the rest of the block willbe in what state?
We write here Might belong to someone else!
CSE 141 Carro
Block Size and Miss Rate
1 KB8 KB16 KB64 KB256 KB
256
40%
35%
30%
25%
20%
15%
10%
5%
0%
Mis
s ra
te
64164Block size (bytes)
CSE 141 Carro
The effect of block size
MissPenalty
Block Size
MissRate Exploits Spatial Locality
Fewer blocks: compromisestemporal locality
AverageAccess
Time
Increased Miss Penalty& Miss Rate
Block Size Block Size
As block sizes increases, there will be fewer blocks: not all data is used!The overall effect is a strong increase in access time.
CSE 141 Carro
Performance, that’s what we want!
• CPU TIME = (CPU execution clock cycles +
Memory stall cycles ) X Cycle time
Memory stall cycles = Inst/Prog X Misses/Inst X Miss penalty
Example: assume an instruction cache miss rate of 2%, datamiss 4%, machine with CPI=2; miss penalty=40 cycles.Assume a program distribution with 36% loads and stores,and compare this machine with a perfect cache.
CSE 141 Carro
Continuing the example I
The number of cycles it costs us in instruction misses is:
I x 2% x 40 = 0.8I (we increase the total cycles by this amount)
Data miss cycles: I x 36% x 4% x 40 = 0.56I
Total cycles increase: (0.8+0.56)I = 1.36I
The CPI considering memory stalls is 2+1.36=3.36
CPIstall/CPIperfect= 3.36/2=1.68
Number of cycles a miss takes
% of instruction misses
Number of cycles a miss takes% of load/stores
% of data misses
CSE 141 Carro
Continuing the example II
Repeat the problem with CPI=1 (thanks to a better pipeline, for example)
The CPI considering memory stalls is 1+1.36=2.36CPIstall/CPIperfect= 2.36/1=2.36, we are loosing performance!
Time in stalls:
1.36/3.36=41%, in the first case
1.36/2.36=58%, in the second case!
CSE 141 Carro
Continuing the example III
Repeat the problem with CPI=2, but suppose a CPU with thedouble clock frequency (thanks to technologyimprovements, for example).
If we do not change memory hierarchy, the cost in cycles for amiss will move from 40 to 80!I x 2% x 80 = 1.6 I (we increase the total cycles by this amount)Data miss cycles: I x 36% x 4% x 80 = 1.152 ITotal cycles increase: (1.6+1.152)I = 2.752 IThe CPI considering memory stalls is 2+2.752 = 4.752CPInewclock/CPIoldclock= 4.752/3.36 = 1.41
We have doubled the clock (and hence the power) but not performance...
CSE 141 Carro
To remember!
• A cache miss has the same effect of a wrong branchprediction:
• THERE IS AN INCREASE OF THE CPI
• this is the clue on HOW to compare machines. All the restis just common sense.
CSE 141 Carro
Extreme Example: single big line
• Cache Size = 4 bytes Block Size = 4 bytes– Only ONE entry in the cache
• If an item is accessed, likely that it will be accessed again soon– But it is unlikely that it will be accessed again immediately!!!
– The next access will likely to be a miss again� Continually loading data into the cache but
discard (force out) them before they are used again
� Worst nightmare of a cache designer: Ping Pong Effect
• Conflict Misses are misses caused by:– Different memory locations mapped to the same cache index
� Solution 1: make the cache size bigger
� Solution 2: Multiple entries for the same Cache Index
0
Cache DataValid Bit
Byte 0Byte 1Byte 3
Cache Tag
Byte 2
CSE 141 Carro
Another Extreme Example: Fully Associative• Fully Associative Cache
– Forget about the Cache Index
– Compare the Cache Tags of all cache entries in parallel
– Example: Block Size = 32 B, we need N 27-bit comparators
• By definition: Conflict Miss = 0 for a fully associative cache
:
Cache Data
Byte 0
0431
:
Cache Tag (27 bits long)
Valid Bit
:
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Cache Tag
Byte Select
Ex: 0x01
X
X
X
X
X
CSE 141 Carro
A Two-way Set Associative Cache• N-way set associative: N entries for each Cache Index
– N direct mapped caches operates in parallel
• Example: Two-way set associative cache– Cache Index selects a “set” from the cache
– The two tags in the set are compared in parallel
– Data is selected based on the tag result
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
CSE 141 Carro
Disadvantage of Set Associative Cache• N-way Set Associative Cache versus Direct Mapped Cache:
– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss decision and set selection
• In a direct mapped cache, Cache Block is available BEFOREHit/Miss:
– Possible to assume a hit and continue. Recover later if miss.
• How do we know what to move from the cache?
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
CSE 141 Carro
A Summary on Sources of Cache Misses
• Compulsory (cold start or process migration, firstreference): first access to a block
– “Cold” fact of life: not a whole lot you can do about it
– Note: If you are going to run “billions” of instructions,Compulsory Misses are insignificant
• Conflict (collision):– Multiple memory locations mapped to the same cache location
– Solution 1: increase cache size
– Solution 2: increase associativity
• Capacity:– Cache cannot contain all blocks accessed by the program
– Solution: increase cache size
• Invalidation: other process (e.g., I/O) updates memory
CSE 141 Carro
Sources of Cache Misses Answer
Direct Mapped N-way Set Associative Fully Associative
Compulsory Miss
Cache Size
Capacity Miss
Invalidation Miss
Big Medium Small
Note:If you are going to run “billions” of instructions, Compulsory Misses are insignificant.
Same Same Same
Conflict Miss High Medium Zero
Low Medium High
Same Same Same
CSE 141 Carro
Accessing a Sample Cache
• 64 KB cache, direct-mapped, 32-byte cache block size
31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index
valid tag data
64 KB
/ 32 bytes = 2 K
cache blocks/sets
11
=
256
32
16
hit/miss
012.........
...204520462047
word offset
CSE 141 Carro
Accessing a Sample Cache
• 32 KB cache, 2-way set-associative, 16-byte block size
31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index
valid tag data
32 KB
/ 16 bytes / 2 =
1 K cache sets
10
=
18
hit/miss
012.........
...102110221023
word offset
tag datavalid
=
CSE 141 Carro
Cache Associativity
0%
3%
6%
9%
12%
15%
Eight-wayFour-wayTwo-wayOne-way
1 KB2 KB4 KB8 KB
Miss
rate
Associativity 16 KB32 KB64 KB128 KB
CSE 141 Carro
Cache Miss Components
20%
Mis
s ra
te p
er t
ype
2%
4%
6%
8%
10%
12%
14%
1 4 8 16 32 64 128
One-way
Two-way
Cache size (KB) Four-way
Eight-way
Capacity
CSE 141 Carro
LRU replacement algorithms
• only needed for associative caches
• requires one bit for 2-way set-associative, 8 bits for 4-way,24 bits for 8-way.
• can be emulated with log n bits (NMRU)
• can be emulated with use bits for highly associative caches(like page tables)
CSE 141 Carro
Caches in Current Processors
• Often DM at highest level (closest to CPU), associative further away
• split I and D close to the processor (for throughput rather than miss rate),unified further away.
• write-through and write-back both common, but never write-through allthe way to memory.
• 32-byte cache lines very common (but getting larger – 64, 128)
• Non-blocking– processor doesn’t stall on a miss, but only on the use of a miss (if even then)
– this means the cache must be able to handle multiple outstanding accesses.
CSE 141 Carro
Key Points
• Caches give illusion of a large, cheap memory with theaccess time of a fast, expensive memory.
• Caches take advantage of memory locality, specificallytemporal locality and spatial locality.
• Cache design presents many options (block size, cachesize, associativity, write policy) that an architect mustcombine to minimize miss rate and access time tomaximize performance.
• Cache misses increase CPI
CSE 141 Carro
Virtual Memory
The magician’s approach to hardware
CSE 141 Carro
The problem:
• Our computer has 32Kbyes of main memory. How can we:
a)run programs that use more than 32Kbytes?
Divide the program into chunks that fit;
Let the user worry about how to bring each chunk fromdisk to memory at the right time.
b)allow multiple users to use our computer?
Pay someone to look at each program and do the abovetasks, just thinking the multiple users programs are like ahuge single program (this is true!).
CSE 141 Carro
Virtual Memory
• It’s just another level in the cache/memory hierarchy
• Virtual memory is the name of the technique that allows usto view main memory as a cache of a larger memory space(on disk). cpu
$
memory
disk
cache
cacheing
virtual memory
CSE 141 Carro
Virtual Memory
• is just cacheing, but uses different terminologycache VM
block page
cache miss page fault
address virtual address
index physical address (sort of)
CSE 141 Carro
Virtual Memory
• What happens if another program in the processor uses the sameaddresses that yours does?
• What happens if your program uses addresses that don’t exist in themachine?
• What happens to “holes” in the address space your program uses?
• So, virtual memory provides– performance (through the cacheing effect)
– protection
– ease of programming/compilation
– efficient use of memory
CSE 141 Carro
Virtual Memory
• is just a mapping function from virtual memory addressesto physical memory locations, which allows cacheing ofvirtual pages in physical memory.
Physical memory
Disk storage
Valid
1
1
1
1
0
1
1
0
1
1
0
1
Page table
Virtual pagenumber
Physical page ordisk address
CSE 141 Carro
What makes VM different than memorycaches
• MUCH higher miss penalty (millions of cycles)! If it is notin memory, it is in the disk!
• Therefore:– large pages [equivalent of cache line] (4 KB to MBs)
– associative mapping of pages (typically fully associative)
– software handling of misses (since we have time)
– write-through not an option, only write-back
• substitution policy: LRU
CSE 141 Carro
Mapping virtual to physical address
• Page size =212=4KB
• Physical pages < 4* Virtual pages!
3 2 1 011 10 9 815 14 13 1231 30 29 28 27
Page offsetVirtual page number
Virtual address
3 2 1 011 10 9 815 14 13 1229 28 27
Page offsetPhysical page number
Physical address
Translation
CSE 141 Carro
Virtual Memory mapping
physical addresses
virtual addresses
virtual addresses
disk
We can share code or data!
Where have you seen this?
Address translation via the page tablevirtual page number page offset
valid physical page numberpage table reg
physical page number page offset
virtual address
physical address
pagetable
• all page mappings are in the page table, so hit/miss is determinedsolely by the valid bit (i.e., no tag)
• if we do it in software, where do we store the address translation table?
CSE 141 Carro
Actual (somewhat) hardware
Page offsetVirtual page number
Virtual address
Page offsetPhysical page number
Physical address
Physical page numberValid
If 0 then page is notıpresent in memory
Page table register
Page table
20 12
18
31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0
29 28 27 15 14 13 12 11 10 9 8 3 2 1 0
Notice: what do wehave to do to save the context?
CSE 141 Carro
Making Address Translation Fast
• A cache for address translations: translation-lookasidebuffer (TLB)
Valid
1
1
1
1
0
1
1
0
1
1
0
1
Page table
Physical pageaddressValid
TLB
1
1
1
1
0
1
TagVirtual page
number
Physical pageor disk address
Physical memory
Disk storage
CSE 141 Carro
TLBs and caches
Valid Tag Data
Page offset
Page offset
Virtual page number
Virtual address
Physical page numberValid
1220
20
16 14
Cache index
32
Cache
DataCache hit
2
Byteoffset
Dirty Tag
TLB hit
Physical page number
Physical address tag
TLB
Physical address
31 30 29 15 14 13 12 11 10 9 8 3 2 1 0
CSE 141 Carro
Memory system Decstation 3100
Yes
Deliver dataıto the CPU
Write?
Try to read dataıfrom cache
Write data into cache,ıupdate the tag, and putı
the data and the addressıinto the write buffer
Cache hit?Cache miss stall
TLB hit?
TLB access
Virtual address
TLB missıexception
No
YesNo
YesNo
Write accessıbit on?
ı
YesNo
Write protectionıexception
Physical address
CSE 141 Carro
Modern systems: nightmare!Characteristic Intel Pentium Pro PowerPC 604
Virtual address 32 bits 52 bitsPhysical address 32 bits 32 bitsPage size 4 KB, 4 MB 4 KB, selectable, and 256 MBTLB organization A TLB for instructions and a TLB for data A TLB for instructions and a TLB for data
Both four-way set associative Both two-way set associativePseudo-LRU replacement LRU replacementInstruction TLB: 32 entries Instruction TLB: 128 entriesData TLB: 64 entries Data TLB: 128 entriesTLB misses handled in hardware TLB misses handled in hardware
Characteristic Intel Pentium Pro PowerPC 604Cache organization Split instruction and data caches Split intruction and data cachesCache size 8 KB each for instructions/data 16 KB each for instructions/dataCache associativity Four-way set associative Four-way set associativeReplacement Approximated LRU replacement LRU replacementBlock size 32 bytes 32 bytesWrite policy Write-back Write-back or write-through
CSE 141 Carro
Virtual Memory Key Points - I
• How does virtual memory provide:– protection?
– sharing?
– performance?
– illusion of large main memory?
• Virtual Memory requires twice as many memory accesses,so we cache page table entries in the TLB.
• Three things can go wrong on a memory access: cachemiss, TLB miss, page fault.
CSE 141 Carro
Virtual Memory Key Points - II
• Processor speeds continue to increase very fast— much faster than either DRAM or disk access times
• Design challenge: dealing with this growing disparity
• Trends:– synchronous SRAMs (provide a burst of data)
– redesign DRAM chips to provide higher bandwidth or processing
– restructure code to increase locality
– use prefetching (make cache visible to ISA)