eece476: computer architecture lecture 25: chapter 7, memory and caches the university of british...
Post on 18-Dec-2015
215 views
TRANSCRIPT
EECE476: Computer Architecture
Lecture 25:Chapter 7, Memory and Caches
The University ofBritish Columbia EECE 476 © 2005 Guy Lemieux
Motivation for CachesCPU vs. Memory Performance Gap
Memory is getting slower relative to CPU speeds (log scale!)
Goal: Make memory faster!
Importance of Cache MemoryFast CPUs are Mostly Cache!
64 kBData
Cache
64 kBInstr.Cache
Load/Store
ExecutionUnit
Fetch ScanAlign
Micro-code
BusUnit
HyperTransport DDR Memory Interface
1 MB UnifiedInstruction/DataLevel 2 Cache
Floating-Point Unit
Memory Controller
Total Area: 193 mm2
42% 1MB L2 Cache
4% Instr. Cache
4% Data Cache
(50% is cache)
13% HyperTransport
10% DDR Memory
(23% is I/O)
6% Fetch/Scan/etc
4% Mem Controller
4% FPU
3% Exec Units
2% Bus Unit
(only 20% is actually CPU!)
Main Memory
• What to use for Main Memory?– SRAM– DRAM– SDRAM– RAMBUS
– FLASH
– Disk
Memory Technology• SRAM: Static RAM
– 6 transistors per bit• Expensive
– Transistors configured as 2 inverters in a loop• Stable, positive feedback holds value strongly (static)• Actively drive bit value along bitlines to sense amps
– Fast: can tune transistors and sense amps• Used to make cache memory!
• DRAM: Dynamic RAM– 1 transistor per bit
• Inexpensive– Transistor holds charge (C)
• Loses charge/value when driving bitline (dynamic)• Transistor leaks charge over time (dynamic)• Must recharge transistor periodically (including after a data-read)
– Slow• Transistors tiny, hold small charge• Sense amps must detect tiny change in voltage
(row select) word
bit bit
10
0 1
(row select) word
bit
C
Memory Technology• SDRAM: Synchronous DRAM (not Static DRAM!)
– New, around 1995-1996
– Like DRAM, but pipelined (needs clock!)• Pipeline register on Address inputs• Pipeline register on Data outputs• Sometimes additional registers in-between!
– Multiple clock cycles to get data• Latency: CL=2, 2.5, or 3 cycles
– SDR vs DDR• Single data word, transfers once per clock cycle (SDR)• Double data word, twice per clock cycle (DDR, both edges)
– Clock rate• DDR: PC266, PC333, PC400 is 133MHz, 167MHz, 200MHz• SDR: PC100, PC133 is 100MHz, 133MHz
Memory Technology
• RAMBUS– New, around SDRAM time– More complex than SDR, DDR SDRAM
– Faster clock rates (800MHz!)• Fancy signaling on circuit board• Narrow data width (16 bits)• Difficult to get working• Must license technology from Rambus Inc.• Rambus lawyers are costly, $$
– Longer latency (eg, ten cycles)– Overall memory speed higher (not by a lot!)
– Only used on high-end server PCs (too costly)
Memory Technology
• FLASH Memory– Different beast: non-volatile
• Keeps power even when turned off!
– 1 transistor per bit (sometimes 0.5)• Very Cheap
– Operation• Trap charge in floating (disconnected) gate of transistor (tunneling)• Floating-gate keeps transistor turned on or off• Not leaky like DRAM
– Not suitable for main memory• Physically wears out with use (100,000 writes)• Writes are very slow, reads are slow (70ns)
Memory Technology Trends
• Semiconductor manufacturing processes– SRAM & logic compatible– DRAM & logic incompatible– FLASH memory = logic process + extra masks + some tweaking
• Impact on CPU– On-chip SRAM feasible
• Can get FAST memory! (but at high cost)
– On-chip DRAM possible, but unlikely• Cannot get BIG memory
– On-chip FLASH may be feasible• Can store some non-volatile information
Memory Technology Trends
Memory is getting slower relative to CPU speeds (log scale!)
Recent Impact of Memory Speed
• 1996– 100 MHz CPU clock rate (10ns)– 80 ns Memory Access Time– Memory read: 8 CPU clock cycles– Add 8 pipeline stages just to access data memory?
• DF+DS+DT+DF+DF+DS+DS+DE ?
• 2003– 3 GHz CPU clock rate (0.33ns = 330ps)– PC400 DDR (200MHz or 5ns)– Memory read: 5ns x 2 cycles = 10ns
= 30 CPU clock cycles– Add 30 pipeline stages? Impossible to keep up!
Memory Technology (1997)
Memory Technology
Access Time Cost/MB
SRAM 5-25 ns $100-$250
SDRAM 50-60 ns$10-$20
(today: cheaper than DRAM)
DRAM 60-120 ns $5-$10
Disk 10-20 million ns $0.10-$0.20
Cache Memory• Problem:
– SRAM fast, but costly– DRAM cheap, but slow
• Solution:– Cache
• Small SRAM memory• Holds frequently-used data• Logically, insert between CPU and main memory
– Memory Hierarchy is born• Generally, use cheaper/bigger/slower memory as
you move farther away from CPU
• Question: How to access cache SRAM?
Memory HierarchyCPU
Level n
Level 2
Level 1
Levels in thememory hierarchy
Increasing distance from the CPU in
access time
Size of the memory at each level
MultipleLevels ofMemory
Memory Hierarchy
SRAM
CPURegisters
SDRAM
Cost ($/bit)
Smallest
Biggest
Highest
Lowest
Fastest
SlowestDisk
and/orTape
SizeSpeed
Accessing a Cache• Cache: hide in French, a safe place to hide things
• Importance concept: transparent to user/software!– Wish to speed up ALL programs
• Do not want to rewrite old programs• Do not want to write programs to specifically use the cache
• How to hide? Need general cache management policy– CPU manages cache itself (NOT managed by software)
– Load data• If data is in cache
– retrieve from cache• Else, retrieve from main memory
– put a copy in cache
– Store data (write-through, no-alloc-on-write policy)• If data is in cache, write to that cache location and memory• Else write data to memory
Using a Cache
• Problems– Finding existing location for data in cache?
– Finding new location for new cache data?
– Cache is full?• Finding a location that is no longer needed• Must evict data presently in cache
• Various Solutions– Different styles of caches!
Associative Cache
• Choosing a location– Associative cache is very flexible– New data: any– Find existing data: must search all– Difficult, but not impossible
• CAM: content-addressable memory– Searches all locations (addresses) in “1 cycle”– Reports “match” location– Match location holds data
• Cache is full?– Must throw out old data– Need replacement or eviction policy
Associative Cache:Replacement Policies
• Associative Cache is full? Possible replacement policies:– Ideal
• Non-causal: cannot predict what CPU will do in the future!• CPU architects use simulation to find performance of ideal cache
– Least Frequently Accessed• Count # of accesses, choose the one accessed the least• Problem: you will always choose to evict NEW DATA
– Least Recently Used (LRU)• Timestamp every time you use data in cache• Location with oldest timestamp is evicted
– Pseudo-LRU• Periodically “age” contents of cache• Flag data every time it is used• Location with “aged” status is evicted
RANDOM works too!
(LRU or PseudoLRU is slightly better, so is commonly used)
Direct-mapped Cache
• Choosing a location– Much more restrictive than associative cache– New data: one eligible location– Find existing data: search one location only– Location: use lower bits of data address– Easy to use SRAM, fast access!
• Cache is full? Replacement is easy…– Only one location– Must evict old data
Direct-mapped Cache
00001 00101 01001 01101 10001 10101 11001 11101
00
00
01
01
00
11
100
101
110
11
1
Cache Location
Memory Address
Each address inmemory maps toonly one locationin a direct-mappedcache
Lowest 3 bitsof addressdetermineslocation
Direct-mapped Cache
2 0 1 0
B y t e
o f f s e t
0
1
2
1 0 2 1
1 0 2 2
1 0 2 3
2 0 3 2
3 1 3 0 1 3 1 2 1 1 2 1 0Memory Address
DataHit
DataTagVIndex
TagIndex
Cache Size:1024 locations* 4 data bytes each= 4kB cache
Overhead:1024 locations* 21 bits (Tag + V)= 2.626kB tag bits
(more than 50% overhead!)