cse 502: computer architecture
DESCRIPTION
CSE 502: Computer Architecture. Memory Hierarchy & Caches. This Lecture. Memory Hierarchy Caches / SRAM Cache Organization and Optimizations. Motivation. Want memory to appear: As fast as CPU As large as required by all of the running applications. Processor. Memory. Storage Hierarchy. - PowerPoint PPT PresentationTRANSCRIPT
CSE502: Computer Architecture
CSE 502:Computer Architecture
Memory Hierarchy & Caches
CSE502: Computer Architecture
This Lecture• Memory Hierarchy• Caches / SRAM• Cache Organization and Optimizations
CSE502: Computer Architecture
1
10
100
1000
10000
1985 1990 1995 2000 2005 2010
Per
form
ance
Motivation
• Want memory to appear:– As fast as CPU– As large as required by all of the running applications
Processor
Memory
CSE502: Computer Architecture
Storage Hierarchy• Make common case fast:
– Common: temporal & spatial locality– Fast: smaller more expensive memory
Controlled by
Hardware
Controlled by
Software (OS)
Bigger Transfers
Larger
Cheaper
More Bandwidth
Faster
Registers
Caches (SRAM)
Memory (DRAM)
[SSD? (Flash)]
Disk (Magnetic Media)
What is S(tatic)RAM vs D(dynamic)RAM?
CSE502: Computer Architecture
Caches
• An automatically managed hierarchy
• Break memory into blocks (several bytes)and transfer data to/from cache in blocks– spatial locality
• Keep recently accessed blocks– temporal locality
Core
$
Memory
CSE502: Computer Architecture
Cache Terminology
• block (cache line): minimum unit that may be cached• frame: cache storage location to hold one block• hit: block is found in the cache• miss: block is not found in the cache• miss ratio: fraction of references that miss• hit time: time to access the cache• miss penalty: time to replace block on a miss
CSE502: Computer Architecture
Miss
Cache Example• Address sequence from core:
(assume 8-byte lines)
Memory
0x10000 (…data…)
0x10120 (…data…)
0x10008 (…data…)HitMissMissHitHit
Final miss ratio is 50%
Core
0x10000
0x10004
0x10120
0x10008
0x10124
0x10004
CSE502: Computer Architecture
AMAT (1/2)• Very powerful tool to estimate performance
• If …cache hit is 10 cycles (core to L1 and back)memory access is 100 cycles (core to mem and back)
• Then …at 50% miss ratio, avg. access: 0.5×10+0.5×100 = 55at 10% miss ratio, avg. access: 0.9×10+0.1×100 = 19at 1% miss ratio, avg. access: 0.99×10+0.01×100 ≈ 11
CSE502: Computer Architecture
AMAT (2/2)• Generalizes nicely to any-depth hierarchy
• If …L1 cache hit is 5 cycles (core to L1 and back)L2 cache hit is 20 cycles (core to L2 and back)memory access is 100 cycles (core to mem and back)
• Then …at 20% miss ratio in L1 and 40% miss ratio in L2 …
avg. access: 0.8×5+0.2×(0.6×20+0.4×100) ≈ 14
CSE502: Computer Architecture
Processor
Memory Organization (1/3)
Registers
L1 I-Cache
L1 D-Cache
L2 Cache
D-TLBI-
TLB
Main Memory (DRAM)
L3 Cache (LLC)
CSE502: Computer Architecture
Processor
Memory Organization (2/3)
Main Memory (DRAM)
L3 Cache (LLC)
Core 0Registers
L1 I-Cache
L1 D-Cache
L2 Cache
D-TLBI-
TLB
Core 1Registers
L1 I-Cache
L1 D-Cache
L2 Cache
D-TLBI-
TLB
Multi-core replicates the top of the hierarchy
CSE502: Computer Architecture
Memory Organization (3/3)
256KL2
32KL1-D
32KL1-IIn
tel N
ehale
m(3
.3G
Hz,
4 c
ore
s, 2
thre
ads
per
core
)
CSE502: Computer Architecture
SRAM Overview
• Chained inverters maintain a stable state• Access gates provide access to the cell• Writing to cell involves over-powering storage inverters
1 00 1
1 1
b b
“6T SRAM” cell
2 access gates2T per inverter
CSE502: Computer Architecture
8-bit SRAM Array
wordline
bitlines
CSE502: Computer Architecture
===
• Keep blocks in cache frames– data – state (e.g., valid)– address tag
datadatadata
data
Fully-Associative Cache
multiplexor
tag[63:6] block offset[5:0]
address
What happens when the cache runs out of space?
tagtagtag
tag
statestatestate
state =
063
hit?
CSE502: Computer Architecture
The 3 C’s of Cache Misses
• Compulsory: Never accessed before• Capacity: Accessed long ago and already replaced• Conflict: Neither compulsory nor capacity (later today)• Coherence: (To appear in multi-core lecture)
CSE502: Computer Architecture
Cache Size• Cache size is data capacity (don’t count tag and state)
– Bigger can exploit temporal locality better– Not always better
• Too large a cache– Smaller is faster bigger is slower– Access time may hurt critical path
• Too small a cache– Limited temporal locality– Useful data constantly replaced hi
t ra
te
working set size
capacity
CSE502: Computer Architecture
Block Size• Block size is the data that is
– Associated with an address tag – Not necessarily the unit of transfer between hierarchies
• Too small a block– Don’t exploit spatial locality well– Excessive tag overhead
• Too large a block– Useless data transferred– Too few total blocks
• Useful data frequently replacedhi
t ra
teblock size
CSE502: Computer Architecture
8×8-bit SRAM Array
wordline
bitlines
1-o
f-8 d
eco
der
CSE502: Computer Architecture
64×1-bit SRAM Array
wordline
bitlines
columnmux
1-o
f-8 d
eco
der
1-of-8 decoder
CSE502: Computer Architecture
• Use middle bits as index• Only one tag comparison
datadatadata
tagtagtag
data tag
statestatestate
state
Direct-Mapped Cache
multiplexor
tag[63:16] index[15:6] block offset[5:0]
=
Why take index bits out of the middle?
deco
der
tag match(hit?)
CSE502: Computer Architecture
Cache Conflicts• What if two blocks alias on a frame?
– Same index, but different tags
Address sequence:0xDEADBEEF 11011110101011011011111011101111 0xFEEDBEEF 111111101110110110111110111011110xDEADBEEF 11011110101011011011111011101111
• 0xDEADBEEF experiences a Conflict miss– Not Compulsory (seen it before)– Not Capacity (lots of other indexes available in cache)
tag index block offse
t
CSE502: Computer Architecture
Associativity (1/2)
Fully-associativeblock goes in any frame
(all frames in 1 set)
01234567
Block
Direct-mappedblock goes in exactly
one frame(1 frame per set)
01234567
Set
Set-associativeblock goes in any frame
in one set(frames grouped in sets)
01010101
Set/Block
0
1
2
3
• Where does block index 12 (b’1100) go?
CSE502: Computer Architecture
Associativity (2/2)• Larger associativity
– lower miss rate (fewer conflicts)– higher power consumption
• Smaller associativity– lower cost– faster hit time
~5for L1-Dhi
t ra
teassociativity
CSE502: Computer Architecture
N-Way Set-Associative Cachetag[63:15] index[14:6] block offset[5:0]
tagtagtag
tag
multiplexor
deco
der
=
hit?
datadatadata
tagtagtag
data tag
statestatestate
state
multiplexor
deco
der
=
multiplexor
way
set
Note the additional bit(s) moved from index to tag
datadatadata
data
statestatestate
state
CSE502: Computer Architecture
Associative Block Replacement• Which block in a set to replace on a miss?• Ideal replacement (Belady’s Algorithm)
– Replace block accessed farthest in the future– Trick question: How do you implement it?
• Least Recently Used (LRU)– Optimized for temporal locality (expensive for >2-way)
• Not Most Recently Used (NMRU)– Track MRU, random select among the rest
• Random– Nearly as good as LRU, sometimes better (when?)
CSE502: Computer Architecture
Victim Cache (1/2)
• Associativity is expensive– Performance from extra muxes– Power from reading and checking more tags and data
• Conflicts are expensive– Performance from extra mises
• Observation: Conflicts don’t occur in all sets
CSE502: Computer Architecture
Fully-AssociativeVictim Cache
4-way Set-AssociativeL1 Cache +
Every access is a miss!ABCDE and JKLMN
do not “fit” in a 4-wayset associative cache
X Y Z
P Q R
X Y Z
Victim Cache (2/2)
ABJKLM
Victim cache providesa “fifth way” so long asonly four sets overflowinto it at the same time
Can even provide 6th
or 7th … ways
AB
C
D
E
JN
KL
M
AccessSequence:
Provide “extra” associativity, but not for all sets
4-way Set-AssociativeL1 Cache
A B C DA BE C
J K LJN L
B CE A B CD DA
J K L MN J LM
C
K K M
DCL
P Q R
CSE502: Computer Architecture
Parallel vs Serial Caches• Tag and Data usually separate (tag is smaller & faster)
– State bits stored along with tags• Valid bit, “LRU” bit(s), …
hit?
= = = =
valid?
data
hit?
= = = =
valid?
data
enable
Parallel access to Tag and Datareduces latency (good for L1)
Serial access to Tag and Datareduces power (good for L2+)
CSE502: Computer Architecture
0
= = = =
Way-Prediction and Partial Tagging• Sometimes even parallel tag and data is too slow
– Usually happens in high-associativity L1 caches
• If you can’t wait… guess
Way PredictionGuess based on history, PC, etc…
Partial TaggingUse part of tag to pre-select mux
hit?
= = = =
valid?
data
pre
dic
tio
n
correct?
=
hit?
= = = =
valid?
data
correct?
=0
Guess and verify, extra cycle(s) if not correct
CSE502: Computer Architecture
Physically-Indexed Caches
• Core requests are VAs
• Cache index is PA[15:6]– VA passes through TLB– D-TLB on critical path
• Cache tag is PA[63:16]
• If index size < page size– Can use VA for index
tag[63:14] index[13:6] block offset[5:0]Virtual Address
virtual page[63:13] page offset[12:0]
/ index[6:0]
/physical
tag[51:1]
physicalindex[7:0]/
= = = =
D-TLB
Simple, but slow. Can we do better?
/physical
index[0:0]
CSE502: Computer Architecture
Virtually-Indexed Caches
• Core requests are VAs• Cache index is VA[15:6]• Cache tag is PA[63:16]
• Why not tag with VA?– Cache flush on ctx switch
• Virtual aliases– Ensure they don’t exist– … or check all on miss
tag[63:14] index[13:6] block offset[5:0]Virtual Address
virtual page[63:13] page offset[12:0]
/ virtual index[7:0]
D-TLB
Fast, but complicated
/physical
tag[51:0]
= = = =
One bit overlaps
CSE502: Computer Architecture
Compute Infrastructure• “rocks” cluster and several VMs
– Use your “Unix” login and password• allv21.all.cs.sunysb.edu• sbrocks.cewit.stonybrook.edu
– Once on rocks head node, run “condor_status”• Pick a machine and use ssh to log into it (e.g., compute-4-4)
– Don’t use VM for Homework
• I already sent out details regarding software– Including MARSS sources and path to boot image– Go through
• http://marss86.org/~marss86/index.php/Getting_Started• http://cloud.github.com/downloads/avadhpatel/marss/Marss_ISCA_2012_tutorial.pdf
Send email to list in case you encounter problems
CSE502: Computer Architecture
Homework #1 (of 1) part 2• Already have accurate & precise time measurement
• Extend your program to…– Measure how many cache levels there are in the CPU– Measure the hit time of each cache level
• Output should resemble this:123..890 Loop: 2500000000nsCache Latencies: 2ns 6ns 20ns 150ns
Note: You must measure it, not read it from MSRs
CSE502: Computer Architecture
Homework #1 Hints• Instructions take time – use them wisely
– Use high optimization level in compiler (-O3)– Check assembly to see instructions (use -S flag in compiler)– Independent instructions run in parallel (don’t count time)– Dependent instructions run serially (precisely count time)– Use random walks in memory (sequential will trick you)
• Smallest linked-list walk:void* list[100];for(y=0;y<100;++y) list[y]=&list[(y+1)%100];void* x = &list[0]; x = *x; // mov %rax,(%rax)
CSE502: Computer Architecture
Warm-Up Project (1/3)• Build an L1-D cache array• Determine groups
– Email member list to Connor and CC me (by Feb 20)
• Must use SRAM<> module for storage• Grading reminder:
Warm-up Project Points1 Port,Direct-Mapped,16K 102 Port,Direct-Mapped,16K 121 Port,Set-associative,16K 142 Port,DM,64K,Pipelined 162 Port,SA,64K,Pipelined 182 Port,SA,64K,Pipel.,Way-pred. 20
CSE502: Computer Architecture
L1-D
Warm-Up Project (2/3)clk<bool>reset<bool>
in_ena[#]<bool>
in_is_insert[#]<bool>in_has_data[#]<bool>in_needs_data[#]<bool>in_addr[#]<uint<64>>in_data[#]<bv<8*blocksz>>in_update_state[#]<bool>in_new_state[#]<uint<8>>
out_ready[#]<bool>out_token[#]<uint<64>>out_addr[#]<uint<64>>out_state[#]<uint<8>>
out_data[#]<bv<8*blocksz>>
port_available[#]<bool>
CSE502: Computer Architecture
SRAM<width,depthLog2,ports>
Warm-Up Project (3/3)
clk<bool>
ena[#]<bool>addr[#]<bv<depthLog2>>we[#]<bool>din[#]<bv<width>> dout[#]<bv<width>>
CSE502: Computer Architecture
Multiple Accesses per Cycle
• Need high-bandwidth access to caches– Core can make multiple access requests per cycle– Multiple cores can access LLC at the same time
• Must either delay some requests, or…– Design SRAM with multiple ports
• Big and power-hungry
– Split SRAM into multiple banks• Can result in delays, but usually not
CSE502: Computer Architecture
Multi-Ported SRAMs
b1 b1
Wordline1
b2 b2
Wordline2
Wordlines = 1 per portBitlines = 2 per port
Area = O(ports2)
CSE502: Computer Architecture
Multi-Porting vs Banking
Deco
der
Deco
der
Deco
der
Deco
der
SRAMArray
Sen
se
Sen
se
Sen
se
Sen
se
ColumnMuxing
SD
eco
der
SRAMArray
SD
eco
der
SRAMArray
SD
eco
der
SRAMArray
SD
eco
der
SRAMArray
4 banks, 1 port eachEach bank small (and
fast)Conflicts (delays)
possible
4 portsBig (and slow)
Guarantees concurrent access
How to decide which bank to go to?
CSE502: Computer Architecture
Bank Conflicts• Banks are address interleaved
– For block size b cache with N banks…– Bank = (Address / b) % N
• More complicated than it looks: just low-order bits of index
– Modern processors perform hashed cache indexing• May randomize bank and index• XOR some low-order tag bits with bank/index bits (why XOR?)
• Banking can provide high bandwidth• But only if all accesses are to different banks
– For 4 banks, 2 accesses, chance of conflict is 25%
CSE502: Computer Architecture
On-chip Interconnects (1/5)• Today, (Core+L1+L2) = “core”
– (L3+I/O+Memory) = “uncore”
• How to interconnect multiple “core”s to “uncore”?
• Possible topologies– Bus– Crossbar– Ring– Mesh– Torus
LLC $
MemoryControll
er
Core
$
Core
$
Core
$
Core
$
CSE502: Computer Architecture
On-chip Interconnects (2/5)• Possible topologies
– Bus– Crossbar– Ring– Mesh– Torus
$Bank 0
MemoryControll
er
Core$
Core$
Core$
Core$
$Bank 1
$Bank 2
$Bank 3
Ora
cle U
ltra
SPA
RC
T5
(3
.6G
Hz,
16
core
s, 8
thre
ads
per
core
)
CSE502: Computer Architecture
On-chip Interconnects (3/5)• Possible topologies
– Bus– Crossbar– Ring– Mesh– Torus $
Bank 0
MemoryControll
er
Core$
Core$
Core$
Core$
$Bank 1
$Bank 2
$Bank 3
Inte
l Sandy B
ridge
(3.5
GH
z,6
core
s, 2
thre
ads
per
core
)
• 3 ports per switch• Simple and cheap• Can be bi-directional to
reduce latency
CSE502: Computer Architecture
On-chip Interconnects (4/5)• Possible topologies
– Bus– Crossbar– Ring– Mesh– Torus
Tile
ra T
ile6
4 (
86
6M
Hz,
64
co
res)
Core$$
Bank 1
$Bank
0
Core$
Core$$
Bank 4
Core$$
Bank 3
MemoryControll
er
Core$$
Bank 2
Core$$
Bank 7
Core$$
Bank 6
Core$$
Bank 5
• Up to 5 ports per switchTiled organization combines core and cache
CSE502: Computer Architecture
On-chip Interconnects (5/5)• Possible topologies
– Bus– Crossbar– Ring– Mesh– Torus
• 5 ports per switch• Can be “folded”
to avoid long links
Core$$
Bank 1
$Bank
0
Core$
Core$$
Bank 4
Core$$
Bank 3
MemoryControll
er
Core$$
Bank 2
Core$$
Bank 7
Core$$
Bank 6
Core$$
Bank 5
CSE502: Computer Architecture
Write Policies• Writes are more interesting
– On reads, tag and data can be accessed in parallel– On writes, needs two steps – Is access time important for writes?
• Choices of Write Policies– On write hits, update memory?
• Yes: write-through (higher bandwidth)• No: write-back (uses Dirty bits to identify blocks to write back)
– On write misses, allocate a cache block frame?• Yes: write-allocate• No: no-write-allocate
CSE502: Computer Architecture
Inclusion• Core often accesses blocks not present on chip
– Should block be allocated in L3, L2, and L1?• Called Inclusive caches• Waste of space• Requires forced evict (e.g., force evict from L1 on evict from L2+)
– Only allocate blocks in L1• Called Non-inclusive caches (who not “exclusive”?)• Must write back clean lines
• Some processors combine both– L3 is inclusive of L1 and L2– L2 is non-inclusive of L1 (like a large victim cache)
CSE502: Computer Architecture
Parity & ECC• Cosmic radiation can strike at any time
– Especially at high altitude– Or during solar flares
• What can be done?– Parity
• 1 bit to indicate if sum is odd/even (detects single-bit errors)
– Error Correcting Codes (ECC)• 8 bit code per 64-bit word• Generally SECDED (Single-Error-Correct, Double-Error-Detect)
• Detecting errors on clean cache lines is harmless– Pretend it’s a cache miss
0 1 0 1 1 0 0 1 0 1 1 0 0 1 0 11 0
CSE502: Computer Architecture
Homework #1 (of 1) part 3• Extend your program to…
– Measure and report the block size of each cache– Measure the number of sets in each cache– Measure the associativity of each cache– Compute the total capacity of each cache
• Output should resemble this:123..890 Loop: 2500000000nsCache Latencies: 2ns 6ns 20ns 150nsCaches: 64/256x4(64KB) 64/512x8(256KB) 128/8192x24(12MB)