cse 502: computer architecture

51
CSE502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches

Upload: apollo

Post on 11-Jan-2016

48 views

Category:

Documents


2 download

DESCRIPTION

CSE 502: Computer Architecture. Memory Hierarchy & Caches. This Lecture. Memory Hierarchy Caches / SRAM Cache Organization and Optimizations. Motivation. Want memory to appear: As fast as CPU As large as required by all of the running applications. Processor. Memory. Storage Hierarchy. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSE 502: Computer Architecture

CSE502: Computer Architecture

CSE 502:Computer Architecture

Memory Hierarchy & Caches

Page 2: CSE 502: Computer Architecture

CSE502: Computer Architecture

This Lecture• Memory Hierarchy• Caches / SRAM• Cache Organization and Optimizations

Page 3: CSE 502: Computer Architecture

CSE502: Computer Architecture

1

10

100

1000

10000

1985 1990 1995 2000 2005 2010

Per

form

ance

Motivation

• Want memory to appear:– As fast as CPU– As large as required by all of the running applications

Processor

Memory

Page 4: CSE 502: Computer Architecture

CSE502: Computer Architecture

Storage Hierarchy• Make common case fast:

– Common: temporal & spatial locality– Fast: smaller more expensive memory

Controlled by

Hardware

Controlled by

Software (OS)

Bigger Transfers

Larger

Cheaper

More Bandwidth

Faster

Registers

Caches (SRAM)

Memory (DRAM)

[SSD? (Flash)]

Disk (Magnetic Media)

What is S(tatic)RAM vs D(dynamic)RAM?

Page 5: CSE 502: Computer Architecture

CSE502: Computer Architecture

Caches

• An automatically managed hierarchy

• Break memory into blocks (several bytes)and transfer data to/from cache in blocks– spatial locality

• Keep recently accessed blocks– temporal locality

Core

$

Memory

Page 6: CSE 502: Computer Architecture

CSE502: Computer Architecture

Cache Terminology

• block (cache line): minimum unit that may be cached• frame: cache storage location to hold one block• hit: block is found in the cache• miss: block is not found in the cache• miss ratio: fraction of references that miss• hit time: time to access the cache• miss penalty: time to replace block on a miss

Page 7: CSE 502: Computer Architecture

CSE502: Computer Architecture

Miss

Cache Example• Address sequence from core:

(assume 8-byte lines)

Memory

0x10000 (…data…)

0x10120 (…data…)

0x10008 (…data…)HitMissMissHitHit

Final miss ratio is 50%

Core

0x10000

0x10004

0x10120

0x10008

0x10124

0x10004

Page 8: CSE 502: Computer Architecture

CSE502: Computer Architecture

AMAT (1/2)• Very powerful tool to estimate performance

• If …cache hit is 10 cycles (core to L1 and back)memory access is 100 cycles (core to mem and back)

• Then …at 50% miss ratio, avg. access: 0.5×10+0.5×100 = 55at 10% miss ratio, avg. access: 0.9×10+0.1×100 = 19at 1% miss ratio, avg. access: 0.99×10+0.01×100 ≈ 11

Page 9: CSE 502: Computer Architecture

CSE502: Computer Architecture

AMAT (2/2)• Generalizes nicely to any-depth hierarchy

• If …L1 cache hit is 5 cycles (core to L1 and back)L2 cache hit is 20 cycles (core to L2 and back)memory access is 100 cycles (core to mem and back)

• Then …at 20% miss ratio in L1 and 40% miss ratio in L2 …

avg. access: 0.8×5+0.2×(0.6×20+0.4×100) ≈ 14

Page 10: CSE 502: Computer Architecture

CSE502: Computer Architecture

Processor

Memory Organization (1/3)

Registers

L1 I-Cache

L1 D-Cache

L2 Cache

D-TLBI-

TLB

Main Memory (DRAM)

L3 Cache (LLC)

Page 11: CSE 502: Computer Architecture

CSE502: Computer Architecture

Processor

Memory Organization (2/3)

Main Memory (DRAM)

L3 Cache (LLC)

Core 0Registers

L1 I-Cache

L1 D-Cache

L2 Cache

D-TLBI-

TLB

Core 1Registers

L1 I-Cache

L1 D-Cache

L2 Cache

D-TLBI-

TLB

Multi-core replicates the top of the hierarchy

Page 12: CSE 502: Computer Architecture

CSE502: Computer Architecture

Memory Organization (3/3)

256KL2

32KL1-D

32KL1-IIn

tel N

ehale

m(3

.3G

Hz,

4 c

ore

s, 2

thre

ads

per

core

)

Page 13: CSE 502: Computer Architecture

CSE502: Computer Architecture

SRAM Overview

• Chained inverters maintain a stable state• Access gates provide access to the cell• Writing to cell involves over-powering storage inverters

1 00 1

1 1

b b

“6T SRAM” cell

2 access gates2T per inverter

Page 14: CSE 502: Computer Architecture

CSE502: Computer Architecture

8-bit SRAM Array

wordline

bitlines

Page 15: CSE 502: Computer Architecture

CSE502: Computer Architecture

===

• Keep blocks in cache frames– data – state (e.g., valid)– address tag

datadatadata

data

Fully-Associative Cache

multiplexor

tag[63:6] block offset[5:0]

address

What happens when the cache runs out of space?

tagtagtag

tag

statestatestate

state =

063

hit?

Page 16: CSE 502: Computer Architecture

CSE502: Computer Architecture

The 3 C’s of Cache Misses

• Compulsory: Never accessed before• Capacity: Accessed long ago and already replaced• Conflict: Neither compulsory nor capacity (later today)• Coherence: (To appear in multi-core lecture)

Page 17: CSE 502: Computer Architecture

CSE502: Computer Architecture

Cache Size• Cache size is data capacity (don’t count tag and state)

– Bigger can exploit temporal locality better– Not always better

• Too large a cache– Smaller is faster bigger is slower– Access time may hurt critical path

• Too small a cache– Limited temporal locality– Useful data constantly replaced hi

t ra

te

working set size

capacity

Page 18: CSE 502: Computer Architecture

CSE502: Computer Architecture

Block Size• Block size is the data that is

– Associated with an address tag – Not necessarily the unit of transfer between hierarchies

• Too small a block– Don’t exploit spatial locality well– Excessive tag overhead

• Too large a block– Useless data transferred– Too few total blocks

• Useful data frequently replacedhi

t ra

teblock size

Page 19: CSE 502: Computer Architecture

CSE502: Computer Architecture

8×8-bit SRAM Array

wordline

bitlines

1-o

f-8 d

eco

der

Page 20: CSE 502: Computer Architecture

CSE502: Computer Architecture

64×1-bit SRAM Array

wordline

bitlines

columnmux

1-o

f-8 d

eco

der

1-of-8 decoder

Page 21: CSE 502: Computer Architecture

CSE502: Computer Architecture

• Use middle bits as index• Only one tag comparison

datadatadata

tagtagtag

data tag

statestatestate

state

Direct-Mapped Cache

multiplexor

tag[63:16] index[15:6] block offset[5:0]

=

Why take index bits out of the middle?

deco

der

tag match(hit?)

Page 22: CSE 502: Computer Architecture

CSE502: Computer Architecture

Cache Conflicts• What if two blocks alias on a frame?

– Same index, but different tags

Address sequence:0xDEADBEEF 11011110101011011011111011101111 0xFEEDBEEF 111111101110110110111110111011110xDEADBEEF 11011110101011011011111011101111

• 0xDEADBEEF experiences a Conflict miss– Not Compulsory (seen it before)– Not Capacity (lots of other indexes available in cache)

tag index block offse

t

Page 23: CSE 502: Computer Architecture

CSE502: Computer Architecture

Associativity (1/2)

Fully-associativeblock goes in any frame

(all frames in 1 set)

01234567

Block

Direct-mappedblock goes in exactly

one frame(1 frame per set)

01234567

Set

Set-associativeblock goes in any frame

in one set(frames grouped in sets)

01010101

Set/Block

0

1

2

3

• Where does block index 12 (b’1100) go?

Page 24: CSE 502: Computer Architecture

CSE502: Computer Architecture

Associativity (2/2)• Larger associativity

– lower miss rate (fewer conflicts)– higher power consumption

• Smaller associativity– lower cost– faster hit time

~5for L1-Dhi

t ra

teassociativity

Page 25: CSE 502: Computer Architecture

CSE502: Computer Architecture

N-Way Set-Associative Cachetag[63:15] index[14:6] block offset[5:0]

tagtagtag

tag

multiplexor

deco

der

=

hit?

datadatadata

tagtagtag

data tag

statestatestate

state

multiplexor

deco

der

=

multiplexor

way

set

Note the additional bit(s) moved from index to tag

datadatadata

data

statestatestate

state

Page 26: CSE 502: Computer Architecture

CSE502: Computer Architecture

Associative Block Replacement• Which block in a set to replace on a miss?• Ideal replacement (Belady’s Algorithm)

– Replace block accessed farthest in the future– Trick question: How do you implement it?

• Least Recently Used (LRU)– Optimized for temporal locality (expensive for >2-way)

• Not Most Recently Used (NMRU)– Track MRU, random select among the rest

• Random– Nearly as good as LRU, sometimes better (when?)

Page 27: CSE 502: Computer Architecture

CSE502: Computer Architecture

Victim Cache (1/2)

• Associativity is expensive– Performance from extra muxes– Power from reading and checking more tags and data

• Conflicts are expensive– Performance from extra mises

• Observation: Conflicts don’t occur in all sets

Page 28: CSE 502: Computer Architecture

CSE502: Computer Architecture

Fully-AssociativeVictim Cache

4-way Set-AssociativeL1 Cache +

Every access is a miss!ABCDE and JKLMN

do not “fit” in a 4-wayset associative cache

X Y Z

P Q R

X Y Z

Victim Cache (2/2)

ABJKLM

Victim cache providesa “fifth way” so long asonly four sets overflowinto it at the same time

Can even provide 6th

or 7th … ways

AB

C

D

E

JN

KL

M

AccessSequence:

Provide “extra” associativity, but not for all sets

4-way Set-AssociativeL1 Cache

A B C DA BE C

J K LJN L

B CE A B CD DA

J K L MN J LM

C

K K M

DCL

P Q R

Page 29: CSE 502: Computer Architecture

CSE502: Computer Architecture

Parallel vs Serial Caches• Tag and Data usually separate (tag is smaller & faster)

– State bits stored along with tags• Valid bit, “LRU” bit(s), …

hit?

= = = =

valid?

data

hit?

= = = =

valid?

data

enable

Parallel access to Tag and Datareduces latency (good for L1)

Serial access to Tag and Datareduces power (good for L2+)

Page 30: CSE 502: Computer Architecture

CSE502: Computer Architecture

0

= = = =

Way-Prediction and Partial Tagging• Sometimes even parallel tag and data is too slow

– Usually happens in high-associativity L1 caches

• If you can’t wait… guess

Way PredictionGuess based on history, PC, etc…

Partial TaggingUse part of tag to pre-select mux

hit?

= = = =

valid?

data

pre

dic

tio

n

correct?

=

hit?

= = = =

valid?

data

correct?

=0

Guess and verify, extra cycle(s) if not correct

Page 31: CSE 502: Computer Architecture

CSE502: Computer Architecture

Physically-Indexed Caches

• Core requests are VAs

• Cache index is PA[15:6]– VA passes through TLB– D-TLB on critical path

• Cache tag is PA[63:16]

• If index size < page size– Can use VA for index

tag[63:14] index[13:6] block offset[5:0]Virtual Address

virtual page[63:13] page offset[12:0]

/ index[6:0]

/physical

tag[51:1]

physicalindex[7:0]/

= = = =

D-TLB

Simple, but slow. Can we do better?

/physical

index[0:0]

Page 32: CSE 502: Computer Architecture

CSE502: Computer Architecture

Virtually-Indexed Caches

• Core requests are VAs• Cache index is VA[15:6]• Cache tag is PA[63:16]

• Why not tag with VA?– Cache flush on ctx switch

• Virtual aliases– Ensure they don’t exist– … or check all on miss

tag[63:14] index[13:6] block offset[5:0]Virtual Address

virtual page[63:13] page offset[12:0]

/ virtual index[7:0]

D-TLB

Fast, but complicated

/physical

tag[51:0]

= = = =

One bit overlaps

Page 33: CSE 502: Computer Architecture

CSE502: Computer Architecture

Compute Infrastructure• “rocks” cluster and several VMs

– Use your “Unix” login and password• allv21.all.cs.sunysb.edu• sbrocks.cewit.stonybrook.edu

– Once on rocks head node, run “condor_status”• Pick a machine and use ssh to log into it (e.g., compute-4-4)

– Don’t use VM for Homework

• I already sent out details regarding software– Including MARSS sources and path to boot image– Go through

• http://marss86.org/~marss86/index.php/Getting_Started• http://cloud.github.com/downloads/avadhpatel/marss/Marss_ISCA_2012_tutorial.pdf

Send email to list in case you encounter problems

Page 34: CSE 502: Computer Architecture

CSE502: Computer Architecture

Homework #1 (of 1) part 2• Already have accurate & precise time measurement

• Extend your program to…– Measure how many cache levels there are in the CPU– Measure the hit time of each cache level

• Output should resemble this:123..890 Loop: 2500000000nsCache Latencies: 2ns 6ns 20ns 150ns

Note: You must measure it, not read it from MSRs

Page 35: CSE 502: Computer Architecture

CSE502: Computer Architecture

Homework #1 Hints• Instructions take time – use them wisely

– Use high optimization level in compiler (-O3)– Check assembly to see instructions (use -S flag in compiler)– Independent instructions run in parallel (don’t count time)– Dependent instructions run serially (precisely count time)– Use random walks in memory (sequential will trick you)

• Smallest linked-list walk:void* list[100];for(y=0;y<100;++y) list[y]=&list[(y+1)%100];void* x = &list[0]; x = *x; // mov %rax,(%rax)

Page 36: CSE 502: Computer Architecture

CSE502: Computer Architecture

Warm-Up Project (1/3)• Build an L1-D cache array• Determine groups

– Email member list to Connor and CC me (by Feb 20)

• Must use SRAM<> module for storage• Grading reminder:

Warm-up Project Points1 Port,Direct-Mapped,16K 102 Port,Direct-Mapped,16K 121 Port,Set-associative,16K 142 Port,DM,64K,Pipelined 162 Port,SA,64K,Pipelined 182 Port,SA,64K,Pipel.,Way-pred. 20

Page 37: CSE 502: Computer Architecture

CSE502: Computer Architecture

L1-D

Warm-Up Project (2/3)clk<bool>reset<bool>

in_ena[#]<bool>

in_is_insert[#]<bool>in_has_data[#]<bool>in_needs_data[#]<bool>in_addr[#]<uint<64>>in_data[#]<bv<8*blocksz>>in_update_state[#]<bool>in_new_state[#]<uint<8>>

out_ready[#]<bool>out_token[#]<uint<64>>out_addr[#]<uint<64>>out_state[#]<uint<8>>

out_data[#]<bv<8*blocksz>>

port_available[#]<bool>

Page 38: CSE 502: Computer Architecture

CSE502: Computer Architecture

SRAM<width,depthLog2,ports>

Warm-Up Project (3/3)

clk<bool>

ena[#]<bool>addr[#]<bv<depthLog2>>we[#]<bool>din[#]<bv<width>> dout[#]<bv<width>>

Page 39: CSE 502: Computer Architecture

CSE502: Computer Architecture

Multiple Accesses per Cycle

• Need high-bandwidth access to caches– Core can make multiple access requests per cycle– Multiple cores can access LLC at the same time

• Must either delay some requests, or…– Design SRAM with multiple ports

• Big and power-hungry

– Split SRAM into multiple banks• Can result in delays, but usually not

Page 40: CSE 502: Computer Architecture

CSE502: Computer Architecture

Multi-Ported SRAMs

b1 b1

Wordline1

b2 b2

Wordline2

Wordlines = 1 per portBitlines = 2 per port

Area = O(ports2)

Page 41: CSE 502: Computer Architecture

CSE502: Computer Architecture

Multi-Porting vs Banking

Deco

der

Deco

der

Deco

der

Deco

der

SRAMArray

Sen

se

Sen

se

Sen

se

Sen

se

ColumnMuxing

SD

eco

der

SRAMArray

SD

eco

der

SRAMArray

SD

eco

der

SRAMArray

SD

eco

der

SRAMArray

4 banks, 1 port eachEach bank small (and

fast)Conflicts (delays)

possible

4 portsBig (and slow)

Guarantees concurrent access

How to decide which bank to go to?

Page 42: CSE 502: Computer Architecture

CSE502: Computer Architecture

Bank Conflicts• Banks are address interleaved

– For block size b cache with N banks…– Bank = (Address / b) % N

• More complicated than it looks: just low-order bits of index

– Modern processors perform hashed cache indexing• May randomize bank and index• XOR some low-order tag bits with bank/index bits (why XOR?)

• Banking can provide high bandwidth• But only if all accesses are to different banks

– For 4 banks, 2 accesses, chance of conflict is 25%

Page 43: CSE 502: Computer Architecture

CSE502: Computer Architecture

On-chip Interconnects (1/5)• Today, (Core+L1+L2) = “core”

– (L3+I/O+Memory) = “uncore”

• How to interconnect multiple “core”s to “uncore”?

• Possible topologies– Bus– Crossbar– Ring– Mesh– Torus

LLC $

MemoryControll

er

Core

$

Core

$

Core

$

Core

$

Page 44: CSE 502: Computer Architecture

CSE502: Computer Architecture

On-chip Interconnects (2/5)• Possible topologies

– Bus– Crossbar– Ring– Mesh– Torus

$Bank 0

MemoryControll

er

Core$

Core$

Core$

Core$

$Bank 1

$Bank 2

$Bank 3

Ora

cle U

ltra

SPA

RC

T5

(3

.6G

Hz,

16

core

s, 8

thre

ads

per

core

)

Page 45: CSE 502: Computer Architecture

CSE502: Computer Architecture

On-chip Interconnects (3/5)• Possible topologies

– Bus– Crossbar– Ring– Mesh– Torus $

Bank 0

MemoryControll

er

Core$

Core$

Core$

Core$

$Bank 1

$Bank 2

$Bank 3

Inte

l Sandy B

ridge

(3.5

GH

z,6

core

s, 2

thre

ads

per

core

)

• 3 ports per switch• Simple and cheap• Can be bi-directional to

reduce latency

Page 46: CSE 502: Computer Architecture

CSE502: Computer Architecture

On-chip Interconnects (4/5)• Possible topologies

– Bus– Crossbar– Ring– Mesh– Torus

Tile

ra T

ile6

4 (

86

6M

Hz,

64

co

res)

Core$$

Bank 1

$Bank

0

Core$

Core$$

Bank 4

Core$$

Bank 3

MemoryControll

er

Core$$

Bank 2

Core$$

Bank 7

Core$$

Bank 6

Core$$

Bank 5

• Up to 5 ports per switchTiled organization combines core and cache

Page 47: CSE 502: Computer Architecture

CSE502: Computer Architecture

On-chip Interconnects (5/5)• Possible topologies

– Bus– Crossbar– Ring– Mesh– Torus

• 5 ports per switch• Can be “folded”

to avoid long links

Core$$

Bank 1

$Bank

0

Core$

Core$$

Bank 4

Core$$

Bank 3

MemoryControll

er

Core$$

Bank 2

Core$$

Bank 7

Core$$

Bank 6

Core$$

Bank 5

Page 48: CSE 502: Computer Architecture

CSE502: Computer Architecture

Write Policies• Writes are more interesting

– On reads, tag and data can be accessed in parallel– On writes, needs two steps – Is access time important for writes?

• Choices of Write Policies– On write hits, update memory?

• Yes: write-through (higher bandwidth)• No: write-back (uses Dirty bits to identify blocks to write back)

– On write misses, allocate a cache block frame?• Yes: write-allocate• No: no-write-allocate

Page 49: CSE 502: Computer Architecture

CSE502: Computer Architecture

Inclusion• Core often accesses blocks not present on chip

– Should block be allocated in L3, L2, and L1?• Called Inclusive caches• Waste of space• Requires forced evict (e.g., force evict from L1 on evict from L2+)

– Only allocate blocks in L1• Called Non-inclusive caches (who not “exclusive”?)• Must write back clean lines

• Some processors combine both– L3 is inclusive of L1 and L2– L2 is non-inclusive of L1 (like a large victim cache)

Page 50: CSE 502: Computer Architecture

CSE502: Computer Architecture

Parity & ECC• Cosmic radiation can strike at any time

– Especially at high altitude– Or during solar flares

• What can be done?– Parity

• 1 bit to indicate if sum is odd/even (detects single-bit errors)

– Error Correcting Codes (ECC)• 8 bit code per 64-bit word• Generally SECDED (Single-Error-Correct, Double-Error-Detect)

• Detecting errors on clean cache lines is harmless– Pretend it’s a cache miss

0 1 0 1 1 0 0 1 0 1 1 0 0 1 0 11 0

Page 51: CSE 502: Computer Architecture

CSE502: Computer Architecture

Homework #1 (of 1) part 3• Extend your program to…

– Measure and report the block size of each cache– Measure the number of sets in each cache– Measure the associativity of each cache– Compute the total capacity of each cache

• Output should resemble this:123..890 Loop: 2500000000nsCache Latencies: 2ns 6ns 20ns 150nsCaches: 64/256x4(64KB) 64/512x8(256KB) 128/8192x24(12MB)