computer architecture 2015 – caches 1 computer architecture cache memory by yoav etsion and dan...

73
Computer Architecture 2015 – Caches 1 Computer Architecture Cache Memory By Yoav Etsion and Dan Tsafrir Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz

Upload: pamela-rose

Post on 31-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Computer Architecture 2015 – Caches1

Computer Architecture

Cache Memory

By Yoav Etsion and Dan TsafrirPresentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz

Computer Architecture 2015 – Caches2

In the old days… The predecessor of ENIAC

(the first general-purpose electronic computer)

Designed & built in 1944-1949 by Eckert & Mauchly (who also invented ENIAC), with John Von Neumann

Unlike ENIAC, binary rather than decimal, and a “stored program” machine

Operational until 1961

EDVAC (Electronic DiscreteVariable Automatic Computer)

Computer Architecture 2015 – Caches3

In the olden days… In 1945, Von Neumann wrote:

“…This result deserves to be noted. It shows in a most striking way where the real difficulty, the main bottleneck, of an automatic very high speed computing device lies: at the memory.”

Von Neumann & EDVAC

Computer Architecture 2015 – Caches4

In the olden days… Later, in 1946, he wrote:

“…Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available……We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible”

Von Neumann & EDVAC

Computer Architecture 2015 – Caches5

Not so long ago… In 1994, in their paper

“Hitting the Memory Wall: Implications of the Obvious”,

William Wulf and Sally McKee said:

“We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed – each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs.

The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one.”

Computer Architecture 2015 – Caches6

Not so long ago…

1

10

100

1000

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

DRAM

CPU

Pe

rfo

rma

nc

e

Time

DRAM

9% per yr

2X in 10 yrs

CPU60% per yr2X in 1.5 yrs

Gap grew 50% per year

Computer Architecture 2015 – Caches7

More recently (2008)…lo

wer

= s

low

er

Fast

Slow

The memory wall in the multicore era

Perf

orm

an

ce (

secon

ds)

Processor cores

Conventionalarchitecture

Computer Architecture 2015 – Caches8

Memory Trade-Offs Large (dense) memories are slow Fast memories are small, expensive and consume high

power Goal: give the processor a feeling that it has a memory

which is large (dense), fast, consumes low power, and cheap

Solution: a Hierarchy of memories

Speed: Fastest SlowestSize: Smallest BiggestCost: Highest LowestPower: Highest Lowest

L1CacheCPU

L2Cache

L3Cache

Memory(DRAM)

Computer Architecture 2015 – Caches9

Typical levels in mem hierarchy

Response time Size Memory level

≈ 0.5 ns ≈ 100 bytes CPU registers

≈ 1 ns ≈ 64 KB L1 cache

≈ 30 ns ≈ 8 – 32 MB Last Leve cache (LLC)

≈ 300 ns ≈ 4 – 100s GB Main memory (DRAM)

W? r? 128 GB SSD

≈ 5 ms ≈ 1 – 4 TB Hard disk (SATA)

Computer Architecture 2015 – Caches10

Why Hierarchy Works: Locality

Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again

soon Example: code and variables in loops

Keep recently accessed data closer to the processor

Spatial Locality (Locality in Space): If an item is referenced, nearby items tend to be referenced

soon Example: scanning an array

Move contiguous blocks closer to the processor

Due to locality, memory hierarchy is a good idea We’re going to use what we’ve just recently used And we’re going to use its immediate neighborhood

Computer Architecture 2015 – Caches11

Programs with locality cache well ...

Time

Mem

ory

Ad

dre

ss (

on

e d

ot

per

acc

ess)

SpatialLocality

Temporal Locality

Donald J. Hatfield, Jeanette Gerald: Program Restructuring forVirtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Computer Architecture 2015 – Caches12

Memory Hierarchy: Terminology

For each memory level define the following Hit: data appears in the memory level Hit Rate: the fraction of accesses found in that level Hit Latency: time to access the memory level

• includes also the time to determine hit/miss Miss: need to retrieve data from next level Miss Rate: 1 - (Hit Rate) Miss Penalty: Time to bring in the missing info (replace a

block) + Time to deliver the info to the accessor

Average memory access time = teffective = (Hit Lat. Hit Rate) + (Miss Pen. Miss Rate)

= (Hit Lat. Hit Rate) + (Miss Pen. (1- Hit Rate))

If hit rate is close to 1, teffective is close to Hit latency, which is generally what we want

Computer Architecture 2015 – Caches13

Effective Memory Access Time

Cache – holds a subset of the memory Hopefully – the subset that is being used now That subset is known as the “working set”

Effective memory access time• teffective = (tcache Hit Rate) + (tmem (1 – Hit rate))

• tmem includes the time it takes to detect a cache miss Example

Assume: tcache = 10 ns , tmem = 100 nsec

Hit Rate t eff (nsec)

0 100 50 55 90 20 99 10.9 99.9 10.1

tmem/tcache goes up more important that hit-rate closer to 1

Computer Architecture 2015 – Caches14

The cache holds a small part of the entire memory Need to map parts of the memory into the cache

Main memory is (logically) partitioned into “blocks” or “lines” or, when the info is cached, “cache lines” Typical block size is 32, 64 bytes Blocks are “aligned” in memory

Cache partitioned to cache lines Each cache line holds a block Only a subset of the blocks are mapped

to the cache at a given time The cache views an address as

Why use lines/blocks rather than words?

Cache – main idea

Block # offset

memory

cache

0123456

.

.

.

90919293

.

.

.

92

90

42

Line Tag

Computer Architecture 2015 – Caches15

Cache Lookup Cache hit

Block is stored in the cache – return data according to block’s offset

Cache miss Block is not stored in the cache

do a cache line fill• Fetch block from next level

• may require few cycles May need to evict another block

from the cache • Make room for the new block

memory

cache

0123456

.

.

.

90919293

.

.

.

92

90

42

Computer Architecture 2015 – Caches16

Checking valid bit & tag Initially cache is empty

Need to have a “line valid” indication – line valid bit A line may also be invalidated

Line

Tag Array

Tag

Tag Offset0431

Data array

031

==

=

hit data

valid bit

v

Computer Architecture 2015 – Caches17

Cache organization Basic questions:

Associativity: Where can we place a memory block in the cache?

Eviction policy: Which cache line should be evicted on a miss?

Associativity: Ideally, every memory block can go to each cache line

• Called Fully-associative cache• Most flexible, but most expensive

Compromise: simpler designs • Blocks can only reside in a subset of cache lines

• Direct-mapped cache• 2-way set associative cache• N-way set associative cache

Computer Architecture 2015 – Caches18

Fully Associative Cache An address is partitioned to

offset within block block number

Each block may be mapped to each of the cache lines Lookup block in all lines

Each cache line has a tag All tags are compared to the

block# in parallel Need a comparator per line If one of the tags matches the

block#, we have a hit • Supply data according to

offset Best hit rate, but most

wasteful Must be relatively small

Tag Array

Tag

Tag = Block# Offset

Address Fields

0431

Data array031

Line=

=

=

hit data

Computer Architecture 2015 – Caches19

Direct Map Cache Each memory block can only be

mapped to a single cache line

Offset Byte within the cache-line

Set The index into the “cache

array”, and to the “tag array” For a given set (an index), only

one of the cache lines that has this set can reside in the cache

Tag Remaining block bits are used

as tag Tag uniquely identifies mem.

block Must compare the tag stored in

the tag array to the tag of the address

TagArray

Tag

Set#

031

Tag Set Offset

Address

041331

Line

5Block number

29 =

512 sets

DataArray

14

Computer Architecture 2015 – Caches20

Direct Map Cache (cont) Partition memory into slices

slice size = cache size

Partition each slice to blocks Block size = cache line size Distance of block from slice start

indicates position in cache (set) Advantages

Easy & fast hit/miss resolution Easy & fast replacement algorithm Lowest power

Disadvantage Line has only “one chance” Lines replaced due to “conflict

misses” Organization with highest miss-rate

CacheSize

.

.

.

.

x

x

x

Mapped to set X

CacheSize

CacheSize

Computer Architecture 2015 – Caches21

Line Size: 32 bytes 5 Offset bitsCache Size: 16KB = 214 Bytes

#lines = cache size / line size = 214/25=29=512

#sets = #lines = 512#set bits = 9 bits (=5…13)

#Tag bits = 32 – (#set bits + #offset bits) = 32 – (9+5) = 18 bits (=14…31)

Lookup Address: 0x123456780001 0010 0011 0100 0101 0110 0111

1000

Direct Map Cache – Example

offset=0x18

set=0x0B3

tag=0x048B1

Tag

Tag Set Offset

Address Fields041331

Tag Array

=Hit/Miss

514

Computer Architecture 2015 – Caches22

Direct map (tiny example) Assume

Memory size is 2^5 = 32 bytes

For this, need 5-bit address A block is comprised of 4

bytes Thus, there are exactly 8

blocks

Note Need only 3-bits to identify a

block The offset is exclusively

used within the cache lines The offset is not used to

locate the cache line

00 01 10 11

000

001

010

011

100

101

110

111

Offset (within a block)

Blo

ck index

Address 11111

Address 01110

Address 00001

Computer Architecture 2015 – Caches23

Direct map (tiny example) Further assume

The size of our cache is 2 cache-lines (=> need 2=5-2-1 tag bits)

The address divides like so b4 b3 | b2 | b1 b0 tag | set | offset

00 01 10 11

000

001

010

011

100

101

110

111

Offset (within a block)

Blo

ck index

00 01 10 11

0

1

b3 b4

0

1

tag array(bits)

data array(bytes)

memory array(bytes)

even cache lines

odd cache lines

Computer Architecture 2015 – Caches24

Direct map (tiny example) Accessing address

0 0 0 1 0 (= marked “C”)

The address divides like so b4 b3 | b2 | b1 b0 tag (00) | set (0)| offset

(10)

00 01 10 11

A B C D 000

001

010

011

100

101

110

111

Offset (within a block)

Blo

ck index

00 01 10 11

A B C D 0

1

b3 b4

0 0 0

1

tag array(bits)

cache array(bytes)

memory array(bytes)

Computer Architecture 2015 – Caches25

Direct map (tiny example) Accessing address

0 1 0 1 0 (=Y)

The address divides like so b4 b3 | b2 | b1 b0 tag (01) | set (0)| offset

(10)

00 01 10 11

000

001

W X Y Z 010

011

100

101

110

111

Offset (within a block)

Blo

ck index

00 01 10 11

W X Y Z 0

1

b3 b4

1 0 0

1

tag array(bits)

cache array(bytes)

memory array(bytes)

Computer Architecture 2015 – Caches26

Direct map (tiny example) Accessing address

1 0 0 1 0 (=Q)

The address divides like so b4 b3 | b2 | b1 b0 tag (10) | set (0)| offset

(10)

00 01 10 11

000

001

010

011

T R Q P 100

101

110

111

Offset (within a block)

Blo

ck index

00 01 10 11

T R Q P 0

1

b3 b4

0 1 0

1

tag array(bits)

cache array(bytes)

memory array(bytes)

Computer Architecture 2015 – Caches27

Direct map (tiny example) Accessing address

1 1 0 1 0 (=J)

The address divides like so b4 b3 | b2 | b1 b0 tag (11) | set (0)| offset

(10)

00 01 10 11

000

001

010

011

100

101

L K J I 110

111

Offset (within a block)

Blo

ck index

00 01 10 11

L K J I 0

1

b3 b4

1 1 0

1

tag array(bits)

cache array(bytes)

memory array(bytes)

Computer Architecture 2015 – Caches28

Direct map (tiny example) Accessing address

0 0 1 1 0 (=B)

The address divides like so b4 b3 | b2 | b1 b0 tag (00) | set (1)| offset

(10)

00 01 10 11

000

D C B A 001

010

011

100

101

110

111

Offset (within a block)

Blo

ck index

00 01 10 11

0

D C B A 1

b3 b4

0

0 0 1

tag array(bits)

cache array(bytes)

memory array(bytes)

Computer Architecture 2015 – Caches29

Direct map (tiny example) Accessing address

0 1 1 1 0 (=Y)

The address divides like so b4 b3 | b2 | b1 b0 tag (01) | set (1)| offset

(10)

00 01 10 11

000

001

010

W Z Y X 011

100

101

110

111

Offset (within a block)

Blo

ck index

00 01 10 11

0

W Z Y X 1

b3 b4

0

1 0 1

tag array(bits)

cache array(bytes)

memory array(bytes)

Computer Architecture 2015 – Caches30

Direct map (tiny example) Now assume

The size of our cache is 4 cache-lines

The address divides like so b4 | b3 b2 | b1 b0 tag | set | offset

00 01 10 11

000

001

010

011

D C B A 100

101

110

111

Offset (within a block)

Blo

ck index

00 01 10 11

D C B A 00

01

10

11

b4

1 00

01

10

11

tag array(bits)

cache array(bytes)

memory array(bytes)

Computer Architecture 2015 – Caches31

Direct map (tiny example) Now assume

The size of our cache is 4 cache-lines

The address divides like so b4 | b3 b2 | b1 b0 tag | set | offset

00 01 10 11

W Z Y X 000

001

010

011

100

101

110

111

Offset (within a block)

Blo

ck index

00 01 10 11

W Z Y X 00

01

10

11

b4

0 00

01

10

11

tag array(bits)

cache array(bytes)

memory array(bytes)

Computer Architecture 2015 – Caches32

2-Way Set Associative Cache Each set holds two line (way 0 and way 1)

Each block can be mapped into one of two lines in the appropriate set (HW checks both ways in parallel)

Cache effectively partitioned into two

Example:Line Size: 32 bytesCache Size 16KB#of lines 512 lines#sets 256Offset bits 5 bitsSet bits 8 bitsTag bits 19 bits

Address0001 0010 0011 01000101 0110 0111 1000

Offset: 1 1000 = 0x18 = 24Set: 1011 0011 = 0x0B3 = 179Tag: 000 1001 0001 1010 0010 = = 0x091A2

LineTagTag Line

Tag Set Offset

Address Fields041231

Cachestorage

Way 1Tag Array

Set#

031Way 0Tag Array

Set#

031 Cachestorage

WAY #1WAY #0

513

Computer Architecture 2015 – Caches33

2-Way Cache – Hit Decision

Tag Set Offset041231

Way 0

Tag

Set#

Data

=

Hit/Miss

MUX

Data Out

DataTag

Way 1

=

513

Computer Architecture 2015 – Caches35

N-way set associative cache Similarly to 2-way At the extreme, every cache line is a way…

Computer Architecture 2015 – Caches36

Cache organization summary

Increasing set associativity Improves hit rate Increases power consumption Increases access time

Strike a balance

Computer Architecture 2015 – Caches37

Cache Read Miss On a read miss – perform a cache line fill

Fetch entire block that contains the missing data from memory

Block is fetched into the cache line fill buffer May take a few bus cycles to complete the fetch

• e.g., 64 bit (8 byte) data bus, 32 byte cache line 4 bus cycles

• Can stream (forward) the critical chunk into the core before the line fill ends

Once the entire block fetched into the fill buffer It is moved into the cache

Computer Architecture 2015 – Caches39

Cache Replacement Policy Direct map cache – easy

A new block is mapped to a single line in the cache Old line is evicted (re-written to memory if needed)

N-way set associative cache – harder Choose a victim from all ways in the appropriate set But which? To determine, use a replacement algorithm

Example replacement policies Optimum (theoretical, postmortem, called “Belady”) FIFO (First In First Out) Random LRU (Least Recently used)

• A decent approximation of Belady

Computer Architecture 2015 – Caches40

LRU Implementation 2 ways

1 bit per set to mark latest way accessed in set Evict way not pointed by bit

k-way set associative LRU Requires full ordering of way accesses Algorithm: when way i is accessed

x = counter[i]

counter[i] = k-1

for (j = 0 to k-1) if( (ji) && (counter[j]>x) ) counter[j]--;

When replacement is needed• evict way with counter = 0

Expensive even for small k-s• Because invoked for every load/store

Need a log2k bit counter per line

Initial StateWay 0 1 2 3Count 0 1 2 3

Access way 2Way 0 1 2 3Count 0 1 3 2

Access way 0Way 0 1 2 3Count 3 0 2 1

Computer Architecture 2015 – Caches41

Pseudo LRU (PLRU) In practice, it’s sufficient to efficiently approximate

LRU Maintain k-1 bits, instead of k ∙ log2k bits

Assume k=4, and let’s enumerate the way’s cache lines We need 2 bits: cache line 00, cl-01, cl-10, and cl-11

Use a binary search tree to represent the 4 cache lines Set each of the 3 (=k-1) internal nodes to hold

a bit variable: B0, B1, and B2

Whenever accessing a cache line b1b0

Set the bit variable Bj to be thecorresponding cache line bit bk

Can think about the bit value as Bj “right side was referenced more recently”

Need to evict? Walk tree as follows: Go left if Bj = 1; go right if Bj = 0 Evict the leaf you’ve reached (= the opposite

direction relative to previous insertions)

00 01 1110

0

0 1

11 0

B0

B1 B2

cache lines

Computer Architecture 2015 – Caches42

Pseudo LRU (PLRU) – Example Access 3 (11), 0 (00), 2 (10), 1 (01)

=> next victim is 3 (11), as expected

00 01 1110

0

0 1

11 0

B0

B1 B2

00 01 1110

0

0 1

11 0

0

1 0

cache lines

00 01 1110

0

0 1

11 0

1

1

00 01 1110

0

0 1

11 0

0

0 1

00 01 1110

0

0 1

11 0

1

0 03 0 2

1

B1

Computer Architecture 2015 – Caches43

LRU vs. Random vs. FIFO

LRU: hardest FIFO: easier, approximates LRU (oldest rather then

LRU) Random: easiest Results:

Misses per 1000 instructions in L1-d, on average Average across ten SPECint2000 / SPECfp2000

benchmarks PLRU turns out rather similar to LRUSize 2-way 4-way 8-way

LRU Rand FIFO LRU Rand FIFO LRU Rand FIFO

16K 114.1 117.3 115.5 111.7 115.1 113.1 109.0 111.8 110.4

64K 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3

256K 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5

Computer Architecture 2015 – Caches44

Effect of Cache on Performance

MPKI (miss per kilo-instruction) Average number of misses for every 1000 instructions

MPKI= Memory accesses per kilo-instruction × Miss

rate

Memory stall cycles= |Memory accesses| × Miss rate × Miss penalty cycles= IC/1000 × MPKI × Miss penalty cycles

CPU time= (CPU execution cycles + Memory stall cycles) × cycle time= IC/1000 × (1000* CPIexecution + MPKI × Miss penalty cycles) × cycle time

Computer Architecture 2015 – Caches45

Memory Update Policy on Writes

Write back:Lazy writes to next cache level; prefer cache

Write through:Immediately update next cache level

Computer Architecture 2015 – Caches46

Write Back: Cheaper writes Store operations that hit the cache

Write only to cache; next cache level (or memory) not accessed

Line marked as “modified” or “dirty” When evicted, line written to next level only if dirty

Pros: Saves memory accesses when line updated more than

once Attractive for multicore/multiprocessor

Cons: On eviction, the entire line must be written to memory

(there’s no indication which bytes within the line were modified)

Read miss might require writing to memory (evicted line is dirty)

Computer Architecture 2015 – Caches47

Write Through: Cheaper evictions

Stores that hit the cache Write to cache, and Write to next cache level (or memory)

Need to write only the bytes that were changed Not entire line Less work

When evicting, no need to write to next cache level Never dirty, so don’t need to be written Still need to throw stuff out, though

Use write buffers To mask waiting for lower level memory

Computer Architecture 2015 – Caches48

Write through: need write-buffer

A write buffer between cache & memory Processor core: writes data into cache & write buffer Write buffer allows processor to avoid stalling on writes

Works ok if store frequency in cycles << DRAM write cycle Otherwise store buffer overflows no matter how big it is

Write combining Combine adjacent writes to same location in write buffer

Note: on cache miss need to lookup write buffer (or drain it)

ProcessorCache

Write Buffer

DRAM

Computer Architecture 2015 – Caches49

Cache Write Miss The processor is not waiting for data

continues to work

Option 1: Write allocate: fetch the line into the cache Goes with write back policy

• Because, with write back,write ops are quicker if line in cache

Assumes more writes/reads to cache line will be performed soon

Hopes that subsequent accesses to the line hit the cache

Option 2: Write no allocate: do not fetch line into cache Goes with write through policy Subsequent writes would update memory anyhow (If read ops occur, first read will bring line to cache)

Computer Architecture 2015 – Caches50

WT vs. WB – Summary

Write-Through Write-Back

Policy

Data written to cache block (if present)

also written to lower-level memory

Write data only to the cache

Update lower level when a block falls out of the cache

Complexity Less More

Can read misses produce writes? No Yes

Do repeated writes make it to

lower level?Yes No

Upon write miss Write no allocate Write allocate

Computer Architecture 2015 – Caches51

Write Buffers for WT – Summary

Q. Why a write buffer ?

ProcessorCache

Write Buffer

Lower Level

Memory

Holds data awaiting write-through to lower level memory

A. So CPU doesn’t stall

Q. Why a buffer, why not just one register ?

A. Bursts of writes are common

Q. Are Read After Write (RAW) hazards an issue for write buffer?

A. Yes! Drain buffer before next read, or check in buffer

Computer Architecture 2015 – Caches52

Write-back vs. Write-through Commercial processors favor write-back

Write bursts to the same line are common Simplifies management of multi-cores

• Data in two consecutive cache levels is inconsistent while write is in-flight

• With write-through, this happens on every write

Computer Architecture 2015 – Caches53

Optimizing the Hierarchy

Computer Architecture 2015 – Caches54

Cache Line Size Larger line size takes advantage of spatial locality

Too big blocks: may fetch unused data While possibly evicting useful data miss rate goes up

Larger line size means larger miss penalty Longer time to fill line (critical chunk first reduces the

problem) Longer time to evict

avgAccessTime = missPenalty × missRate + hitTime × (1 – missRate)

Computer Architecture 2015 – Caches55

Classifying Misses: 3 Cs Compulsory

First access to a block which is not in the cache Block must be brought into cache Cache size does not matter Solution: prefetching

Capacity Cache cannot contain all blocks needed during program

execution Blocks are evicted and later retrieved Solution: increase cache size, stream buffers

Conflict Occurs in set associative or direct mapped caches when

too many blocks are mapped to the same set Solution: increase associativity, victim cache

Computer Architecture 2015 – Caches56

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

Conflict

3Cs in SPEC92

Compulsory

Capacity

Mis

s r

ate

(fr

act

ion

)

Computer Architecture 2015 – Caches57

Multi-ported Cache N-ported cache enables n accesses in parallel

Parallelize cache access in different pipeline stages Parallelize cache access in a super-scalar processors

For n=2, more than doubles the cache area size Wire complexity also degrades access times

Can help: “banked cache” Addresses are divided to n banks Can fetch data from k n different banks in possibly

different lines

Computer Architecture 2015 – Caches58

Separate Code / Data Caches Parallelize data access and instruction fetch

Code cache is a read only cache No need to write back line into memory when evicted Simpler to manage

What about self modifying code ? I-cache “snoops” (=monitors) all write ops

• Requires a dedicated snoop port: read tag array + match tag

• (Otherwise snoops would stall fetch) If the code cache contains the written address

• Invalidate the corresponding cache line• Flush the pipeline – it may contain stale code

Computer Architecture 2015 – Caches59

Last-level cache (LLC) Either L2 or L3

LLC cache is bigger, but with higher latency Reduces L1 miss penalty – saves access to memory On modern processors, LLC is located on-chip

Since LLC contains L1 it needs to be significantly larger Data is replicated across the cache levels

• Fetching from LLC to L1 replicates data E.g., if LLC is only 2× L1, half of LLC is duplicated in L1

LLC is typically unified (code / data)

Computer Architecture 2015 – Caches60

Core 2 Duo Die Photo

L2 Cache

(Core 2 Duo L2 size is up to 6MB; it is shared by the cores.)

Computer Architecture 2015 – Caches61

Ivy Bridge (L3, “last level” cache)

(64KB data + 64KB instruction L1 cache per core; 512KB L2 data cache per core; and up to 32MB L3 cache shared by all cores)

Computer Architecture 2015 – Caches62

AMD Phenom II Six Core

Computer Architecture 2015 – Caches63

LLC: Inclusiveness Data replication across cache levels presents a

tradeoff:Inclusive vs. non-inclusive caches

Inclusive: LLC contains all data in higher cache levels Evicting a line from the LLC also evicts it from the higher

levels Pro: makes it easy to manage cache hierarchy

• LLC serves as coordination point Con: wasted cache space

Non-inclusive: L1 may contain data not present in LLC Pro: better use of cache resources Con: how do we know what data is in the caches?

Critical issue in multicore design Data coherency and consistency across individual L1

caches

Computer Architecture 2015 – Caches64

LLC: Inclusiveness Practicality wins - LLC is typically inclusive

All addresses in L1 are also contained in LLC

LLC eviction process Address evicted from LLC snoop invalidate it in L1 But data in L1 might be newer than in L2

• When evicting a dirty line from L1 write to L2 Thus, when evicting a line from L2 which is dirty in L1

• Snoop invalidate to L1 generates a write from L1 to L2• Line marked as modified in L2 Line written to memory

Computer Architecture 2015 – Caches65

Victim Cache The load on a cache set may be non-uniform

Some sets may have more conflict misses than others Solution: allocate ways to sets dynamically

Victim buffer adds some associativity to direct-mapped caches A line evicted from L1 cache is placed in the victim cache If victim cache is full evict its LRU line On L1 cache lookup, in parallel, also search victim cache

Direct-mapped cache Victim buffer (fully-assoc.)

Computer Architecture 2015 – Caches66

Victim Cache On victim cache hit

Line is moved back to cache Evicted line moved to the victim cache Same access time as cache hit

Direct-mapped cache Victim buffer (fully-assoc.)

Computer Architecture 2015 – Caches67

Stream Buffers Before inserting a new line into cache

Put new line in a stream buffer

If the line is expected to be accessed again Move the line from the stream buffer into cache E.g., if the line hits in the stream buffer

Example: Scanning a very large array (much larger than the

cache) Each item in the array is accessed just once If the array elements are inserted into the cache

• The entire cache will be thrashed If we detect that this is just a scan-once operation

• E.g., using a hint from the software• Can avoid putting the array lines into the cache

Computer Architecture 2015 – Caches68

Prefetching Predict future memory accesses

Fetch them from memory ahead of time

Instruction Prefetching On cache miss, prefetch sequential lines into stream

buffers Branch-predictor-directed prefetching

• Let branch predictor run ahead

Data Prefetching - predict future data accesses Next sequential (block prefetcher) Stride General pattern

Software Prefetching Compiler injects special prefetching instructions

Computer Architecture 2015 – Caches69

Prefetching Prefetch can greatly improve performance

…but incurs high overheads!

Predictions are not 100% accurate Need to predict correct address and make sure it arrives

on time• Too early: line may be evicted• Too late: processor has to stall

Closer to 50-60% in practice

Can waste memory bandwidth and power In some commodity processors, roughly 50% of data

brought from memory is never used due to aggressive prefetching

Computer Architecture 2015 – Caches70

Critical Word First Reduce Miss Penalty Don’t wait for full block to be loaded before

restarting CPU Early restart

• As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution

Critical Word First• Request the missed word first from memory and send

it to the CPU as soon as it arrives• Let the CPU continue execution while filling the rest of

the words in the line• Also called wrapped fetch and requested word first

Example: Pentium 8 byte bus, 32 byte cache line 4 bus cycles to fill line Fetch data from 95H

80H-87H 88H-8FH 90H-97H 98H-9FH

1 43 2

Computer Architecture 2015 – Caches71

Non-Blocking Cache Very important in OoO processors

Hit Under Miss Allow cache hits while one miss is in progress Another miss has to wait

Miss Under Miss, Hit Under Multiple Misses Allow hits and misses when other misses in progress Memory system must allow multiple pending requests Manage a list of outstanding cache misses

• When miss is served and data gets back, update list

Pending operations manages by MSHR Also known as “Miss-Status Holding Register”

Computer Architecture 2015 – Caches72

Compiler/Programmer Optimizations: Merging Arrays

Merge 2 arrays into a single array of compound elements

/* BEFORE: two sequential arrays */int val[SIZE];int key[SIZE];

/* AFTER: One array of structures */struct merge {

int val;int key;

} merged_array[SIZE];

Reduce conflicts between val and key Improves spatial locality

Computer Architecture 2015 – Caches73

Compiler optimizations: Loop Fusion Combine 2 independent loops that have same

looping and some variables overlap Assume each element in a is 4 bytes, 32KB cache, 32 B / line

for (i = 0; i < 10000; i++)a[i] = 1 / a[i];

for (i = 0; i < 10000; i++)sum = sum + a[i];

First loop: hit 7/8 of iterations Second loop: array > cache same hit rate as in 1st loop

Fuse the loops

for (i = 0; i < 10000; i++) {a[i] = 1 / a[i];sum = sum + a[i];

} First line: hit 7/8 of iterations Second line: hit all

Computer Architecture 2015 – Caches74

Compiler Optimizations: Loop Interchange

Change loops nesting to access data in order stored in memory

Two dimensional array in memory:x[0][0] x[0][1] … x[0][99] x[1][0] x[1][1] … /* Before */

for (j = 0; j < 100; j++)for (i = 0; i < 5000; i++)

x[i][j] = 2 * x[i][j];/* After */

for (i = 0; i < 5000; i++)for (j = 0; j < 100; j++)

x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding through memory every 100 words Improved spatial locality

Computer Architecture 2015 – Caches76

Summary: Cache and Performance

Reduce cache miss rate Larger cache Reduce compulsory

misses• Larger Block Size• HW Prefetching (Instr,

Data)• SW Prefetching (Data)

Reduce conflict misses• Higher Associativity• Victim Cache

Stream buffers• Reduce cache thrashing

Compiler Optimizations

Reduce the miss penalty Early Restart and Critical

Word First on miss Non-blocking Caches (Hit

under Miss, Miss under Miss)

2nd/3rd Level Cache

Reduce cache hit time On-chip caches Smaller size cache (hit

time increases with cache size)

Direct map cache (hit time increases with associativity)

Bring frequently accessed data closer to the processor