caches spring 2016 may 4: caches · 2016. 5. 10. · caches spring 2016 cache performance metrics...

95
Spring 2016 Caches May 4: Caches 1

Upload: others

Post on 01-Sep-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

May4:Caches

1

Page 2: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Roadmap

2

car *c = malloc(sizeof(car));c->miles = 100;c->gals = 17;float mpg = get_mpg(c);free(c);

Car c = new Car();c.setMiles(100);c.setGals(17);float mpg =

c.getMPG();

get_mpg:pushq %rbpmovq %rsp, %rbp...popq %rbpret

Java:C:

Assemblylanguage:

Machinecode:

01110100000110001000110100000100000000101000100111000010110000011111101000011111

Computersystem:

OS:

Memory&dataIntegers&floatsMachinecode&Cx86assemblyProcedures&stacksArrays&structsMemory&cachesProcessesVirtualmemoryMemoryallocationJavavs.C

Page 3: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

HowdoesexecutiontimegrowwithSIZE?

3

int array[SIZE];

int sum = 0;

for (int i = 0; i < 200000; i++) {

for (int j = 0; j < SIZE; j++) {

sum += array[j];

}

}

SIZE

TIME

Plot

Page 4: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

ActualData

4SIZE

Time

Page 5: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Makingmemoryaccessesfast!¢ Cachebasics¢ Principleoflocality¢ Memoryhierarchies¢ Cacheorganization¢ Programoptimizationsthatconsidercaches

5

Page 6: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Problem:Processor-MemoryBottleneck

MainMemory

CPU Reg

Processorperformancedoubledaboutevery18months Buslatency/bandwidth

evolvedmuchslower

Core2Duo:Canprocessatleast256Bytes/cycle

Core2Duo:Bandwidth2Bytes/cycleLatency100-200cycles(30-60ns)

Problem:lotsofwaitingonmemory6cycle:singlemachinestep(fixed-time)

Page 7: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Problem:Processor-MemoryBottleneck

MainMemory

CPU Reg

Processorperformancedoubledaboutevery18months Buslatency/bandwidth

evolvedmuchslower

Core2Duo:Canprocessatleast256Bytes/cycle

Core2Duo:Bandwidth2Bytes/cycleLatency100-200cycles(30-60ns)

Solution:caches7

Cache

cycle:singlemachinestep(fixed-time)

Page 8: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Cache💰¢ Ahiddenstoragespace

forprovisions,weapons,and/ortreasures

¢ Computermemorywithshortaccesstimeusedforthestorageoffrequentlyorrecentlyusedinstructionsordata(i-cacheandd-cache)§ Moregenerally:

usedtooptimizedatatransfersbetweenanysystemelementswithdifferentcharacteristics(networkinterfacecache,I/Ocache,etc.)

8

Page 9: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

GeneralCacheMechanics

9

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

7 9 14 3Cache

Memory • Larger,slower,cheapermemory.• Viewedaspartitionedinto“blocks”

or“lines”

Dataiscopiedinblock-sizedtransferunits

• Smaller,faster,moreexpensivememory.

• Cachesasubsetoftheblocks(a.k.a.lines)

Page 10: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

GeneralCacheConcepts:Hit

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

7 9 14 3Cache

Memory

DatainblockbisneededRequest:14

14Blockbisincache:Hit!

10

Page 11: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

GeneralCacheConcepts:Miss

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

7 9 14 3Cache

Memory

DatainblockbisneededRequest:12

Blockbisnotincache:Miss!

BlockbisfetchedfrommemoryRequest:12

12

12

12

Blockbisstoredincache•Placementpolicy:determineswherebgoes•Replacementpolicy:determineswhichblockgetsevicted(victim)

11

Page 12: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith

addressesnearorequaltothosetheyhaveusedrecently

12

Page 13: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith

addressesnearorequaltothosetheyhaveusedrecently

¢ Temporallocality:§ Recentlyreferenceditemsarelikely

tobereferencedagaininthenearfuture

§ Whyisthisimportant?

block

13

Page 14: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith

addressesnearorequaltothosetheyhaveusedrecently

¢ Temporallocality:§ Recentlyreferenceditemsarelikely

tobereferencedagaininthenearfuture

¢ Spatiallocality?

block

14

Page 15: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith

addressesnearorequaltothosetheyhaveusedrecently

¢ Temporallocality:§ Recentlyreferenceditemsarelikely

tobereferencedagaininthenearfuture

¢ Spatiallocality:§ Itemswithnearbyaddressestend

tobereferencedclosetogetherintime

§ Howdocachestakeadvantageofthis?

block

block

15

Page 16: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example:AnyLocality?sum = 0;for (i = 0; i < n; i++) {

sum += a[i];}return sum;

16

Page 17: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example:AnyLocality?

¢ Data:§ Temporal:sum referencedineachiteration§ Spatial:arraya[] accessedinstride-1pattern

¢ Instructions:§ Temporal:cyclethroughlooprepeatedly§ Spatial:referenceinstructionsinsequence

¢ Beingabletoassessthelocalityofcodeisacrucialskillforaprogrammer

17

sum = 0;for (i = 0; i < n; i++) {

sum += a[i];}return sum;

Page 18: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

LocalityExample#1int sum_array_rows(int a[M][N]){

int i, j, sum = 0;

for (i = 0; i < M; i++)for (j = 0; j < N; j++)

sum += a[i][j];

return sum;}

18

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

Page 19: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

LocalityExample#1

19

1:a[0][0]2:a[0][1]3:a[0][2]4:a[0][3]5:a[1][0]6:a[1][1]7:a[1][2]8:a[1][3]9:a[2][0]10:a[2][1]11:a[2][2]12:a[2][3]

stride-1

OrderAccessed

76 92 108

a[0][0]

a[0][1]

a[0][2]

a[0][3]

5a[1][1]

a[1][2]

a[1][3]

0 5a[2][2]

a[2][3]

a[2][1]

a[2][0]

a[1][0]

LayoutinMemory

M=3,N=4

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

76isjustonepossible startingaddressofarraya

int sum_array_rows(int a[M][N]){

int i, j, sum = 0;

for (i = 0; i < M; i++)for (j = 0; j < N; j++)

sum += a[i][j];

return sum;}

Page 20: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

LocalityExample#2int sum_array_cols(int a[M][N]){

int i, j, sum = 0;

for (j = 0; j < N; j++)for (i = 0; i < M; i++)

sum += a[i][j];

return sum;}

20

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

Page 21: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

1:a[0][0]2:a[1][0]3:a[2][0]4:a[0][1]

LocalityExample#2

21

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

5:a[1][1]6:a[2][1]7:a[0][2]8:a[1][2]9:a[2][2]10:a[0][3]11:a[1][3]12:a[2][3]

stride-N

76 92 108

a[0][0]

a[0][1]

a[0][2]

a[0][3]

5a[1][1]

a[1][2]

a[1][3]

0 5a[2][2]

a[2][3]

a[2][1]

a[2][0]

a[1][0]

LayoutinMemoryOrderAccessed

M=3,N=4int sum_array_cols(int a[M][N]){

int i, j, sum = 0;

for (j = 0; j < N; j++)for (i = 0; i < M; i++)

sum += a[i][j];

return sum;}

Page 22: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

LocalityExample#3int sum_array_3d(int a[M][N][L]){

int i, j, k, sum = 0;

for (i = 0; i < N; i++)for (j = 0; j < L; j++)

for (k = 0; k < M; k++)sum += a[k][i][j];

return sum;}

¢ Whatiswrongwiththiscode?¢ Howcanitbefixed?

22

Page 23: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

LocalityExample#3

¢ Whatiswrongwiththiscode?¢ Howcanitbefixed?

2376 92 108

a[0][0][0]

a[0][0][1]

a[0][0][2]

a[0][0][3]

5

a[0][1][1]

a[0][1][2]

a[0][1][3]

0 5

a[0][2][2]

a[0][2][3]

a[0][2][1]

a[0][2][0]

a[0][1][0]

LayoutinMemory(forM=?,N=3,L=4)

a[1][0][0]

a[1][0][1]

a[1][0][2]

a[1][0][3]

5

a[1][1][1]

a[1][1][2]

a[1][1][3]

0 5

a[1][2][2]

a[1][2][3]

a[1][2][1]

a[1][2][0]

a[1][1][0]

a[2][0][0]

a[2][0][1]

a[2][0][2]

a[2][0][3]

5

a[2][1][1]

a[2][1][2]

a[2][1][3]

0 5

a[2][2][2]

a[2][2][3]

a[2][2][1]

a[2][2][0]

a[2][1][0] …

int sum_array_3d(int a[M][N][L]){

int i, j, k, sum = 0;

for (i = 0; i < N; i++)for (j = 0; j < L; j++)

for (k = 0; k < M; k++)sum += a[k][i][j];

return sum;}

Page 24: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Makememoryfastagain.¢ Cachebasics¢ Principleoflocality¢ Memoryhierarchies¢ Cacheorganization¢ Programoptimizationsthatconsidercaches

24

Page 25: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

CostofCacheMisses¢ Hugedifferencebetweenahitandamiss

§ Couldbe100x,ifjustL1andmainmemory

¢ Wouldyoubelieve99%hitsistwiceasgoodas97%?§ Consider:

Cachehittimeof1cycleMisspenaltyof100cycles

25

cycle=singlefixed-timemachinestep

Page 26: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

CostofCacheMisses¢ Hugedifferencebetweenahitandamiss

§ Couldbe100x,ifjustL1andmainmemory

¢ Wouldyoubelieve99%hitsistwiceasgoodas97%?§ Consider:

Cachehittimeof1cycleMisspenaltyof100cycles

§ Averageaccesstime:§ 97%hits:1cycle+0.03*100cycles=4cycles§ 99%hits:1cycle+0.01*100cycles=2cycles

¢ Thisiswhy“missrate”isusedinsteadof“hitrate”

26

cycle=singlefixed-timemachinestep

checkthecacheeverytime

Page 27: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

CachePerformanceMetrics

¢ MissRate§ Fractionofmemoryreferencesnotfoundincache(misses/accesses)

=1- hitrate§ Typicalnumbers(inpercentages):

§ 3%- 10%forL1§ Canbequitesmall(e.g.,<1%)forL2,dependingonsize,etc.

¢ HitTime§ Timetodeliveralineinthecachetotheprocessor

§ Includestimetodeterminewhetherthelineisinthecache§ Typicalhittimes:4clockcyclesforL1;10clockcyclesforL2

¢ MissPenalty§ Additionaltimerequiredbecauseofamiss§ Typically50- 200cyclesformissinginL2&goingtomainmemory

(Trend:increasing!)27

Page 28: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Canwehavemorethanonecache?¢ Whywouldwewanttodothat?

28

Page 29: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

MemoryHierarchies¢ Somefundamentalandenduringpropertiesofhardwareand

softwaresystems:§ Fasterstoragetechnologiesalmostalwayscostmoreperbyteandhave

lowercapacity§ Thegapsbetweenmemorytechnologyspeedsarewidening

§ Truefor:registers↔cache,cache↔DRAM,DRAM↔disk,etc.§ Well-writtenprogramstendtoexhibitgoodlocality

¢ Thesepropertiescomplementeachotherbeautifully

¢ Theysuggestanapproachfororganizingmemoryandstoragesystemsknownasamemoryhierarchy

29

Page 30: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

May6

30

Page 31: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

AnExampleMemoryHierarchy

31

registers

on-chipL1cache(SRAM)

mainmemory(DRAM)

localsecondarystorage(localdisks)

Larger,slower,cheaperperbyte

remotesecondarystorage(distributedfilesystems, webservers)

off-chipL2cache(SRAM)

Smaller,faster,costlierperbyte

<1ns

1ns

5-10ns

100ns

150,000ns

10,000,000ns(10ms)

1-150ms

SSD

Disk

5-10s

1-2min

15-30min

31days

66months =1.3years

1- 15years

Page 32: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

AnExampleMemoryHierarchy

32

registers

on-chipL1cache(SRAM)

mainmemory(DRAM)

localsecondarystorage(localdisks)

Larger,slower,cheaperperbyte

remotesecondarystorage(distributedfilesystems, webservers)

Localdiskshold filesretrievedfromdisksonremotenetworkservers

Mainmemoryholdsdiskblocksretrievedfromlocaldisks

off-chipL2cache(SRAM)

L1cacheholdscachelinesretrievedfromL2cache

CPUregistershold wordsretrievedfromL1cache

L2cacheholdscachelinesretrievedfrommainmemory

Smaller,faster,costlierperbyte

Page 33: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

AnExampleMemoryHierarchy

33

registers

on-chipL1cache(SRAM)

mainmemory(DRAM)

localsecondarystorage(localdisks)

Larger,slower,cheaperperbyte

remotesecondarystorage(distributedfilesystems, webservers)

off-chipL2cache(SRAM)

explicitlyprogram-controlled(e.g.refertoexactly%rax,%rbx)

Smaller,faster,costlierperbyte

programsees“memory”;hardwaremanagescaching

transparently

Page 34: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

MemoryHierarchies

¢ Fundamentalideaofamemoryhierarchy:§ Foreachlevelk,thefaster,smallerdeviceatlevelkservesasacachefor

thelarger,slowerdeviceatlevelk+1.

¢ Whydomemoryhierarchieswork?§ Becauseoflocality,programstendtoaccessthedataatlevelk more

oftenthantheyaccessthedataatlevelk+1.§ Thus,thestorageatlevelk+1canbeslower,andthuslargerand

cheaperperbit.

¢ BigIdea: Thememoryhierarchycreatesalargepoolofstoragethatcostsasmuchasthecheapstoragenearthebottom,butthatservesdatatoprogramsattherateofthefaststoragenearthetop.

34

Page 35: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

IntelCorei7CacheHierarchy

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core0

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core3

L3unifiedcache(sharedbyallcores)

Mainmemory

Processorpackage

Blocksize:64bytesforallcaches.

L1i-cacheandd-cache:32KB,8-way,Access:4cycles

L2unifiedcache:256KB,8-way,Access:11cycles

L3unifiedcache:8MB,16-way,Access:30-40cycles

35

Page 36: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Makingmemoryaccessesfast!¢ Cachebasics¢ Principleoflocality¢ Memoryhierarchies¢ Cacheorganization

§ Direct-mapped (sets; index+tag)§ Associativity(ways)§ Replacementpolicy§ Handlingwrites

¢ Programoptimizationsthatconsidercaches

36

Page 37: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

CacheOrganization¢ Whereshoulddatagointhecache?

§ Weneedamappingfrommemoryaddressestospecificlocationsinthecachetomakecheckingthecacheforanaddressfast§ Otherwiseeachmemoryaccessrequires“searchingtheentirecache”(slow!)

§ Whatisadatastructurethatprovidesfastlookup?

37

Page 38: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Aside:HashTablesforFastLookup

38

0123456789

Insert:

52734

1002119

Page 39: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Aside:HashTablesforFastLookup

39

000001010011100101110111

01234567

Insert:

000001000101110011101010100111

Page 40: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Where should we put data in the cache?

40

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Data

Memory Cache

¢ How can we compute this mapping?

addressmodcachesizesameas:low-orderlog2(cachesize)bits

Page 41: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Where should we put data in the cache?

41

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Data

Memory Cache

Collision.Hmm..Thecachemightgetconfusedlater!Why?Andhowdowesolvethat?

Page 42: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Usetagstorecordwhichlocationiscached

42

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Tag Data00??0101

Memory Cache

tag=restofaddressbits

Page 43: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

What’sacacheblock?(orcacheline)

43

0123456789101112131415

ByteAddress

0123

Index

0

1

2

3

4

5

6

7

Block(line)number

block/linesize=?

typicalblock/linesizes:32bytes,64bytes

0000000100100011010001010110011110001001101010111100110111101111

(00)(01)(10)(11)

Tag

Page 44: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Apuzzle.¢ Whatcanyouinferfromthis:

¢ Cachestartsempty¢ Access(addr: hit/miss)stream:

¢ (12:miss),(13:hit),(14:miss)

44

blocksize>=2bytes blocksize<3bytes

Page 45: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Directmappedcache

¢ Directmapped:§ Eachmemoryaddresscan

bemappedtoexactlyoneindexinthecache.

§ Easytofindanaddress!Cheaptoimplement.

§ But…

¢ Whathappensifaprogramusesaddresses2, 6, 2, 6, 2,…?

45

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

MemoryAddress

________

Tag

2and6conflict,restofcacheisunused

Page 46: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Associativity¢ Whatifwecouldstoredatainanyplaceinthecache?

§ Minimizeconflicts!§ Morecomplicatedhardware,consumesmorepower,slower.

46

0000000100100011010001010110011110001001101010111100110111101111

MemoryAddress

________00100110

TagAccesses:2, 6, 2, 6, 2,…?

Page 47: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Associativity¢ Whatifwecouldstoredatainanyplaceinthecache?

§ Morecomplicatedhardware,consumesmorepower,slower.

¢ Sowecombine thetwoideas:§ Eachaddressmapstoexactlyoneset.§ Eachsetcanhavemorethanoneway.

47

01234567

Set

0

1

2

3

Set

0

1

Set

1-way8 sets,

1 block each

2-way4 sets,

2 blocks each

4-way2 sets,

4 blocks each

0

Set

8-way1 set,

8 blocks

direct mapped fully associative

Page 48: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

NowhowdoIknowwheredatagoes?

48

m-bit Address

k bits(m-k-n) bitsn-bit Block

OffsetTag Index

Page 49: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

NowhowdoIknowwheredatagoes?

49

0123456789101112131415

ByteAddress

0123

Index

0

1

2

3

4

5

6

7

Block(line)number

0000000100100011010001010110011110001001101010111100110111101111

00011011

capacity: 8bytesblocksize: 2bytessets: 4ways: 1

m-bit Address

k bits(m-k-n) bitsn-bit Block

OffsetTag Index

Wherewould13(1101)bestored?

Tag

Page 50: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Exampleplacementinset-associativecaches¢ Wherewoulddatafromaddress0x1833 beplaced?¢ 0x1833 inbinaryis:

50

m-bit Address

k bits(m-k-n) bitsn-bit Block

OffsetTag Index

k=? k=? k=?

blocksize: 16bytescapacity: 8blocks0001 1000 0011 0011011 0011

Set Tag Data01234567

1-way associativity8 sets, 1 block each

Set Tag Data

0

1

2

3

Set Tag Data

0

1

2-way associativity4 sets, 2 blocks each

4-way associativity2 sets, 4 blocks each

Page 51: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

m-bit Address

k bits(m-k-4) bits4-bit Block

OffsetTag Index

k=3 k=2 k=1

51

Exampleplacementinset-associativecaches¢ Wherewoulddatafromaddress0x1833 beplaced?¢ 0x1833 inbinaryis:

blocksize: 16bytescapacity: 8blocks0001 1000 0011 0011

Set Tag Data01234567

1-way associativity8 sets, 1 block each

Set Tag Data

0

1

2

3

Set Tag Data

0

1

2-way associativity4 sets, 2 blocks each

4-way associativity2 sets, 4 blocks each

Page 52: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Blockreplacement¢ Anyemptyblockinthecorrectsetmaybeusedforstoringdata.¢ Iftherearenoemptyblocks,whichoneshouldwereplace?

§ Obviousfordirect-mappedcaches,whataboutset-associative?§ Cachestypicallyusesomethingclosetoleastrecentlyused(LRU)

(hardwareusuallyimplements“notmostrecentlyused”)

52

01234567

Set

0

1

2

3

Set

0

1

Set

1-way associativity8 sets, 1 block each

2-way associativity4 sets, 2 blocks each

4-way associativity2 sets, 4 blocks each

Page 53: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Anotherpuzzle.¢ Whatcanyouinferfromthis?

¢ Cachestartsempty¢ Access(addr: hit/miss)stream:

¢ (10:miss);(12:miss);(10:miss)

53

12isnotinthesameblockas10

12’sblockreplaced10’sblock

direct-mappedcachewith<=2sets

Page 54: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

GeneralCacheOrganization(S,E,B)E=linesperset(wesay“E-way”)

S=#sets

set

line

0 1 2 B-1tagv

validbitB=bytesofdatapercacheline(thedatablock)

cachesize:SxExBdatabytes

54

Page 55: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

CacheReadE=linesperset

S=2s sets

0 1 2 B-1tagv

validbitB=2b bytesofdatapercacheline(thedatablock)

tbits sbits bbitsAddressofbyteinmemory:

tag setindex

blockoffset

databeginsatthisoffset

• Locateset•Checkifanyline insethasmatching tag•Yes+linevalid:hit• Locatedatastartingatoffset

55

Page 56: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example:Direct-MappedCache(E=1)

S=2s sets

Direct-mapped:OnelinepersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofint:

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

findset

56

Page 57: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example:Direct-MappedCache(E=1)Direct-mapped:OnelinepersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofint:

0 1 2 7tagv 3 654

match?:yes=hitvalid?+

blockoffset

tag

57

Page 58: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example:Direct-MappedCache(E=1)Direct-mapped:OnelinepersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofint:

0 1 2 7tagv 3 654

match?:yes=hitvalid?+

int (4Bytes)ishere

blockoffset

Nomatch? Thenoldlinegetsevictedandreplaced

58

Thisiswhywewantalignment!

Page 59: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example(forE=1)int sum_array_rows(double a[16][16]){

int i, j;double sum = 0;

for (i = 0; i < 16; i++)for (j = 0; j < 16; j++)

sum += a[i][j];return sum;

}

32B=4doubles

Assume:cold(empty)cache3bitsforset,5bitsforoffset

aa...ayyy yxx xx000

int sum_array_cols(double a[16][16]){

int i, j;double sum = 0;

for (j = 0; j < 16; j++)for (i = 0; i < 16; i++)

sum += a[i][j];return sum;

}

Assumesum,i,j inregistersAddressofanalignedelementofa:aa...ayyyyxxxx000

59

0,0 0,1 0,2 0,3

0,4 0,5 0,6 0,7

0,8 0,9 0,a 0,b

0,c 0,d 0,e 0,f

1,0 1,1 1,2 1,3

1,4 1,5 1,6 1,7

1,8 1,9 1,a 1,b

1,c 1,d 1,e 1,f

32B=4doubles4missesperrowofarray

4*16=64misseseveryaccessamiss16*16=256misses

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

4,0 4,1 4,2 4,3

0,0:aa...a000 000 000000,4:aa...a000 001 000001,0:aa...a000 100 000002,0:aa...a001 000 00000

Page 60: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example(forE=1)float dotprod(float x[8], float y[8]){

float sum = 0;int i;

for (i = 0; i < 8; i++)sum += x[i]*y[i];

return sum;}

60

x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]

ifx andy havealignedstartingaddresses,

e.g.,&x[0]=0,&y[0]=128

ifx andy haveunalignedstartingaddresses,

e.g.,&x[0]=0,&y[0]=160

x[0] x[1] x[2] x[3]

y[0] y[1] y[2] y[3]

x[4] x[5] x[6] x[7]

y[4] y[5] y[6] y[7]

Inthisexample,cacheblocksare16bytes;8setsincacheHowmanyblockoffsetbits?Howmanysetindexbits?

Addressbits:ttt....tsssbbbbB=16=2b: b=4offsetbitsS=8=2s: s=3indexbits

0: 000....00000000128: 000....10000000160: 000....10100000

Page 61: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

E-waySet-AssociativeCache(Here:E=2)E=2:TwolinespersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofshortint:

findset

61

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

Page 62: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

E-waySet-AssociativeCache(Here:E=2)E=2:TwolinespersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofshortint:

compareboth

valid?+ match:yes=hit

blockoffset

tag

62

Page 63: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

E-waySet-AssociativeCache(Here:E=2)E=2:TwolinespersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofshortint:

valid?+ match:yes=hit

blockoffset

shortint (2Bytes)ishere

Nomatch?• Onelineinsetisselectedforevictionandreplacement• Replacementpolicies:random,leastrecentlyused(LRU),…

63

compareboth

Page 64: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example(forE=2)float dotprod(float x[8], float y[8]){

float sum = 0;int i;

for (i = 0; i < 8; i++)sum += x[i]*y[i];

return sum;}

64

x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3]Ifx andy havealignedstartingaddresses,e.g.&x[0]=0,&y[0]=128,canstillfitbothbecausetwolinesineachset

x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7]

Page 65: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

TypesofCacheMisses:3C’s!¢ Cold(compulsory)miss

§ Occursonfirstaccesstoablock

¢ Conflictmiss§ Conflictmissesoccurwhenthecacheislargeenough,butmultipledata

objectsallmaptothesameslot§ e.g.,referencingblocks0,8,0,8,...couldmisseverytime

§ direct-mappedcacheshavemoreconflictmissesthann-wayset-associative (wheren>1)

¢ Capacitymiss§ Occurswhenthesetofactivecacheblocks(theworkingset)

islargerthanthecache(justwon’tfit,evenifcachewasfully-associative)§ Note:Fully-associative onlyhasColdandCapacitymisses

65

Page 66: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Whataboutwrites?¢ Multiplecopiesofdataexist:

§ L1,L2,possiblyL3,mainmemory

¢ Whatisthemainproblemwiththat?§ Whichcopiesshouldweupdate?§ Whatifit’sahit?Miss?

66

L1

L2Cache

L3Cache

MainMemory

Page 67: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Whataboutwrites?¢ Multiplecopiesofdataexist:

§ L1,L2,possiblyL3,mainmemory

¢ Whattodoonawrite-hit?§ Write-through:writeimmediatelytomemory,allcachesinbetween.§ Write-back:deferwritetomemoryuntillineisevicted(replaced)

§ Musttrackwhichcachelineshavebeenmodified(“dirtybit”)

¢ Whattodoonawrite-miss?§ Write-allocate(“fetchonwrite”):loadintocache,updatelineincache.

§ Goodifmorewritesorreadstothelocationfollow§ No-write-allocate(“writearound”):justwriteimmediatelytomemory.

¢ Typicalcaches:§ Write-back+Write-allocate,usually§ Write-through+No-write-allocate,occasionally

67

why?

Page 68: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Write-back,write-allocateexample

68

0xBEEFCache

Memory

U

0xCAFE

0xBEEF

0

T

U

dirtybit

tag(thereisonlyonesetinthistinycache,sothetagistheentireaddress!)

Inthisexamplewearesortofignoringblockoffsets.Hereablockholds2bytes(16bits,4hexdigits).

Normallyablockwouldbemuchbiggerandthustherewouldbemultipleitemsperblock.Whileonlyoneiteminthatblockwouldbewrittenatatime,theentirelinewouldbebroughtintocache.

ContentsofmemorystoredataddressU

Page 69: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Write-back,write-allocateexample

69

0xBEEFCache

Memory

U

0xCAFE

0xBEEF

0

T

U

mov 0xFACE, T

dirtybit

Step1:BringTintocache Step2:Write0xFACE tocacheonlyandsetdirtybit.

mov U,%rax

mov 0xFEED,T Writehit!Write0xFEED tocacheonly

1. WriteTbacktomemorysinceit’sdirty.

2. BringUintothecachesowecancopyitinto%rax

Page 70: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

0xBEEFU 0

Write-back,write-allocateexample

70

0xCAFECache

Memory

T

0xCAFE

0xBEEF

T

U

mov 0xFACE, T

dirtybit0xCAFE 0

Step1:BringTintocache

Page 71: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

0xBEEFU 0

Write-back,write-allocateexample

71

0xCAFECache

Memory

T

0xCAFE

0xBEEF

T

U

mov 0xFACE, T

dirtybit0xFACE 1

Step2:Write0xFACEtocacheonlyandsetdirtybit.

Page 72: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

0xBEEFU 0

Write-back,write-allocateexample

72

0xCAFECache

Memory

T

0xCAFE

0xBEEF

T

U

mov 0xFACE,T mov 0xFEED,T

dirtybit0xFACE 10xFEED

Writehit!Write0xFEED to

cacheonly

Page 73: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

0xBEEFU 0

Write-back,write-allocateexample

73

0xCAFECache

Memory

U

0xFEED

0xBEEF

T

U

mov 0xFACE,T mov 0xFEED,T

dirtybit0xFACE 00xBEEF

mov U,%rax

1.WriteTbacktomemorysinceitisdirty.

2.BringUintothecachesowecancopyitinto%rax

Page 74: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

BacktotheCorei7tolookatways

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core0

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core3

L3unifiedcache(sharedbyallcores)

Mainmemory

Processorpackage

Block/linesize:64bytesforall.

L1i-cacheandd-cache:32KB,8-way,Access:4cycles

L2unifiedcache:256KB,8-way,Access:11cycles

L3unifiedcache:8MB,16-way,Access:30-40cycles

74

slower,butmorelikelytohit

Page 75: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

May11¢ Reminders

§ Lab3duetonight@11:59pm§ HW3(mostlycacheproblems)dueFriday

75

Page 76: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Whereelseiscachingused?

76

Page 77: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

SoftwareCachesareMoreFlexible¢ Examples

§ Filesystembuffercaches,browsercaches,etc.§ Content-deliverynetworks(CDN):cachefortheInternet(e.g.Netflix)

¢ Somedesigndifferences§ Almostalwaysfully-associative

§ so,noplacementrestrictions§ indexstructureslikehashtablesarecommon(forplacement)

§ Morecomplexreplacementpolicies§ missesareveryexpensivewhendiskornetworkinvolved§ worththousandsofcyclestoavoidthem

§ Notnecessarilyconstrainedtosingle“block”transfers§ mayfetchorwrite-backinlargerunits,opportunistically

77

Page 78: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

OptimizationsfortheMemoryHierarchy¢ Writecodethathaslocality!

§ Spatial:accessdatacontiguously§ Temporal:makesureaccesstothesamedataisnottoofarapartintime

¢ Howcanyouachievelocality?§ Properchoiceofalgorithm§ Looptransformations

78

Page 79: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example:ArrayofStructs

79

typedef struct {char * name;int scores[4];

} Student;/* Array of structs */Student students[N];

/* Compute average score for assignment ‘hw’ */void average_score(int hw) {

int total = 0;for (int s = 0; s < N; s++) {

total += students[s].scores[hw];return total / (double)N;

}

0x00 0x60010

0x08 10 10

0x10 9 8

0x18 0x6020b

0x20 9 9

0x28 9 7

0x30 0x6031a

0x38 10 8

0x40 5 6

... ...

student[0]

Memory

student[1]

student[2]

.scores[1]

.scores[1]

.scores[1]

Page 80: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example:ArrayofStructs

1. students[0].scores[1] set:0,tag:000COLD MISS

2. students[1].scores[1] set:0,tag:001COLD MISS

3. students[2].scores[1] set:1,tag:001COLD MISS

80

double average_score(int hw) {int total = 0;for (int s = 0; s < N; s++) {

total += students[s].scores[hw];return total / (double)N;

}

0x00 0x60010

0x08 10 10

0x10 9 8

0x18 0x6020b

0x20 9 9

0x28 9 7

0x30 0x6031a

0x38 10 8

0x40 5 6

... ...

students[0]

Memory

students[1]

students[2]

.scores[1]

.scores[1]

.scores[1]

Set Tag Data

0

1

capacity: 32bytesblocksize: 16bytessets: 2ways: 1

typedef struct {char * name;int scores[4];

} Student;/* Array of structs */Student students[N];

0x60010

10 10000

9 9

9 7001

0x6031a

10 8001

Page 81: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example:ArrayofStructs¢ CacheMissAnalysis

§ Accesses: int(4bytes)§ Stride: sizeof(Student)=24§ Cacheblocksize:16bytes§ Notemporallocality,soonlycoldmisses§ Missrate:100%

81

double average_score(int hw) {int total = 0;for (int s = 0; s < N; s++) {

total += students[s].scores[hw];return total / (double)N;

}

0x00 0x60010

0x08 10 10

0x10 9 8

0x18 0x6020b

0x20 9 9

0x28 9 7

0x30 0x6031a

0x38 10 8

0x40 5 6

... ...

students[0]

Memory

students[1]

students[2]

.scores[1]

.scores[1]

.scores[1]

capacity: 32bytesblocksize: 16bytessets: 2ways: 1

typedef struct {char * name;int scores[4];

} Student;/* Array of structs */Student students[N];

Cache

Page 82: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

0x00 0x60010

0x08 0x6020b

0x10 0x6031a

... ...

0x80 10 9

0x88 10 ?

... ...

0xb0 10 9

0xb8 8 ?

... ...

Example:ArrayofStructs vsStructofArrays

82

0x00 0x60010

0x08 10 10

0x10 9 8

0x18 0x6020b

0x20 9 9

0x28 9 7

0x30 0x6031a

0x38 10 8

0x40 5 6

... ...

students[0]

Memory

students[1]

students[2]

.scores[1]

.scores[1]

.scores[1]

/* “Struct of arrays” */struct {

char * names[N];int scores[4][N];

} students;

typedef struct {char * name;int scores[4];

} Student;/* Array of structs */Student students[N];

Memory students.names

students.scores[0]

students.scores[1]

Disclaimer:Thisisnotidealstyle-wise,justanoptionforoptimizing.

Page 83: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example:ArrayofStructs vs StructofArrays

1. students.scores[1][0] set:1,tag:100COLD MISS

2. students.scores[1][1] set:1,tag:101HIT

3. students.scores[1][2] set:1,tag:101HIT

83

/* “Struct of arrays” */struct {

char * names[N];int scores[4][N];

} students;

0x00 0x60010

0x08 0x6020b

0x10 0x6031a

... ...

0x80 10 9

0x88 10 ?

... ...

0xb0 10 9

0xb8 8 ?

... ...

Memory students.names

students.scores[0]

students.scores[1]

0xb0 = 1011 0000

Set Tag Data

0

1

double average_score(int hw) {int total = 0;for (int s = 0; s < N; s++) {

total += students[s].scores[hw];return total / (double)N;

}

Page 84: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example:ArrayofStructs vs StructofArrays

84

/* “Struct of arrays” */struct {

char * names[N];int scores[4][N];

} students;

0x00 0x60010

0x08 0x6020b

0x10 0x6031a

... ...

0x80 10 9

0x88 10 ?

... ...

0xb0 10 9

0xb8 8 ?

... ...

Memory students.names

students.scores[0]

students.scores[1]

¢ CacheMissAnalysis§ Accesses: int(4bytes)§ Stride:4bytes§ Cacheblocksize:16bytes§ 16bytes/block /4bytes/int =4ints/block§ Missrate:

§ 1COLDMISS/block§ 4ints/block – 1MISS/block =3HIT/block§ 75%HITrate

capacity: 32bytesblocksize: 16bytessets: 2ways: 1

Cache

Page 85: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example:ArrayofStructs vs StructofArrays

85

/* “Struct of arrays” */struct {

char * names[N];int scores[4][N];

} students;

0x00 0x60010

0x08 0x6020b

0x10 0x6031a

... ...

0x80 10 9

0x88 10 ?

... ...

0xb0 10 9

0xb8 8 ?

... ...

Memory students.names

students.scores[0]

students.scores[1]

Set Tag Data

0

1

double student_grade(int s) {int total = 0;for (int hw = 0; hw < 4; hw++) {

total += students.scores[s][hw];return total / (double)4;

}

Butdon’t forgetaboutotherusesofthisstruct/array…

WhatwouldtheMISSratebeforthis?

Page 86: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Example:MatrixMultiplication

a b

i

j

*c

=

c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) {

int i, j, k;for (i = 0; i < n; i++)

for (j = 0; j < n; j++)for (k = 0; k < n; k++)

c[i*n + j] += a[i*n + k] * b[k*n + j];}

86

i

j

memoryaccesspattern?

Extraexample(skipping inclass)

Page 87: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

CacheMissAnalysis¢ Assume:

§ Matrixelementsaredoubles§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann,not left-shiftedbyn)

¢ Firstiteration:§ n/8+n=9n/8misses

(omittingmatrixc)

§ Afterwardsincache:(schematic)

*=

n

*=8doubleswide

87

n/8misses

nmisses

eachitemincolumnindifferentcacheline

spatiallocality:chunksof8itemsinarowinsamecacheline

Extraexample(skipping inclass)

Page 88: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

CacheMissAnalysis¢ Assume:

§ Matrixelementsaredoubles§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann)

¢ Otheriterations:§ Again:

n/8+n=9n/8misses(omittingmatrixc)

¢ Totalmisses:§ 9n/8*n2 =(9/8)*n3

n

*=8wide

88onceperelement

Extraexample(skipping inclass)

Page 89: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

BlockedMatrixMultiplicationc = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) {

int i, j, k, i1, j1, k1;for (i = 0; i < n; i+=B)

for (j = 0; j < n; j+=B)for (k = 0; k < n; k+=B)

/* B x B mini matrix multiplications */for (i1 = i; i1 < i+B; i1++)

for (j1 = j; j1 < j+B; j1++)for (k1 = k; k1 < k+B; k1++)

c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1];}

a b

i1

j1

*c

=

BlocksizeBxB89

Extraexample(skipping inclass)

Page 90: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

CacheMissAnalysis¢ Assume:

§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann)§ Threeblocksfitintocache:3B2 <C

¢ First(block)iteration:§ B2/8missesforeachblock§ 2n/B*B2/8=nB/4

(omittingmatrixc)

§ Afterwardsincache(schematic)

*=

*=

BlocksizeBxB

n/Bblocks

90

B2elementsperblock,8percacheline

n/Bblocksperrow,n/Bblockspercolumn

Extraexample(skipping inclass)

Page 91: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

CacheMissAnalysis¢ Assume:

§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann)§ Threeblocksfitintocache:3B2 <C

¢ Other(block)iterations:§ Sameasfirstiteration§ 2n/B*B2/8=nB/4

¢ Totalmisses:§ nB/4*(n/B)2 =n3/(4B)

*=

BlocksizeBxB

n/Bblocks

91

Extraexample(skipping inclass)

Page 92: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Summary¢ Noblocking: (9/8) *n3

¢ Blocking: 1/(4B) *n3

¢ IfB=8differenceis4*8*9/8=36x¢ IfB=16differenceis4*16*9/8=72x

¢ SuggestslargestpossibleblocksizeB,butlimit3B2 <C!

¢ Reasonfordramaticdifference:§ Matrixmultiplicationhasinherenttemporallocality:

§ Inputdata:3n2,computation2n3

§ EveryarrayelementusedO(n)times!§ Butprogramhastobewrittenproperly

92

Extraexample(skipping inclass)

Page 93: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

Cache-FriendlyCode¢ Programmercanoptimizeforcacheperformance

§ Howdatastructuresareorganized§ Howdataareaccessed

§ Nestedloopstructure§ Blocking(previousexample)isageneraltechnique

¢ Allsystemsfavor“cache-friendlycode”§ Gettingabsoluteoptimumperformanceisveryplatformspecific

§ Cachesizes,linesizes,associativities,etc.§ Cangetmostoftheadvantagewithgenericcode

§ Keepworkingsetreasonablysmall(temporallocality)§ Usesmallstrides(spatiallocality)§ Focusoninnerloopcode

93

Page 94: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

IntelCorei7CacheHierarchy

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core0

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core3

L3unifiedcache(sharedbyallcores)

Mainmemory

Processorpackage

L1i-cacheandd-cache:32KB,8-way,Access:4cycles

L2unifiedcache:256KB,8-way,Access:11cycles

L3unifiedcache:8MB,16-way,Access:30-40cycles

Blocksize:64bytesforallcaches.

94

Page 95: Caches Spring 2016 May 4: Caches · 2016. 5. 10. · Caches Spring 2016 Cache Performance Metrics ¢ Miss Rate § Fraction of memory references not found in cache (misses / accesses)

Spring 2016Caches

TheMemoryMountain

128m32m

8m2m

512k128k

32k0

2000

4000

6000

8000

10000

12000

14000

16000

s1s3

s5s7

s9s11

Size(bytes)

Readth

roughput(MB/s)

Stride(x8bytes)

Corei7Haswell2.1GHz32KBL1d-cache256KBL2cache8MBL3cache64Bblocksize

Slopesofspatiallocality

Ridgesoftemporallocality

L1

Mem

L2

L3

Aggressiveprefetching

95