caches spring 2016 may 4: caches · 2016. 5. 10. · caches spring 2016 cache performance metrics...

Spring 2016Caches

May4:Caches

Spring 2016Caches

Roadmap

car *c = malloc(sizeof(car));c->miles = 100;c->gals = 17;float mpg = get_mpg(c);free(c);

Car c = new Car();c.setMiles(100);c.setGals(17);float mpg =

c.getMPG();

get_mpg:pushq %rbpmovq %rsp, %rbp...popq %rbpret

Java:C:

Assemblylanguage:

Machinecode:

01110100000110001000110100000100000000101000100111000010110000011111101000011111

Computersystem:

Memory&dataIntegers&floatsMachinecode&Cx86assemblyProcedures&stacksArrays&structsMemory&cachesProcessesVirtualmemoryMemoryallocationJavavs.C

Spring 2016Caches

HowdoesexecutiontimegrowwithSIZE?

int array[SIZE];

int sum = 0;

for (int i = 0; i < 200000; i++) {

for (int j = 0; j < SIZE; j++) {

sum += array[j];

Spring 2016Caches

ActualData

Spring 2016Caches

Makingmemoryaccessesfast!¢ Cachebasics¢ Principleoflocality¢ Memoryhierarchies¢ Cacheorganization¢ Programoptimizationsthatconsidercaches

Spring 2016Caches

Problem:Processor-MemoryBottleneck

MainMemory

CPU Reg

Processorperformancedoubledaboutevery18months Buslatency/bandwidth

evolvedmuchslower

Core2Duo:Canprocessatleast256Bytes/cycle

Core2Duo:Bandwidth2Bytes/cycleLatency100-200cycles(30-60ns)

Problem:lotsofwaitingonmemory6cycle:singlemachinestep(fixed-time)

Spring 2016Caches

Problem:Processor-MemoryBottleneck

MainMemory

CPU Reg

Processorperformancedoubledaboutevery18months Buslatency/bandwidth

evolvedmuchslower

Core2Duo:Canprocessatleast256Bytes/cycle

Core2Duo:Bandwidth2Bytes/cycleLatency100-200cycles(30-60ns)

Solution:caches7

cycle:singlemachinestep(fixed-time)

Spring 2016Caches

Cache💰¢ Ahiddenstoragespace

forprovisions,weapons,and/ortreasures

¢ Computermemorywithshortaccesstimeusedforthestorageoffrequentlyorrecentlyusedinstructionsordata(i-cacheandd-cache)§ Moregenerally:

usedtooptimizedatatransfersbetweenanysystemelementswithdifferentcharacteristics(networkinterfacecache,I/Ocache,etc.)

Spring 2016Caches

GeneralCacheMechanics

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

7 9 14 3Cache

Memory • Larger,slower,cheapermemory.• Viewedaspartitionedinto“blocks”

or“lines”

Dataiscopiedinblock-sizedtransferunits

• Smaller,faster,moreexpensivememory.

• Cachesasubsetoftheblocks(a.k.a.lines)

Spring 2016Caches

GeneralCacheConcepts:Hit

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

7 9 14 3Cache

Memory

DatainblockbisneededRequest:14

14Blockbisincache:Hit!

Spring 2016Caches

GeneralCacheConcepts:Miss

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

7 9 14 3Cache

Memory

DatainblockbisneededRequest:12

Blockbisnotincache:Miss!

BlockbisfetchedfrommemoryRequest:12

Blockbisstoredincache•Placementpolicy:determineswherebgoes•Replacementpolicy:determineswhichblockgetsevicted(victim)

Spring 2016Caches

WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith

addressesnearorequaltothosetheyhaveusedrecently

Spring 2016Caches

¢ Temporallocality:§ Recentlyreferenceditemsarelikely

tobereferencedagaininthenearfuture

§ Whyisthisimportant?

Spring 2016Caches

¢ Spatiallocality?

Spring 2016Caches

¢ Spatiallocality:§ Itemswithnearbyaddressestend

tobereferencedclosetogetherintime

§ Howdocachestakeadvantageofthis?

Spring 2016Caches

Example:AnyLocality?sum = 0;for (i = 0; i < n; i++) {

sum += a[i];}return sum;

Spring 2016Caches

Example:AnyLocality?

¢ Data:§ Temporal:sum referencedineachiteration§ Spatial:arraya[] accessedinstride-1pattern

¢ Instructions:§ Temporal:cyclethroughlooprepeatedly§ Spatial:referenceinstructionsinsequence

¢ Beingabletoassessthelocalityofcodeisacrucialskillforaprogrammer

sum = 0;for (i = 0; i < n; i++) {

sum += a[i];}return sum;

Spring 2016Caches

LocalityExample#1int sum_array_rows(int a[M][N]){

int i, j, sum = 0;

for (i = 0; i < M; i++)for (j = 0; j < N; j++)

sum += a[i][j];

return sum;}

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

Spring 2016Caches

LocalityExample#1

1:a[0][0]2:a[0][1]3:a[0][2]4:a[0][3]5:a[1][0]6:a[1][1]7:a[1][2]8:a[1][3]9:a[2][0]10:a[2][1]11:a[2][2]12:a[2][3]

stride-1

OrderAccessed

76 92 108

a[0][0]

a[0][1]

a[0][2]

a[0][3]

5a[1][1]

a[1][2]

a[1][3]

0 5a[2][2]

a[2][3]

a[2][1]

a[2][0]

a[1][0]

LayoutinMemory

M=3,N=4

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

76isjustonepossible startingaddressofarraya

int sum_array_rows(int a[M][N]){

int i, j, sum = 0;

for (i = 0; i < M; i++)for (j = 0; j < N; j++)

sum += a[i][j];

return sum;}

Spring 2016Caches

LocalityExample#2int sum_array_cols(int a[M][N]){

int i, j, sum = 0;

for (j = 0; j < N; j++)for (i = 0; i < M; i++)

sum += a[i][j];

return sum;}

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

Spring 2016Caches

1:a[0][0]2:a[1][0]3:a[2][0]4:a[0][1]

LocalityExample#2

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

5:a[1][1]6:a[2][1]7:a[0][2]8:a[1][2]9:a[2][2]10:a[0][3]11:a[1][3]12:a[2][3]

stride-N

76 92 108

a[0][0]

a[0][1]

a[0][2]

a[0][3]

5a[1][1]

a[1][2]

a[1][3]

0 5a[2][2]

a[2][3]

a[2][1]

a[2][0]

a[1][0]

LayoutinMemoryOrderAccessed

M=3,N=4int sum_array_cols(int a[M][N]){

int i, j, sum = 0;

for (j = 0; j < N; j++)for (i = 0; i < M; i++)

sum += a[i][j];

return sum;}

Spring 2016Caches

LocalityExample#3int sum_array_3d(int a[M][N][L]){

int i, j, k, sum = 0;

for (i = 0; i < N; i++)for (j = 0; j < L; j++)

for (k = 0; k < M; k++)sum += a[k][i][j];

return sum;}

¢ Whatiswrongwiththiscode?¢ Howcanitbefixed?

Spring 2016Caches

LocalityExample#3

¢ Whatiswrongwiththiscode?¢ Howcanitbefixed?

2376 92 108

a[0][0][0]

a[0][0][1]

a[0][0][2]

a[0][0][3]

a[0][1][1]

a[0][1][2]

a[0][1][3]

a[0][2][2]

a[0][2][3]

a[0][2][1]

a[0][2][0]

a[0][1][0]

LayoutinMemory(forM=?,N=3,L=4)

a[1][0][0]

a[1][0][1]

a[1][0][2]

a[1][0][3]

a[1][1][1]

a[1][1][2]

a[1][1][3]

a[1][2][2]

a[1][2][3]

a[1][2][1]

a[1][2][0]

a[1][1][0]

a[2][0][0]

a[2][0][1]

a[2][0][2]

a[2][0][3]

a[2][1][1]

a[2][1][2]

a[2][1][3]

a[2][2][2]

a[2][2][3]

a[2][2][1]

a[2][2][0]

a[2][1][0] …

int sum_array_3d(int a[M][N][L]){

int i, j, k, sum = 0;

for (i = 0; i < N; i++)for (j = 0; j < L; j++)

for (k = 0; k < M; k++)sum += a[k][i][j];

return sum;}

Spring 2016Caches

Makememoryfastagain.¢ Cachebasics¢ Principleoflocality¢ Memoryhierarchies¢ Cacheorganization¢ Programoptimizationsthatconsidercaches

Spring 2016Caches

CostofCacheMisses¢ Hugedifferencebetweenahitandamiss

§ Couldbe100x,ifjustL1andmainmemory

¢ Wouldyoubelieve99%hitsistwiceasgoodas97%?§ Consider:

Cachehittimeof1cycleMisspenaltyof100cycles

cycle=singlefixed-timemachinestep

Spring 2016Caches

CostofCacheMisses¢ Hugedifferencebetweenahitandamiss

§ Couldbe100x,ifjustL1andmainmemory

¢ Wouldyoubelieve99%hitsistwiceasgoodas97%?§ Consider:

Cachehittimeof1cycleMisspenaltyof100cycles

§ Averageaccesstime:§ 97%hits:1cycle+0.03*100cycles=4cycles§ 99%hits:1cycle+0.01*100cycles=2cycles

¢ Thisiswhy“missrate”isusedinsteadof“hitrate”

cycle=singlefixed-timemachinestep

checkthecacheeverytime

Spring 2016Caches

CachePerformanceMetrics

¢ MissRate§ Fractionofmemoryreferencesnotfoundincache(misses/accesses)

=1- hitrate§ Typicalnumbers(inpercentages):

§ 3%- 10%forL1§ Canbequitesmall(e.g.,<1%)forL2,dependingonsize,etc.

¢ HitTime§ Timetodeliveralineinthecachetotheprocessor

§ Includestimetodeterminewhetherthelineisinthecache§ Typicalhittimes:4clockcyclesforL1;10clockcyclesforL2

¢ MissPenalty§ Additionaltimerequiredbecauseofamiss§ Typically50- 200cyclesformissinginL2&goingtomainmemory

(Trend:increasing!)27

Spring 2016Caches

Canwehavemorethanonecache?¢ Whywouldwewanttodothat?

Spring 2016Caches

MemoryHierarchies¢ Somefundamentalandenduringpropertiesofhardwareand

softwaresystems:§ Fasterstoragetechnologiesalmostalwayscostmoreperbyteandhave

lowercapacity§ Thegapsbetweenmemorytechnologyspeedsarewidening

§ Truefor:registers↔cache,cache↔DRAM,DRAM↔disk,etc.§ Well-writtenprogramstendtoexhibitgoodlocality

¢ Thesepropertiescomplementeachotherbeautifully

¢ Theysuggestanapproachfororganizingmemoryandstoragesystemsknownasamemoryhierarchy

Spring 2016Caches

AnExampleMemoryHierarchy

registers

on-chipL1cache(SRAM)

mainmemory(DRAM)

localsecondarystorage(localdisks)

Larger,slower,cheaperperbyte

remotesecondarystorage(distributedfilesystems, webservers)

off-chipL2cache(SRAM)

Smaller,faster,costlierperbyte

5-10ns

150,000ns

10,000,000ns(10ms)

1-150ms

1-2min

15-30min

31days

66months =1.3years

1- 15years

Spring 2016Caches

registers

mainmemory(DRAM)

Localdiskshold filesretrievedfromdisksonremotenetworkservers

Mainmemoryholdsdiskblocksretrievedfromlocaldisks

L1cacheholdscachelinesretrievedfromL2cache

CPUregistershold wordsretrievedfromL1cache

L2cacheholdscachelinesretrievedfrommainmemory

Spring 2016Caches

registers

mainmemory(DRAM)

explicitlyprogram-controlled(e.g.refertoexactly%rax,%rbx)

programsees“memory”;hardwaremanagescaching

transparently

Spring 2016Caches

MemoryHierarchies

¢ Fundamentalideaofamemoryhierarchy:§ Foreachlevelk,thefaster,smallerdeviceatlevelkservesasacachefor

thelarger,slowerdeviceatlevelk+1.

¢ Whydomemoryhierarchieswork?§ Becauseoflocality,programstendtoaccessthedataatlevelk more

oftenthantheyaccessthedataatlevelk+1.§ Thus,thestorageatlevelk+1canbeslower,andthuslargerand

cheaperperbit.

¢ BigIdea: Thememoryhierarchycreatesalargepoolofstoragethatcostsasmuchasthecheapstoragenearthebottom,butthatservesdatatoprogramsattherateofthefaststoragenearthetop.

Spring 2016Caches

IntelCorei7CacheHierarchy

L1d-cache

L1i-cache

L2unifiedcache

L1d-cache

L1i-cache

L2unifiedcache

L3unifiedcache(sharedbyallcores)

Mainmemory

Processorpackage

Blocksize:64bytesforallcaches.

L1i-cacheandd-cache:32KB,8-way,Access:4cycles

L2unifiedcache:256KB,8-way,Access:11cycles

L3unifiedcache:8MB,16-way,Access:30-40cycles

Spring 2016Caches

Makingmemoryaccessesfast!¢ Cachebasics¢ Principleoflocality¢ Memoryhierarchies¢ Cacheorganization

§ Direct-mapped (sets; index+tag)§ Associativity(ways)§ Replacementpolicy§ Handlingwrites

¢ Programoptimizationsthatconsidercaches

Spring 2016Caches

CacheOrganization¢ Whereshoulddatagointhecache?

§ Weneedamappingfrommemoryaddressestospecificlocationsinthecachetomakecheckingthecacheforanaddressfast§ Otherwiseeachmemoryaccessrequires“searchingtheentirecache”(slow!)

§ Whatisadatastructurethatprovidesfastlookup?

Spring 2016Caches

Aside:HashTablesforFastLookup

0123456789

Insert:

1002119

Spring 2016Caches

Aside:HashTablesforFastLookup

000001010011100101110111

01234567

Insert:

000001000101110011101010100111

Spring 2016Caches

Where should we put data in the cache?

00011011

0000000100100011010001010110011110001001101010111100110111101111

Memory Cache

¢ How can we compute this mapping?

addressmodcachesizesameas:low-orderlog2(cachesize)bits

Spring 2016Caches

Where should we put data in the cache?

00011011

0000000100100011010001010110011110001001101010111100110111101111

Memory Cache

Collision.Hmm..Thecachemightgetconfusedlater!Why?Andhowdowesolvethat?

Spring 2016Caches

Usetagstorecordwhichlocationiscached

00011011

0000000100100011010001010110011110001001101010111100110111101111

Tag Data00??0101

Memory Cache

tag=restofaddressbits

Spring 2016Caches

What’sacacheblock?(orcacheline)

0123456789101112131415

ByteAddress

Block(line)number

block/linesize=?

typicalblock/linesizes:32bytes,64bytes

0000000100100011010001010110011110001001101010111100110111101111

(00)(01)(10)(11)

Spring 2016Caches

Apuzzle.¢ Whatcanyouinferfromthis:

¢ Cachestartsempty¢ Access(addr: hit/miss)stream:

¢ (12:miss),(13:hit),(14:miss)

blocksize>=2bytes blocksize<3bytes

Spring 2016Caches

Directmappedcache

¢ Directmapped:§ Eachmemoryaddresscan

bemappedtoexactlyoneindexinthecache.

§ Easytofindanaddress!Cheaptoimplement.

§ But…

¢ Whathappensifaprogramusesaddresses2, 6, 2, 6, 2,…?

00011011

0000000100100011010001010110011110001001101010111100110111101111

MemoryAddress

________

2and6conflict,restofcacheisunused

Spring 2016Caches

Associativity¢ Whatifwecouldstoredatainanyplaceinthecache?

§ Minimizeconflicts!§ Morecomplicatedhardware,consumesmorepower,slower.

0000000100100011010001010110011110001001101010111100110111101111

MemoryAddress

________00100110

TagAccesses:2, 6, 2, 6, 2,…?

Spring 2016Caches

Associativity¢ Whatifwecouldstoredatainanyplaceinthecache?

§ Morecomplicatedhardware,consumesmorepower,slower.

¢ Sowecombine thetwoideas:§ Eachaddressmapstoexactlyoneset.§ Eachsetcanhavemorethanoneway.

01234567

1-way8 sets,

1 block each

2-way4 sets,

2 blocks each

4-way2 sets,

4 blocks each

8-way1 set,

8 blocks

direct mapped fully associative

Spring 2016Caches

NowhowdoIknowwheredatagoes?

m-bit Address

k bits(m-k-n) bitsn-bit Block

OffsetTag Index

Spring 2016Caches

NowhowdoIknowwheredatagoes?

0123456789101112131415

ByteAddress

Block(line)number

0000000100100011010001010110011110001001101010111100110111101111

00011011

capacity: 8bytesblocksize: 2bytessets: 4ways: 1

m-bit Address

OffsetTag Index

Wherewould13(1101)bestored?

Spring 2016Caches

Exampleplacementinset-associativecaches¢ Wherewoulddatafromaddress0x1833 beplaced?¢ 0x1833 inbinaryis:

m-bit Address

OffsetTag Index

k=? k=? k=?

blocksize: 16bytescapacity: 8blocks0001 1000 0011 0011011 0011

Set Tag Data01234567

1-way associativity8 sets, 1 block each

Set Tag Data

2-way associativity4 sets, 2 blocks each

Spring 2016Caches

m-bit Address

k bits(m-k-4) bits4-bit Block

OffsetTag Index

k=3 k=2 k=1

Exampleplacementinset-associativecaches¢ Wherewoulddatafromaddress0x1833 beplaced?¢ 0x1833 inbinaryis:

blocksize: 16bytescapacity: 8blocks0001 1000 0011 0011

Set Tag Data01234567

Set Tag Data

Spring 2016Caches

Blockreplacement¢ Anyemptyblockinthecorrectsetmaybeusedforstoringdata.¢ Iftherearenoemptyblocks,whichoneshouldwereplace?

§ Obviousfordirect-mappedcaches,whataboutset-associative?§ Cachestypicallyusesomethingclosetoleastrecentlyused(LRU)

(hardwareusuallyimplements“notmostrecentlyused”)

01234567

Spring 2016Caches

Anotherpuzzle.¢ Whatcanyouinferfromthis?

¢ Cachestartsempty¢ Access(addr: hit/miss)stream:

¢ (10:miss);(12:miss);(10:miss)

12isnotinthesameblockas10

12’sblockreplaced10’sblock

direct-mappedcachewith<=2sets

Spring 2016Caches

GeneralCacheOrganization(S,E,B)E=linesperset(wesay“E-way”)

S=#sets

0 1 2 B-1tagv

validbitB=bytesofdatapercacheline(thedatablock)

cachesize:SxExBdatabytes

Spring 2016Caches

CacheReadE=linesperset

S=2s sets

0 1 2 B-1tagv

validbitB=2b bytesofdatapercacheline(thedatablock)

tbits sbits bbitsAddressofbyteinmemory:

tag setindex

blockoffset

databeginsatthisoffset

• Locateset•Checkifanyline insethasmatching tag•Yes+linevalid:hit• Locatedatastartingatoffset

Spring 2016Caches

Example:Direct-MappedCache(E=1)

S=2s sets

Direct-mapped:OnelinepersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofint:

0 1 2 7tagv 3 654

findset

Spring 2016Caches

Example:Direct-MappedCache(E=1)Direct-mapped:OnelinepersetAssume:cacheblocksize8bytes

0 1 2 7tagv 3 654

match?:yes=hitvalid?+

blockoffset

Spring 2016Caches

Example:Direct-MappedCache(E=1)Direct-mapped:OnelinepersetAssume:cacheblocksize8bytes

0 1 2 7tagv 3 654

match?:yes=hitvalid?+

int (4Bytes)ishere

blockoffset

Nomatch? Thenoldlinegetsevictedandreplaced

Thisiswhywewantalignment!

Spring 2016Caches

Example(forE=1)int sum_array_rows(double a[16][16]){

int i, j;double sum = 0;

for (i = 0; i < 16; i++)for (j = 0; j < 16; j++)

sum += a[i][j];return sum;

32B=4doubles

Assume:cold(empty)cache3bitsforset,5bitsforoffset

aa...ayyy yxx xx000

int sum_array_cols(double a[16][16]){

int i, j;double sum = 0;

for (j = 0; j < 16; j++)for (i = 0; i < 16; i++)

sum += a[i][j];return sum;

Assumesum,i,j inregistersAddressofanalignedelementofa:aa...ayyyyxxxx000

0,0 0,1 0,2 0,3

0,4 0,5 0,6 0,7

0,8 0,9 0,a 0,b

0,c 0,d 0,e 0,f

1,0 1,1 1,2 1,3

1,4 1,5 1,6 1,7

1,8 1,9 1,a 1,b

1,c 1,d 1,e 1,f

32B=4doubles4missesperrowofarray

4*16=64misseseveryaccessamiss16*16=256misses

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

4,0 4,1 4,2 4,3

0,0:aa...a000 000 000000,4:aa...a000 001 000001,0:aa...a000 100 000002,0:aa...a001 000 00000

Spring 2016Caches

Example(forE=1)float dotprod(float x[8], float y[8]){

float sum = 0;int i;

for (i = 0; i < 8; i++)sum += x[i]*y[i];

return sum;}

x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]

ifx andy havealignedstartingaddresses,

e.g.,&x[0]=0,&y[0]=128

ifx andy haveunalignedstartingaddresses,

e.g.,&x[0]=0,&y[0]=160

x[0] x[1] x[2] x[3]

y[0] y[1] y[2] y[3]

x[4] x[5] x[6] x[7]

y[4] y[5] y[6] y[7]

Inthisexample,cacheblocksare16bytes;8setsincacheHowmanyblockoffsetbits?Howmanysetindexbits?

Addressbits:ttt....tsssbbbbB=16=2b: b=4offsetbitsS=8=2s: s=3indexbits

0: 000....00000000128: 000....10000000160: 000....10100000

Spring 2016Caches

E-waySet-AssociativeCache(Here:E=2)E=2:TwolinespersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofshortint:

findset

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

Spring 2016Caches

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

compareboth

valid?+ match:yes=hit

blockoffset

Spring 2016Caches

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

valid?+ match:yes=hit

blockoffset

shortint (2Bytes)ishere

Nomatch?• Onelineinsetisselectedforevictionandreplacement• Replacementpolicies:random,leastrecentlyused(LRU),…

compareboth

Spring 2016Caches

Example(forE=2)float dotprod(float x[8], float y[8]){

float sum = 0;int i;

for (i = 0; i < 8; i++)sum += x[i]*y[i];

return sum;}

x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3]Ifx andy havealignedstartingaddresses,e.g.&x[0]=0,&y[0]=128,canstillfitbothbecausetwolinesineachset

x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7]

Spring 2016Caches

TypesofCacheMisses:3C’s!¢ Cold(compulsory)miss

§ Occursonfirstaccesstoablock

¢ Conflictmiss§ Conflictmissesoccurwhenthecacheislargeenough,butmultipledata

objectsallmaptothesameslot§ e.g.,referencingblocks0,8,0,8,...couldmisseverytime

§ direct-mappedcacheshavemoreconflictmissesthann-wayset-associative (wheren>1)

¢ Capacitymiss§ Occurswhenthesetofactivecacheblocks(theworkingset)

islargerthanthecache(justwon’tfit,evenifcachewasfully-associative)§ Note:Fully-associative onlyhasColdandCapacitymisses

Spring 2016Caches

Whataboutwrites?¢ Multiplecopiesofdataexist:

§ L1,L2,possiblyL3,mainmemory

¢ Whatisthemainproblemwiththat?§ Whichcopiesshouldweupdate?§ Whatifit’sahit?Miss?

L2Cache

L3Cache

MainMemory

Spring 2016Caches

Whataboutwrites?¢ Multiplecopiesofdataexist:

§ L1,L2,possiblyL3,mainmemory

¢ Whattodoonawrite-hit?§ Write-through:writeimmediatelytomemory,allcachesinbetween.§ Write-back:deferwritetomemoryuntillineisevicted(replaced)

§ Musttrackwhichcachelineshavebeenmodified(“dirtybit”)

¢ Whattodoonawrite-miss?§ Write-allocate(“fetchonwrite”):loadintocache,updatelineincache.

§ Goodifmorewritesorreadstothelocationfollow§ No-write-allocate(“writearound”):justwriteimmediatelytomemory.

¢ Typicalcaches:§ Write-back+Write-allocate,usually§ Write-through+No-write-allocate,occasionally

Spring 2016Caches

Write-back,write-allocateexample

0xBEEFCache

Memory

0xCAFE

0xBEEF

dirtybit

tag(thereisonlyonesetinthistinycache,sothetagistheentireaddress!)

Inthisexamplewearesortofignoringblockoffsets.Hereablockholds2bytes(16bits,4hexdigits).

Normallyablockwouldbemuchbiggerandthustherewouldbemultipleitemsperblock.Whileonlyoneiteminthatblockwouldbewrittenatatime,theentirelinewouldbebroughtintocache.

ContentsofmemorystoredataddressU

Spring 2016Caches

0xBEEFCache

Memory

0xCAFE

0xBEEF

mov 0xFACE, T

dirtybit

Step1:BringTintocache Step2:Write0xFACE tocacheonlyandsetdirtybit.

mov U,%rax

mov 0xFEED,T Writehit!Write0xFEED tocacheonly

1. WriteTbacktomemorysinceit’sdirty.

2. BringUintothecachesowecancopyitinto%rax

Spring 2016Caches

0xBEEFU 0

0xCAFECache

Memory

0xCAFE

0xBEEF

mov 0xFACE, T

dirtybit0xCAFE 0

Step1:BringTintocache

Spring 2016Caches

0xBEEFU 0

0xCAFECache

Memory

0xCAFE

0xBEEF

mov 0xFACE, T

dirtybit0xFACE 1

Step2:Write0xFACEtocacheonlyandsetdirtybit.

Spring 2016Caches

0xBEEFU 0

0xCAFECache

Memory

0xCAFE

0xBEEF

mov 0xFACE,T mov 0xFEED,T

dirtybit0xFACE 10xFEED

Writehit!Write0xFEED to

cacheonly

Spring 2016Caches

0xBEEFU 0

0xCAFECache

Memory

0xFEED

0xBEEF

mov 0xFACE,T mov 0xFEED,T

dirtybit0xFACE 00xBEEF

mov U,%rax

1.WriteTbacktomemorysinceitisdirty.

2.BringUintothecachesowecancopyitinto%rax

Spring 2016Caches

BacktotheCorei7tolookatways

L1d-cache

L1i-cache

L2unifiedcache

L1d-cache

L1i-cache

L2unifiedcache

Mainmemory

Processorpackage

Block/linesize:64bytesforall.

slower,butmorelikelytohit

Spring 2016Caches

May11¢ Reminders

§ Lab3duetonight@11:59pm§ HW3(mostlycacheproblems)dueFriday

Spring 2016Caches

Whereelseiscachingused?

Spring 2016Caches

SoftwareCachesareMoreFlexible¢ Examples

§ Filesystembuffercaches,browsercaches,etc.§ Content-deliverynetworks(CDN):cachefortheInternet(e.g.Netflix)

¢ Somedesigndifferences§ Almostalwaysfully-associative

§ so,noplacementrestrictions§ indexstructureslikehashtablesarecommon(forplacement)

§ Morecomplexreplacementpolicies§ missesareveryexpensivewhendiskornetworkinvolved§ worththousandsofcyclestoavoidthem

§ Notnecessarilyconstrainedtosingle“block”transfers§ mayfetchorwrite-backinlargerunits,opportunistically

Spring 2016Caches

OptimizationsfortheMemoryHierarchy¢ Writecodethathaslocality!

§ Spatial:accessdatacontiguously§ Temporal:makesureaccesstothesamedataisnottoofarapartintime

¢ Howcanyouachievelocality?§ Properchoiceofalgorithm§ Looptransformations

Spring 2016Caches

Example:ArrayofStructs

typedef struct {char * name;int scores[4];

} Student;/* Array of structs */Student students[N];

/* Compute average score for assignment ‘hw’ */void average_score(int hw) {

int total = 0;for (int s = 0; s < N; s++) {

total += students[s].scores[hw];return total / (double)N;

0x00 0x60010

0x08 10 10

0x10 9 8

0x18 0x6020b

0x20 9 9

0x28 9 7

0x30 0x6031a

0x38 10 8

0x40 5 6

... ...

student[0]

Memory

student[1]

student[2]

.scores[1]

Spring 2016Caches

Example:ArrayofStructs

1. students[0].scores[1] set:0,tag:000COLD MISS

double average_score(int hw) {int total = 0;for (int s = 0; s < N; s++) {

0x00 0x60010

0x08 10 10

0x10 9 8

0x18 0x6020b

0x20 9 9

0x28 9 7

0x30 0x6031a

0x38 10 8

0x40 5 6

... ...

students[0]

Memory

students[1]

students[2]

.scores[1]

Set Tag Data

0x60010

10 10000

9 7001

0x6031a

10 8001

Spring 2016Caches

Example:ArrayofStructs¢ CacheMissAnalysis

§ Accesses: int(4bytes)§ Stride: sizeof(Student)=24§ Cacheblocksize:16bytes§ Notemporallocality,soonlycoldmisses§ Missrate:100%

0x00 0x60010

0x08 10 10

0x10 9 8

0x18 0x6020b

0x20 9 9

0x28 9 7

0x30 0x6031a

0x38 10 8

0x40 5 6

... ...

students[0]

Memory

students[1]

students[2]

.scores[1]

Spring 2016Caches

0x00 0x60010

0x08 0x6020b

0x10 0x6031a

... ...

0x80 10 9

0x88 10 ?

... ...

0xb0 10 9

0xb8 8 ?

... ...

Example:ArrayofStructs vsStructofArrays

0x00 0x60010

0x08 10 10

0x10 9 8

0x18 0x6020b

0x20 9 9

0x28 9 7

0x30 0x6031a

0x38 10 8

0x40 5 6

... ...

students[0]

Memory

students[1]

students[2]

.scores[1]

/* “Struct of arrays” */struct {

char * names[N];int scores[4][N];

} students;

Memory students.names

students.scores[0]

students.scores[1]

Disclaimer:Thisisnotidealstyle-wise,justanoptionforoptimizing.

Spring 2016Caches

Example:ArrayofStructs vs StructofArrays

1. students.scores[1][0] set:1,tag:100COLD MISS

2. students.scores[1][1] set:1,tag:101HIT

3. students.scores[1][2] set:1,tag:101HIT

} students;

0x00 0x60010

0x08 0x6020b

0x10 0x6031a

... ...

0x80 10 9

0x88 10 ?

... ...

0xb0 10 9

0xb8 8 ?

... ...

students.scores[0]

students.scores[1]

0xb0 = 1011 0000

Set Tag Data

Spring 2016Caches

} students;

0x00 0x60010

0x08 0x6020b

0x10 0x6031a

... ...

0x80 10 9

0x88 10 ?

... ...

0xb0 10 9

0xb8 8 ?

... ...

students.scores[0]

students.scores[1]

¢ CacheMissAnalysis§ Accesses: int(4bytes)§ Stride:4bytes§ Cacheblocksize:16bytes§ 16bytes/block /4bytes/int =4ints/block§ Missrate:

§ 1COLDMISS/block§ 4ints/block – 1MISS/block =3HIT/block§ 75%HITrate

Spring 2016Caches

} students;

0x00 0x60010

0x08 0x6020b

0x10 0x6031a

... ...

0x80 10 9

0x88 10 ?

... ...

0xb0 10 9

0xb8 8 ?

... ...

students.scores[0]

students.scores[1]

Set Tag Data

double student_grade(int s) {int total = 0;for (int hw = 0; hw < 4; hw++) {

total += students.scores[s][hw];return total / (double)4;

Butdon’t forgetaboutotherusesofthisstruct/array…

WhatwouldtheMISSratebeforthis?

Spring 2016Caches

Example:MatrixMultiplication

c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) {

int i, j, k;for (i = 0; i < n; i++)

for (j = 0; j < n; j++)for (k = 0; k < n; k++)

c[i*n + j] += a[i*n + k] * b[k*n + j];}

memoryaccesspattern?

Extraexample(skipping inclass)

Spring 2016Caches

CacheMissAnalysis¢ Assume:

§ Matrixelementsaredoubles§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann,not left-shiftedbyn)

¢ Firstiteration:§ n/8+n=9n/8misses

(omittingmatrixc)

§ Afterwardsincache:(schematic)

*=8doubleswide

n/8misses

nmisses

eachitemincolumnindifferentcacheline

spatiallocality:chunksof8itemsinarowinsamecacheline

Spring 2016Caches

§ Matrixelementsaredoubles§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann)

¢ Otheriterations:§ Again:

n/8+n=9n/8misses(omittingmatrixc)

¢ Totalmisses:§ 9n/8*n2 =(9/8)*n3

*=8wide

88onceperelement

Spring 2016Caches

BlockedMatrixMultiplicationc = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) {

int i, j, k, i1, j1, k1;for (i = 0; i < n; i+=B)

for (j = 0; j < n; j+=B)for (k = 0; k < n; k+=B)

/* B x B mini matrix multiplications */for (i1 = i; i1 < i+B; i1++)

for (j1 = j; j1 < j+B; j1++)for (k1 = k; k1 < k+B; k1++)

c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1];}

BlocksizeBxB89

Spring 2016Caches

§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann)§ Threeblocksfitintocache:3B2 <C

¢ First(block)iteration:§ B2/8missesforeachblock§ 2n/B*B2/8=nB/4

(omittingmatrixc)

§ Afterwardsincache(schematic)

BlocksizeBxB

n/Bblocks

B2elementsperblock,8percacheline

n/Bblocksperrow,n/Bblockspercolumn

Spring 2016Caches

§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann)§ Threeblocksfitintocache:3B2 <C

¢ Other(block)iterations:§ Sameasfirstiteration§ 2n/B*B2/8=nB/4

¢ Totalmisses:§ nB/4*(n/B)2 =n3/(4B)

BlocksizeBxB

n/Bblocks

Spring 2016Caches

Summary¢ Noblocking: (9/8) *n3

¢ Blocking: 1/(4B) *n3

¢ IfB=8differenceis4*8*9/8=36x¢ IfB=16differenceis4*16*9/8=72x

¢ SuggestslargestpossibleblocksizeB,butlimit3B2 <C!

¢ Reasonfordramaticdifference:§ Matrixmultiplicationhasinherenttemporallocality:

§ Inputdata:3n2,computation2n3

§ EveryarrayelementusedO(n)times!§ Butprogramhastobewrittenproperly

Spring 2016Caches

Cache-FriendlyCode¢ Programmercanoptimizeforcacheperformance

§ Howdatastructuresareorganized§ Howdataareaccessed

§ Nestedloopstructure§ Blocking(previousexample)isageneraltechnique

¢ Allsystemsfavor“cache-friendlycode”§ Gettingabsoluteoptimumperformanceisveryplatformspecific

§ Cachesizes,linesizes,associativities,etc.§ Cangetmostoftheadvantagewithgenericcode

§ Keepworkingsetreasonablysmall(temporallocality)§ Usesmallstrides(spatiallocality)§ Focusoninnerloopcode

Spring 2016Caches

IntelCorei7CacheHierarchy

L1d-cache

L1i-cache

L2unifiedcache

L1d-cache

L1i-cache

L2unifiedcache

Mainmemory

Processorpackage

Blocksize:64bytesforallcaches.

Spring 2016Caches

TheMemoryMountain

128m32m

512k128k

Size(bytes)

Readth

roughput(MB/s)

Stride(x8bytes)

Corei7Haswell2.1GHz32KBL1d-cache256KBL2cache8MBL3cache64Bblocksize

Slopesofspatiallocality

Ridgesoftemporallocality

Aggressiveprefetching

caches spring 2016 may 4: caches · 2016. 5. 10. · caches spring 2016 cache performance metrics...

Documents

speicherhierarchie, caches, consistency models · cache cpu...

winter 2002cse 141 - cache improving memory with caches cpu...

university of washington today memory hierarchy, caches,...

reducing cache misses 5.1 introduction 5.2 the abcs of...

cache memories topics generic cache-memory organization...

cis429/529 cache basics 1 caches °why is caching needed?...

advanced systems lab - advanced computing laboratory ·...

selective gpu caches to eliminate cpu–gpu hw cache...

cache memories february 24, 2004 topics generic cache memory...

cpu caches and why you care - scott meyers · cpu caches...

1 seoul national university cache memories. 2 seoul national...

cpsc 312 cache memories slides source: bryant topics generic...

chapter 7 – large and fast: exploiting memory...

1 lecture 11: large cache design topics: large cache basics...

unique cache types - geocachealaska.org cache...beacon”...

cache memories topics generic cache memory organization...

cache memories may 5, 2008 topics generic cache memory...

caches and virtual memory:- plan - uppsala...

shared memory - caches, cache coherence and memory...

caches ii - courses.cs.washington.edu … · cache puzzle...