caches spring 2016 may 4: caches · 2016. 5. 10. · caches spring 2016 cache performance metrics...

Post on 01-Sep-2020

24 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Spring 2016Caches

May4:Caches

1

Spring 2016Caches

Roadmap

2

car *c = malloc(sizeof(car));c->miles = 100;c->gals = 17;float mpg = get_mpg(c);free(c);

Car c = new Car();c.setMiles(100);c.setGals(17);float mpg =

c.getMPG();

get_mpg:pushq %rbpmovq %rsp, %rbp...popq %rbpret

Java:C:

Assemblylanguage:

Machinecode:

01110100000110001000110100000100000000101000100111000010110000011111101000011111

Computersystem:

OS:

Memory&dataIntegers&floatsMachinecode&Cx86assemblyProcedures&stacksArrays&structsMemory&cachesProcessesVirtualmemoryMemoryallocationJavavs.C

Spring 2016Caches

HowdoesexecutiontimegrowwithSIZE?

3

int array[SIZE];

int sum = 0;

for (int i = 0; i < 200000; i++) {

for (int j = 0; j < SIZE; j++) {

sum += array[j];

}

}

SIZE

TIME

Plot

Spring 2016Caches

ActualData

4SIZE

Time

Spring 2016Caches

Makingmemoryaccessesfast!¢ Cachebasics¢ Principleoflocality¢ Memoryhierarchies¢ Cacheorganization¢ Programoptimizationsthatconsidercaches

5

Spring 2016Caches

Problem:Processor-MemoryBottleneck

MainMemory

CPU Reg

Processorperformancedoubledaboutevery18months Buslatency/bandwidth

evolvedmuchslower

Core2Duo:Canprocessatleast256Bytes/cycle

Core2Duo:Bandwidth2Bytes/cycleLatency100-200cycles(30-60ns)

Problem:lotsofwaitingonmemory6cycle:singlemachinestep(fixed-time)

Spring 2016Caches

Problem:Processor-MemoryBottleneck

MainMemory

CPU Reg

Processorperformancedoubledaboutevery18months Buslatency/bandwidth

evolvedmuchslower

Core2Duo:Canprocessatleast256Bytes/cycle

Core2Duo:Bandwidth2Bytes/cycleLatency100-200cycles(30-60ns)

Solution:caches7

Cache

cycle:singlemachinestep(fixed-time)

Spring 2016Caches

Cache💰¢ Ahiddenstoragespace

forprovisions,weapons,and/ortreasures

¢ Computermemorywithshortaccesstimeusedforthestorageoffrequentlyorrecentlyusedinstructionsordata(i-cacheandd-cache)§ Moregenerally:

usedtooptimizedatatransfersbetweenanysystemelementswithdifferentcharacteristics(networkinterfacecache,I/Ocache,etc.)

8

Spring 2016Caches

GeneralCacheMechanics

9

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

7 9 14 3Cache

Memory • Larger,slower,cheapermemory.• Viewedaspartitionedinto“blocks”

or“lines”

Dataiscopiedinblock-sizedtransferunits

• Smaller,faster,moreexpensivememory.

• Cachesasubsetoftheblocks(a.k.a.lines)

Spring 2016Caches

GeneralCacheConcepts:Hit

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

7 9 14 3Cache

Memory

DatainblockbisneededRequest:14

14Blockbisincache:Hit!

10

Spring 2016Caches

GeneralCacheConcepts:Miss

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

7 9 14 3Cache

Memory

DatainblockbisneededRequest:12

Blockbisnotincache:Miss!

BlockbisfetchedfrommemoryRequest:12

12

12

12

Blockbisstoredincache•Placementpolicy:determineswherebgoes•Replacementpolicy:determineswhichblockgetsevicted(victim)

11

Spring 2016Caches

WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith

addressesnearorequaltothosetheyhaveusedrecently

12

Spring 2016Caches

WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith

addressesnearorequaltothosetheyhaveusedrecently

¢ Temporallocality:§ Recentlyreferenceditemsarelikely

tobereferencedagaininthenearfuture

§ Whyisthisimportant?

block

13

Spring 2016Caches

WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith

addressesnearorequaltothosetheyhaveusedrecently

¢ Temporallocality:§ Recentlyreferenceditemsarelikely

tobereferencedagaininthenearfuture

¢ Spatiallocality?

block

14

Spring 2016Caches

WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith

addressesnearorequaltothosetheyhaveusedrecently

¢ Temporallocality:§ Recentlyreferenceditemsarelikely

tobereferencedagaininthenearfuture

¢ Spatiallocality:§ Itemswithnearbyaddressestend

tobereferencedclosetogetherintime

§ Howdocachestakeadvantageofthis?

block

block

15

Spring 2016Caches

Example:AnyLocality?sum = 0;for (i = 0; i < n; i++) {

sum += a[i];}return sum;

16

Spring 2016Caches

Example:AnyLocality?

¢ Data:§ Temporal:sum referencedineachiteration§ Spatial:arraya[] accessedinstride-1pattern

¢ Instructions:§ Temporal:cyclethroughlooprepeatedly§ Spatial:referenceinstructionsinsequence

¢ Beingabletoassessthelocalityofcodeisacrucialskillforaprogrammer

17

sum = 0;for (i = 0; i < n; i++) {

sum += a[i];}return sum;

Spring 2016Caches

LocalityExample#1int sum_array_rows(int a[M][N]){

int i, j, sum = 0;

for (i = 0; i < M; i++)for (j = 0; j < N; j++)

sum += a[i][j];

return sum;}

18

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

Spring 2016Caches

LocalityExample#1

19

1:a[0][0]2:a[0][1]3:a[0][2]4:a[0][3]5:a[1][0]6:a[1][1]7:a[1][2]8:a[1][3]9:a[2][0]10:a[2][1]11:a[2][2]12:a[2][3]

stride-1

OrderAccessed

76 92 108

a[0][0]

a[0][1]

a[0][2]

a[0][3]

5a[1][1]

a[1][2]

a[1][3]

0 5a[2][2]

a[2][3]

a[2][1]

a[2][0]

a[1][0]

LayoutinMemory

M=3,N=4

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

76isjustonepossible startingaddressofarraya

int sum_array_rows(int a[M][N]){

int i, j, sum = 0;

for (i = 0; i < M; i++)for (j = 0; j < N; j++)

sum += a[i][j];

return sum;}

Spring 2016Caches

LocalityExample#2int sum_array_cols(int a[M][N]){

int i, j, sum = 0;

for (j = 0; j < N; j++)for (i = 0; i < M; i++)

sum += a[i][j];

return sum;}

20

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

Spring 2016Caches

1:a[0][0]2:a[1][0]3:a[2][0]4:a[0][1]

LocalityExample#2

21

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

5:a[1][1]6:a[2][1]7:a[0][2]8:a[1][2]9:a[2][2]10:a[0][3]11:a[1][3]12:a[2][3]

stride-N

76 92 108

a[0][0]

a[0][1]

a[0][2]

a[0][3]

5a[1][1]

a[1][2]

a[1][3]

0 5a[2][2]

a[2][3]

a[2][1]

a[2][0]

a[1][0]

LayoutinMemoryOrderAccessed

M=3,N=4int sum_array_cols(int a[M][N]){

int i, j, sum = 0;

for (j = 0; j < N; j++)for (i = 0; i < M; i++)

sum += a[i][j];

return sum;}

Spring 2016Caches

LocalityExample#3int sum_array_3d(int a[M][N][L]){

int i, j, k, sum = 0;

for (i = 0; i < N; i++)for (j = 0; j < L; j++)

for (k = 0; k < M; k++)sum += a[k][i][j];

return sum;}

¢ Whatiswrongwiththiscode?¢ Howcanitbefixed?

22

Spring 2016Caches

LocalityExample#3

¢ Whatiswrongwiththiscode?¢ Howcanitbefixed?

2376 92 108

a[0][0][0]

a[0][0][1]

a[0][0][2]

a[0][0][3]

5

a[0][1][1]

a[0][1][2]

a[0][1][3]

0 5

a[0][2][2]

a[0][2][3]

a[0][2][1]

a[0][2][0]

a[0][1][0]

LayoutinMemory(forM=?,N=3,L=4)

a[1][0][0]

a[1][0][1]

a[1][0][2]

a[1][0][3]

5

a[1][1][1]

a[1][1][2]

a[1][1][3]

0 5

a[1][2][2]

a[1][2][3]

a[1][2][1]

a[1][2][0]

a[1][1][0]

a[2][0][0]

a[2][0][1]

a[2][0][2]

a[2][0][3]

5

a[2][1][1]

a[2][1][2]

a[2][1][3]

0 5

a[2][2][2]

a[2][2][3]

a[2][2][1]

a[2][2][0]

a[2][1][0] …

int sum_array_3d(int a[M][N][L]){

int i, j, k, sum = 0;

for (i = 0; i < N; i++)for (j = 0; j < L; j++)

for (k = 0; k < M; k++)sum += a[k][i][j];

return sum;}

Spring 2016Caches

Makememoryfastagain.¢ Cachebasics¢ Principleoflocality¢ Memoryhierarchies¢ Cacheorganization¢ Programoptimizationsthatconsidercaches

24

Spring 2016Caches

CostofCacheMisses¢ Hugedifferencebetweenahitandamiss

§ Couldbe100x,ifjustL1andmainmemory

¢ Wouldyoubelieve99%hitsistwiceasgoodas97%?§ Consider:

Cachehittimeof1cycleMisspenaltyof100cycles

25

cycle=singlefixed-timemachinestep

Spring 2016Caches

CostofCacheMisses¢ Hugedifferencebetweenahitandamiss

§ Couldbe100x,ifjustL1andmainmemory

¢ Wouldyoubelieve99%hitsistwiceasgoodas97%?§ Consider:

Cachehittimeof1cycleMisspenaltyof100cycles

§ Averageaccesstime:§ 97%hits:1cycle+0.03*100cycles=4cycles§ 99%hits:1cycle+0.01*100cycles=2cycles

¢ Thisiswhy“missrate”isusedinsteadof“hitrate”

26

cycle=singlefixed-timemachinestep

checkthecacheeverytime

Spring 2016Caches

CachePerformanceMetrics

¢ MissRate§ Fractionofmemoryreferencesnotfoundincache(misses/accesses)

=1- hitrate§ Typicalnumbers(inpercentages):

§ 3%- 10%forL1§ Canbequitesmall(e.g.,<1%)forL2,dependingonsize,etc.

¢ HitTime§ Timetodeliveralineinthecachetotheprocessor

§ Includestimetodeterminewhetherthelineisinthecache§ Typicalhittimes:4clockcyclesforL1;10clockcyclesforL2

¢ MissPenalty§ Additionaltimerequiredbecauseofamiss§ Typically50- 200cyclesformissinginL2&goingtomainmemory

(Trend:increasing!)27

Spring 2016Caches

Canwehavemorethanonecache?¢ Whywouldwewanttodothat?

28

Spring 2016Caches

MemoryHierarchies¢ Somefundamentalandenduringpropertiesofhardwareand

softwaresystems:§ Fasterstoragetechnologiesalmostalwayscostmoreperbyteandhave

lowercapacity§ Thegapsbetweenmemorytechnologyspeedsarewidening

§ Truefor:registers↔cache,cache↔DRAM,DRAM↔disk,etc.§ Well-writtenprogramstendtoexhibitgoodlocality

¢ Thesepropertiescomplementeachotherbeautifully

¢ Theysuggestanapproachfororganizingmemoryandstoragesystemsknownasamemoryhierarchy

29

Spring 2016Caches

May6

30

Spring 2016Caches

AnExampleMemoryHierarchy

31

registers

on-chipL1cache(SRAM)

mainmemory(DRAM)

localsecondarystorage(localdisks)

Larger,slower,cheaperperbyte

remotesecondarystorage(distributedfilesystems, webservers)

off-chipL2cache(SRAM)

Smaller,faster,costlierperbyte

<1ns

1ns

5-10ns

100ns

150,000ns

10,000,000ns(10ms)

1-150ms

SSD

Disk

5-10s

1-2min

15-30min

31days

66months =1.3years

1- 15years

Spring 2016Caches

AnExampleMemoryHierarchy

32

registers

on-chipL1cache(SRAM)

mainmemory(DRAM)

localsecondarystorage(localdisks)

Larger,slower,cheaperperbyte

remotesecondarystorage(distributedfilesystems, webservers)

Localdiskshold filesretrievedfromdisksonremotenetworkservers

Mainmemoryholdsdiskblocksretrievedfromlocaldisks

off-chipL2cache(SRAM)

L1cacheholdscachelinesretrievedfromL2cache

CPUregistershold wordsretrievedfromL1cache

L2cacheholdscachelinesretrievedfrommainmemory

Smaller,faster,costlierperbyte

Spring 2016Caches

AnExampleMemoryHierarchy

33

registers

on-chipL1cache(SRAM)

mainmemory(DRAM)

localsecondarystorage(localdisks)

Larger,slower,cheaperperbyte

remotesecondarystorage(distributedfilesystems, webservers)

off-chipL2cache(SRAM)

explicitlyprogram-controlled(e.g.refertoexactly%rax,%rbx)

Smaller,faster,costlierperbyte

programsees“memory”;hardwaremanagescaching

transparently

Spring 2016Caches

MemoryHierarchies

¢ Fundamentalideaofamemoryhierarchy:§ Foreachlevelk,thefaster,smallerdeviceatlevelkservesasacachefor

thelarger,slowerdeviceatlevelk+1.

¢ Whydomemoryhierarchieswork?§ Becauseoflocality,programstendtoaccessthedataatlevelk more

oftenthantheyaccessthedataatlevelk+1.§ Thus,thestorageatlevelk+1canbeslower,andthuslargerand

cheaperperbit.

¢ BigIdea: Thememoryhierarchycreatesalargepoolofstoragethatcostsasmuchasthecheapstoragenearthebottom,butthatservesdatatoprogramsattherateofthefaststoragenearthetop.

34

Spring 2016Caches

IntelCorei7CacheHierarchy

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core0

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core3

L3unifiedcache(sharedbyallcores)

Mainmemory

Processorpackage

Blocksize:64bytesforallcaches.

L1i-cacheandd-cache:32KB,8-way,Access:4cycles

L2unifiedcache:256KB,8-way,Access:11cycles

L3unifiedcache:8MB,16-way,Access:30-40cycles

35

Spring 2016Caches

Makingmemoryaccessesfast!¢ Cachebasics¢ Principleoflocality¢ Memoryhierarchies¢ Cacheorganization

§ Direct-mapped (sets; index+tag)§ Associativity(ways)§ Replacementpolicy§ Handlingwrites

¢ Programoptimizationsthatconsidercaches

36

Spring 2016Caches

CacheOrganization¢ Whereshoulddatagointhecache?

§ Weneedamappingfrommemoryaddressestospecificlocationsinthecachetomakecheckingthecacheforanaddressfast§ Otherwiseeachmemoryaccessrequires“searchingtheentirecache”(slow!)

§ Whatisadatastructurethatprovidesfastlookup?

37

Spring 2016Caches

Aside:HashTablesforFastLookup

38

0123456789

Insert:

52734

1002119

Spring 2016Caches

Aside:HashTablesforFastLookup

39

000001010011100101110111

01234567

Insert:

000001000101110011101010100111

Spring 2016Caches

Where should we put data in the cache?

40

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Data

Memory Cache

¢ How can we compute this mapping?

addressmodcachesizesameas:low-orderlog2(cachesize)bits

Spring 2016Caches

Where should we put data in the cache?

41

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Data

Memory Cache

Collision.Hmm..Thecachemightgetconfusedlater!Why?Andhowdowesolvethat?

Spring 2016Caches

Usetagstorecordwhichlocationiscached

42

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Tag Data00??0101

Memory Cache

tag=restofaddressbits

Spring 2016Caches

What’sacacheblock?(orcacheline)

43

0123456789101112131415

ByteAddress

0123

Index

0

1

2

3

4

5

6

7

Block(line)number

block/linesize=?

typicalblock/linesizes:32bytes,64bytes

0000000100100011010001010110011110001001101010111100110111101111

(00)(01)(10)(11)

Tag

Spring 2016Caches

Apuzzle.¢ Whatcanyouinferfromthis:

¢ Cachestartsempty¢ Access(addr: hit/miss)stream:

¢ (12:miss),(13:hit),(14:miss)

44

blocksize>=2bytes blocksize<3bytes

Spring 2016Caches

Directmappedcache

¢ Directmapped:§ Eachmemoryaddresscan

bemappedtoexactlyoneindexinthecache.

§ Easytofindanaddress!Cheaptoimplement.

§ But…

¢ Whathappensifaprogramusesaddresses2, 6, 2, 6, 2,…?

45

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

MemoryAddress

________

Tag

2and6conflict,restofcacheisunused

Spring 2016Caches

Associativity¢ Whatifwecouldstoredatainanyplaceinthecache?

§ Minimizeconflicts!§ Morecomplicatedhardware,consumesmorepower,slower.

46

0000000100100011010001010110011110001001101010111100110111101111

MemoryAddress

________00100110

TagAccesses:2, 6, 2, 6, 2,…?

Spring 2016Caches

Associativity¢ Whatifwecouldstoredatainanyplaceinthecache?

§ Morecomplicatedhardware,consumesmorepower,slower.

¢ Sowecombine thetwoideas:§ Eachaddressmapstoexactlyoneset.§ Eachsetcanhavemorethanoneway.

47

01234567

Set

0

1

2

3

Set

0

1

Set

1-way8 sets,

1 block each

2-way4 sets,

2 blocks each

4-way2 sets,

4 blocks each

0

Set

8-way1 set,

8 blocks

direct mapped fully associative

Spring 2016Caches

NowhowdoIknowwheredatagoes?

48

m-bit Address

k bits(m-k-n) bitsn-bit Block

OffsetTag Index

Spring 2016Caches

NowhowdoIknowwheredatagoes?

49

0123456789101112131415

ByteAddress

0123

Index

0

1

2

3

4

5

6

7

Block(line)number

0000000100100011010001010110011110001001101010111100110111101111

00011011

capacity: 8bytesblocksize: 2bytessets: 4ways: 1

m-bit Address

k bits(m-k-n) bitsn-bit Block

OffsetTag Index

Wherewould13(1101)bestored?

Tag

Spring 2016Caches

Exampleplacementinset-associativecaches¢ Wherewoulddatafromaddress0x1833 beplaced?¢ 0x1833 inbinaryis:

50

m-bit Address

k bits(m-k-n) bitsn-bit Block

OffsetTag Index

k=? k=? k=?

blocksize: 16bytescapacity: 8blocks0001 1000 0011 0011011 0011

Set Tag Data01234567

1-way associativity8 sets, 1 block each

Set Tag Data

0

1

2

3

Set Tag Data

0

1

2-way associativity4 sets, 2 blocks each

4-way associativity2 sets, 4 blocks each

Spring 2016Caches

m-bit Address

k bits(m-k-4) bits4-bit Block

OffsetTag Index

k=3 k=2 k=1

51

Exampleplacementinset-associativecaches¢ Wherewoulddatafromaddress0x1833 beplaced?¢ 0x1833 inbinaryis:

blocksize: 16bytescapacity: 8blocks0001 1000 0011 0011

Set Tag Data01234567

1-way associativity8 sets, 1 block each

Set Tag Data

0

1

2

3

Set Tag Data

0

1

2-way associativity4 sets, 2 blocks each

4-way associativity2 sets, 4 blocks each

Spring 2016Caches

Blockreplacement¢ Anyemptyblockinthecorrectsetmaybeusedforstoringdata.¢ Iftherearenoemptyblocks,whichoneshouldwereplace?

§ Obviousfordirect-mappedcaches,whataboutset-associative?§ Cachestypicallyusesomethingclosetoleastrecentlyused(LRU)

(hardwareusuallyimplements“notmostrecentlyused”)

52

01234567

Set

0

1

2

3

Set

0

1

Set

1-way associativity8 sets, 1 block each

2-way associativity4 sets, 2 blocks each

4-way associativity2 sets, 4 blocks each

Spring 2016Caches

Anotherpuzzle.¢ Whatcanyouinferfromthis?

¢ Cachestartsempty¢ Access(addr: hit/miss)stream:

¢ (10:miss);(12:miss);(10:miss)

53

12isnotinthesameblockas10

12’sblockreplaced10’sblock

direct-mappedcachewith<=2sets

Spring 2016Caches

GeneralCacheOrganization(S,E,B)E=linesperset(wesay“E-way”)

S=#sets

set

line

0 1 2 B-1tagv

validbitB=bytesofdatapercacheline(thedatablock)

cachesize:SxExBdatabytes

54

Spring 2016Caches

CacheReadE=linesperset

S=2s sets

0 1 2 B-1tagv

validbitB=2b bytesofdatapercacheline(thedatablock)

tbits sbits bbitsAddressofbyteinmemory:

tag setindex

blockoffset

databeginsatthisoffset

• Locateset•Checkifanyline insethasmatching tag•Yes+linevalid:hit• Locatedatastartingatoffset

55

Spring 2016Caches

Example:Direct-MappedCache(E=1)

S=2s sets

Direct-mapped:OnelinepersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofint:

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

findset

56

Spring 2016Caches

Example:Direct-MappedCache(E=1)Direct-mapped:OnelinepersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofint:

0 1 2 7tagv 3 654

match?:yes=hitvalid?+

blockoffset

tag

57

Spring 2016Caches

Example:Direct-MappedCache(E=1)Direct-mapped:OnelinepersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofint:

0 1 2 7tagv 3 654

match?:yes=hitvalid?+

int (4Bytes)ishere

blockoffset

Nomatch? Thenoldlinegetsevictedandreplaced

58

Thisiswhywewantalignment!

Spring 2016Caches

Example(forE=1)int sum_array_rows(double a[16][16]){

int i, j;double sum = 0;

for (i = 0; i < 16; i++)for (j = 0; j < 16; j++)

sum += a[i][j];return sum;

}

32B=4doubles

Assume:cold(empty)cache3bitsforset,5bitsforoffset

aa...ayyy yxx xx000

int sum_array_cols(double a[16][16]){

int i, j;double sum = 0;

for (j = 0; j < 16; j++)for (i = 0; i < 16; i++)

sum += a[i][j];return sum;

}

Assumesum,i,j inregistersAddressofanalignedelementofa:aa...ayyyyxxxx000

59

0,0 0,1 0,2 0,3

0,4 0,5 0,6 0,7

0,8 0,9 0,a 0,b

0,c 0,d 0,e 0,f

1,0 1,1 1,2 1,3

1,4 1,5 1,6 1,7

1,8 1,9 1,a 1,b

1,c 1,d 1,e 1,f

32B=4doubles4missesperrowofarray

4*16=64misseseveryaccessamiss16*16=256misses

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

4,0 4,1 4,2 4,3

0,0:aa...a000 000 000000,4:aa...a000 001 000001,0:aa...a000 100 000002,0:aa...a001 000 00000

Spring 2016Caches

Example(forE=1)float dotprod(float x[8], float y[8]){

float sum = 0;int i;

for (i = 0; i < 8; i++)sum += x[i]*y[i];

return sum;}

60

x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]

ifx andy havealignedstartingaddresses,

e.g.,&x[0]=0,&y[0]=128

ifx andy haveunalignedstartingaddresses,

e.g.,&x[0]=0,&y[0]=160

x[0] x[1] x[2] x[3]

y[0] y[1] y[2] y[3]

x[4] x[5] x[6] x[7]

y[4] y[5] y[6] y[7]

Inthisexample,cacheblocksare16bytes;8setsincacheHowmanyblockoffsetbits?Howmanysetindexbits?

Addressbits:ttt....tsssbbbbB=16=2b: b=4offsetbitsS=8=2s: s=3indexbits

0: 000....00000000128: 000....10000000160: 000....10100000

Spring 2016Caches

E-waySet-AssociativeCache(Here:E=2)E=2:TwolinespersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofshortint:

findset

61

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

Spring 2016Caches

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

E-waySet-AssociativeCache(Here:E=2)E=2:TwolinespersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofshortint:

compareboth

valid?+ match:yes=hit

blockoffset

tag

62

Spring 2016Caches

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

E-waySet-AssociativeCache(Here:E=2)E=2:TwolinespersetAssume:cacheblocksize8bytes

tbits 0…01 100Addressofshortint:

valid?+ match:yes=hit

blockoffset

shortint (2Bytes)ishere

Nomatch?• Onelineinsetisselectedforevictionandreplacement• Replacementpolicies:random,leastrecentlyused(LRU),…

63

compareboth

Spring 2016Caches

Example(forE=2)float dotprod(float x[8], float y[8]){

float sum = 0;int i;

for (i = 0; i < 8; i++)sum += x[i]*y[i];

return sum;}

64

x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3]Ifx andy havealignedstartingaddresses,e.g.&x[0]=0,&y[0]=128,canstillfitbothbecausetwolinesineachset

x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7]

Spring 2016Caches

TypesofCacheMisses:3C’s!¢ Cold(compulsory)miss

§ Occursonfirstaccesstoablock

¢ Conflictmiss§ Conflictmissesoccurwhenthecacheislargeenough,butmultipledata

objectsallmaptothesameslot§ e.g.,referencingblocks0,8,0,8,...couldmisseverytime

§ direct-mappedcacheshavemoreconflictmissesthann-wayset-associative (wheren>1)

¢ Capacitymiss§ Occurswhenthesetofactivecacheblocks(theworkingset)

islargerthanthecache(justwon’tfit,evenifcachewasfully-associative)§ Note:Fully-associative onlyhasColdandCapacitymisses

65

Spring 2016Caches

Whataboutwrites?¢ Multiplecopiesofdataexist:

§ L1,L2,possiblyL3,mainmemory

¢ Whatisthemainproblemwiththat?§ Whichcopiesshouldweupdate?§ Whatifit’sahit?Miss?

66

L1

L2Cache

L3Cache

MainMemory

Spring 2016Caches

Whataboutwrites?¢ Multiplecopiesofdataexist:

§ L1,L2,possiblyL3,mainmemory

¢ Whattodoonawrite-hit?§ Write-through:writeimmediatelytomemory,allcachesinbetween.§ Write-back:deferwritetomemoryuntillineisevicted(replaced)

§ Musttrackwhichcachelineshavebeenmodified(“dirtybit”)

¢ Whattodoonawrite-miss?§ Write-allocate(“fetchonwrite”):loadintocache,updatelineincache.

§ Goodifmorewritesorreadstothelocationfollow§ No-write-allocate(“writearound”):justwriteimmediatelytomemory.

¢ Typicalcaches:§ Write-back+Write-allocate,usually§ Write-through+No-write-allocate,occasionally

67

why?

Spring 2016Caches

Write-back,write-allocateexample

68

0xBEEFCache

Memory

U

0xCAFE

0xBEEF

0

T

U

dirtybit

tag(thereisonlyonesetinthistinycache,sothetagistheentireaddress!)

Inthisexamplewearesortofignoringblockoffsets.Hereablockholds2bytes(16bits,4hexdigits).

Normallyablockwouldbemuchbiggerandthustherewouldbemultipleitemsperblock.Whileonlyoneiteminthatblockwouldbewrittenatatime,theentirelinewouldbebroughtintocache.

ContentsofmemorystoredataddressU

Spring 2016Caches

Write-back,write-allocateexample

69

0xBEEFCache

Memory

U

0xCAFE

0xBEEF

0

T

U

mov 0xFACE, T

dirtybit

Step1:BringTintocache Step2:Write0xFACE tocacheonlyandsetdirtybit.

mov U,%rax

mov 0xFEED,T Writehit!Write0xFEED tocacheonly

1. WriteTbacktomemorysinceit’sdirty.

2. BringUintothecachesowecancopyitinto%rax

Spring 2016Caches

0xBEEFU 0

Write-back,write-allocateexample

70

0xCAFECache

Memory

T

0xCAFE

0xBEEF

T

U

mov 0xFACE, T

dirtybit0xCAFE 0

Step1:BringTintocache

Spring 2016Caches

0xBEEFU 0

Write-back,write-allocateexample

71

0xCAFECache

Memory

T

0xCAFE

0xBEEF

T

U

mov 0xFACE, T

dirtybit0xFACE 1

Step2:Write0xFACEtocacheonlyandsetdirtybit.

Spring 2016Caches

0xBEEFU 0

Write-back,write-allocateexample

72

0xCAFECache

Memory

T

0xCAFE

0xBEEF

T

U

mov 0xFACE,T mov 0xFEED,T

dirtybit0xFACE 10xFEED

Writehit!Write0xFEED to

cacheonly

Spring 2016Caches

0xBEEFU 0

Write-back,write-allocateexample

73

0xCAFECache

Memory

U

0xFEED

0xBEEF

T

U

mov 0xFACE,T mov 0xFEED,T

dirtybit0xFACE 00xBEEF

mov U,%rax

1.WriteTbacktomemorysinceitisdirty.

2.BringUintothecachesowecancopyitinto%rax

Spring 2016Caches

BacktotheCorei7tolookatways

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core0

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core3

L3unifiedcache(sharedbyallcores)

Mainmemory

Processorpackage

Block/linesize:64bytesforall.

L1i-cacheandd-cache:32KB,8-way,Access:4cycles

L2unifiedcache:256KB,8-way,Access:11cycles

L3unifiedcache:8MB,16-way,Access:30-40cycles

74

slower,butmorelikelytohit

Spring 2016Caches

May11¢ Reminders

§ Lab3duetonight@11:59pm§ HW3(mostlycacheproblems)dueFriday

75

Spring 2016Caches

Whereelseiscachingused?

76

Spring 2016Caches

SoftwareCachesareMoreFlexible¢ Examples

§ Filesystembuffercaches,browsercaches,etc.§ Content-deliverynetworks(CDN):cachefortheInternet(e.g.Netflix)

¢ Somedesigndifferences§ Almostalwaysfully-associative

§ so,noplacementrestrictions§ indexstructureslikehashtablesarecommon(forplacement)

§ Morecomplexreplacementpolicies§ missesareveryexpensivewhendiskornetworkinvolved§ worththousandsofcyclestoavoidthem

§ Notnecessarilyconstrainedtosingle“block”transfers§ mayfetchorwrite-backinlargerunits,opportunistically

77

Spring 2016Caches

OptimizationsfortheMemoryHierarchy¢ Writecodethathaslocality!

§ Spatial:accessdatacontiguously§ Temporal:makesureaccesstothesamedataisnottoofarapartintime

¢ Howcanyouachievelocality?§ Properchoiceofalgorithm§ Looptransformations

78

Spring 2016Caches

Example:ArrayofStructs

79

typedef struct {char * name;int scores[4];

} Student;/* Array of structs */Student students[N];

/* Compute average score for assignment ‘hw’ */void average_score(int hw) {

int total = 0;for (int s = 0; s < N; s++) {

total += students[s].scores[hw];return total / (double)N;

}

0x00 0x60010

0x08 10 10

0x10 9 8

0x18 0x6020b

0x20 9 9

0x28 9 7

0x30 0x6031a

0x38 10 8

0x40 5 6

... ...

student[0]

Memory

student[1]

student[2]

.scores[1]

.scores[1]

.scores[1]

Spring 2016Caches

Example:ArrayofStructs

1. students[0].scores[1] set:0,tag:000COLD MISS

2. students[1].scores[1] set:0,tag:001COLD MISS

3. students[2].scores[1] set:1,tag:001COLD MISS

80

double average_score(int hw) {int total = 0;for (int s = 0; s < N; s++) {

total += students[s].scores[hw];return total / (double)N;

}

0x00 0x60010

0x08 10 10

0x10 9 8

0x18 0x6020b

0x20 9 9

0x28 9 7

0x30 0x6031a

0x38 10 8

0x40 5 6

... ...

students[0]

Memory

students[1]

students[2]

.scores[1]

.scores[1]

.scores[1]

Set Tag Data

0

1

capacity: 32bytesblocksize: 16bytessets: 2ways: 1

typedef struct {char * name;int scores[4];

} Student;/* Array of structs */Student students[N];

0x60010

10 10000

9 9

9 7001

0x6031a

10 8001

Spring 2016Caches

Example:ArrayofStructs¢ CacheMissAnalysis

§ Accesses: int(4bytes)§ Stride: sizeof(Student)=24§ Cacheblocksize:16bytes§ Notemporallocality,soonlycoldmisses§ Missrate:100%

81

double average_score(int hw) {int total = 0;for (int s = 0; s < N; s++) {

total += students[s].scores[hw];return total / (double)N;

}

0x00 0x60010

0x08 10 10

0x10 9 8

0x18 0x6020b

0x20 9 9

0x28 9 7

0x30 0x6031a

0x38 10 8

0x40 5 6

... ...

students[0]

Memory

students[1]

students[2]

.scores[1]

.scores[1]

.scores[1]

capacity: 32bytesblocksize: 16bytessets: 2ways: 1

typedef struct {char * name;int scores[4];

} Student;/* Array of structs */Student students[N];

Cache

Spring 2016Caches

0x00 0x60010

0x08 0x6020b

0x10 0x6031a

... ...

0x80 10 9

0x88 10 ?

... ...

0xb0 10 9

0xb8 8 ?

... ...

Example:ArrayofStructs vsStructofArrays

82

0x00 0x60010

0x08 10 10

0x10 9 8

0x18 0x6020b

0x20 9 9

0x28 9 7

0x30 0x6031a

0x38 10 8

0x40 5 6

... ...

students[0]

Memory

students[1]

students[2]

.scores[1]

.scores[1]

.scores[1]

/* “Struct of arrays” */struct {

char * names[N];int scores[4][N];

} students;

typedef struct {char * name;int scores[4];

} Student;/* Array of structs */Student students[N];

Memory students.names

students.scores[0]

students.scores[1]

Disclaimer:Thisisnotidealstyle-wise,justanoptionforoptimizing.

Spring 2016Caches

Example:ArrayofStructs vs StructofArrays

1. students.scores[1][0] set:1,tag:100COLD MISS

2. students.scores[1][1] set:1,tag:101HIT

3. students.scores[1][2] set:1,tag:101HIT

83

/* “Struct of arrays” */struct {

char * names[N];int scores[4][N];

} students;

0x00 0x60010

0x08 0x6020b

0x10 0x6031a

... ...

0x80 10 9

0x88 10 ?

... ...

0xb0 10 9

0xb8 8 ?

... ...

Memory students.names

students.scores[0]

students.scores[1]

0xb0 = 1011 0000

Set Tag Data

0

1

double average_score(int hw) {int total = 0;for (int s = 0; s < N; s++) {

total += students[s].scores[hw];return total / (double)N;

}

Spring 2016Caches

Example:ArrayofStructs vs StructofArrays

84

/* “Struct of arrays” */struct {

char * names[N];int scores[4][N];

} students;

0x00 0x60010

0x08 0x6020b

0x10 0x6031a

... ...

0x80 10 9

0x88 10 ?

... ...

0xb0 10 9

0xb8 8 ?

... ...

Memory students.names

students.scores[0]

students.scores[1]

¢ CacheMissAnalysis§ Accesses: int(4bytes)§ Stride:4bytes§ Cacheblocksize:16bytes§ 16bytes/block /4bytes/int =4ints/block§ Missrate:

§ 1COLDMISS/block§ 4ints/block – 1MISS/block =3HIT/block§ 75%HITrate

capacity: 32bytesblocksize: 16bytessets: 2ways: 1

Cache

Spring 2016Caches

Example:ArrayofStructs vs StructofArrays

85

/* “Struct of arrays” */struct {

char * names[N];int scores[4][N];

} students;

0x00 0x60010

0x08 0x6020b

0x10 0x6031a

... ...

0x80 10 9

0x88 10 ?

... ...

0xb0 10 9

0xb8 8 ?

... ...

Memory students.names

students.scores[0]

students.scores[1]

Set Tag Data

0

1

double student_grade(int s) {int total = 0;for (int hw = 0; hw < 4; hw++) {

total += students.scores[s][hw];return total / (double)4;

}

Butdon’t forgetaboutotherusesofthisstruct/array…

WhatwouldtheMISSratebeforthis?

Spring 2016Caches

Example:MatrixMultiplication

a b

i

j

*c

=

c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) {

int i, j, k;for (i = 0; i < n; i++)

for (j = 0; j < n; j++)for (k = 0; k < n; k++)

c[i*n + j] += a[i*n + k] * b[k*n + j];}

86

i

j

memoryaccesspattern?

Extraexample(skipping inclass)

Spring 2016Caches

CacheMissAnalysis¢ Assume:

§ Matrixelementsaredoubles§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann,not left-shiftedbyn)

¢ Firstiteration:§ n/8+n=9n/8misses

(omittingmatrixc)

§ Afterwardsincache:(schematic)

*=

n

*=8doubleswide

87

n/8misses

nmisses

eachitemincolumnindifferentcacheline

spatiallocality:chunksof8itemsinarowinsamecacheline

Extraexample(skipping inclass)

Spring 2016Caches

CacheMissAnalysis¢ Assume:

§ Matrixelementsaredoubles§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann)

¢ Otheriterations:§ Again:

n/8+n=9n/8misses(omittingmatrixc)

¢ Totalmisses:§ 9n/8*n2 =(9/8)*n3

n

*=8wide

88onceperelement

Extraexample(skipping inclass)

Spring 2016Caches

BlockedMatrixMultiplicationc = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) {

int i, j, k, i1, j1, k1;for (i = 0; i < n; i+=B)

for (j = 0; j < n; j+=B)for (k = 0; k < n; k+=B)

/* B x B mini matrix multiplications */for (i1 = i; i1 < i+B; i1++)

for (j1 = j; j1 < j+B; j1++)for (k1 = k; k1 < k+B; k1++)

c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1];}

a b

i1

j1

*c

=

BlocksizeBxB89

Extraexample(skipping inclass)

Spring 2016Caches

CacheMissAnalysis¢ Assume:

§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann)§ Threeblocksfitintocache:3B2 <C

¢ First(block)iteration:§ B2/8missesforeachblock§ 2n/B*B2/8=nB/4

(omittingmatrixc)

§ Afterwardsincache(schematic)

*=

*=

BlocksizeBxB

n/Bblocks

90

B2elementsperblock,8percacheline

n/Bblocksperrow,n/Bblockspercolumn

Extraexample(skipping inclass)

Spring 2016Caches

CacheMissAnalysis¢ Assume:

§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann)§ Threeblocksfitintocache:3B2 <C

¢ Other(block)iterations:§ Sameasfirstiteration§ 2n/B*B2/8=nB/4

¢ Totalmisses:§ nB/4*(n/B)2 =n3/(4B)

*=

BlocksizeBxB

n/Bblocks

91

Extraexample(skipping inclass)

Spring 2016Caches

Summary¢ Noblocking: (9/8) *n3

¢ Blocking: 1/(4B) *n3

¢ IfB=8differenceis4*8*9/8=36x¢ IfB=16differenceis4*16*9/8=72x

¢ SuggestslargestpossibleblocksizeB,butlimit3B2 <C!

¢ Reasonfordramaticdifference:§ Matrixmultiplicationhasinherenttemporallocality:

§ Inputdata:3n2,computation2n3

§ EveryarrayelementusedO(n)times!§ Butprogramhastobewrittenproperly

92

Extraexample(skipping inclass)

Spring 2016Caches

Cache-FriendlyCode¢ Programmercanoptimizeforcacheperformance

§ Howdatastructuresareorganized§ Howdataareaccessed

§ Nestedloopstructure§ Blocking(previousexample)isageneraltechnique

¢ Allsystemsfavor“cache-friendlycode”§ Gettingabsoluteoptimumperformanceisveryplatformspecific

§ Cachesizes,linesizes,associativities,etc.§ Cangetmostoftheadvantagewithgenericcode

§ Keepworkingsetreasonablysmall(temporallocality)§ Usesmallstrides(spatiallocality)§ Focusoninnerloopcode

93

Spring 2016Caches

IntelCorei7CacheHierarchy

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core0

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core3

L3unifiedcache(sharedbyallcores)

Mainmemory

Processorpackage

L1i-cacheandd-cache:32KB,8-way,Access:4cycles

L2unifiedcache:256KB,8-way,Access:11cycles

L3unifiedcache:8MB,16-way,Access:30-40cycles

Blocksize:64bytesforallcaches.

94

Spring 2016Caches

TheMemoryMountain

128m32m

8m2m

512k128k

32k0

2000

4000

6000

8000

10000

12000

14000

16000

s1s3

s5s7

s9s11

Size(bytes)

Readth

roughput(MB/s)

Stride(x8bytes)

Corei7Haswell2.1GHz32KBL1d-cache256KBL2cache8MBL3cache64Bblocksize

Slopesofspatiallocality

Ridgesoftemporallocality

L1

Mem

L2

L3

Aggressiveprefetching

95

top related