cs 152 computer architecture and engineering cs252 ...– write-back: write cache only, memory is...
TRANSCRIPT
-
CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture
Lecture6–MemoryII
KrsteAsanovicElectricalEngineeringandComputerSciences
UniversityofCaliforniaatBerkeley
http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152
-
Last?meinLecture6
§ DynamicRAM(DRAM)ismainformofmainmemorystorageinusetoday– Holdsvaluesonsmallcapacitors,needrefreshing(hencedynamic)– SlowmulF-stepaccess:precharge,readrow,readcolumn
§ StaFcRAM(SRAM)isfasterbutmoreexpensive– Usedtobuildon-chipmemoryforcaches
§ Cacheholdssmallsetofvaluesinfastmemory(SRAM)closetoprocessor– Needtodevelopsearchschemetofindvaluesincache,andreplacementpolicytomakespacefornewlyaccessedlocaFons
§ Cachesexploittwoformsofpredictabilityinmemoryreferencestreams– Temporallocality,samelocaFonlikelytobeaccessedagainsoon– SpaFallocality,neighboringlocaFonlikelytobeaccessedsoon
2
-
Recap:ReplacementPolicy
3
InanassociaFvecache,whichlinefromasetshouldbeevictedwhenthesetbecomesfull?• Random• Least-RecentlyUsed(LRU)
• LRUcachestatemustbeupdatedoneveryaccess• TrueimplementaFononlyfeasibleforsmallsets(2-way)• Pseudo-LRUbinarytreeoTenusedfor4-8way
• First-In,First-Out(FIFO)a.k.a.Round-Robin• UsedinhighlyassociaFvecaches
• Not-Most-RecentlyUsed(NMRU)• FIFOwithexcepFonformost-recentlyusedlineorlines
Thisisasecond-ordereffect.Why?
Replacementonlyhappensonmisses
-
Pseudo-LRUBinaryTree
§ For2-waycache,onahit,singleLRUbitissettopointtootherway
§ For4-waycache,need3bitsofstate.Oncachehit,onpathdowntree,setallbitstopointtootherhalf.Onmiss,bitssaywhichwaytoreplace
4
Way0Way1Way2Way3
1 0
1 1 0 0
-
CPU-CacheInterac?on(5-stagepipeline)
5
PC addr inst
PrimaryInstrucFonCache
0x4Add
IR
D
bubble
hit?PCen
Decode,RegisterFetch
wdataR
addr
wdata
rdataPrimaryDataCache
weA
B
YYALU
MD1 MD2
CacheRefillDatafromLowerLevelsofMemoryHierarchy
hit?
StallenFreCPUondatacachemiss
ToMemoryControl
ME
-
ImprovingCachePerformance
6
AveragememoryaccessFme(AMAT)= HitFme+MissratexMisspenalty
Toimproveperformance:• reducethehitFme• reducethemissrate• reducethemisspenalty
Whatisbestcachedesignfor5-stagepipeline?
Biggestcachethatdoesn’tincreasehitAmepast1cycle(approx8-32KBinmoderntechnology)
[designissuesmorecomplexwithdeeperpipelinesand/orout-of-ordersuperscalarprocessors]
-
CausesofCacheMisses:The3C’s
Compulsory:firstreferencetoaline(a.k.a.coldstartmisses)
– missesthatwouldoccurevenwithinfinitecacheCapacity:cacheistoosmalltoholdalldataneededbytheprogram
– missesthatwouldoccurevenunderperfectreplacementpolicy
Conflict:missesthatoccurbecauseofcollisionsduetoline-placementstrategy
– missesthatwouldnotoccurwithidealfullassociaAvity
7
-
EffectofCacheParametersonPerformance
§ Largercachesize+ reducescapacityandconflictmisses- hitFmewillincrease
§ HigherassociaFvity+ reducesconflictmisses- mayincreasehitFme
§ Largerlinesize+ reducescompulsoryandcapacity(reload)misses- increasesconflictmissesandmisspenalty
8
-
Figure B.9 Total miss rate (top) and distribution of miss rate (bottom) for each size cache according to the three C's for the data in Figure B.8. The top diagram shows the actual data cache miss rates, while the bottom diagram shows the percentage in each category. (Space allows the graphs to show one extra cache size than can fit in Figure B.8.)
© 2018 Elsevier Inc. All rights reserved.
-
Recap:LineSizeandSpa?alLocality
10
Word3Word0 Word1 Word2
LargerlinesizehasdisFncthardwareadvantages• lesstagoverhead• exploitfastbursttransfersfromDRAM• exploitfastbursttransfersoverwidebusses
Whatarethedisadvantagesofincreasinglinesize?
LineAddress
2b=linesizea.k.alinesize(inbytes)
SplitCPUaddress
bbits32-bbits
Tag
Alineisunitoftransferbetweenthecacheandmemory
4wordline,b=2
Fewerlines=>moreconflicts.Canwastebandwidth.
Offset
-
© 2019 Elsevier Inc. All rights reserved. 11
Figure B.10 Miss rate versus block size for five different-sized caches. Note that miss rate actually goes up if the block size is too large relative to the cache size. Each line represents a cache of different size. Figure B.11 shows the data used to plot these lines. Unfortunately, SPEC2000 traces would take too long if block size were included, so these data are based on SPEC92 on a DECstation 5000 (Gee et al. 1993).
-
WritePolicyChoices§ Cachehit:
– write-through:writebothcache&memory• Generallyhighertrafficbutsimplerpipeline&cachedesign
– write-back:writecacheonly,memoryiswrigenonlywhentheentryisevicted• Adirtybitperlinefurtherreduceswrite-backtraffic• Musthandle0,1,or2accessestomemoryforeachload/store
§ Cachemiss:– no-write-allocate:onlywritetomainmemory– write-allocate(akafetch-on-write):fetchintocache
§ CommoncombinaFons:– write-throughandno-write-allocate– write-backwithwrite-allocate
12
-
WritePerformance
13
Tag DataV
=
OffsetTag Index
t kb
t
HIT DataWordorByte
2klines
WE
-
ReducingWriteHitTime
Problem:Writestaketwocyclesinmemorystage,onecyclefortagcheckplusonecyclefordatawriteifhit
Solu?ons:§ DesigndataRAMthatcanperformreadandwriteinonecycle,restoreoldvalueaTertagmiss
§ Pipelinedwrites:Holdwritedataforstoreinsinglebufferaheadofcache,writecachedataduringnextstore’stagcheck
§ Fully-associaFve(CAMTag)caches:Wordlineonlyenabledifhit
14
-
PipeliningCacheWrites
15
Tags Data
Tag Index StoreData
Address and Store Data From CPU
DelayedWriteDataDelayedWriteAddr.
=?
=?
LoadDatatoCPU
Load/Store
LS
1 0
Hit?
Datafromastorehitiswri\enintodataporAonofcacheduringtagaccessofsubsequentstore
-
CS152Administrivia
§ PS2outonWednesday,dueWednesdayFeb26§ MondayFeb17isPresident’sDayHoliday,noclass!§ Lab1dueatstartofclassonWednesdayFeb19§ Friday’ssecFonswillreviewPS1andsoluFons
16
-
CS252
CS252Administrivia
§ Startthinkingofclassprojectsandformingteamsoftwo§ ProposaldueWednesdayFebruary26th
17
-
WriteBuffertoReduceReadMissPenalty
18
Processorisnotstalledonwrites,andreadmissescangoaheadofwritetomainmemory
Problem:WritebuffermayholdupdatedvalueoflocaFonneededbyareadmissSimplesolu?on:onareadmiss,waitforthewritebuffertogoemptyFastersolu?on:Checkwritebufferaddressesagainstreadmissaddresses,ifnomatch,allowreadmisstogoaheadofwrites,else,returnvalueinwritebuffer
DataCacheUnifiedL2Cache
RF
CPU
Writebuffer
Evicteddirtylinesforwrite-backcacheOR
Allwritesinwrite-throughcache
-
ReducingTagOverheadwithSub-Blocks
§ Problem:Tagsaretoolarge,i.e.,toomuchoverhead– SimplesoluFon:Largerlines,butmisspenaltycouldbelarge.
§ Solu?on:Sub-blockplacement(akasectorcache)– Avalidbitaddedtounitssmallerthanfullline,calledsub-blocks– Onlyreadasub-blockonamiss– Ifatagmatches,isthewordinthecache?
19
100 300 204
1 1 1 1 1 1 0 0 0 1 0 1
-
Mul?levelCaches
Problem:AmemorycannotbelargeandfastSolu?on:Increasingsizesofcacheateachlevel
20
CPU L1$ L2$ DRAM
Local miss rate = misses in cache / accesses to cache
Global miss rate = misses in cache / CPU memory accesses
Misses per instruction = misses in cache / number of instructions
-
© 2019 Elsevier Inc. All rights reserved. 21
Figure B.14 Miss rates versus cache size for multilevel caches. Second-level caches smaller than the sum of the two 64 KiB first-level caches make little sense, as reflected in the high miss rates. After 256 KiB the single cache is within 10% of the global miss rates. The miss rate of a single-level cache versus size is plotted against the local miss rate and global miss rate of a second-level cache using a 32 KiB first-level cache. The L2 caches (unified) were two-way set associative with replacement. Each had split L1 instruction and data caches that were 64 KiB two-way set associative with LRU replacement. The block size for both L1 and L2 caches was 64 bytes. Data were collected as in Figure B.4.
-
PresenceofL2influencesL1design
§ UsesmallerL1ifthereisalsoL2– TradeincreasedL1missrateforreducedL1hitFme– BackupL2reducesL1misspenalty– Reducesaverageaccessenergy
§ Usesimplerwrite-throughL1withon-chipL2– Write-backL2cacheabsorbswritetraffic,doesn’tgooff-chip– AtmostoneL1missrequestperL1access(nodirtyvicFmwriteback)simplifiespipelinecontrol
– Simplifiescoherenceissues– SimplifieserrorrecoveryinL1(canusejustparitybitsinL1andreloadfromL2whenparityerrordetectedonL1read)
22
-
© 2019 Elsevier Inc. All rights reserved. 23
Figure B.15 Relative execution time by second-level cache size. The two bars are for different clock cycles for an L2 cache hit. The reference execution time of 1.00 is for an 8192 KiB second-level cache with a 1-clock-cycle latency on a second-level hit. These data were collected the same way as in Figure B.14, using a simulator to imitate the Alpha 21264.
-
InclusionPolicy
§ InclusivemulFlevelcache:– Innercachecanonlyholdlinesalsopresentinoutercache
– Externalcoherencesnoopaccessneedonlycheckoutercache
§ ExclusivemulFlevelcaches:– Innercachemayholdlinesnotinoutercache– Swaplinesbetweeninner/outercachesonmiss– UsedinAMDAthlonwith64KBprimaryand256KBsecondarycache
Whychooseonetypeortheother?
24
-
Itanium-2On-ChipCaches(Intel/HP,2002)
25
Level1:16KB,4-ways.a.,64Bline,quad-port(2load+2store),singlecyclelatency
Level2:256KB,4-ways.a,128B
line,quad-port(4loador4store),fivecyclelatency
Level3:3MB,12-ways.a.,128B
line,single32Bport,twelvecyclelatency
-
Power7On-ChipCaches[IBM2009]
26
32KBL1I$/core
32KBL1D$/core
3-cyclelatency
256KBUnifiedL2$/core
8-cyclelatency
32MBUnifiedSharedL3$
EmbeddedDRAM(eDRAM)
25-cyclelatencytolocalslice
-
IBMz196MainframeCaches2010
27
§ 96cores(4cores/chip,24chips/system)– Out-of-order,[email protected]
§ L1:64KBI-$/core+128KBD-$/core§ L2:1.5MBprivate/core(144MBtotal)§ L3:24MBshared/chip(eDRAM)(576MBtotal)§ L4:768MBshared/system(eDRAM)
-
Acknowledgements
§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:– Arvind(MIT)– JoelEmer(Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz(UCB)– DavidPagerson(UCB)
28