cs 61c: great ideas in computer architecture lecture 18: parallel processing...
TRANSCRIPT
-
CS61C:GreatIdeasinComputerArchitecture
Lecture18:ParallelProcessing– SIMD
BernhardBoser&RandyKatz
http://inst.eecs.berkeley.edu/~cs61c
-
61CSurvey
Itwouldbenicetohaveareviewlectureeveryonceinawhile,
actuallyshowingushowthingsfitinthebiggerpicture
CS61c Lecture18:ParallelProcessing- SIMD 2
-
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 3
-
61CTopicssofar…• Whatwelearned:
1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Datapath architecture6. Pipelining7. Caches8. Performanceevaluation9. Floatingpoint
• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!
CS61c Lecture18:ParallelProcessing- SIMD 4
-
ReferenceProblem
•Matrixmultiplication−Basicoperationinmanyengineering,data,andimagingprocessingtasks
−Imagefiltering,noisereduction,…−Manycloselyrelatedoperations
§ E.g.stereovision(project4)
•dgemm−doubleprecisionfloatingpointmatrixmultiplication
CS61c Lecture18:ParallelProcessing- SIMD 5
-
ApplicationExample:DeepLearning
• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying
CS61c Lecture18:ParallelProcessing- SIMD 6
-
Matrices
CS61c Lecture18:ParallelProcessing- SIMD 7
𝑐"#
• Square(orrectangular)NxNarrayofnumbers− DimensionN
𝐶 = 𝐴 ' 𝐵
𝑐"# = )𝑎"+𝑏+#
�
+
𝑖
𝑗N-1
N-1
00
-
MatrixMultiplication
CS61c 8
𝑪 = 𝑨 ' 𝑩𝑐"# = )𝑎"+𝑏+#
�
+
𝑖
𝑗
𝑘
𝑘
-
Reference:Python• MatrixmultiplicationinPython
CS61c Lecture18:ParallelProcessing- SIMD 9
N Python[Mflops]32 5.4160 5.5480 5.4960 5.3
• 1Mflop =1Millionfloatingpointoperationspersecond(fadd,fmul)
• dgemm(N…)takes2*N3 flops
-
C
• c=axb• a,b,careNxNmatrices
CS61c Lecture18:ParallelProcessing- SIMD 10
-
TimingProgramExecution
CS61c Lecture18:ParallelProcessing- SIMD 11
-
CversusPython
CS61c Lecture18:ParallelProcessing- SIMD 12
N C[Gflops] Python[Gflops]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053
Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!
240x!
-
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 13
-
WhyParallelProcessing?
• CPUClockRatesarenolongerincreasing−Technical&economicchallenges
§ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications
§ Energycostsareprohibitive
• Parallelprocessingisonlypathtohigherspeed−Compareairlines:
§ Maximumspeedlimitedbyspeedofsoundandeconomics§ Usemoreandlargerairplanestoincreasethroughput§ Andsmallerseats…
CS61c Lecture18:ParallelProcessing- SIMD 14
-
UsingParallelismforPerformance
• Twobasicways:−Multiprogramming
§ runmultipleindependentprogramsinparallel§ “Easy”
−Parallelcomputing§ runoneprogramfaster§ “Hard”
•We’llfocusonparallelcomputinginthenextfewlectures
15CS61c Lecture18:ParallelProcessing- SIMD
-
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssigned tocomputere.g.,Search“Katz”
• ParallelThreadsAssigned tocoree.g.,Lookup,Ads
• ParallelInstructions>[email protected].,5pipelined instructions
• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords
• HardwaredescriptionsAllgates@onetime
• ProgrammingLanguages 16
SmartPhone
WarehouseScale
Computer
SoftwareHardware
HarnessParallelism&AchieveHighPerformance
LogicGates
Core Core…
Memory(Cache)
Input/Output
Computer
CacheMemory
Core
InstructionUnit(s) FunctionalUnit(s)
A3+B3A2+B2A1+B1A0+B0
Today’sLecture
-
Single-Instruction/Single-DataStream(SISD)
• Sequentialcomputerthatexploitsnoparallelism ineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines
E.g.ourtrustedMIPS
17
ProcessingUnit
CS61c Lecture18:ParallelProcessing- SIMD
Thisiswhatwediduptonowin61C
-
Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)
• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)
18CS61c Lecture18:ParallelProcessing- SIMD
Today’stopic.
-
Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)
• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers
19
InstructionPool
PU
PU
PU
PU
DataPoo
l
CS61c Lecture18:ParallelProcessing- SIMD
TopicofLecture19andbeyond.
-
Multiple-Instruction/Single-DataStream(MISD)
• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance
20CS61c Lecture18:ParallelProcessing- SIMD
Thishasfewapplications.Notcoveredin61C.
-
Flynn*Taxonomy,1966
• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures– usuallybothinsamesystem!• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives
21CS61c Lecture18:ParallelProcessing- SIMD
-
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 22
-
SIMD– “SingleInstructionMultipleData”
23CS61c Lecture18:ParallelProcessing- SIMD
-
SIMDApplications&Implementations
• Applications− Scientificcomputing
§ Matlab,NumPy− Graphicsandvideoprocessing
§ Photoshop,…− BigData
§ Deeplearning− Gaming−…
• Implementations− x86− ARM−…
CS61c Lecture18:ParallelProcessing- SIMD 24
-
25
FirstSIMDExtensions:MITLincolnLabsTX-2,1957
CS61c
-
x86SIMDEvolution
CS61c Lecture18:ParallelProcessing- SIMD 26
http://svmoore.pbworks.com/w/file/fetch/70583970/VectorOps.pdf
• Newinstructions• New,wider,moreregisters• Moreparallelism
-
CPUSpecs(Bernhard’sLaptop)$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4
machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS
CS61c Lecture18:ParallelProcessing- SIMD 27
-
SIMDRegisters
CS61c Lecture18:ParallelProcessing- SIMD 28
-
SIMDDataTypes
CS61c Lecture18:ParallelProcessing- SIMD 29
-
SIMDVectorMode
CS61c Lecture18:ParallelProcessing- SIMD 30
-
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 31
-
Problem
• Today’scompilers(largely)donotgenerateSIMDcode•Backtoassembly…• x86
−Over1000instructionstolearn…−GreenBook
•Canweusethecompilertogenerateallnon-SIMDinstructions?
CS61c Lecture18:ParallelProcessing- SIMD 32
-
x86IntrinsicsAVXDataTypes
CS61c Lecture18:ParallelProcessing- SIMD 33
Intrinsics: Directaccesstoregisters&assemblyfromC
Register
-
IntrinsicsAVXCodeNomenclature
CS61c Lecture18:ParallelProcessing- SIMD 34
-
x86SIMD“Intrinsics”
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
CS61c Lecture18:ParallelProcessing- SIMD 35
4parallelmultiplies
2instructionsperclockcycle(CPI=0.5)
assemblyinstruction
-
RawDoublePrecisionThroughput(Bernhard’sPowerbook Pro)
Characteristic Value
CPU i7-5557U
Clockrate(sustained) 3.1GHz
Instructions perclock(mul_pd) 2
Parallel multipliesperinstruction 4
Peakdoubleflops 24.8Gflops
CS61c Lecture18:ParallelProcessing- SIMD 36
Actualperformanceislowerbecauseofoverhead
https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
-
VectorizedMatrixMultiplication
CS61c 37
𝑖
𝑗
𝑘
𝑘
InnerLoop:
fori …;i+=4forj...
i+=4
-
“Vectorized”dgemm
CS61c Lecture18:ParallelProcessing- SIMD 38
-
Performance
NGflops
scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64
CS61c Lecture18:ParallelProcessing- SIMD 39
• 4xfaster• Butstill
-
Weareflying…
• Survey:
• But…thereissomuchmaterialtocover!− Solution:targetedreading−Weeklyhomeworkwithintegratedreading&lecturereview
CS61c Lecture18:ParallelProcessing- SIMD 40
-
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 41
-
AtriptoLA
Get toSFO&check-in SFOà LAX Getto destination
3hours 1hour 3 hours
CS61c Lecture18:ParallelProcessing- SIMD 42
Commercialairline:
Supersonicaircraft:
Get toSFO&check-in SFOà LAX Getto destination
3hours 6min 3 hours
Totaltime:7hours
Totaltime:6.1hours
Speedup:
Flyingtime Sflight =60/6=10xTriptime Strip =7/6.1=1.15x
-
Amdahl’sLaw
• GetenhancementE foryournewPC− E.g.floatingpointrocketbooster
• E− Speedsupsometask(e.g.arithmetic)byfactorSE− F isfractionofprogramthatusesthis”task”
CS61c Lecture18:ParallelProcessing- SIMD 43
1-F F
1-F F/ SE
ExecutionTime:
Speedup:
T0 (noE)
TE (withE)
𝑆 =𝑇6𝑇7
=1
1 − 𝐹 + 𝐹𝑆7
nospeedup speedupsection
-
BigIdea:Amdahl’sLaw
44
Partnotspedup Partspedup
Example:Theexecutiontimeofhalf ofaprogramcanbeacceleratedbyafactorof2.Whatistheprogramspeed-upoverall?
𝑆 =𝑇6𝑇7=
1
1 − 𝐹 + 𝐹𝑆7
𝑆 =𝑇6𝑇7=
1
1− 0.5 + 0.52= 1.33 ≪ 2
CS61c Lecture18:ParallelProcessing- SIMD
-
Maximum“Achievable”Speed-Up
45
Question: Whatisareasonable#ofparallelprocessorstospeedupanalgorithmwithF=95%?(i.e.19/20th canbespedup)
a)Maximumspeedup:
b)Reasonable“engineering”compromise:
𝑆BCD =1
1 − 𝐹 + 𝐹𝑆7E
FG⟹I
=1
1− 𝐹
𝐹 = 95% ⟹𝑆BCD = 20 but𝑆7 → ∞ !?
1 − 𝐹 =𝐹𝑆7
⟹ 𝑆7 =𝐹
1− 𝐹 =0.950.05 = 19
Then𝑆 = FOPQR = 10
Equaltime insequentialandparallelcode
CS61c
-
46
Iftheportionoftheprogramthatcanbeparallelizedissmall,thenthespeedupislimited
Inthisregion,thesequentialportionlimitstheperformance
500processorsfor19x
20processorsfor10x
CS61c
-
StrongandWeakScaling
• Togetgoodspeeduponaparallelprocessorwhilekeepingtheproblemsizefixedisharderthangettinggoodspeedupbyincreasingthesizeoftheproblem.− Strongscaling:whenspeedupcanbeachievedonaparallelprocessorwithoutincreasingthesizeoftheproblem
−Weakscaling:whenspeedupisachievedonaparallelprocessorbyincreasingthesizeoftheproblemproportionallytotheincreaseinthenumberofprocessors
• Loadbalancingisanotherimportantfactor:everyprocessordoingsameamountofwork− Justoneunitwithtwicetheloadofotherscutsspeedupalmostinhalf
47CS61c Lecture18:ParallelProcessing- SIMD
-
Clickers/PeerInstruction
48
Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?
𝑆 =𝑇6𝑇7=
1
1 − 𝐹 + 𝐹𝑆7
Answer SEA 5B 16C 20D 100E Noneoftheabove
CS61c Lecture18:ParallelProcessing- SIMD
-
Clickers/PeerInstruction
49
Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?
𝑆 =𝑇6𝑇7=
1
1 − 𝐹 + 𝐹𝑆7
Answer SEA 5B 16C 20D 100E Noneoftheabove
CS61c Lecture18:ParallelProcessing- SIMD
-
Administrivia• MT2is
− Tuesday,November1,− 3:30-5pm− seewebforroom assignments
• TAReviewSession:§ Sunday10/30,3:30– 5PMin10Evans§ SeePiazza
50CS61c Lecture19:ThreadLevalParallelProcessing
-
MT2Topics• Coverslecturematerialupto10/20
− Caches− notfloatingpoint
• Combinatoriallogicincludingsynthesisandtruthtables• FSMs• Timingandtimingdiagrams• Pipelining• Datapath,hazards,stalls• Performance(e.g.CPI,instructionspersecond,latency)• Caches• AlltopicscoveredinMT1
− Focusisnewmaterial,butdonotbesurprisedbye.g.MIPSassembly
51CS61c Lecture19:ThreadLevalParallelProcessing
-
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 52
-
Amdahl’sLawappliedtodgemm
• Measureddgemm performance− Peak 5.5Gflops− Largematrices 3.6Gflops− Processor 24.8Gflops
• Whyarewenotgetting(closeto)25Gflops?− Somethingelse(notfloatingpointALU)islimitingperformance!
− Butwhat?Possibleculprits:§ Cache§ Hazards§ Let’slookatboth!
CS61c Lecture18:ParallelProcessing- SIMD 53
-
PipelineHazards– dgemm
CS61c Lecture18:ParallelProcessing- SIMD 54
-
LoopUnrolling
CS61c Lecture18:ParallelProcessing- SIMD 55
Compilerdoestheunrolling
Howdoyouverifythatthegeneratedcodeisactuallyunrolled?
4registers
-
Performance
NGflops
scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91
CS61c Lecture18:ParallelProcessing- SIMD 56
-
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 57
-
FPUversusMemoryAccess
• Howmanyfloatingpointoperationsdoesmatrixmultiplytake?− F=2xN3 (N3 multiplies,N3 adds)
• Howmanymemoryload/stores?−M=3xN2 (forA,B,C)
• Manymorefloatingpointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…
CS61c Lecture18:ParallelProcessing- SIMD 58
-
Butmemoryisaccessedrepeatedly
• q=F/M=1!(2loadsand2floatingpointoperations)
CS61c Lecture18:ParallelProcessing- SIMD 59
Innerloop:
-
CS61c Lecture18:ParallelProcessing- SIMD 60
Second-LevelCache(SRAM)
TypicalMemoryHierarchy
Control
Datapath
SecondaryMemory(Disk
OrFlash)
On-ChipComponents
RegFile
MainMemory(DRAM)Data
CacheInstrCache
Speed(cycles):½’s 1’s 10’s 100’s-10001,000,000’s
Size(bytes): 100’s 10K’s M’sG’sT’s
• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!
Cost/bit:highest lowest
Third-LevelCache(SRAM)
-
Sub-MatrixMultiplicationor:BeatingAmdahl’sLaw
CS61c 61
-
Blocking
• Idea:−Rearrangecodetousevaluesloadedincachemanytimes
−Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation
−à throughputlimitedbyFPhardwareandcache,notslowDRAM
−P&Hp.556
CS61c Lecture18:ParallelProcessing- SIMD 62
-
MemoryAccessBlocking
CS61c Lecture18:ParallelProcessing- SIMD 63
-
Performance
NGflops
scalar avx unroll blocking32 1.30 4.56 12.95 13.80160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82
CS61c Lecture18:ParallelProcessing- SIMD 64
-
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 65
-
AndinConclusion,…
• ApproachestoParallelism− SISD,SIMD,MIMD(nextlecture)
• SIMD− Oneinstructionoperatesonmultipleoperandssimultaneously
• Example:matrixmultiplication− Floatingpointheavyà exploitMoore’slawtomakefast
• Amdahl’sLaw:− Serialsectionslimitspeedup− Cache
§ Blocking− Hazards
§ Loopunrolling
66CS61c Lecture18:ParallelProcessing- SIMD