cs 61c: great ideas in computer architecture lecture 18: parallel processing...

CS61C:GreatIdeasinComputerArchitecture

Lecture18:ParallelProcessing– SIMD

BernhardBoser&RandyKatz

http://inst.eecs.berkeley.edu/~cs61c

61CSurvey

Itwouldbenicetohaveareviewlectureeveryonceinawhile,

actuallyshowingushowthingsfitinthebiggerpicture

CS61c Lecture18:ParallelProcessing- SIMD 2

Agenda

• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…


61CTopicssofar…• Whatwelearned:

1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Datapath architecture6. Pipelining7. Caches8. Performanceevaluation9. Floatingpoint

• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!


ReferenceProblem

•Matrixmultiplication−Basicoperationinmanyengineering,data,andimagingprocessingtasks

−Imagefiltering,noisereduction,…−Manycloselyrelatedoperations

§ E.g.stereovision(project4)

•dgemm−doubleprecisionfloatingpointmatrixmultiplication


ApplicationExample:DeepLearning

• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying


Matrices


𝑐"#

• Square(orrectangular)NxNarrayofnumbers− DimensionN

𝐶 = 𝐴 ' 𝐵

𝑐"# = )𝑎"+𝑏+#

�

+

𝑖

𝑗N-1

N-1

00

MatrixMultiplication

CS61c 8

𝑪 = 𝑨 ' 𝑩𝑐"# = )𝑎"+𝑏+#

�

+

𝑖

𝑗

𝑘

𝑘

Reference:Python• MatrixmultiplicationinPython


N Python[Mflops]32 5.4160 5.5480 5.4960 5.3

• 1Mflop =1Millionfloatingpointoperationspersecond(fadd,fmul)

• dgemm(N…)takes2*N3 flops

C

• c=axb• a,b,careNxNmatrices


TimingProgramExecution


CversusPython


N C[Gflops] Python[Gflops]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053

Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!

240x!

Agenda



WhyParallelProcessing?

• CPUClockRatesarenolongerincreasing−Technical&economicchallenges

§ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications

§ Energycostsareprohibitive

• Parallelprocessingisonlypathtohigherspeed−Compareairlines:

§ Maximumspeedlimitedbyspeedofsoundandeconomics§ Usemoreandlargerairplanestoincreasethroughput§ Andsmallerseats…


UsingParallelismforPerformance

• Twobasicways:−Multiprogramming

§ runmultipleindependentprogramsinparallel§ “Easy”

−Parallelcomputing§ runoneprogramfaster§ “Hard”

•We’llfocusonparallelcomputinginthenextfewlectures

15CS61c Lecture18:ParallelProcessing- SIMD

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssigned tocomputere.g.,Search“Katz”

• ParallelThreadsAssigned tocoree.g.,Lookup,Ads

• ParallelInstructions>[email protected].,5pipelined instructions

• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages 16

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

Core

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

Today’sLecture

Single-Instruction/Single-DataStream(SISD)

• Sequentialcomputerthatexploitsnoparallelism ineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines

E.g.ourtrustedMIPS

17

ProcessingUnit

CS61c Lecture18:ParallelProcessing- SIMD

Thisiswhatwediduptonowin61C

Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)

• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)


Today’stopic.

Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)

• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers

19

InstructionPool

PU

PU

PU

PU

DataPoo

l


TopicofLecture19andbeyond.

Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance


Thishasfewapplications.Notcoveredin61C.

Flynn*Taxonomy,1966

• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures– usuallybothinsamesystem!• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives


Agenda



SIMD– “SingleInstructionMultipleData”


SIMDApplications&Implementations

• Applications− Scientificcomputing

§ Matlab,NumPy− Graphicsandvideoprocessing

§ Photoshop,…− BigData

§ Deeplearning− Gaming−…

• Implementations− x86− ARM−…


25

FirstSIMDExtensions:MITLincolnLabsTX-2,1957

CS61c

x86SIMDEvolution


http://svmoore.pbworks.com/w/file/fetch/70583970/VectorOps.pdf

• Newinstructions• New,wider,moreregisters• Moreparallelism

CPUSpecs(Bernhard’sLaptop)$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4

machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz

machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS


SIMDRegisters


SIMDDataTypes


SIMDVectorMode


Agenda



Problem

• Today’scompilers(largely)donotgenerateSIMDcode•Backtoassembly…• x86

−Over1000instructionstolearn…−GreenBook

•Canweusethecompilertogenerateallnon-SIMDinstructions?


x86IntrinsicsAVXDataTypes


Intrinsics: Directaccesstoregisters&assemblyfromC

Register

IntrinsicsAVXCodeNomenclature


x86SIMD“Intrinsics”

https://software.intel.com/sites/landingpage/IntrinsicsGuide/


4parallelmultiplies

2instructionsperclockcycle(CPI=0.5)

assemblyinstruction

RawDoublePrecisionThroughput(Bernhard’sPowerbook Pro)

Characteristic Value

CPU i7-5557U

Clockrate(sustained) 3.1GHz

Instructions perclock(mul_pd) 2

Parallel multipliesperinstruction 4

Peakdoubleflops 24.8Gflops


Actualperformanceislowerbecauseofoverhead

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

VectorizedMatrixMultiplication

CS61c 37

𝑖

𝑗

𝑘

𝑘

InnerLoop:

fori …;i+=4forj...

i+=4

“Vectorized”dgemm


Performance

NGflops

scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64


• 4xfaster• Butstill

Weareflying…

• Survey:

• But…thereissomuchmaterialtocover!− Solution:targetedreading−Weeklyhomeworkwithintegratedreading&lecturereview


Agenda



AtriptoLA

Get toSFO&check-in SFOà LAX Getto destination

3hours 1hour 3 hours


Commercialairline:

Supersonicaircraft:

Get toSFO&check-in SFOà LAX Getto destination

3hours 6min 3 hours

Totaltime:7hours

Totaltime:6.1hours

Speedup:

Flyingtime Sflight =60/6=10xTriptime Strip =7/6.1=1.15x

Amdahl’sLaw

• GetenhancementE foryournewPC− E.g.floatingpointrocketbooster

• E− Speedsupsometask(e.g.arithmetic)byfactorSE− F isfractionofprogramthatusesthis”task”


1-F F

1-F F/ SE

ExecutionTime:

Speedup:

T0 (noE)

TE (withE)

𝑆 =𝑇6𝑇7

=1

1 − 𝐹 + 𝐹𝑆7

nospeedup speedupsection

BigIdea:Amdahl’sLaw

44

Partnotspedup Partspedup

Example:Theexecutiontimeofhalf ofaprogramcanbeacceleratedbyafactorof2.Whatistheprogramspeed-upoverall?

𝑆 =𝑇6𝑇7=

1

1 − 𝐹 + 𝐹𝑆7

𝑆 =𝑇6𝑇7=

1

1− 0.5 + 0.52= 1.33 ≪ 2


Maximum“Achievable”Speed-Up

45

Question: Whatisareasonable#ofparallelprocessorstospeedupanalgorithmwithF=95%?(i.e.19/20th canbespedup)

a)Maximumspeedup:

b)Reasonable“engineering”compromise:

𝑆BCD =1

1 − 𝐹 + 𝐹𝑆7E

FG⟹I

=1

1− 𝐹

𝐹 = 95% ⟹𝑆BCD = 20 but𝑆7 → ∞ !?

1 − 𝐹 =𝐹𝑆7

⟹ 𝑆7 =𝐹

1− 𝐹 =0.950.05 = 19

Then𝑆 = FOPQR = 10

Equaltime insequentialandparallelcode

CS61c

46

Iftheportionoftheprogramthatcanbeparallelizedissmall,thenthespeedupislimited

Inthisregion,thesequentialportionlimitstheperformance

500processorsfor19x

20processorsfor10x

CS61c

StrongandWeakScaling

• Togetgoodspeeduponaparallelprocessorwhilekeepingtheproblemsizefixedisharderthangettinggoodspeedupbyincreasingthesizeoftheproblem.− Strongscaling:whenspeedupcanbeachievedonaparallelprocessorwithoutincreasingthesizeoftheproblem

−Weakscaling:whenspeedupisachievedonaparallelprocessorbyincreasingthesizeoftheproblemproportionallytotheincreaseinthenumberofprocessors

• Loadbalancingisanotherimportantfactor:everyprocessordoingsameamountofwork− Justoneunitwithtwicetheloadofotherscutsspeedupalmostinhalf


Clickers/PeerInstruction

48

Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?

𝑆 =𝑇6𝑇7=

1

1 − 𝐹 + 𝐹𝑆7

Answer SEA 5B 16C 20D 100E Noneoftheabove


Clickers/PeerInstruction

49

Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?

𝑆 =𝑇6𝑇7=

1

1 − 𝐹 + 𝐹𝑆7

Answer SEA 5B 16C 20D 100E Noneoftheabove


Administrivia• MT2is

− Tuesday,November1,− 3:30-5pm− seewebforroom assignments

• TAReviewSession:§ Sunday10/30,3:30– 5PMin10Evans§ SeePiazza

50CS61c Lecture19:ThreadLevalParallelProcessing

MT2Topics• Coverslecturematerialupto10/20

− Caches− notfloatingpoint

• Combinatoriallogicincludingsynthesisandtruthtables• FSMs• Timingandtimingdiagrams• Pipelining• Datapath,hazards,stalls• Performance(e.g.CPI,instructionspersecond,latency)• Caches• AlltopicscoveredinMT1

− Focusisnewmaterial,butdonotbesurprisedbye.g.MIPSassembly

51CS61c Lecture19:ThreadLevalParallelProcessing

Agenda



Amdahl’sLawappliedtodgemm

• Measureddgemm performance− Peak 5.5Gflops− Largematrices 3.6Gflops− Processor 24.8Gflops

• Whyarewenotgetting(closeto)25Gflops?− Somethingelse(notfloatingpointALU)islimitingperformance!

− Butwhat?Possibleculprits:§ Cache§ Hazards§ Let’slookatboth!


PipelineHazards– dgemm


LoopUnrolling


Compilerdoestheunrolling

Howdoyouverifythatthegeneratedcodeisactuallyunrolled?

4registers

Performance

NGflops

scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91


Agenda



FPUversusMemoryAccess

• Howmanyfloatingpointoperationsdoesmatrixmultiplytake?− F=2xN3 (N3 multiplies,N3 adds)

• Howmanymemoryload/stores?−M=3xN2 (forA,B,C)

• Manymorefloatingpointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…


Butmemoryisaccessedrepeatedly

• q=F/M=1!(2loadsand2floatingpointoperations)


Innerloop:


Second-LevelCache(SRAM)

TypicalMemoryHierarchy

Control

Datapath

SecondaryMemory(Disk

OrFlash)

On-ChipComponents

RegFile

MainMemory(DRAM)Data

CacheInstrCache

Speed(cycles):½’s 1’s 10’s 100’s-10001,000,000’s

Size(bytes): 100’s 10K’s M’sG’sT’s

• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!

Cost/bit:highest lowest

Third-LevelCache(SRAM)

Sub-MatrixMultiplicationor:BeatingAmdahl’sLaw

CS61c 61

Blocking

• Idea:−Rearrangecodetousevaluesloadedincachemanytimes

−Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation

−à throughputlimitedbyFPhardwareandcache,notslowDRAM

−P&Hp.556


MemoryAccessBlocking


Performance

NGflops

scalar avx unroll blocking32 1.30 4.56 12.95 13.80160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82


Agenda



AndinConclusion,…

• ApproachestoParallelism− SISD,SIMD,MIMD(nextlecture)

• SIMD− Oneinstructionoperatesonmultipleoperandssimultaneously

• Example:matrixmultiplication− Floatingpointheavyà exploitMoore’slawtomakefast

• Amdahl’sLaw:− Serialsectionslimitspeedup− Cache

§ Blocking− Hazards

§ Loopunrolling


cs 61c: great ideas in computer architecture lecture 18: parallel processing...

Documents