cs 61c: great ideas in computer architecture lecture 18

CS61C:GreatIdeasinComputerArchitecture

Lecture18:ParallelProcessing– SIMD

KrsteAsanović &RandyKatz

http://inst.eecs.berkeley.edu/~cs61c

61CSurvey

Itwouldbenicetohaveareviewlectureeveryonceinawhile,actuallyshowingus

howthingsfitinthebiggerpicture

CS61c Lecture18:ParallelProcessing- SIMD 2

Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 3

61CTopicssofar…• Whatwelearned:

1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Datapath architecture6. Pipelining7. Caches8. Performanceevaluation9. Floatingpoint

• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!


ReferenceProblem•Matrixmultiplication

−Basicoperationinmanyengineering,data,andimagingprocessingtasks

− Imagefiltering,noisereduction,…−Manycloselyrelatedoperations

§ E.g.stereovision(project4)•dgemm

−double-precisionfloating-pointmatrixmultiplication


ApplicationExample:DeepLearning• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying


Matrices


𝑖

𝑗N-1

N-1

0

0

• SquarematrixofdimensionNxN

MatrixMultiplication

CS61c 8

𝑖

𝑗

𝑘

𝑘C=A*BCij =Σk (Aik *Bkj)

A

B

C

Reference:Python• MatrixmultiplicationinPython


N Python[Mflops]32 5.4160 5.5480 5.4960 5.3

• 1MFLOP=1Millionfloating-pointoperationspersecond(fadd,fmul)

• dgemm(N…)takes2*N3 flops

C• c=a*b• a,b,careNxNmatrices


TimingProgramExecution


CversusPython


N C[GFLOPS] Python[GFLOPS]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053

Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!

240x!

WhyParallelProcessing?• CPUClockRatesarenolongerincreasing

−Technical&economicchallenges§ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications

§ Energycostsareprohibitive

• Parallelprocessingisonlypathtohigherspeed−Compareairlines:

§ Maximumspeedlimitedbyspeedofsoundandeconomics§ Usemoreandlargerairplanestoincreasethroughput§ Andsmallerseats…


UsingParallelismforPerformance• Twobasicways:

−Multiprogramming§ runmultipleindependentprogramsinparallel§ “Easy”

−Parallelcomputing§ runoneprogramfaster§ “Hard”

•We’llfocusonparallelcomputinginthenextfewlectures

15CS61c Lecture18:ParallelProcessing- SIMD

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssignedtocomputere.g.,Search“Katz”• ParallelThreadsAssignedtocoree.g.,Lookup,Ads• ParallelInstructions>[email protected].,5pipelinedinstructions• ParallelData>[email protected].,Addof4pairsofwords• HardwaredescriptionsAllgates@onetime• ProgrammingLanguages

16

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…Memory(Cache)Input/Output

Computer

CacheMemory

CoreInstructionUnit(s) Functional

Unit(s)A3+B3A2+B2A1+B1A0+B0

Today’sLecture

Single-Instruction/Single-DataStream(SISD)

• Sequentialcomputerthatexploitsnoparallelismineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines

E.g.ourtrustedRISC-Vpipeline

17

ProcessingUnit

CS61c Lecture18:ParallelProcessing- SIMD

Thisiswhatwediduptonowin61C

Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)

• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)


Today’stopic.

Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)

• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers

19

InstructionPool

PU

PU

PU

PU

DataPoo

l


TopicofLecture19andbeyond.

Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance


Thishasfewapplications.Notcoveredin61C.

Flynn*Taxonomy,1966

• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures– usuallybothinsamesystem!

• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives


SIMD– “SingleInstructionMultipleData”


SIMDApplications&Implementations• Applications

− Scientificcomputing§ Matlab,NumPy

− Graphicsandvideoprocessing§ Photoshop,…

− BigData§ Deeplearning

− Gaming− …

• Implementations− x86− ARM− RISC-Vvectorextensions


25

FirstSIMDExtensions:MITLincolnLabsTX-2,1957

CS61c

x86SIMDEvolution


http://svmoore.pbworks.com/w/file/fetch/70583970/VectorOps.pdf

• Newinstructions• New,wider,moreregisters• Moreparallelism

LaptopCPUSpecs$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4

machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz

machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0RDRAND F16C

machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS


SIMDRegisters


SIMDDataTypes


(NowalsoAVX-512available)

SIMDVectorMode


Problem• Today’scompilers(largely)donotgenerateSIMDcode• Backtoassembly…• x86(notusingRISC-Vasnovectorhardwareyet)

−Over1000instructionstolearn…−GreenBook

• Canweusethecompilertogenerateallnon-SIMDinstructions?


x86IntrinsicsAVXDataTypes


Intrinsics: Directaccesstoregisters&assemblyfromCRegister

IntrinsicsAVXCodeNomenclature


https://software.intel.com/sites/landingpage/IntrinsicsGuide/CS61c Lecture18:ParallelProcessing- SIMD 35

4parallelmultiplies

2instructionsperclockcycle(CPI=0.5)

assemblyinstruction

x86SIMD“Intrinsics”

RawDouble-PrecisionThroughput

Characteristic Value

CPU i7-5557U

Clockrate(sustained) 3.1GHz

Instructions perclock(mul_pd) 2

Parallel multipliesperinstruction 4

PeakdoubleFLOPS 24.8GFLOPS


Actualperformanceislowerbecauseofoverheadhttps://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

VectorizedMatrixMultiplication

CS61c 37

𝑖

𝑗

𝑘

𝑘InnerLoop:

fori …;i+=4forj...

i+=4

“Vectorized”dgemm


Performance

NGflops

scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64


• 4xfaster• Butstill<<theoretical25GFLOPS!

AtriptoLA

Get toSFO&check-in SFOà LAX Getto destination3hours 1hour 3 hours


Commercialairline:

Supersonicaircraft:

Get toSFO&check-in SFOà LAX Getto destination3hours 6min 3 hours

Totaltime:7hours

Totaltime:6.1hoursSpeedup:

Flyingtime Sflight =60/6=10xTriptime Strip =7/6.1=1.15x

Amdahl’sLaw• GetenhancementE foryournewPC− E.g.floating-pointrocketbooster• E− Speedsupsometask(e.g.arithmetic)byfactorSE− F isfractionofprogramthatusesthis”task”


1-F F

1-F F/ SE

ExecutionTime:

Speedup:

T0 (noE)TE (withE)

nospeedup speedupsection

Speedup=TO /TE =1/((1-F)+F/SE)

BigIdea:Amdahl’sLaw

43

Partnotspedup Partspedup

Example: Theexecutiontimeofhalf ofaprogramcanbeacceleratedbyafactorof2.Whatistheprogramspeed-upoverall?


Speedup=TO /TE =1/((1-F)+F/SE)

TO /TE =1/((1-F)+F/SE)=1/((1- 0.5)+0.5/2)=1.33<<2

Maximum“Achievable”Speed-Up

44

Question: Whatisareasonable#ofparallelprocessorstospeedupanalgorithmwithF=95%?(i.e.19/20th canbespedup)

a)Maximumspeedup: Smax =1/(1-0.95)=20butneedsSE=∞!

b) Reasonable“engineering”compromise:Equaltimeinsequentialandparallelcode:(1-F)=F/SE

SE =F/(1-F)=0.95/0.05=20S=1/(0.05+0.05)=10

CS61c

Speedup=1/((1-F)+F/SE)(SE->∞)=1/(1-F)

45

Iftheportionoftheprogramthatcanbeparallelizedissmall,thenthespeedupislimited

Inthisregion,thesequentialportionlimitstheperformance

500processorsfor19x

20processorsfor10x

CS61c

StrongandWeakScaling• Togetgoodspeeduponaparallelprocessorwhilekeepingtheproblemsizefixedisharderthangettinggoodspeedupbyincreasingthesizeoftheproblem.− Strongscaling:whenspeedupcanbeachievedonaparallelprocessorwithoutincreasingthesizeoftheproblem

− Weakscaling:whenspeedupisachievedonaparallelprocessorbyincreasingthesizeoftheproblemproportionallytotheincreaseinthenumberofprocessors

• Loadbalancingisanotherimportantfactor:everyprocessordoingsameamountofwork− Justoneunitwithtwicetheloadofotherscutsspeedupalmostinhalf


PeerInstruction

47

Supposeaprogramspends80%ofitstimeinasquare-rootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?


Speedup=1/((1-F)+F/SE)RED: 5

GREEN: 20ORANGE: 500

Noneofabove

Administrivia• Midterm2isTuesdayduringclasstime!

− ReviewSession:Saturday10-12pm@HearstAnnexA1−Onedouble-sidedcribsheet;we’llprovidetheRISCVcard− BringyourID!We’llcheckit− RoomassignmentspostedonPiazza

• Project2pre-labdueFriday• HW4due11/3,butadvisabletofinishbythemidterm• Project2-2due11/6,butalsogoodtofinishbeforeMT2

48CS61c Lecture19:ThreadLevalParallelProcessing

Amdahl’sLawappliedtodgemm• Measureddgemm performance

− Peak 5.5GFLOPS− Largematrices 3.6GFLOPS− Processor 24.8GFLOPS

• Whyarewenotgetting(closeto)25GFLOPS?− Somethingelse(notfloating-pointALU)islimitingperformance!− Butwhat?Possibleculprits:

§ Cache§ Hazards§ Let’slookatboth!


PipelineHazards– dgemm


LoopUnrolling


Compilerdoestheunrolling

Howdoyouverifythatthegeneratedcodeisactuallyunrolled?

4registers

Performance

NGflops

scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91


FPUversusMemoryAccess• Howmanyfloating-pointoperationsdoesmatrixmultiplytake?− F=2xN3 (N3 multiplies,N3 adds)

• Howmanymemoryload/stores?− M=3xN2 (forA,B,C)

• Manymorefloating-pointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…


Butmemoryisaccessedrepeatedly

• q=F/M=1!(2loadsand2floating-pointoperations)


Innerloop:


Second-LevelCache(SRAM)

TypicalMemoryHierarchyControl

Datapath

SecondaryMemory(Disk

OrFlash)

On-ChipComponents

RegFile

MainMemory(DRAM)Data

CacheInstrCache

Speed(cycles):½’s 1’s 10’s 100’s-10001,000,000’sSize(bytes): 100’s 10K’sM’sG’sT’s

• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!

Cost/bit:highest lowest

Third-LevelCache(SRAM)

Sub-MatrixMultiplicationor:BeatingAmdahl’sLaw

CS61c 58

Blocking• Idea:

− Rearrangecodetousevaluesloadedincachemanytimes−Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation

−à throughputlimitedbyFPhardwareandcache,notslowDRAM

− P&H,RISC-Veditionp.465


MemoryAccessBlocking


Performance

NGflops

scalar avx unroll blocking32 1.30 4.56 12.95 13.80160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82


AndinConclusion,…• ApproachestoParallelism

− SISD,SIMD,MIMD(nextlecture)• SIMD

− Oneinstructionoperatesonmultipleoperandssimultaneously• Example:matrixmultiplication

− Floatingpointheavyà exploitMoore’slawtomakefast• Amdahl’sLaw:

− Serialsectionslimitspeedup− Cache

§ Blocking− Hazards

§ Loopunrolling


cs 61c: great ideas in computer architecture lecture 18

Documents