cs 61c: great ideas in computer architecture lecture 19 ...cs61c/fa16/lec/19/l19.pdf · cs 61c...

Post on 17-Jun-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS61C:GreatIdeasinComputerArchitecture

Lecture19:Thread-LevelParallelProcessing

BernhardBoser&RandyKatz

http://inst.eecs.berkeley.edu/~cs61c

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

CS61c Lecture19:ThreadLevelParallelProcessing 2

ImprovingPerformance1. Increaseclockratefs

− Reachedpracticalmaximumfortoday’stechnology− <5GHzforgeneralpurposecomputers

2. LowerCPI(cyclesperinstruction)− SIMD,“instructionlevelparallelism”

3. Performmultipletaskssimultaneously− MultipleCPUs,eachexecutingdifferentprogram− Tasksmayberelated

§ E.g.eachCPUperformspartofabigmatrixmultiplication− orunrelated

§ E.g.distributedifferentwebhttprequestsoverdifferentcomputers§ E.g.runppt (viewlectureslides)andbrowser(youtube)simultaneously

4. Doalloftheabove:− Highfs,SIMD,multipleparalleltasks

3CS61c Lecture19:ThreadLevelParallelProcessing

Today’slecture

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssigned tocomputere.g.,Search“Katz”

• ParallelThreadsAssigned tocoree.g.,Lookup,Ads

• ParallelInstructions>1instruction@onetimee.g.,5pipelined instructions

• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages 4

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

Core

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

Project4CS61c Lecture19:ThreadLevelParallelProcessing

ParallelComputerArchitectures

CS61c 5

Severalseparatecomputers,somemeansforcommunication(e.g.Ethernet)

Massivearrayofcomputers,fastcommunicationbetweenprocessors

Multi-coreCPU:1datapathinsinglechip

shareL3cache,memory, peripheralsExample:Hivemachines

GPU“graphicsprocessing unit”

Example:CPUwith2Cores

6

Processor“Core”1

Control

DatapathPC

Registers(ALU)

MemoryInput

Output

Bytes

I/O-MemoryInterfaces

Processor0MemoryAccesses

Processor“Core”2

Control

DatapathPC

Registers(ALU)

Processor1MemoryAccesses

CS61c

MultiprocessorExecutionModel

• Eachprocessor(core)executesitsowninstructions• Separate resources(notshared)

− Datapath(PC,registers,ALU)− Highestlevelcaches(e.g.1st and2nd)

• Shared resources− Memory(DRAM)− Often3rd levelcache

§ Oftenonsamesiliconchip§ Butnotarequirement

• Nomenclature− “MultiprocessorMicroprocessor”− Multicoreprocessor

§ E.g.4coreCPU(centralprocessingunit)§ Executes4differentinstructionstreamssimultaneously

7CS61c Lecture19:ThreadLevelParallelProcessing

TransitiontoMulticore

Sequential App Performance

8CS61c Lecture19:ThreadLevelParallelProcessing

MultiprocessorExecutionModel

• Sharedmemory− Each“core”hasaccesstotheentirememoryintheprocessor− Specialhardwarekeepscachesconsistent− Advantages:

§ Simplifiescommunication inprogramviasharedvariables− Drawbacks:

§ Doesnotscalewell:o “Slow”memorysharedbymany“customers”(cores)o Maybecomebottleneck(Amdahl’sLaw)

• Twowaystouseamultiprocessor:− Job-levelparallelism

§ Processorsworkonunrelatedproblems§ Nocommunicationbetweenprograms

− Partitionworkofsingletaskbetweenseveralcores§ E.g.eachperformspartoflargematrixmultiplication

9CS61c Lecture19:ThreadLevelParallelProcessing

ParallelProcessing

• It’sdifficult!• It’sinevitable

− Onlypathtoincreaseperformance− Onlypathtolowerenergyconsumption(improvebatterylife)

• Inmobilesystems(e.g.smartphones,tablets)− Multiplecores− Dedicatedprocessors,e.g.

§ motionprocessoriniPhone§ GPU(graphicsprocessingunit)

• Warehouse-scalecomputers− multiple“nodes”

§ “boxes”withseveralCPUs,disksperbox− MIMD(multi-core)andSIMD(e.g.AVX)ineachnode

10CS61c Lecture19:ThreadLevelParallelProcessing

PotentialParallelPerformance(assumingsoftwarecanuseit)

Year Cores SIMD bits /Core Core *SIMD bits

Total, e.g.FLOPs/Cycle

2003 2 128 256 42005 4 128 512 82007 6 128 768 122009 8 128 1024 162011 10 256 2560 402013 12 256 3072 482015 14 512 7168 1122017 16 512 8192 1282019 18 1024 18432 2882021 20 1024 20480 320

11

2.5X 8X 20X

MIMD SIMD MIMD&SIMD+2/

2yrs2X/4yrs

CS61c

12years

20xin12years201/12 =1.28xà 28%peryearor2xevery3years!

IF(!)wecanuseit

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

CS61c Lecture19:ThreadLevelParallelProcessing 12

ProgramsRunningonmyComputerPID TTY TIME CMD220 ?? 0:04.34 /usr/libexec/UserEventAgent (Aqua)222 ?? 0:10.60 /usr/sbin/distnoted agent224 ?? 0:09.11 /usr/sbin/cfprefsd agent229 ?? 0:04.71 /usr/sbin/usernoted230 ?? 0:02.35 /usr/libexec/nsurlsessiond232 ?? 0:28.68 /System/Library/PrivateFrameworks/CalendarAgent.framework/Executables/CalendarAgent234 ?? 0:04.36 /System/Library/PrivateFrameworks/GameCenterFoundation.framework/Versions/A/gamed235 ?? 0:01.90 /System/Library/CoreServices/cloudphotosd.app/Contents/MacOS/cloudphotosd236 ?? 0:49.72 /usr/libexec/secinitd239 ?? 0:01.66 /System/Library/PrivateFrameworks/TCC.framework/Resources/tccd240 ?? 0:12.68 /System/Library/Frameworks/Accounts.framework/Versions/A/Support/accountsd241 ?? 0:09.56 /usr/libexec/SafariCloudHistoryPushAgent242 ?? 0:00.27 /System/Library/PrivateFrameworks/CallHistory.framework/Support/CallHistorySyncHelper243 ?? 0:00.74 /System/Library/CoreServices/mapspushd244 ?? 0:00.79 /usr/libexec/fmfd246 ?? 0:00.09 /System/Library/PrivateFrameworks/AskPermission.framework/Versions/A/Resources/askpermissiond248 ?? 0:01.03 /System/Library/PrivateFrameworks/CloudDocsDaemon.framework/Versions/A/Support/bird249 ?? 0:02.50 /System/Library/PrivateFrameworks/IDS.framework/identityservicesd.app/Contents/MacOS/identityservicesd250 ?? 0:04.81 /usr/libexec/secd254 ?? 0:24.01 /System/Library/PrivateFrameworks/CloudKitDaemon.framework/Support/cloudd258 ?? 0:04.73 /System/Library/PrivateFrameworks/TelephonyUtilities.framework/callservicesd267 ?? 0:02.15 /System/Library/CoreServices/AirPlayUIAgent.app/Contents/MacOS/AirPlayUIAgent --launchd271 ?? 0:03.91 /usr/libexec/nsurlstoraged274 ?? 0:00.90 /System/Library/PrivateFrameworks/CommerceKit.framework/Versions/A/Resources/storeaccountd282 ?? 0:00.09 /usr/sbin/pboard283 ?? 0:00.90

/System/Library/PrivateFrameworks/InternetAccounts.framework/Versions/A/XPCServices/com.apple.internetaccounts.xpc/Contents/MacOS/com.apple.internetaccounts285 ?? 0:04.72 /System/Library/Frameworks/ApplicationServices.framework/Frameworks/ATS.framework/Support/fontd291 ?? 0:00.25 /System/Library/Frameworks/Security.framework/Versions/A/Resources/CloudKeychainProxy.bundle/Contents/MacOS/CloudKeychainProxy292 ?? 0:09.54 /System/Library/CoreServices/CoreServicesUIAgent.app/Contents/MacOS/CoreServicesUIAgent293 ?? 0:00.29

/System/Library/PrivateFrameworks/CloudPhotoServices.framework/Versions/A/Frameworks/CloudPhotoServicesConfiguration.framework/Versions/A/XPCServices/com.apple.CloudPhotosConfiguration.xpc/Contents/MacOS/com.apple.CloudPhotosConfiguration

297 ?? 0:00.84 /System/Library/PrivateFrameworks/CloudServices.framework/Resources/com.apple.sbd302 ?? 0:26.11 /System/Library/CoreServices/Dock.app/Contents/MacOS/Dock303 ?? 0:09.55 /System/Library/CoreServices/SystemUIServer.app/Contents/MacOS/SystemUIServer

…156total at this momentHow does mylaptopdothis?

Imagine doing 156assignments all at the same time!CS61c Lecture19:ThreadLevelParallelProcessing 13

Threads• Sequentialflowofinstructionsthatperformssometask

− Uptonowwejustcalledthisa“program”

• Eachthreadhasa− DedicatedPC(programcounter)− Separateregisters− Accessesthesharedmemory

• Eachprocessorprovidesone(ormore)− hardware threads (orharts)thatactivelyexecuteinstructions− Eachcoreexecutesone“hardware thread”

• Operatingsystemmultiplexesmultiple− software threads ontotheavailablehardwarethreads− allthreadsexceptthosemappedtohardwarethreadsarewaiting

14CS61c Lecture19:ThreadLevelParallelProcessing

OperatingSystemThreads

Giveillusionofmany“simultaneously”activethreads1. Multiplexsoftwarethreadsontohardwarethreads:

a) Switchoutblockedthreads(e.g.cachemiss,userinput,networkaccess)b) Timer(e.g.switchactivethreadevery1ms)

2. Removeasoftwarethreadfromahardwarethreadbyi. interruptingitsexecutionii. savingitsregistersandPCtomemory

3. Startexecutingadifferentsoftwarethreadbyi. loadingitspreviouslysavedregistersintoahardwarethread’sregistersii. jumpingtoitssavedPC

CS61c Lecture19:ThreadLevelParallelProcessing 15

Example:4Cores

CS61c Lecture19:ThreadLevelParallelProcessing 16

Threadpool:Listofthreadscompetingforprocessor

OSmapsthreadstocoresandscheduleslogical(software)threads

Core2

Each“Core”activelyruns1programatatime

Core1 Core3 Core4

Multithreading

• Typicalscenario:− Activethreadencounterscachemiss− Activethreadwaits~ 1000cyclesfordatafromDRAM−à switchoutandrundifferentthreaduntildataavailable

• Problem−Mustsavecurrentthreadstateandloadnewthreadstate

§ PC,allregisters(couldbemany,e.g.AVX)−àmustperformswitchin≪1000cycles

• Canhardwarehelp?−Moore’slaw:transistorsareplenty

17CS61c Lecture19:ThreadLevelParallelProcessing

HardwareassistedSoftwareMultithreading

18

MemoryInput

Output

Bytes

I/O-MemoryInterfaces

Processor(1 Core,2Threads)

Control

DatapathPC0

Registers0

(ALU)

PC1

Registers1

• TwocopiesofPCandRegistersinsideprocessorhardware

• Looksliketwoprocessorstosoftware(hardwarethread0,hardwarethread1)

• Hyperthreading:• Boththreadsmaybeactive

simultaneously

CS61c Lecture19:ThreadLevelParallelProcessingNote:presentedincorrectlyinthelecture

Multithreading

• Logicalthreads− ≈1%morehardware,≈10%(?)betterperformance

§ Separateregisters§ Sharedatapath,ALU(s),caches

• Multicore− =>DuplicateProcessors− ≈50%morehardware,≈2Xbetterperformance?

• Modernmachinesdoboth−Multiplecoreswithmultiplethreads percore

19CS61c Lecture19:ThreadLevelParallelProcessing

Bernhard’sLaptop

CS61c Lecture19:ThreadLevelParallelProcessing 20

$ sysctl -a | grep hw

hw.physicalcpu: 2hw.logicalcpu: 4hw.l1icachesize: 32,768hw.l1dcachesize: 32,768hw.l2cachesize: 262,144hw.l3cachesize: 3,145,728

• 2Cores• 4Threadstotal

Example:6Cores,24LogicalThreads

CS61c Lecture19:ThreadLevelParallelProcessing 21

Threadpool:Listofthreadscompetingforprocessor

OSmapsthreadstocoresandscheduleslogical(software)threads

Thread1Core2

Thread2

Thread3

Thread4

Thread1Core6

Thread2

Thread3

Thread4

Thread1Core4

Thread2

Thread3

Thread4

Thread1Core5

Thread2

Thread3

Thread4

Thread1Core3

Thread2

Thread3

Thread4

Thread1Core1

Thread2

Thread3

Thread4

4Logicalthreadspercore(hardware)thread

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

CS61c Lecture19:ThreadLevelParallelProcessing 22

LanguagessupportingParallelProgramming

23

ActorScript Concurrent Pascal JoCaml OrcAda Concurrent ML Join OzAfnix Concurrent Haskell Java PictAlef Curry Joule ReiaAlice CUDA Joyce SALSAAPL E LabVIEW ScalaAxum Eiffel Limbo SISALChapel Erlang Linda SRCilk Fortan 90 MultiLisp Stackless PythonClean Go Modula-3 SuperPascalClojure Io Occam VHDLConcurrent C Janus occam-π XC

CS61c Lecture19:ThreadLevelParallelProcessing

Whichonetopick?

Whysomanyparallelprogramminglanguages?

• Piazzaquestion:−Why“intrinsics”?− TOIntel:fixyour#()&$!Compiler!

• It’shappening...but− SIMDfeaturesarecontinuallyaddedtocompilers(Intel,gcc)− Intenseareaofresearch− Researchprogress:

§ 20+yearstotranslateCintogood(fast!)assembly§ HowlongtotranslateCintogood(fast!)parallelcode?

o Generalproblem isveryhardtosolveo Presentstate:specializedsolutions forspecificcaseso Youropportunitytobecomefamous!

CS61c Lecture19:ThreadLevelParallelProcessing 24

ParallelProgrammingLanguages

• Numberofchoicesisindicationof− Nouniversalsolution

§ Needsareveryproblemspecific− E.g.

§ Scientificcomputing(matrixmultiply)§ Webserver:handlemanyunrelatedrequestssimultaneously§ Input/output:it’sallhappeningsimultaneously!

• Specializedlanguagesfordifferenttasks− Someareeasiertouse(forsomeproblems)− Noneisparticularly”easy”touse

• 61C− Parallellanguageexamplesforhigh-performancecomputing− OpenMP

CS61c Lecture19:ThreadLevelParallelProcessing 25

ParallelLoops

• Serialexecution:for (int i=0; i<100; i++) {

…}

• ParallelExecution:

CS61c Lecture19:ThreadLevelParallelProcessing 26

for (int i=0; i<25; i++) { …

}

for (int i=25; i<50; i++) {

…}

for (int i=50; i<75; i++) {

…}

for (int i=75; i<100; i++) {

…}

Parallelfor inOpenMP

#include <omp.h>

#pragma omp parallel forfor (int i=0; i<100; i++) {

…}

CS61c Lecture19:ThreadLevelParallelProcessing 27

OpenMPExample$ gcc-5 -fopenmp for.c;./a.outthread 0, i = 0thread 1, i = 3thread 2, i = 6thread 3, i = 8thread 0, i = 1thread 1, i = 4thread 2, i = 7thread 3, i = 9thread 0, i = 2thread 1, i = 501 02 03 14 15 16 27 28 39 40

CS61c Lecture19:ThreadLevelParallelProcessing 28

OpenMP

• Cextension:nonewlanguagetolearn• Multi-threaded,shared-memoryparallelism

− CompilerDirectives,#pragma− RuntimeLibraryRoutines,#include <omp.h>

• #pragma− IgnoredbycompilersunawareofOpenMP− Samesourceformultiplearchitectures

§ E.g.sameprogramfor1&16cores

• Onlyworkswithsharedmemory

29CS61c Lecture19:ThreadLevelParallelProcessing

OpenMPProgrammingModel• Fork- JoinModel:

• OpenMPprogramsbeginassingleprocess(masterthread)− Sequentialexecution

• Whenparallelregionisencountered− Masterthread“forks” intoteamofparallelthreads− Executedsimultaneously− Atendofparallelregion,parallelthreads”join”,leavingonlymasterthread

• Processrepeatsforeachparallelregion− Amdahl’slaw?

30CS61c Lecture19:ThreadLevelParallelProcessing

WhatKindofThreads?

• OpenMPthreadsareoperatingsystem(software)threads.• OSwillmultiplexrequestedOpenMPthreadsontoavailablehardwarethreads.• Hopefullyeachgetsarealhardwarethreadtorunon,sonoOS-leveltime-multiplexing.• Butothertasksonmachinecanalsousehardwarethreads!• Be“careful”(?)whentimingresultsforproject4!

− 5AM?− Jobqueue?

31CS61c Lecture19:ThreadLevelParallelProcessing

Example2:computingp

CS61c 32http://openmp.org/mp-documents/omp-hands-on-SC08.pdf

Sequentialp

CS61c Lecture19:ThreadLevelParallelProcessing 33

pi = 3.142425985001

• Resemblesp,butnotveryaccurate• Let’sincreasenum_steps andparallelize

Parallelize(1)…

CS61c Lecture19:ThreadLevelParallelProcessing 34

• Problem:eachthreadsneedsaccesstothesharedvariablesum

• Coderunssequentially…

Parallelize(2)…

CS61c Lecture19:ThreadLevelParallelProcessing 35

sum[0] sum[1]

1. Computesum[0]andsum[2]

inparallel

2. Computesum = sum[0] + sum[1]

sequentially

Parallelp

CS61c 36Lecture19:ThreadLevelParallelProcessing

TrialRun

i = 1, id = 1i = 0, id = 0i = 2, id = 2i = 3, id = 3i = 5, id = 1i = 4, id = 0i = 6, id = 2i = 7, id = 3i = 9, id = 1i = 8, id = 0pi = 3.142425985001

CS61c Lecture19:ThreadLevelParallelProcessing 37

Scaleup:num_steps = 106

pi = 3.141592653590

Youverify howmany digitsarecorrect…

CS61c Lecture19:ThreadLevelParallelProcessing 38

CanweParallelizeComputingsum

CS61c Lecture19:ThreadLevelParallelProcessing 39

Summationinsideparallelsection• Insignificantspeedupinthisexample,but…• pi = 3.138450662641• Wrong!And value changes between runs?!• What’s goingon?

AlwayslookingforwaystobeatAmdahl’sLaw…

YourTurn

Whatarethepossiblevaluesof*($s0) afterexecutingthiscodeby2concurrent threads?

# *($s0) = 100lw $t0,0($s0)addi $t0,$t0,1sw $t0,0($s0)

CS61c Lecture19:ThreadLevelParallelProcessing 40

Answer *($s0)

A 100 or101B 101C 101or102D 100or101or102E 100or101or102or103

YourTurn

Whatarethepossiblevaluesof*($s0) afterexecutingthiscodeby2concurrent threads?

# *($s0) = 100lw $t0,0($s0)addi $t0,$t0,1sw $t0,0($s0)

CS61c Lecture19:ThreadLevelParallelProcessing 41

Answer *($s0)

C 101or102

• 102ifthethreadsentercodesectionsequentially• 101ifbothexecutelw beforeeitherrunssw• onethreadsees“stale”data

What’sgoingon?

CS61c Lecture19:ThreadLevelParallelProcessing 42

• Operationisreallypi = pi + sum[id]

• Whatif>1threadsreadscurrent(same)valueofpi,computesthesum,andstorestheresultbacktopi?

• Eachprocessorreadssameintermediatevalueofpi!• Resultdependsonwhogetstherewhen

• A“race”à resultisnotdeterministic

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

CS61c Lecture19:ThreadLevelParallelProcessing 43

Synchronization

• Problem:− Limitaccesstosharedresourceto1actoratatime− E.g.only1personpermittedtoeditafileatatime

§ otherwisechangesbyseveralpeoplegetallmixedup

• Solution:

CS61c Lecture19:ThreadLevelParallelProcessing 44

• Taketurns:• Onlyonepersonget’sthe

microphone&talksatatime

• Alsogoodpracticeforclassrooms,btw…

Locks

• Computersuselockstocontrolaccesstosharedresources− Servespurposeofmicrophoneinexample− Alsoreferredtoas“semaphore”

• Usuallyimplementedwithavariable− int lock;

§ 0forunlocked§ 1forlocked

CS61c Lecture19:ThreadLevelParallelProcessing 45

Synchronizationwithlocks// wait for lock releasedwhile (lock != 0) ;// lock == 0 now (unlocked)

// set locklock = 1;

// access shared resource ... // e.g. pi// sequential execution! (Amdahl ...)

// release locklock = 0;

CS61c Lecture19:ThreadLevelParallelProcessing 46

LockSynchronization

Thread1

while (lock != 0) ;

lock = 1;

// critical section

lock = 0;

Thread2

while (lock != 0) ;

lock = 1; // critical sectionlock = 0;

CS61c Lecture19:ThreadLevelParallelProcessing 47

• Thread2findslocknotset,beforethread1setsit

• Boththreadsbelievetheygotandsetthelock!

Tryasyouwant,thisproblemhasnosolution,notevenattheassemblylevel.

Unlessweintroducenewinstructions,thatis!

HardwareSynchronization

• Solution:− Atomicread/write− Read&writeinsingleinstruction

§ Nootheraccesspermittedbetweenreadandwrite− Note:

§ Mustusesharedmemory (multiprocessing)

• Commonimplementations:− Atomicswapofregister↔memory− Pairofinstructionsfor“linked”readandwrite

§ writefailsifmemorylocationhasbeen“tampered”withafterlinkedread

§ MIPSusesthissolution

48CS61c Lecture19:ThreadLevelParallelProcessing

MIPSSynchronizationInstructions• Loadlinked: ll $rt, off($rs)

− Readsmemorylocation(likelw)− Alsosets(hidden)“linkbit”− Linkbitisresetifmemorylocation(off($rs))isaccessed

• Storeconditional: sc $rt, off($rs)

− Storesoff($rs) = $rt (like sw)− Sets$rt=1 (success)iflinkbitisset

§ i.e.no(other)processaccessedoff($rs) sincell− Sets$rt=0 (failure)otherwise− Note:sc clobbers $rt,i.e.changesitsvalue

49CS61c Lecture19:ThreadLevelParallelProcessing

LockSynchronization

BrokenSynchronization

while (lock != 0) ;

lock = 1;

// critical section

lock = 0;

Fix(lockisatlocation$s1)

Try: addiu $t0,$zero,1ll $t1,0($s1)bne $t1,$zero,Trysc $t0,0($s1)beq $t0,$zero,Try

Locked:

# critical section

Unlock:sw $zero,0($s1)

CS61c Lecture19:ThreadLevelParallelProcessing 50

Tryagainifsc failed(another threadexecutedsc sinceabovell)

$t0 = 1 beforecalling ll:minimize timebetweenll andsc

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

CS61c Lecture19:ThreadLevelParallelProcessing 51

OpenMPLocks

CS61c Lecture19:ThreadLevelParallelProcessing 52

SynchronizationinOpenMP

• Typicallyareusedinlibrariesofhigherlevelparallelprogrammingconstructs• E.g.OpenMPoffers$pragmasforcommoncases:

− critical− atomic− barrier− ordered

• OpenMPoffersmanymorefeatures− seeonlinedocumentation− ortutorialat

§ http://openmp.org/mp-documents/omp-hands-on-SC08.pdf

CS61c Lecture19:ThreadLevelParallelProcessing 53

OpenMPcritical

CS61c Lecture19:ThreadLevelParallelProcessing 54

TheTroublewithLocks…• …isdead-locks• Consider2cookssharingakitchen

− Eachcooksamealthatrequiressaltandpepper(locks)− Cook1grabssalt− Cook2grabspepper− Cook1noticess/heneedspepper

§ it’snotthere,sos/hewaits− Cook2realizess/heneedssalt

§ it’snotthere,sos/hewaits

• Anotsocommoncauseofcookstarvation− Butdeadlocksarepossibleinparallelprograms− Verydifficulttodebug

§ malloc/free iseasy…

CS61c Lecture19:ThreadLevelParallelProcessing 55

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

CS61c Lecture19:ThreadLevelParallelProcessing 56

AndinConclusion,…• Sequentialsoftwareexecutionspeedislimited• Parallelprocessingistheonlypathtohigherperformance

− SIMD:instructionlevelparallelism§ Implemented inallhighperformanceCPUstoday(x86,ARM,…)§ Partiallysupportedbycompilers

− MIMD:threadlevelparallelism§ Multicoreprocessors§ SupportedbyOperatingSystems(OS)§ Requiresprogrammerinterventiontoexploitatsingleprogramlevel

o E.g.OpenMP− SIMD&MIMDformaximumperformance

• Synchronization− Requireshardwaresupport:specializedassemblyinstructions− Typicallyusehigher-levelsupport− Bewareofdeadlocks

57CS61c Lecture19:ThreadLevelParallelProcessing

top related