cs 61c: great ideas in computer architecture lecture 19 ...cs61c/fa16/lec/19/l19.pdf · cs 61c...

CS61C:GreatIdeasinComputerArchitecture

Lecture19:Thread-LevelParallelProcessing

BernhardBoser&RandyKatz

http://inst.eecs.berkeley.edu/~cs61c

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

CS61c Lecture19:ThreadLevelParallelProcessing 2

ImprovingPerformance1. Increaseclockratefs

− Reachedpracticalmaximumfortoday’stechnology− <5GHzforgeneralpurposecomputers

2. LowerCPI(cyclesperinstruction)− SIMD,“instructionlevelparallelism”

3. Performmultipletaskssimultaneously− MultipleCPUs,eachexecutingdifferentprogram− Tasksmayberelated

§ E.g.eachCPUperformspartofabigmatrixmultiplication− orunrelated

§ E.g.distributedifferentwebhttprequestsoverdifferentcomputers§ E.g.runppt (viewlectureslides)andbrowser(youtube)simultaneously

4. Doalloftheabove:− Highfs,SIMD,multipleparalleltasks

3CS61c Lecture19:ThreadLevelParallelProcessing

Today’slecture

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssigned tocomputere.g.,Search“Katz”

• ParallelThreadsAssigned tocoree.g.,Lookup,Ads

• ParallelInstructions>1instruction@onetimee.g.,5pipelined instructions

• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages 4

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

Project4CS61c Lecture19:ThreadLevelParallelProcessing

ParallelComputerArchitectures

CS61c 5

Severalseparatecomputers,somemeansforcommunication(e.g.Ethernet)

Massivearrayofcomputers,fastcommunicationbetweenprocessors

Multi-coreCPU:1datapathinsinglechip

shareL3cache,memory, peripheralsExample:Hivemachines

GPU“graphicsprocessing unit”

Example:CPUwith2Cores

Processor“Core”1

Control

DatapathPC

Registers(ALU)

MemoryInput

Output

I/O-MemoryInterfaces

Processor0MemoryAccesses

Processor“Core”2

Control

DatapathPC

Registers(ALU)

Processor1MemoryAccesses

MultiprocessorExecutionModel

• Eachprocessor(core)executesitsowninstructions• Separate resources(notshared)

− Datapath(PC,registers,ALU)− Highestlevelcaches(e.g.1st and2nd)

• Shared resources− Memory(DRAM)− Often3rd levelcache

§ Oftenonsamesiliconchip§ Butnotarequirement

• Nomenclature− “MultiprocessorMicroprocessor”− Multicoreprocessor

§ E.g.4coreCPU(centralprocessingunit)§ Executes4differentinstructionstreamssimultaneously

TransitiontoMulticore

Sequential App Performance

MultiprocessorExecutionModel

• Sharedmemory− Each“core”hasaccesstotheentirememoryintheprocessor− Specialhardwarekeepscachesconsistent− Advantages:

§ Simplifiescommunication inprogramviasharedvariables− Drawbacks:

§ Doesnotscalewell:o “Slow”memorysharedbymany“customers”(cores)o Maybecomebottleneck(Amdahl’sLaw)

• Twowaystouseamultiprocessor:− Job-levelparallelism

§ Processorsworkonunrelatedproblems§ Nocommunicationbetweenprograms

− Partitionworkofsingletaskbetweenseveralcores§ E.g.eachperformspartoflargematrixmultiplication

ParallelProcessing

• It’sdifficult!• It’sinevitable

− Onlypathtoincreaseperformance− Onlypathtolowerenergyconsumption(improvebatterylife)

• Inmobilesystems(e.g.smartphones,tablets)− Multiplecores− Dedicatedprocessors,e.g.

§ motionprocessoriniPhone§ GPU(graphicsprocessingunit)

• Warehouse-scalecomputers− multiple“nodes”

§ “boxes”withseveralCPUs,disksperbox− MIMD(multi-core)andSIMD(e.g.AVX)ineachnode

PotentialParallelPerformance(assumingsoftwarecanuseit)

Year Cores SIMD bits /Core Core *SIMD bits

Total, e.g.FLOPs/Cycle

2003 2 128 256 42005 4 128 512 82007 6 128 768 122009 8 128 1024 162011 10 256 2560 402013 12 256 3072 482015 14 512 7168 1122017 16 512 8192 1282019 18 1024 18432 2882021 20 1024 20480 320

2.5X 8X 20X

MIMD SIMD MIMD&SIMD+2/

2yrs2X/4yrs

12years

20xin12years201/12 =1.28xà 28%peryearor2xevery3years!

IF(!)wecanuseit

Agenda

ProgramsRunningonmyComputerPID TTY TIME CMD220 ?? 0:04.34 /usr/libexec/UserEventAgent (Aqua)222 ?? 0:10.60 /usr/sbin/distnoted agent224 ?? 0:09.11 /usr/sbin/cfprefsd agent229 ?? 0:04.71 /usr/sbin/usernoted230 ?? 0:02.35 /usr/libexec/nsurlsessiond232 ?? 0:28.68 /System/Library/PrivateFrameworks/CalendarAgent.framework/Executables/CalendarAgent234 ?? 0:04.36 /System/Library/PrivateFrameworks/GameCenterFoundation.framework/Versions/A/gamed235 ?? 0:01.90 /System/Library/CoreServices/cloudphotosd.app/Contents/MacOS/cloudphotosd236 ?? 0:49.72 /usr/libexec/secinitd239 ?? 0:01.66 /System/Library/PrivateFrameworks/TCC.framework/Resources/tccd240 ?? 0:12.68 /System/Library/Frameworks/Accounts.framework/Versions/A/Support/accountsd241 ?? 0:09.56 /usr/libexec/SafariCloudHistoryPushAgent242 ?? 0:00.27 /System/Library/PrivateFrameworks/CallHistory.framework/Support/CallHistorySyncHelper243 ?? 0:00.74 /System/Library/CoreServices/mapspushd244 ?? 0:00.79 /usr/libexec/fmfd246 ?? 0:00.09 /System/Library/PrivateFrameworks/AskPermission.framework/Versions/A/Resources/askpermissiond248 ?? 0:01.03 /System/Library/PrivateFrameworks/CloudDocsDaemon.framework/Versions/A/Support/bird249 ?? 0:02.50 /System/Library/PrivateFrameworks/IDS.framework/identityservicesd.app/Contents/MacOS/identityservicesd250 ?? 0:04.81 /usr/libexec/secd254 ?? 0:24.01 /System/Library/PrivateFrameworks/CloudKitDaemon.framework/Support/cloudd258 ?? 0:04.73 /System/Library/PrivateFrameworks/TelephonyUtilities.framework/callservicesd267 ?? 0:02.15 /System/Library/CoreServices/AirPlayUIAgent.app/Contents/MacOS/AirPlayUIAgent --launchd271 ?? 0:03.91 /usr/libexec/nsurlstoraged274 ?? 0:00.90 /System/Library/PrivateFrameworks/CommerceKit.framework/Versions/A/Resources/storeaccountd282 ?? 0:00.09 /usr/sbin/pboard283 ?? 0:00.90

/System/Library/PrivateFrameworks/InternetAccounts.framework/Versions/A/XPCServices/com.apple.internetaccounts.xpc/Contents/MacOS/com.apple.internetaccounts285 ?? 0:04.72 /System/Library/Frameworks/ApplicationServices.framework/Frameworks/ATS.framework/Support/fontd291 ?? 0:00.25 /System/Library/Frameworks/Security.framework/Versions/A/Resources/CloudKeychainProxy.bundle/Contents/MacOS/CloudKeychainProxy292 ?? 0:09.54 /System/Library/CoreServices/CoreServicesUIAgent.app/Contents/MacOS/CoreServicesUIAgent293 ?? 0:00.29

/System/Library/PrivateFrameworks/CloudPhotoServices.framework/Versions/A/Frameworks/CloudPhotoServicesConfiguration.framework/Versions/A/XPCServices/com.apple.CloudPhotosConfiguration.xpc/Contents/MacOS/com.apple.CloudPhotosConfiguration

297 ?? 0:00.84 /System/Library/PrivateFrameworks/CloudServices.framework/Resources/com.apple.sbd302 ?? 0:26.11 /System/Library/CoreServices/Dock.app/Contents/MacOS/Dock303 ?? 0:09.55 /System/Library/CoreServices/SystemUIServer.app/Contents/MacOS/SystemUIServer

…156total at this momentHow does mylaptopdothis?

Imagine doing 156assignments all at the same time!CS61c Lecture19:ThreadLevelParallelProcessing 13

Threads• Sequentialflowofinstructionsthatperformssometask

− Uptonowwejustcalledthisa“program”

• Eachthreadhasa− DedicatedPC(programcounter)− Separateregisters− Accessesthesharedmemory

• Eachprocessorprovidesone(ormore)− hardware threads (orharts)thatactivelyexecuteinstructions− Eachcoreexecutesone“hardware thread”

• Operatingsystemmultiplexesmultiple− software threads ontotheavailablehardwarethreads− allthreadsexceptthosemappedtohardwarethreadsarewaiting

OperatingSystemThreads

Giveillusionofmany“simultaneously”activethreads1. Multiplexsoftwarethreadsontohardwarethreads:

a) Switchoutblockedthreads(e.g.cachemiss,userinput,networkaccess)b) Timer(e.g.switchactivethreadevery1ms)

2. Removeasoftwarethreadfromahardwarethreadbyi. interruptingitsexecutionii. savingitsregistersandPCtomemory

3. Startexecutingadifferentsoftwarethreadbyi. loadingitspreviouslysavedregistersintoahardwarethread’sregistersii. jumpingtoitssavedPC

Example:4Cores

Threadpool:Listofthreadscompetingforprocessor

OSmapsthreadstocoresandscheduleslogical(software)threads

Each“Core”activelyruns1programatatime

Core1 Core3 Core4

Multithreading

• Typicalscenario:− Activethreadencounterscachemiss− Activethreadwaits～ 1000cyclesfordatafromDRAM−à switchoutandrundifferentthreaduntildataavailable

• Problem−Mustsavecurrentthreadstateandloadnewthreadstate

§ PC,allregisters(couldbemany,e.g.AVX)−àmustperformswitchin≪1000cycles

• Canhardwarehelp?−Moore’slaw:transistorsareplenty

HardwareassistedSoftwareMultithreading

MemoryInput

Output

I/O-MemoryInterfaces

Processor(1 Core,2Threads)

Control

DatapathPC0

Registers0

Registers1

• TwocopiesofPCandRegistersinsideprocessorhardware

• Looksliketwoprocessorstosoftware(hardwarethread0,hardwarethread1)

• Hyperthreading:• Boththreadsmaybeactive

simultaneously

CS61c Lecture19:ThreadLevelParallelProcessingNote:presentedincorrectlyinthelecture

Multithreading

• Logicalthreads− ≈1%morehardware,≈10%(?)betterperformance

§ Separateregisters§ Sharedatapath,ALU(s),caches

• Multicore− =>DuplicateProcessors− ≈50%morehardware,≈2Xbetterperformance?

• Modernmachinesdoboth−Multiplecoreswithmultiplethreads percore

Bernhard’sLaptop

$ sysctl -a | grep hw

hw.physicalcpu: 2hw.logicalcpu: 4hw.l1icachesize: 32,768hw.l1dcachesize: 32,768hw.l2cachesize: 262,144hw.l3cachesize: 3,145,728

• 2Cores• 4Threadstotal

Example:6Cores,24LogicalThreads

Threadpool:Listofthreadscompetingforprocessor

OSmapsthreadstocoresandscheduleslogical(software)threads

Thread1Core2

Thread2

Thread3

Thread4

Thread1Core6

Thread2

Thread3

Thread4

Thread1Core4

Thread2

Thread3

Thread4

Thread1Core5

Thread2

Thread3

Thread4

Thread1Core3

Thread2

Thread3

Thread4

Thread1Core1

Thread2

Thread3

Thread4

4Logicalthreadspercore(hardware)thread

Agenda

LanguagessupportingParallelProgramming

ActorScript Concurrent Pascal JoCaml OrcAda Concurrent ML Join OzAfnix Concurrent Haskell Java PictAlef Curry Joule ReiaAlice CUDA Joyce SALSAAPL E LabVIEW ScalaAxum Eiffel Limbo SISALChapel Erlang Linda SRCilk Fortan 90 MultiLisp Stackless PythonClean Go Modula-3 SuperPascalClojure Io Occam VHDLConcurrent C Janus occam-π XC

CS61c Lecture19:ThreadLevelParallelProcessing

Whichonetopick?

Whysomanyparallelprogramminglanguages?

• Piazzaquestion:−Why“intrinsics”?− TOIntel:fixyour#()&$!Compiler!

• It’shappening...but− SIMDfeaturesarecontinuallyaddedtocompilers(Intel,gcc)− Intenseareaofresearch− Researchprogress:

§ 20+yearstotranslateCintogood(fast!)assembly§ HowlongtotranslateCintogood(fast!)parallelcode?

o Generalproblem isveryhardtosolveo Presentstate:specializedsolutions forspecificcaseso Youropportunitytobecomefamous!

ParallelProgrammingLanguages

• Numberofchoicesisindicationof− Nouniversalsolution

§ Needsareveryproblemspecific− E.g.

§ Scientificcomputing(matrixmultiply)§ Webserver:handlemanyunrelatedrequestssimultaneously§ Input/output:it’sallhappeningsimultaneously!

• Specializedlanguagesfordifferenttasks− Someareeasiertouse(forsomeproblems)− Noneisparticularly”easy”touse

• 61C− Parallellanguageexamplesforhigh-performancecomputing− OpenMP

ParallelLoops

• Serialexecution:for (int i=0; i<100; i++) {

• ParallelExecution:

for (int i=0; i<25; i++) { …

for (int i=25; i<50; i++) {

for (int i=50; i<75; i++) {

for (int i=75; i<100; i++) {

Parallelfor inOpenMP

#include <omp.h>

#pragma omp parallel forfor (int i=0; i<100; i++) {

OpenMPExample$ gcc-5 -fopenmp for.c;./a.outthread 0, i = 0thread 1, i = 3thread 2, i = 6thread 3, i = 8thread 0, i = 1thread 1, i = 4thread 2, i = 7thread 3, i = 9thread 0, i = 2thread 1, i = 501 02 03 14 15 16 27 28 39 40

OpenMP

• Cextension:nonewlanguagetolearn• Multi-threaded,shared-memoryparallelism

− CompilerDirectives,#pragma− RuntimeLibraryRoutines,#include <omp.h>

• #pragma− IgnoredbycompilersunawareofOpenMP− Samesourceformultiplearchitectures

§ E.g.sameprogramfor1&16cores

• Onlyworkswithsharedmemory

OpenMPProgrammingModel• Fork- JoinModel:

• OpenMPprogramsbeginassingleprocess(masterthread)− Sequentialexecution

• Whenparallelregionisencountered− Masterthread“forks” intoteamofparallelthreads− Executedsimultaneously− Atendofparallelregion,parallelthreads”join”,leavingonlymasterthread

• Processrepeatsforeachparallelregion− Amdahl’slaw?

WhatKindofThreads?

• OpenMPthreadsareoperatingsystem(software)threads.• OSwillmultiplexrequestedOpenMPthreadsontoavailablehardwarethreads.• Hopefullyeachgetsarealhardwarethreadtorunon,sonoOS-leveltime-multiplexing.• Butothertasksonmachinecanalsousehardwarethreads!• Be“careful”(?)whentimingresultsforproject4!

− 5AM?− Jobqueue?

Example2:computingp

CS61c 32http://openmp.org/mp-documents/omp-hands-on-SC08.pdf

Sequentialp

pi = 3.142425985001

• Resemblesp,butnotveryaccurate• Let’sincreasenum_steps andparallelize

Parallelize(1)…

• Problem:eachthreadsneedsaccesstothesharedvariablesum

• Coderunssequentially…

Parallelize(2)…

sum[0] sum[1]

1. Computesum[0]andsum[2]

inparallel

2. Computesum = sum[0] + sum[1]

sequentially

Parallelp

CS61c 36Lecture19:ThreadLevelParallelProcessing

TrialRun

i = 1, id = 1i = 0, id = 0i = 2, id = 2i = 3, id = 3i = 5, id = 1i = 4, id = 0i = 6, id = 2i = 7, id = 3i = 9, id = 1i = 8, id = 0pi = 3.142425985001

Scaleup:num_steps = 106

pi = 3.141592653590

Youverify howmany digitsarecorrect…

CanweParallelizeComputingsum

Summationinsideparallelsection• Insignificantspeedupinthisexample,but…• pi = 3.138450662641• Wrong!And value changes between runs?!• What’s goingon?

AlwayslookingforwaystobeatAmdahl’sLaw…

YourTurn

Whatarethepossiblevaluesof*($s0) afterexecutingthiscodeby2concurrent threads?

# *($s0) = 100lw $t0,0($s0)addi $t0,$t0,1sw $t0,0($s0)

Answer *($s0)

A 100 or101B 101C 101or102D 100or101or102E 100or101or102or103

YourTurn

Whatarethepossiblevaluesof*($s0) afterexecutingthiscodeby2concurrent threads?

# *($s0) = 100lw $t0,0($s0)addi $t0,$t0,1sw $t0,0($s0)

Answer *($s0)

C 101or102

• 102ifthethreadsentercodesectionsequentially• 101ifbothexecutelw beforeeitherrunssw• onethreadsees“stale”data

What’sgoingon?

• Operationisreallypi = pi + sum[id]

• Whatif>1threadsreadscurrent(same)valueofpi,computesthesum,andstorestheresultbacktopi?

• Eachprocessorreadssameintermediatevalueofpi!• Resultdependsonwhogetstherewhen

• A“race”à resultisnotdeterministic

Agenda

Synchronization

• Problem:− Limitaccesstosharedresourceto1actoratatime− E.g.only1personpermittedtoeditafileatatime

§ otherwisechangesbyseveralpeoplegetallmixedup

• Solution:

• Taketurns:• Onlyonepersonget’sthe

microphone&talksatatime

• Alsogoodpracticeforclassrooms,btw…

• Computersuselockstocontrolaccesstosharedresources− Servespurposeofmicrophoneinexample− Alsoreferredtoas“semaphore”

• Usuallyimplementedwithavariable− int lock;

§ 0forunlocked§ 1forlocked

Synchronizationwithlocks// wait for lock releasedwhile (lock != 0) ;// lock == 0 now (unlocked)

// set locklock = 1;

// access shared resource ... // e.g. pi// sequential execution! (Amdahl ...)

// release locklock = 0;

LockSynchronization

Thread1

while (lock != 0) ;

lock = 1;

// critical section

lock = 0;

Thread2

while (lock != 0) ;

lock = 1; // critical sectionlock = 0;

• Thread2findslocknotset,beforethread1setsit

• Boththreadsbelievetheygotandsetthelock!

Tryasyouwant,thisproblemhasnosolution,notevenattheassemblylevel.

Unlessweintroducenewinstructions,thatis!

HardwareSynchronization

• Solution:− Atomicread/write− Read&writeinsingleinstruction

§ Nootheraccesspermittedbetweenreadandwrite− Note:

§ Mustusesharedmemory (multiprocessing)

• Commonimplementations:− Atomicswapofregister↔memory− Pairofinstructionsfor“linked”readandwrite

§ writefailsifmemorylocationhasbeen“tampered”withafterlinkedread

§ MIPSusesthissolution

MIPSSynchronizationInstructions• Loadlinked: ll $rt, off($rs)

− Readsmemorylocation(likelw)− Alsosets(hidden)“linkbit”− Linkbitisresetifmemorylocation(off($rs))isaccessed

• Storeconditional: sc $rt, off($rs)

− Storesoff($rs) = $rt (like sw)− Sets$rt=1 (success)iflinkbitisset

§ i.e.no(other)processaccessedoff($rs) sincell− Sets$rt=0 (failure)otherwise− Note:sc clobbers $rt,i.e.changesitsvalue

LockSynchronization

BrokenSynchronization

while (lock != 0) ;

lock = 1;

// critical section

lock = 0;

Fix(lockisatlocation$s1)

Try: addiu $t0,$zero,1ll $t1,0($s1)bne $t1,$zero,Trysc $t0,0($s1)beq $t0,$zero,Try

Locked:

# critical section

Unlock:sw $zero,0($s1)

Tryagainifsc failed(another threadexecutedsc sinceabovell)

$t0 = 1 beforecalling ll:minimize timebetweenll andsc

Agenda

OpenMPLocks

SynchronizationinOpenMP

• Typicallyareusedinlibrariesofhigherlevelparallelprogrammingconstructs• E.g.OpenMPoffers$pragmasforcommoncases:

− critical− atomic− barrier− ordered

• OpenMPoffersmanymorefeatures− seeonlinedocumentation− ortutorialat

§ http://openmp.org/mp-documents/omp-hands-on-SC08.pdf

OpenMPcritical

TheTroublewithLocks…• …isdead-locks• Consider2cookssharingakitchen

− Eachcooksamealthatrequiressaltandpepper(locks)− Cook1grabssalt− Cook2grabspepper− Cook1noticess/heneedspepper

§ it’snotthere,sos/hewaits− Cook2realizess/heneedssalt

§ it’snotthere,sos/hewaits

• Anotsocommoncauseofcookstarvation− Butdeadlocksarepossibleinparallelprograms− Verydifficulttodebug

§ malloc/free iseasy…

Agenda

AndinConclusion,…• Sequentialsoftwareexecutionspeedislimited• Parallelprocessingistheonlypathtohigherperformance

− SIMD:instructionlevelparallelism§ Implemented inallhighperformanceCPUstoday(x86,ARM,…)§ Partiallysupportedbycompilers

− MIMD:threadlevelparallelism§ Multicoreprocessors§ SupportedbyOperatingSystems(OS)§ Requiresprogrammerinterventiontoexploitatsingleprogramlevel

o E.g.OpenMP− SIMD&MIMDformaximumperformance

• Synchronization− Requireshardwaresupport:specializedassemblyinstructions− Typicallyusehigher-levelsupport− Bewareofdeadlocks

cs 61c: great ideas in computer architecture lecture 19 ...cs61c/fa16/lec/19/l19.pdf · cs 61c...

Documents

cs 61c: great ideas in computer architecture running a

61c, 62c, 61cm series - watts water61c 6109c12 1/2”f 1.5...

red thread 2-28-19 - center for security policy

threads and fasteners thread symbols. screw thread terms:...

cs#61c:#greatideas#in#computer# architecture#(machine#...

cs#61c:#greatideas#in#computer# architecture#(machine#

thread level parallelism and...

nasa space shuttle sts-61c press kit

cs 61c: great ideas in architecture

cs 61c: great ideas in computer architecture lecture 12

2015 vhm / pm− hss− werkzeuge · -thread milling cutter...

cs 61c: great ideas in computer architecture lecture 20:...

80b 61a 80b 61d 51b 81b 61c - procomsa

manual prodipe 61c 49 c 25 c english 1

cs 61c: great ideas in computer architecture (machine...

cs 61c: great ideas in computer architecture mips cpu

notice d’instructions manual de … · notice...

2002 prentice hall. all rights reserved. chapter 19 -...

cs 61c: great ideas in computer architecture dependability

cs 61c: great ideas in computer architecture thread-level...