communication architectures for parallel-programmingsystemspanda 4.0, described in chapter 6, was...

282
Communication Architectures for Parallel-Programming Systems Raoul A.F. Bhoedjang

Upload: others

Post on 20-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

CommunicationArchitecturesforParallel-ProgrammingSystems

RaoulA.F. Bhoedjang

Page 2: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0
Page 3: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Advanced School for Computing and Imaging

Thiswork wascarriedout in graduateschoolASCI.ASCI dissertationseriesnumber52.

Copyright c�

2000RaoulA.F. Bhoedjang.

Page 4: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0
Page 5: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

VRIJE UNIVERSITEIT

CommunicationArchitecturesfor

Parallel-ProgrammingSystems

ACADEMISCHPROEFSCHRIFT

ter verkrijgingvandegraadvandoctoraandeVrije UniversiteitteAmsterdam,op gezagvanderectormagnificus

prof.dr. T. Sminia,in hetopenbaarteverdedigen

tenoverstaanvandepromotiecommissievandefaculteitderexactewetenschappen/

wiskundeeninformaticaopdinsdag6 juni 2000om15.45uur

in hethoofdgebouwvandeuniversiteit,DeBoelelaan1105

door

RaoulAngeloFelicienBhoedjang

geborente ’s-Gravenhage

Page 6: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Promotoren: prof.dr.ir. H.E.Balprof.dr. A.S. Tanenbaum

Page 7: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Preface

PersonalAcknowledgements

First andforemost,I am grateful to my thesissupervisors,Henri Bal andAndyTanenbaum,for creatinganoutstandingresearchenvironment.I thankthembothfor their confidencein meandmy work.

I am very grateful to all members,pastandpresent,of Henri’s Orca team:SaniyaBenHassen,RutgerHofman,CerielJacobs,Thilo Kielmann,KoenLangen-doen,JasonMaassen,RobvanNieuwpoort,AskePlaat,JohnRomein,Tim Ruhl,RonaldVeldema,KeesVerstoep,andGreg Wilson.

Henridoesagreatjob runningthisteamandworkingwith himhasbeenagreatpleasure.His ability to get a job done,his punctuality, andhis infinite readingcapacityaregreatlyappreciated— by me,but by many othersaswell. Aboveall,Henri is averyniceperson.

Tim Ruhl hasbeena wonderfulfriend andcolleaguethroughoutmy yearsatthe VU andI owe him greatly. We have spentcountlesshourshackingaway atvariousprojects,meanwhilediscussingresearchandpersonalmatters.Only afewof our petprojectsmadeit into papers,but thepapersthatwe workedon togetherarethebestthatI have (co-)authored.

I would liketo thankJohnRomeinfor hiswork onMultigame(seeChapter8),for teachingmeaboutparallelgame-treesearching,for hisexpertadviceoncom-puterhardware,for keepingthe DAS clusterin goodshape,andfor generouslydonatingpeanutswhenI neededthem.

KoenLangendoenis theliving proof of theratherdisturbingobservationthatthereis no significantcorrelationbetweena programmer’s codingstyle andthecorrectnessof hiscode.Koenis agreathackerandtaughtmeathingor two aboutpersistence.I wantto thankhim for hiswork onOrca,Panda,afew otherprojects,andfor hiswork asamemberof my thesiscommittee.

TheOrcagroupis blessedwith thesupportof threehighly competentresearchprogrammers:RutgerHofman,CerielJacobs,andKeesVerstoep.All threehavecontributeddirectly to thework describedin this thesis.

I wish to thankRutgerfor his work on Panda4.0, for reviewing papers,for

i

Page 8: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

ii Preface

many useful discussionsaboutcommunicationprotocols,and for lots of goodespresso.I alsoapologizefor all the boring microbenchmarkshe performedsowillingly .

Therearecomputerprogrammersandthereareartists. Ceriel,no doubt,be-longsto theartists.I want to thankhim for his fine work on OrcaandPandaandfor reviewing severalof my papers.

Kees’s work hasbeeninstrumentalin completingthecomparisonof LFC im-plementationsdescribedin Chapter8. During my typing diet, he performedatremendousamountof work on theseimplementations.Not only did he writethreeof the five LFC implementationsdescribedin this thesis,but he also putmany hoursinto measuringand analyzingthe performanceof variousapplica-tions.Meanwhile,hedebuggedMyrinet hardwareandkepttheDAS clusteralive.Finally, Keesreviewedsomeof my papersandprovidedmany usefulcommentson this thesis.It hasbeenagreatpleasureto work with him.

With a 128-nodeclusterof computersin your backyard, it is easyto forgetaboutyour desktopmachine����� until you cannotreachyour homedirectory. For-tunately, suchmomentshavebeenscarce.Many thanksto thedepartment’ssystemadministratorsfor keepingeverythingup andrunning.I amespeciallygratefultoRuudWiggers,for quotaof variouskindsandgreatservice.

I wishto thankMatty Huntjensfor startingmy graduatecareerby recommend-ing meto Henri Bal. I greatlyappreciatethewayheworkswith thedepartment’sundergraduatestudents;thedepartmentis lucky to havesuchpeople.

I atemany dinnersat the University’s ’restaurant.’ I will refrain from com-mentingonthefood,but I did enjoy thecompany of RutgerHofman,Philip Hom-burg, CerielJacobs,andothers.

My thesiscommittee— prof. dr. Zwaenepoel,prof. dr. Kaashoek,dr.Langendoen,dr. Epema,anddr. van Steen— went throughsomehard times;my move to Cornell ratherreducedmy pages-per-dayrate. I would like to thankthemall for readingmy thesisandfor their suggestionsfor improvement. I amparticularlygratefulto FransKaashoekwhosecommentsandsuggestionsconsid-erablyimprovedthequality of this thesis.

I would alsolike to thankFransfor his hospitalityduring someof my visitsto theUS. I especiallyenjoyedmy two-monthvisit to his researchgroupat MIT.On that sameoccasion,Robbertvan Renesseinvited me to Cornell, my currentemployer. He too,hasbeenmosthospitableon many occasions,andI would liketo thankhim for it.

TheNetherlandsOrganizationfor ScientificResearch(N.W.O.)supportedpartof my work financiallythroughHenri Bal’sPioniergrant.

Many of the important thingsI know, my parentstaughtme. I am suremyfatherwouldhave lovedto attendmy thesisdefense.My motherhasbeenagreatexampleof strengthin difficult times.

Page 9: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Preface iii

Finally, I wish to thankDaphnefor sharingherlife with me. Most statementsconcerningtwo thesesin onehouseholdaretrue. I amgladwesurvived.

Per-Chapter Contrib utions

Thefollowing paragraphssummarizethecontributionsmadeby othersto thepa-persandsoftwarethateachchapterof this thesisis basedon. I would liketo thankall for thetimeandeffort they have invested.

Chapter2’s survey is basedon anarticleco-authoredby Tim Ruhl andHenriBal [18].

Tim Ruhl andI designedand implementedthe LFC communicationsystemdescribedin Chapters3, 4, and5. JohnRomeinportedLFC to Linux. LFC isdescribedin two papersco-authoredby Tim Ruhl andHenri Bal; it is alsooneofthesystemsclassifiedin thesurvey mentionedabove.

Panda4.0,describedin Chapter6,wasdesignedbyRutgerHofman,Tim Ruhl,andme.RutgerHofmanimplementedPanda4.0on LFC from scratch,exceptforthe threadspackage,OpenThreads,which wasdevelopedby KoenLangendoen.Earlierversionsof Pandaaredescribedin two conferencepublications[19, 124].Thecomparisonof upcallmodelsis basedonanarticleby KoenLangendoen,me,andHenri Bal [88]. Panda’s dynamicswitchingbetweenpolling andinterruptsis describedin a conferencepaperby KoenLangendoen,JohnRomein,me,andHenri Bal [89].

Chapter7 discussesthe implementationof four parallel-programmingsys-tems: Orca, Manta, CRL, and MPICH. Orca and Manta were both developedin thecomputersystemsgroupat the Vrije Universiteit;CRL andMPICH weredevelopedelsewhere.

KoenLangendoenandI implementedthe continuation-basedversionof theOrcaruntimesystemontopof Panda2.0.Theuseof continuationsin theruntimesystemis describedin a paperco-authoredby KoenLangendoen[14]. CerielJa-cobswrotetheOrcacompilerandportedtheruntimesystemto Panda3.0.RutgerHofmanportedthe continuation-basedruntimesystemto Panda4.0. An exten-sivedescriptionandperformanceevaluationof theOrcasystemon Panda3.0aregivenin a journalpaperby Henri Bal, me,RutgerHofman,Ceriel Jacobs,KoenLangendoen,Tim Ruhl, andFransKaashoek[9].

Mantawasdevelopedby JasonMaassen,Rob van Nieuwpoort,andRonaldVeldema. Their work is describedin a conferencepaper[99]. Rob helpedmewith theMantaperformancemeasurementspresentedin Section7.2.

CRL wasdevelopedatMIT by Kirk Johnson[73,74]; I portedCRL to LFC. Itcouldnot havebeenmucheasier, becauseCRL is a really nicepieceof software.

MPICH [61] is beingdevelopedjointly by ArgonneNationalLaboratoriesand

Page 10: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

iv Preface

MississippiStateUniversity. RutgerHofmanportedMPICH to Panda4.0.Thecomparisonof differentLFC implementationsdescribedin Chapter8 was

donein cooperationwith KeesVerstoep,Tim Ruhl, and RutgerHofman. Timand I implementedthe NrxImc andHrxHmc versionsof LFC. KeesimplementedIrxImc, IrxHmc, andNrxHmc andperformedmany of themeasurementspresentedinChapter8. Rutgerhelpedoutwith variousPanda-relatedissues.

Theparallelapplicationsanalyzedin Chapter8 weredevelopedby many per-sons. The Puzzleapplicationand the parallel-programmingsystemthat it runson, Multigame [122], were written by JohnRomein. Awari wasdevelopedbyVictor Allis, Henri Bal, HansStaalman,andElwin deWaal. Thelinearequationsolver (LEQ) wasdevelopedby KoenLangendoen.The MPI applicationswerewritten by RutgerHofman(QR andASP)andThilo Kielmann(SOR).TheCRLapplicationsBarnes-Hut(Barnes),FastFourier Transform(FFT), andradix sort(radix)areall basedonshared-memoryprogramsfromtheSPLASH-2suite[153].Barnes-Hutwasadaptedfor CRL by Kirk Johnson,FFT by AskePlaat,andradixby me.

Page 11: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Contents

1 Intr oduction 11.1 Parallel-ProgrammingSystems 21.2 Communicationarchitecturesfor parallel-programmingsystems 51.3 Problems 6

1.3.1 Data-TransferMismatches 71.3.2 Control-TransferMismatches 81.3.3 DesignAlternativesfor User-Level Communication

Architectures 91.4 Contributions 91.5 Implementation 10

1.5.1 LFC Implementations 101.5.2 Panda 121.5.3 Parallel-ProgrammingSystems 12

1.6 ExperimentalEnvironment 141.6.1 ClusterPosition 141.6.2 HardwareDetails 14

1.7 ThesisOutline 17

2 Network Interface Protocols 192.1 A BasicNetwork InterfaceProtocol 202.2 DataTransfers 23

2.2.1 FromHostto Network Interface 232.2.2 FromNetwork Interfaceto Network Interface 252.2.3 FromNetwork Interfaceto Host 26

2.3 AddressTranslation 262.4 Protection 292.5 ControlTransfers 302.6 Reliability 312.7 Multicast 332.8 Classificationof User-Level CommunicationSystems 342.9 Summary 36

v

Page 12: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

vi Contents

3 The LFC User-Level Communication System 373.1 ProgrammingInterface 37

3.1.1 Addressing 383.1.2 Packets 383.1.3 SendingPackets 403.1.4 Receiving Packets 413.1.5 Synchronization 423.1.6 Statistics 43

3.2 Key ImplementationAssumptions 433.3 ImplementationOverview 44

3.3.1 Myrinet 443.3.2 OperatingSystemExtensions 443.3.3 Library 463.3.4 NI ControlProgram 46

3.4 Packet Anatomy 483.5 DataTransfer 503.6 HostReceiveBuffer Management 533.7 MessageDetectionandHandlerInvocation 543.8 Fetch-and-Add 563.9 Limitations 56

3.9.1 FunctionalLimitations 563.9.2 SecurityViolations 58

3.10 RelatedWork 593.11 Summary 60

4 CoreProtocolsfor Intelligent Network Interfaces 614.1 UCAST: ReliablePoint-to-PointCommunication 62

4.1.1 ProtocolDataStructures 624.1.2 Sender-SideProtocol 644.1.3 Receiver-SideProtocol 65

4.2 MCAST: ReliableMulticastCommunication 674.2.1 MulticastForwarding 674.2.2 MulticastTreeTopology 684.2.3 ProtocolDataStructures 684.2.4 Sender-SideProtocol 694.2.5 Receiver-SideProtocol 704.2.6 Deadlock 71

4.3 RECOV: DeadlockDetectionandRecovery 734.3.1 DeadlockDetection 734.3.2 DeadlockRecovery 744.3.3 ProtocolDataStructures 75

Page 13: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Contents vii

4.3.4 Sender-SideProtocol 774.3.5 Receiver-SideProtocol 78

4.4 INTR: ControlTransfer 824.5 RelatedWork 83

4.5.1 Reliability 834.5.2 InterruptManagement 844.5.3 Multicast 854.5.4 FM/MC 86

4.6 Summary 88

5 The Performanceof LFC 895.1 UnicastPerformance 89

5.1.1 Latency andThroughput 895.1.2 LogGPParameters 915.1.3 InterruptOverhead 92

5.2 Fetch-and-AddPerformance 935.3 MulticastPerformance 94

5.3.1 Performanceof theBasicMulticastProtocol 945.3.2 Impactof DeadlockRecoveryandMulticastTreeShape 98

5.4 Summary 101

6 Panda 1036.1 Overview 104

6.1.1 Functionality 1046.1.2 Structure 1066.1.3 Panda’s Main Abstractions 109

6.2 IntegratingMultithreadingandCommunication 1136.2.1 Multithread-SafeAccessto LFC 1136.2.2 TransparentlySwitchingbetweenInterruptsandPolling 1156.2.3 An EfficientUpcall Implementation 120

6.3 StreamMessages 1306.4 Totally-OrderedBroadcasting 1336.5 Performance 133

6.5.1 Performanceof theMessage-PassingModule 1346.5.2 Performanceof theBroadcastModule 1366.5.3 OtherPerformanceIssues 137

6.6 RelatedWork 1386.6.1 PortableMessage-PassingLibraries 1386.6.2 Polling andInterrupts 1396.6.3 Upcall Models 1406.6.4 StreamMessages 141

Page 14: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

viii Contents

6.6.5 Totally-OrderedBroadcast 1416.7 Summary 142

7 Parallel-Programming Systems 1457.1 Orca 145

7.1.1 ProgrammingModel 1467.1.2 ImplementationOverview 1487.1.3 EfficientOperationTransfer 1517.1.4 Efficient Implementationof GuardedOperations 1557.1.5 Performance 158

7.2 Manta 1627.2.1 ProgrammingModel 1637.2.2 Implementation 1637.2.3 Performance 165

7.3 TheC Region Library 1667.3.1 ProgrammingModel 1667.3.2 Implementation 1677.3.3 Performance 170

7.4 MPI — TheMessagePassingInterface 1717.4.1 Implementation 1717.4.2 Performance 172

7.5 RelatedWork 1757.5.1 OperationTransfer 1757.5.2 OperationExecution 176

7.6 Summary 177

8 Multile vel PerformanceEvaluation 1818.1 ImplementationOverview 182

8.1.1 TheImplementations 1828.1.2 Commonalities 183

8.2 ReliablePoint-to-PointCommunication 1868.2.1 TheNo-RetransmissionProtocol(Nrx) 1868.2.2 TheRetransmittingProtocols(Hrx andIrx) 1868.2.3 Latency Measurements 1878.2.4 Window SizeandReceiveBuffer Space 1888.2.5 SendBuffer Space 1908.2.6 ProtocolComparison 193

8.3 ReliableMulticast 1938.3.1 Host-Level versusInterface-Level Packet Forwarding 1948.3.2 AcknowledgementSchemes 1958.3.3 DeadlockIssues 195

Page 15: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Contents ix

8.3.4 Latency andThroughputMeasurements 1968.4 Parallel-ProgrammingSystems 197

8.4.1 CommunicationStyle 1988.4.2 PerformanceIssues 200

8.5 ApplicationPerformance 2018.5.1 PerformanceResults 2018.5.2 All-pairs ShortestPaths(ASP) 2048.5.3 Awari 2048.5.4 Barnes-Hut 2068.5.5 FastFourierTransform(FFT) 2078.5.6 TheLinearEquationSolver (LEQ) 2078.5.7 Puzzle 2088.5.8 QR Factorization 2088.5.9 RadixSort 2098.5.10 SuccessiveOverrelaxation(SOR) 209

8.6 Classification 2108.7 RelatedWork 211

8.7.1 Reliability 2118.7.2 Multicast 2138.7.3 NI ProtocolStudies 213

8.8 Summary 214

9 Summary and Conclusions 2179.1 LFC andNI Protocols 2179.2 Panda 2199.3 Parallel-ProgrammingSystems 2209.4 PerformanceImpactof DesignDecisions 2219.5 Conclusions 2229.6 LimitationsandOpenIssues 223

A Deadlockissues 241A.1 Deadlock-FreeMulticastingin LFC 241A.2 OverlappingMulticastGroups 243

B Abbreviations 245

Samenvatting 247

Curriculum Vitae 253

Page 16: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0
Page 17: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

List of Figures

1.1 Communicationlayersstudiedin this thesis. 61.2 Structureof this thesis. 111.3 Myrinet switchtopology. 151.4 Switchesandclusternodes. 151.5 Clusternodearchitecture. 16

2.1 Operationof thebasicprotocol. 202.2 Host-to-NIthroughputusingdifferentdatatransfermechanisms. 242.3 Designdecisionsfor reliability. 312.4 Repeatedsendversusa tree-basedmulticast. 33

3.1 LFC’s packet format. 483.2 LFC’s datatransferpath. 503.3 LFC’s sendpathanddatastructures. 513.4 Two waysto transferthepayloadandstatusflagof anetwork packet. 523.5 LFC’s receivedatastructures. 53

4.1 Protocoldatastructuresfor UCAST. 634.2 Sendersideof theUCAST protocol. 644.3 Receiversideof theUCASTprotocol. 664.4 Host-level andinterface-level multicastforwarding. 674.5 Protocoldatastructuresfor MCAST. 694.6 Sender-sideprotocolfor MCAST. 704.7 Receiveprocedurefor themulticastprotocol. 714.8 MCAST deadlockscenario. 724.9 A binarytree. 724.10 Subtreecoveredby thedeadlockrecoveryalgorithm. 754.11 Deadlockrecoverydatastructures. 764.12 Multicastforwardingin thedeadlockrecoveryprotocol. 774.13 Packet transmissionin thedeadlockrecoveryprotocol. 794.14 Receiveprocedurefor thedeadlockrecoveryprotocol. 804.15 Packet releaseprocedurefor thedeadlockrecoveryprotocol. 81

xi

Page 18: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

xii List of Figures

4.16 Timeouthandlingin INTR. 844.17 LFC’s andFM/MC’sbufferingstrategies. 86

5.1 LFC’s unicastlatency. 905.2 LFC’s unicastthroughput. 915.3 Fetch-and-addlatenciesundercontention. 945.4 LFC’s broadcastlatency. 955.5 LFC’s broadcastthroughput. 965.6 Treeusedto determinemulticastgap. 975.7 LFC’s broadcastgap. 975.8 Impactof deadlockrecoveryonbroadcastthroughput. 985.9 Impactof treetopologyon broadcastthroughput. 1005.10 Impactof deadlockrecoveryonall-to-all throughput. 100

6.1 Differentbroadcastdeliveryorders. 1056.2 Pandamodules:functionalityanddependencies. 1076.3 Two Pandaheaderstacks. 1116.4 Recursive invocationof anLFC routine. 1146.5 RPCexample. 1186.6 Simpleremotereadwith activemessages. 1216.7 Messagehandlerwith locking. 1246.8 Threedifferentupcallmodels. 1256.9 Messagehandlerwith blocking. 1266.10 Fastthreadswitchto anupcallthread. 1306.11 Application-to-applicationstreamingwith Panda’sstreammessages.1316.12 Panda’s message-headerformats. 1316.13 Receiver-sideorganizationof Panda’sstreammessages. 1326.14 Panda’s message-passinglatency. 1346.15 Panda’s message-passingthroughput. 1356.16 Panda’s broadcastlatency. 1366.17 Panda’s broadcastthroughput. 137

7.1 Definitionof anOrcaobjecttype. 1477.2 Orcaprocesscreation. 1477.3 Thestructureof theOrcasharedobjectsystem. 1487.4 Decisionprocessfor executinganOrcaoperation. 1507.5 Orcadefinitionof anarrayof records. 1517.6 Specializedmarshalingcodegeneratedby theOrcacompiler. 1537.7 Datatransferpathsin Orca. 1547.8 Orcawith popupthreads. 1557.9 Orcawith single-threadedupcalls. 157

Page 19: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

List of Figures xiii

7.10 Thecontinuationsinterface. 1577.11 Roundtriplatenciesfor Orca,Panda,andLFC. 1597.12 Roundtripthroughputfor Orca,Panda,andLFC. 1607.13 Broadcastlatency for Orca,Panda,andLFC. 1617.14 Broadcastthroughputfor Orca,Panda,andLFC. 1617.15 Roundtriplatenciesfor Manta,Orca,Panda,andLFC. 1657.16 Roundtripthroughputfor Manta,Orca,Panda,andLFC. 1667.17 CRL readandwrite misstransactions. 1687.18 Write misslatency. 1707.19 Write missthroughput. 1717.20 MPI unicastlatency. 1737.21 MPI unicastthroughput. 1747.22 MPI broadcastlatency. 1747.23 MPI broadcastthroughput. 175

8.1 LFC unicastlatency. 1888.2 Peakthroughputfor differentwindow sizes. 1898.3 LFC unicastthroughput. 1918.4 Host-level andinterface-level multicastforwarding. 1948.5 LFC multicastlatency. 1968.6 LFC multicastthroughput. 1978.7 Application-level impactof efficientbroadcastsupport. 1998.8 Dataandpacket ratesof NrxImc on 64nodes. 2038.9 Normalizedapplicationexecutiontimes. 2058.10 Classificationof communicationsystemsfor Myrinet. 212

A.1 A biomial spanningtree. 243A.2 Multicasttreesfor overlappingmulticastgroups. 244

Page 20: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0
Page 21: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

List of Tables

1.1 Classificationof parallel-programmingsystems. 41.2 Communicationcharacteristicsof parallel-programmingsystems. 13

2.1 ThroughputsonaPentiumPro/Myrinetcluster. 272.2 Summaryof designchoicesin existingcommunicationsystems. 35

3.1 LFC’s userinterfac. 393.2 Resourcesusedby theNI controlprogram. 463.3 LFC’s packet tags. 49

5.1 Valuesof theLogGPparametersfor LFC. 935.2 Deadlockrecoverystatistics. 101

6.1 Sendandreceive interfacesof Panda’ssystemmodule. 1126.2 Classificationof upcallmodels. 121

7.1 A comparisonof OrcaandManta. 1627.2 PPSperformancesummary 179

8.1 Fiveversionsof LFC’s reliability andmulticastprotocols. 1838.2 Divisionof work in theLFC implementations. 1848.3 Parametersusedby theprotocols. 1858.4 LogPparametervalues(in microseconds)for a16-bytemessage. 1888.5 Fetch-and-addlatencies. 1898.6 Applicationcharacteristicsandtimings. 2028.7 Classificationof applications. 210

xv

Page 22: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0
Page 23: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Chapter 1

Intr oduction

This thesisis aboutcommunicationsupportfor parallel-programmingsystems.A parallel-programmingsystem(PPS)is a systemthat aids programmerswhowish to exploit multiple processorsto solve a single,computationallyintensiveproblem. PPSscomein many flavors,dependingon theprogrammingparadigmthey provide andthe type of hardwarethey run on. This thesisconcentratesonPPSsfor clusters of computers.By a clusterwe meana collection of off-the-shelf computersinterconnectedby an off-the-shelfnetwork. Clustersare easyto build, easyto extend,scaleto large numbersof processors,andarerelativelyinexpensive.

Theperformanceof aPPSdependscritically onefficientcommunicationmech-anisms.At present,user-level communicationsystemsoffer thebestcommunica-tion performance.Suchsystemsbypasstheoperatingsystemon all critical com-municationpathssothatuserprogramscanbenefitfrom thehigh performanceofmodernnetwork technology. This thesisconcentrateson theinteractionsbetweenPPSsanduser-level communicationsystems.We focuson threequestions:

1. Whichmechanismsshoulda user-level communicationsystemprovide?

2. How shouldparallel-programmingsystemsusethemechanismsprovided?

3. How shouldthesemechanismsbeimplemented?

The relevanceof thesequestionsfollows from the following observations.First, thefunctionalityprovidedby user-level communicationsystemsvariescon-siderably(seeChapter2). Systemsdiffer in the typesof datatransferprimitivesthey offer (point-to-pointmessage,multicastmessage,or remote-memoryaccess),theway incomingdatais detected(polling or interrupts),andtheway incomingdatais handled(explicit receive,upcall,or popupthread).

1

Page 24: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

2 Introduction

Second,mostuser-level communicationsystemsprovide low-level interfacesthatoftendo not matchtheneedsof thedevelopersof a PPS.This is not a prob-lem aslong ashigher-level abstractionscanbelayeredefficiently on top of thoseinterfaces.Unfortunately, many systemsignorethis issue(seeSection1.3).

Third, wheresystemsdo provide similar functionality, they implementit indifferentways(seeChapter2). Many systems,for example,usea programmablenetwork interface(NI) to executepartof thecommunicationprotocol. However,systemsdiffer greatly in the type andamountof work that they off-load to theNI. Somesystemsminimize the amountof work performedon the NI —on theaccountthat the NI processoris much slower than the host processor— whileothersrun acompletereliability protocolon theNI.

This chapterproceedsasfollows. Section1.1 discussesPPSsin moredetailandexplainswhich typesof PPSsweaddressexactly. Section1.2discussesuser-level architecturesandintroducesthedifferentlayersof communicationsoftwarewe study. Section1.3 introducesthespecificproblemsthat this thesisaddresses.Section1.4 statesour contributionstowardssolving theseproblems.Section1.5discussesthesoftwarecomponentsin whichwehaveimplementedour ideas.Sec-tion 1.6 describestheenvironmentin which we developedandevaluatedour so-lutions.Finally, Section1.7describesthestructureof thethesis.

1.1 Parallel-Programming Systems

The useof parallelismin andbetweencomputersis by now the rule ratherthantheexception. Within a modernprocessor, we find a pipelinedCPUwith multi-ple functionalunits,nonblockingcaches,andwrite buffers. Surprisingly, this in-traprocessorparallelismis largely hiddenfrom theprogrammerby instructionsetarchitecturesthatprovide a moreor lesssequentialprogrammingmodel. Whereparallelismdoesbecomevisible (e.g.,in VLIW architectures),compilershide itoncemorefrom mostapplicationprogrammers.This way, applicationprogram-mershave enjoyed advancesin computerarchitecturewithout having to changetheway they write theirprograms.

Programmerswho want to exploit parallelismbetweenprocessorshave notbeenso fortunate. In specificapplicationdomains,or for specifictypesof pro-grams,compilerscanautomaticallydetectandexploit parallelismof asufficientlylargegrainto employ multiplecomputersefficiently. In mostcases,however, pro-grammersmustguidecompilersby meansof annotationsor write an explicitlyparallelprogram.Unfortunately, writing efficientparallelprogramsis notoriouslydifficult. At thealgorithmiclevel, programmersmustfind acompromisebetweenlocality and load balance. To achieve both, nontrivial techniquessuchas datareplication,datamigration,work stealing,load balancing,etc., may have to be

Page 25: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

1.1Parallel-ProgrammingSystems 3

employed. At a lower level, programmersmustdealwith raceconditions,mes-sageinterrupts,messageordering,andmemoryconsistency.

To simplify theprogrammer’s task,many parallel-programmingsystemshavebeendeveloped,eachof which implementsits own programmingmodel. Themostimportantprogrammingmodelsare:

� Sharedmemory

� Messagepassing

� Data-parallelprogramming

� Sharedobjects

In the shared-memorymodel, multiple threadssharean addressspaceandcommunicateby writing andreadingshared-memorylocations.Threadssynchro-nizebymeansof atomicmemoryoperationsandhigher-levelprimitivesbuilt upontheseatomicoperations(e.g.,locksandconditionvariables).Multiprocessorsdi-rectly supportthisprogrammingmodelin hardware.

In the message-passingmodel,processescommunicateby exchangingmes-sages.Implementationsof this model (e.g.,PVM [137] andMPI [53]) usuallyprovide severalflavorsof sendandreceive methodswhich differ, for example,inwhenthesendermay continueandhow the receiver selectsthe next messagetoreceive.

In the shared-memoryandmessage-passingmodels,the programmerhastodealwith multiple threadsor processes.A data-parallelprogramhasonly onethreadof control and in many ways resemblesa sequentialprogram. For effi-ciency, however, theprogrammerprovidesannotationsthat indicatewhich loopscanbeexecutedin parallelandhow datamustbedistributed.

In the shared-objectmodel [7], processescansharedata,provided that thisdatais encapsulatedin anobject.Thedatais accessedby invokingmethodsof theobject’s class.While this modelis not aswidely usedin practiceasthepreviousmodels,it hasreceived a fair amountof attentionin the parallel-programmingresearchcommunity.

Parallel-programmingmodelsusedto beidentifiedwith aspecifictypeof par-allel hardware. Today, mostvendorsof parallelhardwaresupportmultiple pro-grammingmodels. The shared-memorymodel, for example,is a morenaturalmatchto a multiprocessorthan the message-passingmodel, yet multiprocessorvendorsalsoprovideimplementationsof message-passingstandardssuchasMPI.

At present,two typesof hardwaredominateboththecommercialmarketplaceandparallel-programmingresearchin academia:multiprocessorsandworkstationclusters.In a multiprocessor, thehardwareimplementsa single,sharedmemory

Page 26: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4 Introduction

Programming model Multipr ocessor Cluster implementationsimplementations

Sharedmemory Pthreads[110] Shasta[125], Tempest[119],TreadMarks[78]

Messagepassing MPI [53, 140] MPI [53, 61], PVM [137]Data-parallel OpenMP[42] HPF[81]Sharedobjects Java [60] Orca[7, 9], CRL [74],

Java/RMI [99]

Table 1.1.Classificationof parallel-programmingsystems.

that canbe accessedby all processorsthroughmemoryaccessinstructions. Ina workstationcluster, eachprocessorhas its own memory. Accessto anotherprocessor’smemoryis not transparentandcanonly beachievedby someform ofexplicit messagepassing.

All four programmingmodelsmentionedabove have beenimplementedonboth typesof hardware (seeTable 1.1). Implementationson clusterhardwaretendto be morecomplex thanmultiprocessorimplementationsbecauseefficientstatesharingis moredifficult on a clusterthanon a multiprocessor. This thesisaddressesonly communicationsupportfor clusters. On clusters,modelssuchasthe shared-objectmodelcanbe implementedonly by softwarethat hidesthehardware’s distributednature.

All PPSsdiscussedin this thesis(exceptMultigame[122], which we discussonly briefly) require the programmerto write a parallel program. With somesystems,however, this is easierthan with others. The Orca shared-objectsys-tem[7, 9], for example,is language-based.Programmersthereforeneednotwriteany codeto marshalobjectsor parametersof operationsbecausethiscodeis eithergeneratedby thecompileror hiddenin thelanguageruntimesystem.In a library-basedsystemsuchasMPI, in contrast,programmersmusteitherwrite their ownmarshalingcodeor createtypedescriptors.

At a higher level, considerlocality and load balancing. Distributed sharedmemorysystemslike CRL [74] andOrcareplicatedatatransparentlyto optimizereadaccesses,thusimproving locality. SystemssuchasCilk [55] andMultigameusea work-stealingruntimesystemthat automaticallybalancesthe load amongall processors.An MPI programmerwho wishesto replicatea dataitem mustdo so explicitly, without any help from the systemto keepthe replicasconsis-tent. Similarly, MPI programmershave to implementtheir own load balancingschemes.

Theconveniencesof high-levelprogrammingsystemsdonotcomefor free. Ina recentstudy[97], for example,Lu et al. show thatparallelapplicationswritten

Page 27: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

1.2Communicationarchitecturesfor parallel-programmingsystems 5

for the TreadMarks[78] distributed shared-memorysystem(DSM) sendmoredataandmoremessagesthanequivalentmessage-passingprograms.ThereasonsarethatTreadMarkscannotcombinedatatransferandsynchronizationin asinglemessagetransfer, cannottransferdatafrom multiple memorypagesin a singlemessage,andsuffersfrom falsesharinganddiff accumulation.(Diff accumulationis specificto the consistency protocol usedby TreadMarks. The result of diffaccumulationis thatprocessorscommunicateto obtainold versionsof shareddatathathavealreadybeenoverwrittenby newerversions.)

Orca provides a secondexample. Although Orca’s shared-objectprogram-mingmodelisclosertomessagepassingthanTreadMarks’sshared-memorymodel,Orcaprogramsalsosendmoremessagesthanis strictly needed.Thecurrentim-plementationof Orca[9], for example,broadcastsoperationson a replicatedob-ject to all processors,even if the object is replicatedon only a few processors.Also,operationsonnonreplicatedobjectsarealwaysexecutedby meansof remoteprocedurecalls.Eachsuchoperationresultsin arequestandareplymessage,evenif no reply is needed.

Theseexamplesillustratethat,with high-level PPSs,applicationprogrammersno longer control the implementationof high-level functionsand must rely ongenericimplementations.This resultsin increasedmessageratesandincreasedmessagesizes. In addition,high-level PPSsaddmessage-processingoverheadsin theform of marshalingandmessage-handlerdispatch.Consequently, efficientcommunicationsupportis of primeimportanceto thesesystems.

1.2 Communication architecturesforparallel-pr ogramming systems

To achieveefficientcommunication,weneedbothefficienthardwareandefficientsoftware. While high-bandwidth,low-latency, andscalablenetwork hardwareisstill expensive,suchhardwareis availableandcanbeusedreadilyoutsidesuper-computercabinets.Examplesof suchhardwareincludeGigaNet[130], MemoryChannel[57], andMyrinet [27]. On thesesystems,network bandwidthis mea-suredin Gigabitspersecondandnetwork latency in microseconds.Most of thesenetworkshave low bit-errorrates.Finally, network interfaceshave becomemoreflexible andaresometimesprogrammable.

In many ways,software is themainobstacleon theroadto efficient commu-nication. Softwareoverheadcomesin many forms: systemcalls,copying, inter-rupts,threadswitches,etc. Part of the softwareproblemhasbeensolvedby theintroductionof user-levelcommunicationarchitectureswhichgiveuserprocessesdirect accessto the network. In mostuser-level communicationsystems,users

Page 28: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6 Introduction

Communication libraries

Network interface protocols

PPS-specific(compiler and/or runtime system)Parallel applications

Communication software

Network hardware

Fig. 1.1.Communicationlayersstudiedin this thesis.

mustinitialize communicationsegmentsandchannelsby meansof systemcalls.After this initialization phase,however, kernelmediationis no longerneeded,orneededonly in exceptionalcases.

Communicationsupportfor aPPScanbedividedinto threelayersof software(seeFigure1.1). At the bottom,above the network hardware,we find networkinterfaceprotocols. Theseprotocolsperformtwo functions:they controlthenet-work deviceandimplementalow-level communicationabstractionthatis usedbyall higherlayers.

SinceNI protocolsareoftenlow-level, most(but not all) PPSsusea commu-nicationlibrary thatimplementsmessageabstractionsandhigher-level communi-cationprimitives(e.g.,remoteprocedurecall).

Thetop layer implementsa PPS’s programmingmodel. In general,this layerconsistsof a compileranda runtimesystem. Not all PPSsarebasedon a pro-gramminglanguage,however, sonot all PPSsusea compiler. In principle,PPSsbasedon a programminglanguagedo not needa runtimesystem:a compilercangenerateall codethatneedsto beexecuted.In practice,though,PPSsalwaysusesomeruntimesupport.

Sincecommunicationeventsfrequentlycut throughall theselayers,applica-tion-level performanceis determinedby thewaytheselayerscooperate.In partic-ular, highperformanceatonelayeris of nouseif this layeroffersaninconvenientinterfaceto higherlayers.Our goalin this thesisis to find mechanismsandinter-facesthatwork well at all layers.

1.3 Problems

This thesisaddressesthreeproblems:

1. The datatransfermethodsprovidedby user-level communicationsystemsoftendo notmatchtheneedsof PPSs.

2. The control transfermethodsprovided by user-level communicationsys-temsdonotmatchtheneedsof PPSs.

Page 29: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

1.3Problems 7

3. Designalternatives for reliable point-to-pointand multicast implementa-tionsfor modernnetworksarepoorlyunderstood.

1.3.1 Data-Transfer Mismatches

Achieving efficient datatransferat all layersin a PPSis hard. A datatransferproblemthathasreceivedmuchattentionis thehigh costof memorycopiesthattake placeas datatravels from one layer to another. Sinceredundantmemorycopiesdecreaseapplication-to-applicationthroughput(seeChapter2),muchuser-level communicationwork hasfocusedon avoiding unnecessarycopies[13, 24,32, 49]. Most solutionsinvolve the useof DMA transfersbetweenthe user’saddressspaceandthe NI. Unfortunately, thesesolutionscannotalwaysbe usedeffectively by PPSs.

At the sendingside,somesystems(e.g.,U-Net [147] andHamlyn [32]) cantransferdataefficiently (usingDMA) whenit is storedin a specialsendsegmentin thesender’s hostmemory. For a specificapplicationit maybefeasibleto storethe datathat needsto be communicated(in a certainperiodof time) in a sendsegment.A PPS,however, wouldhaveto do this for all applications.If this is notpossible,sendersmustfirst copy their datainto thesendsegment.With this extracopy, theadvantageof a low-level high-speeddatatransfermechanismis lost.

To improve performanceat thereceiving side,severalsystems(e.g.,HamlynandSHRIMP[24]) offer aninterfacethatallows a senderto specifya destinationaddressin a receiver’s addressspace.This allows the communicationsystemtomove incomingdatadirectly from theNI to its destinationaddress.A message-passingsystem,on the other hand,would first copy the datato somenetworkbuffer andwould later, whenthereceiverspecifiesadestinationaddress,copy thedatato its final destination.This may appearinefficient, but in many casesthesendersimply doesnot know thedestinationaddressof thedatathatneedsto besent. In suchcases,higher layers(e.g.,PPSs)areforcedto negotiatea destina-tion addressbeforethe actualdatatransfertakesplace. Sucha negotiationmayintroduceupto two extramessagesperdatatransferandis thereforeexpensiveforsmall datatransfers.(Alternatively, the sendercantransferthe datato a defaultbuffer with a known addressandhave thereceiver copy thedatafrom thatbuffer.Sucha schemecanbeextendedsothat thereceiver canposta destinationbufferthatreplacesthedefault buffer [49]. If thedestinationbuffer is postedin time,noextra copy is needed.)

The needto avoid copying shouldbe balancedagainstthe cost of manag-ing asynchronousdatatransfermechanismsandnegotiatingdestinationaddresses.Also, not all communicationoverheadresultsfrom a lack of bandwidth.In theirstudyof split-Capplications[102], Martin etal. foundthatapplicationsaresensi-tive to sendandreceiveoverhead,but toleratelowerbandwidthsfairly well.

Page 30: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8 Introduction

A datatransferproblemthathasreceivedlessattention,but that is importantin many PPSs,is the efficient transferof datafrom a singlesenderto multiplereceivers. Most existing user-level communicationsystemsfocuson high-speeddatatransfersfrom asinglesenderto asinglereceiveranddonotsupportmulticastor supportit poorly. Thisis unfortunate,becausemulticastplaysanimportantrolein many PPSs.

MPICH [61], a widely usedimplementationMPI, for example,usesmulti-castin its implementationof collectivecommunicationoperations(e.g., reduce,gather, scatter).Collective communicationoperationsareusedin many parallelalgorithms.Efficient multicastprimitiveshave alsoprovedtheir valuein the im-plementationof DSMssuchasOrca[9, 75] andBrazos[131], whichusemulticastto updatereplicateddata.

Thelackof efficientmulticastprimitivesin user-level communicationsystemsforcesPPSs(or underlyingcommunicationlibraries)to implementtheirown mul-ticaston topof point-to-pointprimitives.This is frequentlyinefficient. Naive im-plementationslet thesendersenda singlemessageto eachreceiver, which turnsthesenderinto a serialbottleneck.Smarterimplementationsarebasedon span-ning trees. In theseimplementations,the sendertransmitsa messageto a fewnodes(its children)which thenforward themessageto their children,andsoonuntil all nodeshave beenreached.This strategy allows themessageto travel be-tweendifferentsender-receiverpairsin parallel.

Evena tree-basedmulticastcanbeinefficient if it is layeredon point-to-pointprimitives.At theforwardingnodes,datatravelsup from theNI to thehost.If thehostdoesnot poll, forwardingwill eitherbedelayedor thedatawill bedeliveredby meansof anexpensive interrupt. To forward thedata,thehosthasto reinjectthedatainto thenetwork. This datatransferis unnecessary, becausethedatawasalreadypresenton theNI.

1.3.2 Control-Transfer Mismatches

Controltransferhastwo components:detectingincomingmessagesandexecutinghandlersfor incomingmessages.Most user-level communicationsystemsallowtheir clientsto poll for incomingmessages,becausereceiving a messagethroughpolling is muchmoreefficient thanreceiving it throughaninterrupt.Theproblemfor a PPSis to decidewhento poll, especiallyif theprocessingof incomingmes-sagesis not alwaystied to anexplicit receive call by theapplication.Again, fora specificapplicationthis neednot bea hardproblem,but gettingit right for allapplicationsis difficult.

In DSMs,for example,processesdo not interactwith eachotherdirectly, butonly throughshareddataitems.Whena processaccessesa shareddataitem thatis storedremotely, it typically sendsan accessrequestto the remoteprocessor.

Page 31: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

1.4Contributions 9

The remoteprocessormustrespondto the request,even if noneof its processesareaccessingthe dataitem. If the remoteprocessoris not polling the networkwhentherequestarrives,the reply will bedelayed.This canbesolvedby usinginterrupts,but theseareexpensive.

Oncea messagehasbeendetected,it needsto be handled. An interestingproblemis in which executioncontext messagesshouldbehandled.Allocating athreadto eachmessageis conceptuallyclean,but potentiallyexpensive, both intime andspace.Moreover, dispatchingmessagesto multiple threadscancausemessagesto beprocessedin adifferentorderthanthey werereceived.Sometimesthis is necessary;at othertimes,it shouldbeavoided.Executingapplicationcodeandmessagehandlersin thesamethreadgivesgoodperformance,but canleadtodeadlockwhenmessagehandlersblock.

1.3.3 DesignAlter nativesfor User-Level CommunicationAr chitectures

Thelow-level communicationprotocolsof user-level architecturesdiffer substan-tially from traditional protocolssuchas TCP/IP. For example,many protocolsnow usetheNI to implementreliability [37, 49], multicast[16, 56, 146],networkmapping[100], performancemonitoring[93, 103], andaddresstranslationsand(remote)memoryaccess[13,24,32, 51, 83, 126]. Currentuser-level communica-tion systemsmakedifferentdesigndecisionsin theseareas.Specifically, differentsystemsdivide similar protocol tasksbetweenthe host and the NI in differentways.In somesystems,for example,theNI is responsiblefor implementingreli-ablecommunication.In othersystems,this taskis left to thehostprocessor.

The impactof differentdesignchoiceson the performanceof PPSsandap-plicationshashardlybeeninvestigated.Suchaninvestigationis difficult, becauseexistingarchitecturesimplementdifferentfunctionalityandprovidedifferentpro-gramminginterfaces. Moreover, many studiesuseonly low-level microbench-marksthatignoreperformanceathigherlayers[5].

1.4 Contrib utions

The main contributionsof this thesistowardssolving the problemspresentedintheprevioussectionareasfollows:

1. We show thata smallsetof simple,low-level communicationmechanismscanbeemployedeffectively to obtainefficientPPSimplementationsthatdonot suffer from the dataandcontrol-transfermismatchesdescribedin the

Page 32: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

10 Introduction

previous section. Someof thesecommunicationmechanismsrely on NIsupport(seebelow).

2. We have designedandimplementeda new, efficient, andreliablemulticastalgorithmthatusestheNI insteadof thehostto forwardmulticasttraffic.

3. To gain insight in the tradeoffs involved in choosingbetweendifferentNIprotocol implementations,we have implementeda singleuser-level com-municationinterface in multiple ways. Theseimplementationsdiffer inwhetherthe hostor the NI is responsiblefor reliability andmulticastfor-warding.Usingtheseimplementations,wehavesystematicallyinvestigatedthe impactof differentdesignchoiceson the performanceof higher levelsystems(onecommunicationlibrary andmultiplePPSs)andapplications.

The communicationmechanismsmentionedhave beenimplementedin a new,user-level communicationsystemcalledLFC. LFC andits mechanismsaresum-marizedin Section1.5anddescribedin detail in Chapters3 to 5. On top of LFCwehave implementeda communicationlibrary, Panda,that

� providesanefficient (stream)messageabstraction

� transparentlyswitchesbetweenusingpolling andinterruptsto detectincom-ing messages

� implementsanefficient totally-orderedbroadcastprimitive

Using LFC andPandawe have implementedandportedvariousPPSs(seeSec-tion 1.5).Wedescribehow wemodifiedsomeof thesePPSsto benefitfrom LFC’sandPanda’s communicationsupport.

1.5 Implementation

We have implementedour solutionsin variouscommunicationsystems(seeFig-ure1.2). Thesesystemscover all communicationlayersshown in Figure1.1: NIprotocol,communicationlibrary, andPPS.The following sectionintroducestheindividualsystems,layerby layer, bottom-up.

1.5.1 LFC Implementations

Wehavedevelopedanew NI protocolcalledLFC [17, 18]. LFC providesreliablepoint-to-pointandmulticastcommunicationandrunson Myrinet [27], amodern,switched,high-speednetwork.

Page 33: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

1.5Implementation 11

Orca(Ch. 7)

MPI(Ch. 7)

Multigame(Ch. 8)

CRL(Ch. 7)

Parallel applications (Ch. 8)

Panda (Ch. 6)

LFC (Chs. 3-5)

Myrinet (Ch. 1)

Parallel Java(Ch. 7)

Fig. 1.2.Structureof this thesis.

LFC hasa low-level, packet-basedinterface. By exposingnetwork packets,LFC allows its clientsto avoid mostredundantmemorycopies. Packetscanbereceivedthroughpolling or interruptsandclientscandynamicallyswitchbetweenthe two. LFC deliversincomingpacketsby meansof upcalls,asin Active Mes-sages[148].

We have implementedLFC’s point-to-pointandmulticastprimitives in fivedifferentways. The implementationsdiffer in their reliability assumptionsandin how they divide protocolwork betweenthe NI andthe host. Chapters3 to 5describethemostaggressive of theseimplementationsin detail. This implemen-tationexploitsMyrinet’sprogrammableNI to implementthefollowing:

1. Reliablepoint-to-pointcommunication.

2. An efficient, reliablemulticastprimitive.

3. A mechanismto reduceinterruptoverhead(a polling watchdog).

4. A fetch-and-addprimitivefor globalsynchronization.

This implementationachieves reliable point-to-pointcommunicationby meansof an NI-level flow control protocol that assumesthat the network hardware isreliable.A simpleextensionof this flow controlprotocolallows usto implementanefficient,reliable,NI-level multicast.In thismulticastimplementation,packetsneednot travel to the hostandbackbeforethey areforwarded. Instead,the NIrecognizesmulticastpacketsandforwardsthemto childrenin themulticasttreewithouthostintervention.

The implementationdescribedabove performsmuch protocol work on theprogrammableNI andassumesthat the network hardwaredoesnot drop or cor-rupt network packets. Our otherimplementations,describedin Chapter8, differmainly in whereand how they implementreliability andmulticastforwarding.Ourmostconservativeimplementation,for example,assumesunreliablehardware

Page 34: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

12 Introduction

andimplementsbothreliability (includingretransmission)andmulticastforward-ing on thehost.

LFC hasbeenusedto implementor portseveralsystems,includingCRL [74],FastSockets[120], Parallel Java [99], MPICH [61], Multigame[122], Orca[9],Panda[19, 124],andTreadMarks[78]. Severalof thesesystemsaredescribedinthis thesis;they areintroducedin thesectionsbelow.

1.5.2 Panda

Pandais a portablecommunicationlibrary thatprovidesthreads,messages,reli-ablemessagepassing,RemoteProcedureCall (RPC),andtotally-orderedgroupcommunication.UsingPanda,wehave implementedvariousPPSs(seebelow).

To implementits abstractionsefficiently, Pandaexploits LFC’s packet-basedinterfaceanditsefficientmulticast,fetch-and-add,andinterrupt-managementprim-itives. PandausesLFC’s packet interfaceto implementanefficient messageab-stractionwhich allows end-to-endpipelining of messagedataandwhich allowsapplicationsto delaymessageprocessingwithout copying messagedata.All in-comingmessagesarehandledby a single,dedicatedthread,which solvespartofthe blocking-upcallproblem. To automaticallyswitch betweenpolling and in-terrupts,Pandaintegratesthreadmanagementandcommunicationsuchthat thenetwork is automaticallypolled when all threadsare idle. Finally, PandausesLFC’sNI-level multicastandsynchronizationprimitivesto implementanefficienttotally-orderedmulticastprimitive.

1.5.3 Parallel-Programming Systems

In this thesis,we usefive PPSsto testour ideas: Orca,CRL, MPI, Manta,andMultigame.

Orca is a DSM systembasedon the notion of user-definedsharedobjects.Jointly, theOrcacompilerandruntimesystemautomaticallyreplicateandmigratesharedobjectsto improvelocality. To performoperationsonsharedobjects,OrcausesPanda’s threads,RPC,andtotally-orderedgroupcommunication.

Like Orca,CRL is a DSM system.CRL processessharechunksof memorywhich they can map into their addressspace. Applicationsmust bracket theiraccessesto theseregionsby library callssothattheCRL runtimesystemcankeepthe sharedregions in a consistentstate. Unlike Orca,which updatesreplicatedsharedobjects,CRL usesinvalidationto maintainregion-level coherence.

MPI is a message-passingstandard.Parallelapplicationsareoftendevelopeddirectly on top of MPI, but MPI is alsousedasa compiler target. Our imple-mentationof MPI is basedon MPICH [61] andPanda.MPICH is a portableand

Page 35: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

1.5Implementation 13

PPS Communication patterns Messagedetection # contextsOrca Roundtrip+ broadcast Polling + interrupts 1CRL Roundtrip Polling + interrupts 0MPI One-way+ broadcast Polling 0Manta Roundtrip Polling + interrupts � 1Multigame One-way Polling 0

Table 1.2.Communicationcharacteristicsof parallel-programmingsystems.

widely usedpublic-domainimplementationof MPI thatis beingdevelopedjointlyby ArgonneNationalLaboratoriesandMississippiStateUniversity.

Manta is a PPSthat allows Java [60] threadsto communicateby invokingmethodson sharedobjects,asin Orca.Superficially, Mantahassimplercommu-nicationrequirementsthanOrca,becauseMantadoesnot replicatesharedobjectsandthereforerequiresonly RPC-stylecommunication.In reality, however, sev-eralfeaturesof Javathatarenotpresentin Orca—specifically, garbagecollectionandconditionsynchronizationatarbitraryprogrampoints—leadto morecomplexinteractionswith thecommunicationsystem.

Multigameis a declarative parallelgame-playingsystem.Given the rulesofa boardgameanda boardevaluationfunction,Multigameautomaticallysearchesfor goodmoves,usingoneof severalsearchstrategies(e.g.,IDA* or alpha-beta).During a search,processorspushsearchjobs to eachother(usingone-way mes-sages).A job consistsof (recursively) evaluatinga boardposition. To avoid re-searchingpositions,positionsarecachedin adistributedhashtable.

Not only do thesePPSscovera rangeof programmingmodels,they alsohavedifferentcommunicationrequirements.Table1.2summarizesthemaincommuni-cationcharacteristicsof all fivePPSs.(A moredetaileddiscussionof thesecharac-teristicsappearsin Chapters7 and8.) Thesecondcolumnlists themostcommontype of communicationpatternusedin eachPPS.The third columnshows howeachPPSdetectsincomingmessages.All DSMs(Orca,CRL, andManta)useacombinationof polling andinterrupts.Thefourth columnshowshow many inde-pendentmessage-handlercontexts canbeactive in eachPPS.In CRL, MPI, andMultigame,all incomingmessagesareprocessedby handlersthatrun in thesamecontext asthemaincomputation(i.e.,asprocedurecalls),sotherearenoindepen-denthandlercontexts. In Orca,all handlersarerunby adedicatedthread.Finally,Mantacreatesanew threadfor eachincomingmessage.

Page 36: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

14 Introduction

1.6 Experimental Envir onment

For mostof theexperimentsdescribedin this thesis,weuseaclusterthatconsistsof 128computers,which areinterconnectedby a Myrinet [27] network. Beforedescribingthehardwarein detail,wefirst positionour clusterarchitecture.

1.6.1 Cluster Position

The clusterusedin this thesisis a compromisebetweena traditionalsupercom-puterandLAN technology. Unlikeatraditionalsupercomputer, theNI connectstothehost’sI/O busandis thereforefarawayfrom thehostprocessor. This is atypi-calorganizationfor commodityhardware,but aggressivesupercomputerarchitec-turesintegratetheNI moretightly with thehostsystemby placingit on themem-ory busor integratingit with thecachecontroller. Thesearchitecturesallow forvery low network accesslatenciesandsimplerprotectionschemes[24, 85, 108].In this thesis,however, we focuson architecturesthatuseoff-the-shelfhardwareandrely on advancedsoftwaretechniquesto achieve efficient user-level networkaccess.This typeof architecture(i.e., with theNI residingon thehost’s I/O bus)is now alsofoundin severalcommercialparallelmachines(e.g.,theIBM SP/2).

The cluster’s network, Myrinet, is not aswidely usedasthe ubiquitousEth-ernetLAN. As a parallel-processingplatform,however, Myrinet is usedin manyplaces,bothin academiaandindustry.

Myrinet’sNI containsaprogrammable,customRISCprocessorandfastSRAMmemory. The network consistsof high-bandwidthlinks andswitches;networkpacketsarecut-through-routedthroughtheselinks andswitches.ThesefeaturesmakeMyrinet (andsimilarproducts)muchmoreexpensivethanEthernet,thetra-ditional bus-basedLAN technology. While Ethernetcanbe switchedto obtainhighbandwidth,mostEthernetNIs arenotprogrammable.

Our main reasonfor usingMyrinet is that its programmableNI enablesex-perimentationwith differentprotocols,mechanisms,andinterfaces.This typeofexperimentationis notspecificto Myrinet andhasalsobeenperformedwith otherprogrammableNIs [34, 147]. Themainproblemis oftenthevendors’reluctanceto releaseinformationthat allows otherpartiesto programtheir NIs. Myricom,thecompany thatdevelopsandsellsMyrinet, doesprovidethedocumentationandtoolsthatenablecustomersto programtheNI.

1.6.2 Hardware Details

EachclusternodecontainsasingleSiemens-NixdorfD983motherboard,anIntel440FXPCI chipset,andanIntel PentiumPro[70] processor. Thenodesarecon-nectedvia 328-portMyrinet switches,whichareorganizedin athree-dimensional

Page 37: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

1.6ExperimentalEnvironment 15

Fig. 1.3. Myrinet switchtopology.

Cluster node

switch

Network interface

Fig. 1.4. Switchesandclusternodes.Eachnodehasits own network interfacewhich connectsto a crossbarswitch. Switchesconnectto network interfacesandotherswitches.

grid topology. Figure1.3shows theswitchtopology. Figure1.4shows oneverti-cal planeof theswitch topology. Eachswitchconnectsto 4 clusternodesandtoneighboringswitches.(Figure1.4 shows only theconnectionsto switchesin thesameverticalplane.)Theclusternodescanalsocommunicatethrougha FastEth-ernetnetwork (notshown in Figures1.3and1.4).FastEthernetis usedfor processstartup,file transfers,andterminaltraffic.

The architectureof a single clusternodeis illustratedin Figure 1.5. Eachnodecontainsa single PentiumPro processorand 128 Mbyte of DRAM. ThePentiumPro is a 200MHz, three-way superscalarprocessor. It hastwo on-chipfirst-level caches:an 8 Kbyte, 2-way set-associative datacacheandan 8 Kbyte,4-way set-associative instructioncache.In addition,a unified,256KByte, 4-wayset-associativesecond-level cacheis packagedalongwith theprocessorcore.Thecacheline sizeis 32bytes.Theprocessor-memorybusis 64bitswideandclockedat 66 MHz. TheI/O busis a33 MHz, 32-bitPCI bus.TheMyrinet NI is attachedto theI/O bus.

Page 38: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

16 Introduction

Host CPU

Host bus

PCI bus

Network

Host memory

Cache

Bridge

DMA enginesNetwork interface

DMA

DMAmem.NI

DMAhostTo/from Send

Recv

CPU

Fig. 1.5.Clusternodearchitecture.

To obtainaccuratetimingsin ourperformancemeasurements,weusethePen-tium Pro’s 64-bit timestampcounter[71]. Thetimestampcounteris a clock withclock cycle (5 ns) resolution. A simplepseudodevice driver makes this clockavailable to unprivilegedusers. The countercanbe read(from userspace)us-ing asingleinstruction,sotheuseof this fine-grainclock imposeslittle overhead(approximately0.17µs).

Myrinet is a high-speed,switchedLAN technology[27]. Switchedtechnolo-giesscalewell to a largenumberof hosts,becausebandwidthincreasesasmorehostsandswitchesareaddedto thenetwork. Unlike Ethernet,Myrinet providesno hardwaresupportfor multicast. General-purposemulticastingfor wormhole-routednetworksis acomplex problemandanareaof active research[136,135].

Unlike traditionalNIs, Myrinet’s NI is programmable;it containsa customprocessorwhich canbeprogrammedin C or assembly. Theprocessor, a 33 MHzLANai4.1, is controlledby the LANai control program. The processoris effec-tivelyanorderof magnitudeslowerthanthe200MHz, superscalarhostprocessor.

The NI is equippedwith 1 Mbyte of SRAM memorywhich holdsboth thecodeandthe datafor the control program. The currentlyavailableMyrinet NIsrequirethatall networkpacketsbestagedthroughthismemory, bothatthesendingandthe receiving side. SRAM is fast,but expensive, so the amountof memoryis relatively small. Othernetworksallow datato be transferreddirectly betweenhostmemoryandthenetwork (e.g.,usingmemory-mappedFIFOsor DMA).

The NI containsthreeDMA engines. The hostDMA enginetransfersdata

Page 39: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

1.7ThesisOutline 17

betweenhostmemoryandNI memory. Host memorybuffers that areaccessedby this DMA engineshouldbepinned(i.e., markedunswappable)to prevent theoperatingsystemfrom pagingthesepagesto disk during a DMA transfer. ThesendDMA enginetransferspackets from NI memoryto the outgoingnetworklink; thereceiveDMA enginetransfersincomingpacketsfrom thenetwork link toNI memory. The DMA enginescanrun in parallel,but theNI’s memoryallowsat mosttwo memoryaccessespercycle,oneto or from thePCI busandoneto orfrom theNI processor, thesendDMA engine,or thereceiveDMA engine.

Myrinet NIs connectto eachother via 8-port crossbarswitchesand high-speedlinks (1.28 Gbit/s in eachdirection). The Myrinet hardwareusesroutingandswitching techniquessimilar to thoseusedin massively parallelprocessors(MPPs)suchasthe Thinking MachinesCM-5, the Cray T3D andT3E, andtheIntel Paragon.Network packetsarecut-through-routedfrom a sourceto a desti-nationnetwork interface.Eachpacketstartswith asequenceof routingbytes,onebyte per switch on the path from sourceto destination.Eachswitch consumesone routing byte and usesthe byte’s value to decideon which output port thepacket mustbeforwarded.Myrinet implementsa hardwareflow controlprotocolbetweenpairsof communicatingNIs which makespacket lossvery unlikely. Ifall senderson thenetwork agreeon a deadlock-freeroutingschemeandtheNIs’controlprogramsremove incomingpacketsin a timely manner, thenthenetworkcanbeconsideredreliable(seealsoSection3.9andChapter8).

While in transit, a packet may becomeblocked, either becausepart of thepathit follows is occupiedby anotherpacket, or becausethedestinationNI failsto drain the network. In the first case,Myrinet will kill the packet if it remainsblockedfor morethan50 milliseconds.In thesecondcase,theMyrinet hardwaresendsa resetsignalto thedestinationNI aftera timeoutinterval haselapsed.Thelengthof this interval canbesetin software. Both measuresareneededto breakdeadlocksin the network. If the network did not do this, a maliciousor faultyprogramcouldblockanotherprogram’spackets.

1.7 ThesisOutline

This chapterintroducedour areaof research,communicationsupportfor PPSs.We sketchedtheproblemsin this areaandour approachto solving them. Chap-ter 2 surveys the main designissuesfor user-level communicationarchitecturesand shows that existing systemsresolve theseissuesin widely different ways,which illustratesthat the tradeoffs arestill unclear. Chapter3 describesthe de-signandimplementationof our mostoptimisticLFC implementation.Chapter4givesa detaileddescriptionof the NI-level protocolsemployedby this LFC im-plementation.Chapter5 evaluatesthe implementation’s performance.Chapter6

Page 40: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

18 Introduction

describesPandaandits interactionswith LFC. Chapter7 describestheimplemen-tationof four PPSs:Orca,CRL, MPI, andManta.We show how LFC andPandaenableefficient implementationsof thesesystemsanddescribeadditionalopti-mizationsusedwithin thesesystems.Chapter8 comparestheperformanceof fiveLFC implementationsat multiple levels: theNI protocollevel, thePPSlevel, andtheapplicationlevel. Finally, in Chapter9, wedraw conclusions.

Page 41: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Chapter 2

Network Interface Protocols

High-speednetworks suchasMyrinet offer greatpotentialfor communication-intensive applications.Unfortunately, traditionalcommunicationprotocolssuchasTCP/IPareunableto realizethis potential. In thecommonimplementationoftheseprotocols,all network accessis throughthe operatingsystem,which addssignificantoverheadsto both the transmissionpath(typically a systemcall anda datacopy) and the receive path (typically an interrupt and a datacopy). Inresponseto this performanceproblem,several user-level communicationarchi-tectureshave beendevelopedthat remove the operatingsystemfrom thecriticalcommunicationpath[48,147]. Thischapterprovidesinsightinto thedesignissuesfor communicationprotocolsfor thesearchitectures.We concentrateon issuesthatdeterminetheperformanceandsemanticsof a communicationsystem:datatransfer, addresstranslation,protection,controltransfer, reliability, andmulticast.

In this chapter, we usethe following systemsto illustratethe designissues:Active MessagesII (AM-II) [37], BIP [118], Illinois FastMessages(FM) [112,113], FM/MC [10, 146], Hamlyn[32], PM [141, 142], U-Net [13, 147], VMMC[24], VMMC-2 [49,50], andTrapeze[154]. All systemsaimfor highperformanceandall exceptTrapezeoffer a user-level communicationservice. Interestingly,however, they differ significantlyin how they resolve thedifferentdesignissues.It is this varietythatmotivatesour study.

Thischapteris structuredasfollows. Section2.1explainsthebasicprinciplesof NI protocolsby describinga simple,unreliable,user-level protocol. Next, inSections2.2to 2.7,wediscussthesix protocoldesignissuesthatdeterminesystemperformanceandsemantics.Weillustratetheseissuesby giving performancedataobtainedon ourMyrinet cluster.

19

Page 42: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

20 Network InterfaceProtocols

CPU

CPU CPU

Receive ringSend ring

DMA area

Host NI NI Host

CPU 8

7

654

3

1

2

Data packet

User address space

Fig. 2.1. Operationof thebasicprotocol. Thedashedarrows arebuffer pointers.The numberedarrows representthe following steps. (1) Host copiesuserdatainto DMA area.(2) Hostwritespacket descriptorto sendring. (3) NI processorreadspacketdescriptor. (4) NI DMAs packet to NI memory. (5) Network transfer.(6) NI readsreceive ring to find emptybuffer in DMA area.(7) NI DMAs packetto DMA area.(8) Optionalmemorycopy to userbuffer by thehostprocessor.

2.1 A BasicNetwork Interface Protocol

Thegoalof this sectionis to explain thebasicsof NI protocolsandto introducethe most importantdesignissues.To structureour discussion,we first describethedesignof a simple,user-level NI protocolfor Myrinet. Theprotocolignoresseveralimportantproblems,whichweaddressin subsequentsections.

To avoid the costof kernelcalls for eachnetwork access,the basicprotocolmapsall NI memoryinto userspace. Userprocesseswrite their sendrequestsdirectly to NI memory, without operatingsystem(OS) involvement. The basicprotocolprovidesno protection,so the network device cannotbe sharedamongmultipleprocesses.

User processesinvoke a simple sendprimitive to senda datapacket. Thebasicprotocolsendspacketswith a maximumpayloadof 256bytesandrequiresthatusersfragmenttheirdatasothateachfragmentfits in apacket. Thesignatureof thesendprimitive is asfollows:

void send(int destination, void *data, unsigned size);

Send() performstwo actions(seeFigure2.1).First, it copiestheuserdatato apacket buffer in a specialstagingareain hostmemory(step1). TheNI will laterfetchthepacket from thisDMA areaby meansof aDMA transfer. Unlikenormaluserpages,pagesin theDMA areaareneverswappedto diskby theOS.By onlyDMAing to andfrom suchpinnedpages,the protocolavoids corruptionof usermemoryandusermessagesdueto pagingactivity of theOS.

Page 43: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

2.1A BasicNetwork InterfaceProtocol 21

Second,thehostwritesasendrequestinto adescriptorin NI memory(step2).Thesedescriptorsarestoredin acircularbuffer calledthesendring. Send() storestheidentifierof thedestinationmachine,thesizeof thepacket’s payload,andthepacket’s offset in the DMA areainto the next available descriptorin this sendring. To inform the NI of this event,send() alsosetsa DescriptorReadyflag inthedescriptor. This flag preventsracesbetweenthehostandtheNI: theNI willnot readany otherdescriptorfieldsuntil this flag hasbeenset. Thedescriptoriswritten usingprogrammedI/O; sincethedescriptoris small,DMA would have ahighoverhead.

The NI repeatedlypolls the DescriptorReadyflag of the first descriptorinthe sendring. As soonasthis flag is setby the host, the NI readsthe offset inthe descriptorandaddsit to the physicaladdressof the startof the DMA area,resultingin the physicaladdressof the packet (step3). Next, the NI initiatesaDMA transferover the I/O bus to copy the packet’s payloadfrom hostmemoryto NI memory(step4). Subsequently, it readsthe destinationmachinein thedescriptorandlooksuptheroutefor thepacket in a routingtable.Therouteandapacket headerthatcontainsthepacket’s sizeareprependedto thepacket. Finally,theNI startsasecondDMA to transmitthepacket (step5).

Whenthe sendingNI detectsthat the network DMA for a given packet hascompleted,it setsaDescriptorFreeflagto releasethedescriptor, andpollsthenextfreedescriptorin thering. If thehostwantsto senda packet while no descriptoris available,it busy-waitsby polling theDescriptorFreeflag of thedescriptoratthetail of thering.

NIs usereceive DMAs to storeincomingpackets in their memory. EachNIcontainsa receivering with descriptorsthat point to free buffers in the host’sDMA area. The NI usesthe receive ring’s descriptorsto determinewhere(inhostmemory)to storeincomingpackets. After receiving a packet, the NI triesto acquirea descriptorfrom thereceive ring (step6). Eachdescriptorcontainsaflag bit that is usedin a similar way asfor the sendring. If no free hostbufferis available, the packet is simply dropped. Otherwise,the NI startsa DMA totransferthepacket to hostmemory(step7). Eachhostbuffer alsocontainsa flagthat is setby theNI asthelastpartof its NI-to-hostDMA transfer. Thehostcancheckif thereis apacketavailableby polling theflagof thenext unprocessedhostreceive buffer. Oncethe flag hasbeenset,the receiving processcansafelyreadthebuffer andoptionallycopy its contentsto auserbuffer (step8).

Thedatatransfersin steps1, 4, 5, 7, and8 canall beperformedconcurrently.For example,if thehostsendsa long,multipacketmessage,onepacket’snetworkDMA canbeoverlappedwith thenext packet’s host-to-NIDMA. Exploiting thisconcurrency is essentialfor achieving high throughput.

Sincethe delivery of network interruptsto user-level processesis expensiveon currentOSs,the basicprotocoldoesnot useinterrupts,but requiresusersto

Page 44: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

22 Network InterfaceProtocols

poll for incomingmessages.A successfulcall to poll() resultsin theinvocationofauserfunction,handlepacket(), thathandlesaninboundpacket:

void poll(void);void handle packet(void *data, unsigned size);

Our (unoptimized)implementationof the basicprotocolachievesa one-wayhost-to-hostlatency of 11 µs and a throughputof 33 Mbyte/s (using 256-bytepackets). For comparison,on the samehardwarethe highly optimizedBIP sys-tem[118], achievesa minimumlatency of 4 µs andcansaturatetheI/O bus(127Mbyte/s). For large datatransfers,BIP usesDMA, both at the sendingandthereceiving side. In contrastwith the basicprotocol,however, BIP doesnot stagedatathroughDMA areas.Section2.3discussesdifferenttechniquesto eliminatetheuseof DMA areas.

Thebasicprotocolavoidsall OSoverhead,keepstheNI codesimple,anduseslittle NI memory. It is clear, however, thattheprotocolhasseveralshortcomings:� All inboundand outboundnetwork transfersare stagedthrougha DMA

area.For applicationsthatneedto sendandreceivefrom arbitrarylocations,this introducesextra memorycopies.� Theprotocolprovidesno protection.If thebasicprotocolallowedmultipleusersto accesstheNI, theseuserscouldreadandmodify eachother’s datain NI memory. Userscanevenmodify theNI’s controlprogramanduseitto accessany hostmemorylocation.� The receiver-side control transfermechanism,polling, is simple, but notalwayseffective. For many applicationsit is difficult to determinea goodpolling rate.If thehostpolls too frequently, it will havea high overhead;ifit polls too late,it will not reply quickly enoughto incomingpackets.� Theprotocolis unreliable,eventhoughtheMyrinet hardwareis highly re-liable. If thesenderssendpacketsfasterthanthereceiver canhandlethem,thereceiving hostwill runoutof buffer spaceandtheNI will dropincomingpackets.� The protocol supportsonly point-to-pointmessages.Although multicastcanbe implementedon top of point-to-pointmessages,doing so may beinefficient. Multicast is an importantserviceby itself anda fundamentalcomponentof collectivecommunicationoperationssuchasthosesupportedby themessage-passingstandardMPI.

Below, we will discusstheseproblemsin moredetail and look at betterdesignalternatives.

Page 45: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

2.2DataTransfers 23

2.2 Data Transfers

OnMyrinet,at leastthreestepsareneededto communicateapacketfrom oneuserprocessto another:thepacket mustbemovedfrom thesender’smemoryto its NI(host-NI transfer),from this NI to thereceiver’s NI (NI-NI transfer),andthentothe receiving process’s addressspace(NI-host transfer). Network technologiesthatdonot requiredatato bestagedthroughNI memoryuseonly two steps:host-to-network andnetwork-to-host. Below, we discussthe Myrinet case,but mostissues(programmedI/O versusDMA, pinning,alignment,andmaximumpacketsize)alsoapplyto thetwo-stepcase.

The datatransfershave a significant impact on the latency and throughputobtainedby a protocol, so optimizing them is essentialfor obtaininghigh per-formance.As shown in Figure2.1, the basicprotocolusesfive datatransferstocommunicatea packet, becauseit stagesall packetsthroughDMA areas.Below,we discussalternative designsfor implementingthehost-NI,NI-NI, andNI-hosttransfers.

2.2.1 From Host to Network Interface

On Myrinet, the host-to-NI transfercan use either DMA or ProgrammedI/O(PIO). Steenkistegivesa detaileddescriptionof both mechanisms[133]. WithPIO, the hostprocessorreadsthe datafrom hostmemoryandwrites it into NImemory, typically oneor two wordsat a time,which resultsin many bustransac-tions. DMA usesspecialhardware(a DMA engine)to transfertheentirepacketin large burstsandasynchronously, so that the datatransfercanproceedin par-allel with hostcomputations.Onethusmight expectDMA to alwaysoutperformPIO. Theoptimal choice,however, dependson the typeof hostCPUandon thepacket size. The PentiumPro, for example,supportswrite combiningbuffers,which allow memorywrites to the samecacheline to be merged into a single32-bytebus transaction.We canthusboostthe performanceof host-to-NIPIOtransfersby applyingwrite combiningto NI memory. (This useof the PentiumPro’swrite combiningfacility wassuggestedto usby theFastMessagesgroupoftheUniversityof Illinois at Urbana-Champaign[31].)

Figure 2.2 shows the throughputobtainedby PIO (with and without writecombining)andcache-coherentDMA, for copying datafrom a PentiumPro toa Myrinet NI card. PIO with write combiningquickly outperformsPIO withoutwrite combining. For buffer sizesup to 1024bytes,PIO with write combiningevenoutperformsDMA (which suffersfrom astartupcost).

For small messages,PIO with write combiningis slower than PIO withoutwrite combining. This is dueto an expensive extra instruction(a serializingin-struction)thatour benchmarkexecutesafter eachhost-to-NImemorycopy with

Page 46: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

24 Network InterfaceProtocols

4�

16 64�

256� 1K� 4K� 16K�Buffer size (bytes)

0

20

40

60

80

100T

hrou

ghpu

t (M

byte

/sec

ond)

DMADMA + on-demand pinningPIO + write combiningDMA + memcpyPIO

Fig. 2.2.Host-to-NIthroughputusingdifferentdatatransfermechanisms.

write combining.Weusethis instructionto modelthecommonsituationin whicha datacopy to NI memoryis followedby anotherwrite thatsignalsthepresenceof thedata.Clearly, theNI shouldnotobservethelatterwrite beforethedatacopyhascompleted.With write combining,however, writesmaybereorderedandit isnecessaryto separatethe datacopy andthe write that follows it by a serializinginstruction.

In user-level communicationsystems,DMA transferscanbestartedeitherbya userprocessor by thenetwork interfacewithout any operatingsysteminvolve-ment.SinceDMA transfersareperformedasynchronously, theoperatingsystemmaydecideto swapout thepagethathappensto bethesourceor destinationof arunningDMA transfer. If this happens,partof thedestinationof thetransferwillbe corrupted. To avoid this, operatingsystemsallow applicationsto pin a lim-itednumberpagesin their addressspace.Pinnedpagesareneverswappedout bytheoperatingsystem.Unfortunately, pinningapagerequiresasystemcall andtheamountof memorythatcanbepinnedis limited by theavailablephysicalmemoryandby OSpolicies.In thebestcase,all pagesthatanapplicationtransfersdatatoor from needto bepinnedonly once.In thiscasethecostof pinningthepagescanbeamortizedovermany datatransfers.In theworstcase,anapplicationtransfersdatato or from morepagesthancanbepinnedsimultaneously. In this case,twosystemcallsareneededfor every datatransfer:oneto unpina previously pinnedpageandoneto pin the pageinvolved in the next datatransfer. SHRIMP pro-videsspecialhardwarethatallows userprocessesto startDMA transferswithoutpinning [25]. This user-level DMA mechanism,however, works only for host-initiatedtransfers,not for NI-initiated transfers,sopinningis still requiredat thereceiving side.

Page 47: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

2.2DataTransfers 25

NI protocolsthatuseDMA oftenchooseto copy thedatainto a reserved(andpinned)DMA area,whichcostsanextramemorycopy andthusmaydecreasethethroughput.Figure2.2,shows thata DMA transferprecededby a memorycopyis consistentlyslower thanPIO with write combining.On processorsthatdo notsupportcache-coherentDMAs, theDMA areaneedsto beallocatedin uncachedmemory, which alsodecreasesperformance.(Most modernCPUs,including thePentiumPro,supportcache-coherentDMA, however.) With PIO, pinning is notnecessary. Evenif theOSswappedoutthepageduringthetransfer, thehost’snextmemoryreferencewould generatea pagefault, causingtheOSto swapthepagebackin. In practice,many protocolsuseDMA; otherprotocols(AM-II, Hamlyn,BIP) usePIOfor smallmessagesandDMA for largemessages.FM usesPIOforall messages.

With bothDMA andPIO,datatransfersbetweenunaligneddatabufferscanbemuchslower thanbetweenalignedbuffers. This problemis aggravatedwhentheNI’sDMA enginesrequire thatsourceanddestinationbuffersbeproperlyaligned.In thatcase,extracopying is neededto alignunalignedbuffers.

Anotherimportantdesignchoiceis themaximumpacket size. Largepacketsyield betterthroughput,becauseper-packet overheadsare incurredfewer timesthanwith small packets. The throughputof our basicprotocol,for example,in-creasesfrom 33 Mbyte/sto 48 Mbyte/sby using1024-byteinsteadof 256-bytepackets. The choiceof the maximumpacket size is influencedby the system’spagesize,memoryspaceconsiderations,andhardwarerestrictions.

2.2.2 From Network Interface to Network Interface

TheNI-to-NI transfertraversestheMyrinet links andswitches.All Myrinet pro-tocolsusetheNI’s DMA engineto sendandreceivenetwork data.In theory, PIOcould be used,but DMA transfersarealways faster. To sendor receive a dataword by meansof PIO, theprocessormustalwaysmove thatdataitem throughaprocessorregister, which costsat leasttwo cycles. A DMA enginecantransferoneword percycle. Moreover, usingDMA freestheprocessorto do otherworkduringdatatransfers.

To preventnetwork congestion,thereceiving NI shouldextractincomingdatafrom the network fastenough. On Myrinet, the hardwareusesbackpressure tostall the sendingNI if the receiver doesnot extract datafast enough. To pre-ventdeadlock,however, thereis a time limit on thebackpressuremechanism.Ifthereceiver doesnot drain thenetwork within a certaintime period,thenetworkhardwarewill resetthe NI or truncatea blocked packet. Many Myrinet controlprogramsdealwith this real-timeconstraintby copying datafastenoughto pre-vent resets.Otherprotocolsavoid the problemby usinga softwareflow controlscheme,aswewill discusslater.

Page 48: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

26 Network InterfaceProtocols

2.2.3 From Network Interface to Host

The transferfrom NI to hostat the receiving sidecanagainuseeitherDMA orPIO.On Myrinet, however, only thehost(not theNI) canusePIO,makingDMAthemethodof choicefor mostprotocols.Somesystems(e.g.,AM-II) usePIO onthehostto receive smallmessages.For largemessages,all protocolsuseDMA,becausereadsover the I/O bus are typically muchslower thanDMA transfers.Whetherwe useDMA or PIO, in both casesthe bottleneckfor the NI-to-hosttransferis theI/O bus.

Usingmicrobenchmarks,wemeasuredtheattainablethroughputontheimpor-tantdatapathsof ourPentiumPro/Myrinetcluster, usingDMA andPIO.Table2.1summarizesthe resultsandalsogivesthe hardwarebandwidthof the bottleneckcomponenton eachdatapath. For transfersbetweenthe host and the NI, thebottleneckis the 33.33MHz PCI bus; both main memoryandthe memorybusallow higherthroughputs.With DMA transfers,themicrobenchmarkscanalmostsaturatethe I/O bus. Our throughputis slightly lessthanthe I/O busbandwidthbecausein ourbenchmarkstheNI acknowledgeseveryDMA transferwith aone-word NI-to-host DMA transfer. For network transfersfrom one NI’s memoryto anotherNI’s memory, the bottleneckis the NIs’ memory. The sendand re-ceive DMA enginescanaccessat mostoneword per I/O bus clock cycle (i.e.,127 Mbyte/s). The bandwidthof the network links is higher, 153 Mbyte/s. Fi-nally, for local hostmemorycopies,thebottleneckcomponentis mainmemory.

An interestingobservation is that a local memorycopy on a singlePentiumPro obtainsa lower throughputthan a remotememorycopy over Myrinet (52Mbyte/sversus127 Mbyte/s). This problemis largely dueto the poor memorywrite performanceof thePentiumPro[29]; on a 450MHz PentiumIII platform,we measureda memory-to-memorycopy throughputof 157Mbyte/s. Neverthe-less,memorycopieshave animportantimpacton performance.For comparison,recall that thebasicprotocolachievesa throughputof only 33 Mbyte/s.Therea-sonis that thebasicprotocolusesfairly small packets(256bytes)andperformsmemorycopiesto andfrom DMA areas,which interfereswith DMA transfers.Several systems(e.g.,BIP andVMMC-2) cansaturatethe I/O bus: the key is-sueis to avoid thecopying to andfrom DMA areas. Thenext sectiondescribestechniquesto achieve this.

2.3 Addr essTranslation

Theuseof DMA transfersbetweenhostandNI memoryintroducestwo problems.First,mostsystemsrequirethateveryhostmemorypageinvolvedin aDMA trans-fer bepinnedto prevent theoperatingsystemfrom replacingthatpage.Pinning,

Page 49: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

2.3AddressTranslation 27

Source Destination Method Hardware Measuredbandwidth throughput

Hostmemory Hostmemory PIO 170 52Hostmemory NI memory PIO 127 25

PIO+ WC 127 84DMA 127 117

NI memory Hostmemory PIO 127 7DMA 127 117

NI memory NI memory DMA 127 127

Table 2.1. Bandwidthsand measuredthroughputs(in Mbyte/s) on a Pen-tium Pro/Myrinetcluster. WC meanswrite combining.

however, requiresanexpensive systemcall which shouldbekeptoff thecriticalpath. The secondproblemis that, on mostarchitectures,the NI’s DMA engineneedsto know thephysicaladdressesof eachpagethatit transfersdatato or from.Operatingsystems,however, do not export virtual-to-physicalmappingsto user-level programs,sousersnormallycannotpassphysicaladdressesto theNI. Evenif they could,theNI would have to checkthosephysicaladdresses,to ensurethatuserspassonly addressesof pagesthatthey haveaccessto.

We considerthreeapproachesto solve theseproblems.Thefirst approachisto avoid all DMA transfersby usingprogrammedI/O. Dueto thehighcostof I/Obusreads,however, this is only a realisticsolutionat thesendingside.

The secondapproach,usedby the basicprotocol, requiresthat userscopytheir datainto andout of specialDMA areas(seeFigure2.1). This way, only theDMA areasneedto be pinned. This is doneonce,when the applicationopensthe device, andnot during sendandreceive operations.The addresstranslationproblemis thensolvedasfollows. Theoperatingsystemallocatesfor eachDMAareaa contiguouschunkof physicalmemoryandpassesthearea’s physicalbaseaddressto theNI. Usersspecifysendandreceivebuffersby meansof anoffsetintheir DMA area.The NI only needsto addthis offset to thearea’s baseaddressto obtain the buffer’s physicaladdress.Several systems(e.g.,AM-II, Hamlyn)usethis approach,acceptingthe extra copying costs. As shown in Figure2.2,however, theextra copy reducesthroughputsignificantly.

In the third approach,the copying to andfrom DMA areasis eliminatedbydynamicallypinningandunpinninguserpagessothatDMA transferscanbeper-formeddirectly to thosepages.Systemsthatusethis approach(e.g.,VMMC-2,PM, andBIP) cantrack the ’DMA’ curve in Figure2.2. Themain implementa-tion problemis thattheNI needsto know thecurrentvirtual-to-physicalmappingsof individual pages.SinceNIs areusuallyequippedwith only a smallamountof

Page 50: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

28 Network InterfaceProtocols

memoryandsincetheirprocessorsareslow, they do notstoreinformationfor ev-erysinglevirtual page.Somesystems(e.g.,BIP) provideasimplekernelmodulethat translatesvirtual addressesto physicaladdresses.Usersareresponsibleforpinningtheir pagesandobtainingthephysicaladdressesof thesepagesfrom thekernelmodule. Thedisadvantageof this approachis that theNI cannotcheckifthephysicaladdressesit receivesarevalid andif they referto pinnedpages.

An alternative approachis to let thekernelandNI cooperatesuchthat theNIcankeeptrack of valid addresstranslations(either in hardwareor in software).Systemslike VMMC-2 andU-Net/MM [13] (an extensionof U-Net) let the NIcachea limited numberof valid addresstranslationswhich refer to pinnedpages(this invariantmustbemaintainedcooperatively by theNI andtheoperatingsys-tem). This cachingworkswell for applicationsthatexhibit locality in thepagesthey usefor sendingandreceiving data.Whenthetranslationof a user-specifiedaddressis foundin thecache,theNI canaccessthataddressusingaDMA transfer.In thecaseof amiss,specialactionmustbetaken.In U-Net/MM, for example,theNI generatesaninterruptwhenit cannottranslateanaddress.Thekernelreceivestheinterrupt,looksup theaddressin its pagetable,pinsthepage,andpassesthetranslationto theNI.

In VMMC-2, addresstranslationsfor userbuffers aremanagedby a library.Thisuser-level library mapsusers’virtual addressesto referencesto addresstrans-lations which userscan passto the NI. The library createsthesereferencesbyinvoking a kernelmodule.This moduletranslatesvirtual addresses,pinsthecor-respondingpages,andstoresthetranslationsin a User-managedTLB (UTLB) inkernelmemory.

Toavoid invokingtheoperatingsystemeverytimeareferenceis needed,theli-brarymaintainsauser-level lookupdatastructurethatkeepstrackof theaddressesfor which a valid UTLB entryexists. Thelibrary invokestheUTLB kernelmod-ule only whenit cannotfind the addressin its lookup datastructure.WhentheNI receivesa reference,it canfind the translationusinga DMA transferto thekernelmodule’sdatastructure.To avoid suchDMA transferson thecritical path,theNI maintainsits own cacheof references.The’on-demandpinning’ curve inFigure2.2 shows the throughputobtainedby a benchmarkthat imitatesthemissbehavior of a UTLB. To simulatea missin its lookupdatastructure,thehostin-vokes,for eachpagethatis transferred,asystemcall to pin thatpage.To simulatea missin theNI cache,theNI fetchesa singleword from hostmemorybeforeittransfersthedata.

Page 51: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

2.4Protection 29

2.4 Protection

Sinceuser-level architecturesgive usersdirect accessto the NI, the OS cannolongercheckevery network transfer. Multiplexing anddemultiplexing mustnowbeperformedby theNI andspecialmeasuresareneededto preserve theintegrityof theOSandto isolatethedatastreamsof differentprocessesfrom oneanother.

In a protectedsystem,no userprocessshouldbegivenunrestrictedaccesstoNI memory. With unrestrictedaccessto NI memory, a userprocesscanmodifythe NI control programanduseit to reador write any locationin hostmemory.NI memoryaccessmustalsobe restrictedto implementprotectedmultiplexinganddemultiplexing. In the basicprotocol, for example,userprocessesdirectlywrite to NI memoryto initialize senddescriptors.If multiple processessharedthe NI, oneprocesscould corruptanotherprocess’s senddescriptors.Similarly,no processshouldbeableto readincomingnetwork packetsthataredestinedforanotherprocess.The basicprotocolpreventstheseproblemsby providing user-level network accessto at mostoneuserat a time, but this limitation is clearlyundesirablein amulti-userandmultiprogrammingenvironment.

A straightforwardsolutionto theseproblemsis to usethevirtual-memorysys-tem to give eachuseraccessto a differentpart of NI memory[48]. When theuseropensthe network device, the operatingsystemmapssucha part into theuser’s addressspace.Oncethe mappinghasbeenestablished,all useraccessesoutsidethemappedareawill be trappedby thevirtual-memoryhardware. Userswrite theircommandsandnetwork datato theirown pagesin NI memory. It is theresponsibilityof theNI to checkeachuser’spagefor new requestsandto processonly legal requests.

SinceNI memoryis typically small,only a limited numberof processescanbe givendirectaccessto the NI this way. To solve this problem,AM-II virtual-izesnetwork endpointsin thefollowing way. Part of NI memoryactsasa cachefor active communicationendpoints;inactive endpointsarestoredin hostmem-ory. WhenanNI receivesa messagefor an inactive endpointor whena processtriesto senda messagevia aninactiveendpoint,theNI andtheoperatingsystemcooperateto activatetheendpoint.Activationconsistsof moving theendpoint’sstate(sendandreceivebuffers,protocolstatus)to NI memory, possiblyreplacinganotherendpointwhich is thenswappedout to hostmemory.

Thesharingproblemalsoexistson thehost,for theDMA areas.To maintainprotection,eachuserprocessneedsits own DMA area.Sincetheuseof a DMAareaintroducesanextra copy, somesystemseliminateit andstoreaddresstrans-lationson theNI. VMMC-2 andU-Net/MM do this in a protectedway, eitherbyletting the kernelwrite the translationsto the NI or by letting the NI fetch thetranslationsfrom kernelmemory. BIP, on the otherhand,eliminatesthe DMAarea,but doesnot maintainprotection.

Page 52: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

30 Network InterfaceProtocols

2.5 Control Transfers

The control transfermechanismdetermineshow a receiving host is notified ofmessagearrivals. Theoptionsareto useinterrupts,polling, or a combinationofthese.

Interruptsarenotoriouslyexpensive. With mostoperatingsystems,the timeto deliver an interrupt (asa signal) to a userprocesseven exceedsthe networklatency. On a 200 MHz PentiumPro runningLinux, dispatchingan interrupttoa kernel interrupthandlercostsapproximately8 µs. Dispatchingto a user-levelsignalhandlercostsevenmore,approximately17 µs. This exceedsthelatency ofourbasicprotocol(11 µs).

Given the high costsof interrupts,all user-level architecturessupportsomeform of polling. Thegoalof polling is to give thehosta fastmechanismto checkif a messagehasarrived. This checkmust be inexpensive, becauseit may beexecutedoften. A simpleapproachis to let theNI seta flag in its memoryandtolet thehostcheckthisflag. Thisapproach,however, is inefficient,sinceeverypollnow resultsin anI/O bustransfer. In addition,this polling traffic will slow downotherI/O traffic, includingnetwork packet transfersbetweenNI andhostmemory.

A veryefficientsolutionis to useaspecialdevicestatusregisterthatis sharedbetweentheNI andthehost[107]. Currenthardware,however, doesnot providethesesharedregisters.

On architectureswith cache-coherentDMA, a practicalsolutionis to let theNI write aflag in cachedhostmemory(usingDMA) whenamessageis available.This approachis usedby our basicprotocol. The hostpolls by readingits localmemory;sincepolls areexecutedfrequently, the flag will usually residein thedatacache,so failed polls arecheapanddo not generatememoryor I/O traffic.WhentheNI writes theflag, thehostwill incur a cachemissandreadtheflag’snew valuefrom memory. On a 200MHz PentiumPro,theschemejust describedcosts5 nanosecondsfor a failed poll (i.e., a cachehit) and74 nanosecondsfora successfulpoll (i.e., a cachemiss). For comparison,eachpoll in the simplescheme(i.e.,anI/O bustransfer)costs467nanoseconds.

Even if the polling mechanismis efficient, polling is a mixed blessing. In-sertingpolls manuallyis tediousanderror-prone.Severalsystemsthereforeuseacompilerorabinaryrewriting tool to insertpollsin loopsandfunctions[114,125].Theproblemof finding theright polling frequency remains,though[89]. In mul-tiprocessorarchitecturesthis problemcanbesolvedby dedicatingoneof thepro-cessorsto polling andmessagehandling.

Severalsystems(AM-II, FM/MC, Hamlyn,Trapeze,U-Net,VMMC, VMMC-2) supportboth interruptsandpolling. Interruptsusuallycanbe enabledor dis-abledby thereceiver; sometimesthesendercanalsosetaflag in eachpacket thatdetermineswhetheraninterruptis to begeneratedwhenthepacket arrives.

Page 53: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

2.6Reliability 31

AM-IIVMMC-2

HamlynFMFM/MCVMMC

U-NetTrapeze

PM

Recovery

BIP

Yes No

No (unreliable interface)

Host

Yes

Application

Prevent buffer overflow

Assume Myrinet is reliable?

Reliability strategy Reliability protocol?

Locus of flow control

Fig. 2.3. Designdecisionsfor reliability.

2.6 Reliability

ExistingMyrinet protocolsdiffer widely in theway they addressreliability. Fig-ure2.3 shows thechoicesmadeby varioussystems.Themostimportantchoiceis whetheror not to assumethat thenetwork is reliable. Myrinet hasa very lowbit-error rate(lessthan10� 15 on shieldedcablesup to 25 m long [27]). Conse-quently, therisk of a packet gettinglost or corruptedis smallenoughto considerit a ’f atal event’ (muchlike a memoryparity erroror anOScrash).Sucheventscanbehandledby higher-level software(e.g.,usingcheckpointing),or elsecausetheapplicationto crash.

Many Myrinet protocolsindeedassumethatthehardwareis reliable,solet uslook at theseprotocolsfirst. Theadvantageof thisapproachis efficiency, becausenoretransmissionprotocolor time-outmechanismis needed.Evenif thenetworkis fully reliable,however, thesoftwareprotocolmaystill droppacketsdueto lackof buffer space. In fact, this is the most commoncauseof packet loss. Eachprotocolneedscommunicationbufferson both thehostandtheNI, andbotharea scarceresource.Thebasicprotocoldescribedin Section2.1, for example,willdrop packets when it runs out of receive buffers. This problemcan be solvedin oneof two ways: eitherrecover from buffer overflow or preventoverflow tohappen.

Thefirst idea(recovery) is usedin PM. The receiver simply discardsincom-ing packetsif it hasno roomfor them. It returnsanacknowledgement(ACK orNACK) to thesenderto indicatewhetheror not it acceptedthepacket. A NACKindicatesapacketwasdiscarded;thesenderwill laterretransmitthatpacket. Thisprocesscontinuesuntil anACK is received,in which casethesendercanrelease

Page 54: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

32 Network InterfaceProtocols

thebuffer spacefor themessage.This protocolis fairly simple; themaindisad-vantagesaretheextraacknowledgementmessagesandtheincreasednetwork loadwhenretransmissionsoccur.

A key propertyof theprotocolis thatit neverdropsacknowledgements.By as-sumption,Myrinet is reliable,sothenetwork hardwarealwaysdeliversacknowl-edgementsto their destination.Whenan acknowledgementarrives,the receiverprocessesit completelybeforeit acceptsa new packet from the network, so atmostoneacknowledgementneedsto bebuffered. (Myrinet’s hardwareflow con-trol ensuresthatpendingnetwork packetsarenotdropped.)

Unlike anacknowledgement,a datapacket cannotalwaysbeprocessedcom-pletelyonceit hasbeenreceived. A datapacket needsto betransferredfrom theNI to thehost,which requiresseveral resources:at leasta freehostbuffer andaDMA engine.If thewait time for oneof theseresources(e.g.,a freehostbuffer)is unbounded,thentheNI cannotsafelywait for that resourcewithout receivingpendingpackets,becausethenetwork hardwaremaythenkill blockedpackets.

The secondapproachis to prevent buffer overflow by using a flow controlschemethatblocksthesenderif the receiver is runningout of buffer space.Forlarge messages,BIP requirestheapplicationto dealwith flow control by meansof a rendezvousmechanism:thereceivermustposta receive requestandprovidea buffer beforea messagemaybesent.That is, a large-messagesendnever com-pletesbeforea receivehasbeenposted.FM andFM/MC implementflow controlusing a host-level credit scheme. Before a host can senda packet, it needstohaveacreditfor thereceiver; thecreditrepresentsapacketbuffer in thereceiver’smemory. Creditscanbehandedout in advanceby pre-allocatingbuffersfor spe-cific senders,but if a senderrunsout of creditsit mustblock until the receiverreturnsnew credits.

A host-level credit schemepreventsoverflow of hostbuffers, but not of NIbuffers,which areusuallyevenscarcer(becauseNI memoryis smallerthanhostmemory). With someprotocols,the NI temporarilystopsreceiving messagesifthe NI buffers overflow. Suchprotocolsrely on Myrinet’s hardware, link-level,flow controlmechanism(backpressure)to stall thesenderin suchacase.

The protocolsdescribedso far implementa reliable interfaceby dependingon the reliability of the hardware. Several other protocolsdo not assumethenetwork to be reliable, and either presentan unreliableprogramminginterfaceor implementa retransmissionprotocol. U-Net andTrapezepresentan unreli-able interfaceandexpecthighersoftware layers(e.g.,MPI, TCP) to retransmitlost messages.Othersystemsdo provide a reliableinterface,by implementingatimeout-retransmissionmechanism,eitheron thehostor theNI. Thecostof set-ting timersandprocessingacknowledgementsis modest,typically nomorethanafew microseconds.

Page 55: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

2.7Multicast 33

Fig. 2.4.Repeatedsendversusa tree-basedmulticast.

2.7 Multicast

Multicasting occursin many communicationpatterns,rangingfrom a straight-forward broadcastto the more complicatedall-to-all exchange[84]. Message-passingsystemslike MPI [45] directly supportsuchpatternsby meansof collec-tive communicationservicesandthusrely on anefficient multicastimplementa-tion. In addition,varioushigher-level systemsusemulticastingto updaterepli-catedshareddata[9].

Today’s wormhole-routednetworks, including Myrinet, do not supportreli-ablemulticastingat thehardwarelevel; multicastin wormhole-switchednetworksis a hardproblemandthe subjectof ongoingresearch[56, 128,136, 135]. Thesimplestway to implementa multicast in software is to let the sendersendapoint-to-pointmessageto eachmulticastdestination.This solutionis inefficient,becausethepoint-to-pointstartupcostis incurredfor everymulticastdestination;this cost includesthe datacopy to NI memory(possiblyprecededby a copy toa DMA area). With someNI support,the repeatedcopying canbe avoidedbypassingall multicastdestinationsto the NI, which thenrepeatedlytransmitsthesamepacket to eachdestination.Sucha ’multisend’primitive is providedby PM.

Although more efficient than a repeatedsendon the host, a multisendstillleavesthenetwork interfaceasa serialbottleneck.A moreefficient approachisto organizethesenderandreceiversinto a multicasttree. Thesenderis therootof the treeandtransmitseachmulticastpacket to all its children(a subsetof thereceivers). Thesechildren,in turn, forwardeachpacket to their children,andsoon. Figure2.4contraststherepeated-sendandthetree-basedmulticaststrategies.Tree-basedprotocolsallow packetsto travel in parallelalongthebranchesof thetreeandthereforeusuallyhave logarithmicratherthanlinearcomplexity.

Tree-basedprotocolscanbe implementedefficiently by performingthe for-wardingof multicastpacketson the NI insteadof on the host[56, 68, 80, 146].Verstoepet al. implementedsuchanNI-level spanningtreemulticastasanexten-sion of the Illinois FastMessages[113] substrate;the resultingsystemis calledFM/MC (FastMessages/MultiCast)[146].

Page 56: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

34 Network InterfaceProtocols

An importantissuein thedesignof a multicastprotocolis flow control. Mul-ticastflow control is morecomplex thanpoint-to-pointflow control,becausethesenderof a multicastpacket needsto obtainbuffer spaceon all receiving NIs andhosts. FM/MC usesa centralbuffer manager(implementedon oneof the NIs)to keeptrackof theavailablebuffer spaceat all receivers. This managerhandlesall requestsfor multicastbuffers andallows sendersto prefetchbuffer spaceforfuturemulticasts.An advantageof this schemeis that it avoids deadlockby ac-quiring buffer spacein advance;a disadvantageis thatit employs a centralbuffermanager. An alternative schemeis to acquirebuffers on the fly: the senderof amulticastpacket is responsibleonly for acquiringbuffer spaceat its childrenin themulticasttree.A moredetaileddiscussionof FM/MC is givenin Section4.5.4.

2.8 Classificationof User-LevelCommunicationSys-tems

Table2.2 classifiesten user-level communicationsystemsandshows how eachsystemdealswith the designissuesdiscussedin this chapter. Thesesystemsall aim for high performanceandall provide a lean,low-level, andmoreor lessgenericcommunicationfacility. All systemsexceptVMMC andU-Net werede-velopedfirst on aMyrinet platform.

Most systemsimplementa straightforwardmessage-passingmodel.Thesys-temsdiffer mainly in their reliability andprotectionguaranteesandtheir supportfor multicast.Severalsystemsuseactivemessages[148]. With active messages,the senderof a messagespecifiesnot only the message’s destination,but alsoahandlerfunction, which is invokedat thedestinationwhenthemessagearrives.

VMMC, VMMC-2, andHamlynprovidevirtual memory-mappedandsender-basedcommunicationinsteadof messagepassing. In both modelsthe senderspecifieswherein the receiver’s addressspacethe communicateddatamustbedeposited.This modelallows datato bemoveddirectly to its destinationwithoutany unnecessarycopying.

In additionto the researchprojectslisted in Table2.2, industryhasrecentlycreatedadraftstandardfor user-level communicationin clusterenvironments[51,149]. Implementationsof this Virtual Interface(VI) architecturehave beencon-structedby UC Berkeley, GigaNet,Intel, andTandem.Giganetsellsa hardwareVI implementation.The othersimplementVI in device driver software, in NIfirmware,or both.Speightetal. discusstheperformanceof onehardwareandonesoftwareimplementation[130].

Page 57: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

2.8Classificationof User-Level CommunicationSystems 35S

yste

mD

ata

tran

sfer

(hos

t-N

I)A

ddre

sstr

ansl

atio

nP

rote

ctio

nC

ontr

oltr

ansf

erR

elia

bilit

yM

ultic

ast

supp

ort

AM

-II

PIO

+D

MA

DM

Aar

eas

Yes

Pol

ling

+in

terr

upts

Rel

iabl

e.N

I:al

tern

atin

gbit.

Hos

t:sl

idin

gw

indo

w.

No

FM

PIO

DM

Aar

ea(r

eceiv

e)N

oP

ollin

gR

elia

ble.

Hos

t-lev

elcr

edits

.N

o

FM

/MC

PIO

DM

Aar

ea(r

eceiv

e)N

oP

ollin

g+

inte

rrup

tsR

elia

ble.

Uca

st:h

ost-

level

cred

its.

Mca

st:N

I-le

velc

redi

ts.

Yes

(on

NI)

PM

DM

AS

oftw

are

TLB

onN

IY

es(g

ang

sche

dulin

g)P

ollin

gR

elia

ble.

AC

K/N

AC

Kpr

otoc

olon

NI.

Mul

tiple

send

sV

MM

CD

MA

Sof

twar

eT

LBon

NI

Yes

Pol

ling

+in

terr

upts

Rel

iabl

e.E

xplo

itsha

rd-

war

eba

ckpr

essu

re.

No

VM

MC

-2D

MA

UT

LBin

kern

el,

cach

edon

NI

Yes

Pol

ling

+in

terr

upts

Rel

iabl

e.N

o

Ham

lyn

PIO

+D

MA

DM

Aar

eas

Yes

Pol

ling

+in

terr

upts

Rel

iabl

e.E

xplo

itsha

rd-

war

eba

ckpr

essu

re.

No

Tra

peze

DM

AD

MA

topa

gefr

ames

No

Pol

ling

+in

terr

upts

Unr

elia

ble.

No

BIP

PIO

+D

MA

Use

rtra

nsla

tes

No

Pol

ling

Rel

iabl

e.R

ende

zvous

and

back

pres

sure

.N

o

U-N

etD

MA

TLB

onN

I(U

-Net

/MM

)Y

esP

ollin

g+

inte

rrup

tsU

nrel

iabl

e.N

o

Tabl

e2.

2.S

umm

aryo

fdes

ignc

hoic

esin

exis

ting

com

mun

icat

ions

yste

ms.

Page 58: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

36 Network InterfaceProtocols

2.9 Summary

Thischapterhasdiscussedseveraldesignissuesandtradeoffs for NI protocols,inparticular:

� how to transferdatabetweenhostsandNIs (DMA or programmedI/O);

� how to avoid copying to andfrom DMA areasby meansof addresstransla-tion techniques;

� how to achieveprotected,user-level network accessin amulti-userenviron-ment;

� how to transfercontrol(interrupts,polling, or both);

� how andwhereto implementreliability (application,hosts,NIs);

� whetherto implementadditionalfunctionalityontheNIs, suchasmulticast.

An interestingobservationis thatmany novel techniquesexploit theprogramma-ble NI processor(e.g., for addresstranslation). ProgrammableNIs areflexible,whichcompensatesfor thelackof hardwaresupportpresentin themoreadvancedinterfacesusedby MPPs. Eventually, hardware implementationsmay be moreefficient,but theavailability of a programmableNI hasenabledfastandeasyex-perimentationwith differentprotocolsfor commoditynetwork interfaces. As aresult, the performanceof theseprotocolshassubstantiallyincreased.In com-binationwith theeconomicadvantagesof commoditynetworks,this makessuchnetworksakey technologyfor parallelclustercomputing.

An efficientnetwork interfaceprotocol,however, doesnotsuffice;whatcountsis applicationperformance.As discussedin Section1.1, mostapplicationpro-grammersuseaparallel-programmingsystemto developtheirapplications.Thereis a largegapbetweenthehigh-level abstractionsprovidedby theseprogrammingsystemsandthelow-level interfacesof thecommunicationarchitecturesdiscussedin this chapter. It is not a priori clearthatall typesof programmingsystemscanbeimplementedefficiently on all of theselow-level systems.

Page 59: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Chapter 3

The LFC User-LevelCommunication System

This chapterdescribesLFC, a new user-level communicationsystem.LFC dis-tinguishesitself from othersystemsby theway it dividesprotocoltasksbetweenthehostprocessorandthenetwork interface(NI). In contrastwith communicationsystemsthatminimizetheamountof protocolcodeexecutedon theNI, LFC ex-ploits theNI to implementflow control,to forwardmulticasttraffic, to reducetheoverheadof network interrupts,andto performsynchronizationoperations.ThischapterdescribesLFC’suserinterfaceandgivesanoverview of LFC’s implemen-tationon thecomputerclusterdescribedin Section1.6. LFC’s NI-level protocolsaredescribedseparatelyin Chapter4.

This chapteris structuredasfollows. Section3.1 describesLFC’s program-ming interface. Section3.2 statesthe key assumptionsthat we madein LFC’simplementation.Section3.3 givesan overview of the main componentsof theimplementation.SubsequentsectionsdescribeLFC’spacket format(Section3.4),datatransfermechanisms(Section3.5), host buffer management(Section3.6),the implementationof messagedetectionanddispatch(Section3.7),andthe im-plementationof a fetch-and-addprimitive (Section3.8). Section3.9 discusseslimitationsof theimplementation.Finally, Section3.10discussesrelatedwork.

3.1 Programming Interface

LFC aimsto supportthedevelopmentof parallelruntimesystemsratherthanthedevelopmentof parallelapplications.Therefore,LFC providesanefficient, low-level interfaceratherthana high-level, user-friendly interface.In particular, LFCdoesnot fragmentlargemessagesanddoesnot provide a demultiplexing mecha-nism.Asaresult,theuserof LFChasto writecodefor fragmenting,reassembling,

37

Page 60: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

38 TheLFC User-Level CommunicationSystem

anddemultiplexing messages.This interfacerequiresextra programmingeffort,but at the sametime offers high performanceto all clients. The mostimportantfunctionsof theinterfacearelistedin Table3.1.

3.1.1 Addressing

LFC’s addressingschemeis simple. Eachparticipatingprocessis assignedauniqueprocessidentifier, anumberin therange� 0 ��� � P � 1� , whereP is thenumberof participatingprocesses.This numberis determinedat applicationstartuptimeandremainsfixedduringtheapplication’s lifetime. LFC mapsprocessidentifiersto hostaddressesandnetwork routes.

3.1.2 Packets

LFC clientsusepacketsto sendandreceive data. Packetshave a maximumsize(1 Kbyte by default); messageslarger thanthis sizemustbe fragmentedby theclient. LFC deliberatelydoesnot provide a higher-level abstractionthat allowsclientsto constructmessagesof arbitrarysizes. Suchabstractionsimposeover-headandoftenintroducea copy at thereceiving sidewhendataarrivesin packetbuffers thatarenot contiguousin memory. Clientsthatneedonly sequentialac-cessto messagefragmentscanavoid thiscopy by usingstreammessages[90]. Anefficient implementationof streammessageson top of LFC is describedin Sec-tion6.3.Evenwith streammessages,however, someoverheadremainsin theformof procedurecalls, auxiliary datastructures,andheaderfields that someclientssimply do not need. Our implementationof CRL, for example, is constructeddirectlyon LFC’s packet interface,without intermediatemessageabstractions.

LFC distinguishesbetweensendpacketsandreceivepackets. Sendpacketsresidein NI memory. Clientscanallocateasendpacketandstoredatainto it usingnormalmemorywrites (i.e., usingprogrammedI/O). At the receiving side,LFCusestheDMA areaapproachdescribedin Section2.1, so receive packetsresidein pinnedhostmemory.

Bothsendandreceivepacketsform ascarceresource,for whichonly alimitedamountof memoryis available.Sendpacketsarestoredin NI memory, which istypically small: thememorysizesof currentNIs rangefrom a few hundredkilo-bytesto a few megabytes. Receive buffers are storedin pinnedhost memory.In a multiprogrammingenvironment,whereuserprocessescompetefor CPUcy-clesandmemory, mostoperatingsystemsdo not allow a singleuserto pin largeamountsof memory.

Page 61: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

3.1ProgrammingInterface 39vo

id*l

fcse

ndal

loc(

intu

pcal

lsal

low

ed)

Allo

cate

sase

ndpa

cket

buffe

rin

NI

mem

orya

ndre

turn

sapo

inte

rto

the

pack

et’s

data

part

.Pa

ram

eter

up

calls

allo

we

dha

sth

esa

me

mea

ning

inal

lfun

ctio

ns.I

tin

dica

tesw

heth

erup

calls

can

bem

ade

whe

nne

twor

kdr

aini

ngoc

curs

durin

gex

ecut

iono

fthe

func

tion.

intl

fcgr

oup

crea

te(

unsi

gned

*mem

bers

,uns

igne

dnm

embe

r)

Cre

ates

am

ultic

astg

roup

andr

etur

nsa

grou

pid

entifi

er.T

hepr

oces

sid

entifi

erso

fall

nm

em

be

rsm

embe

rsar

est

ored

inm

em

be

rs.vo

idlfc

ucas

tla

unch

(uns

igne

dde

st,

void

*pkt

,uns

igne

dsi

ze,i

ntup

calls

allo

wed

)

Tra

nsm

itsth

efir

stsi

zeby

teso

fsen

dpac

ketp

ktto

proc

essd

est

and

tran

sfer

sow

ners

hipo

fthe

pack

etto

LFC

.vo

idlfc

bcas

tla

unch

(

void

*pkt

,uns

igne

dsi

ze,i

ntup

calls

allo

wed

)

Bro

adca

ststh

efir

stsi

zeby

tes

ofse

ndpa

cket

pkt

toal

lpr

oces

ses

exce

ptth

ese

nder

and

tran

sfer

sow

ners

hipo

fthe

pack

etto

LFC

.vo

idlfc

mca

stla

unch

(intg

id,

void

*pkt

,uns

igne

dsi

ze,i

ntup

calls

allo

wed

)

Mul

ticas

tsth

efir

stsi

zeby

teso

fse

ndpa

cket

pkt

toal

lmem

bers

ofgr

oup

gid

and

tran

sfer

sow

ners

hipo

fthe

pack

etto

LFC

.vo

idlfc

poll(

void

)D

rain

sth

ene

twor

kan

din

voke

sth

ecl

ient

-sup

plie

dpac

keth

andl

er(lf

cu

pca

ll())

for

each

pack

etre

mov

edfr

omth

ene

twor

k.vo

idlfc

intr

disa

ble(

void

)vo

idlfc

intr

enab

le(v

oid)

Dis

able

and

enab

lene

twor

kin

terr

upts

.T

hese

func

tions

mus

tbe

calle

din

pairs

,with

lfcin

trd

isa

ble(

)go

ing

first

.in

tlfc

upca

ll(

unsi

gned

src,

void

*pkt

,uns

igne

dsi

ze,l

fcup

call

flags

tflag

s)

Lfc

up

call(

)is

the

clie

nt-s

uppl

iedf

unct

ion

that

LFC

invo

kes

once

peri

ncom

ing

pack

et.P

ktpo

ints

toth

eda

taju

stre

ceive

d(s

izeb

ytes

)an

dfla

gs

indi

cate

swhe

ther

orno

tp

ktis

am

ultic

astp

acke

t.If

lfcu

pca

ll()

retu

rns

nonz

ero,

the

clie

ntbe

com

esth

eow

ner

ofp

ktan

dm

ustl

ater

rele

asep

ktus

ing

lfcp

ack

etf

ree(

).O

ther

wis

e,LF

Cre

cycl

esp

ktim

med

iate

ly.vo

idlfc

pack

etfr

ee(v

oid

*pkt

)R

etur

nsow

ners

hipo

frec

eive

pack

etp

ktto

LFC

.un

sign

edlfc

fetc

han

dad

d(

unsi

gned

var,

intu

pcal

lsal

low

ed)

Ato

mic

ally

incr

emen

tsth

eF

&A

varia

ble

iden

tified

byva

ran

dre

-tu

rnst

heva

lue

ofva

rbef

oret

hein

crem

ent.

Tabl

e3.

1.LF

C’s

user

inte

rface

.Onl

yth

ees

sent

ialfu

nctio

nsar

esh

own.

Page 62: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

40 TheLFC User-Level CommunicationSystem

3.1.3 SendingPackets

LFCprovidesthreesendroutines:lfc ucastlaunch() sendsapoint-to-pointpacket,lfc bcastlaunch() broadcastsa packet to all processes,and lfc mcastlaunch()sendsapacketto all processesin amulticastgroup. Eachmulticastgroupcontainsafixedsubsetof all processes.Multicastgroupsarecreatedonceandfor all at ini-tializationtime, soprocessescannotjoin or leave a multicastgroupdynamically.For reasonsdescribedin AppendixA, multicastgroupsmaynotoverlap.

To senddata,aclientmustperformthreesteps:

1. Allocate a sendpacket with lfc sendalloc(). Lfc sendalloc() returnsapointerto a freesendpacket in NI memoryandtransfersownershipof thepacket from LFC to theclient.

2. Fill thesendpacket,usuallywith aclient-specificheaderanddata.

3. Launchthepacketwith lfc ucastlaunch(), lfc mcastlaunch(), or lfc bcast-launch(). Thesefunctionstransmitthe packet to one,multiple, or all des-tinations,respectively. LFC transmitsasmany bytesas indicatedby theclient. Ownershipof the packet is transferredbackto LFC. This impliesthattheallocationstep(step1) mustbeperformedfor eachpacket thatis tobetransmitted.No packet canbetransmittedmultiple times.

Steps1 and3 mayblock dueto thefinite numberof sendpacket buffersandthefinite lengthof the NI’s commandqueue.To avoid deadlock,LFC continuestoreceivepacketswhile it waitsfor resources(seeSection3.1.4).

Thesendprocedureavoidsintermediatecopying: clientscangatherdatafromvariousplaces(registers,differentmemoryregions)anduseprogrammedI/O tostorethat datadirectly into sendpackets. An alternative is to useDMA trans-fers,which consumefewer processorcyclesandrequirefewer bustransfers.Formessagesizesup to 1 Kbyte, however, programmedI/O movesdatafrom hostmemoryto NI memoryfasterthanthe NI’s DMA engine(seeFigure2.2). Fur-thermore,programmedI/O caneasilymove datafrom any virtual addressin theclient’saddressspaceto theNI, while DMA canbeusedonly for pinnedpages.

Thesendprocedureseparatespacketallocationandpacket transmission.Thisallowsclientsto hidetheoverheadof packetallocation.TheCRL implementationon LFC, for example,allocatesa new sendpacket immediatelyafter it hassentapacket. SincetheCRL runtimefrequentlywaitsfor a replymessageaftersendinga requestmessage,it canoftenhidethecostof allocatingasendpacket.

Page 63: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

3.1ProgrammingInterface 41

3.1.4 Receiving Packets

LFC copiesincoming network packets from the NI to receive packets in hostmemory. LFC deliversreceive packetsto the receiving processby meansof anupcall [38]. An upcall is a function that is definedin one software layer andthat is invoked from lower-level software layers. Upcalls are usedto performapplication-specificprocessingat timesthat cannotbe predictedby the applica-tion. SinceLFC doesnotknow whataclientaimsto dowith its incomingpackets,theclientmustsupplyanupcallfunction.This function,namedlfc upcall(), is in-vokedbyLFC eachtimeapacketarrives;it transfersownershipof areceivepacketfrom LFC to theclient. Usually, this functioneithercopiesthepacketor performsa computationusing the packet’s contents. (The computationcanbe assimpleasincrementinga counter.) Theupcall’s returnvaluedetermineswhetheror notownershipof the packet is returnedto LFC. If the client keepsthe packet, thenthepacket mustlaterbereleasedexplicitly by calling lfc packet free(), otherwiseLFC recycles the packet immediately. The main advantageof this interfaceisthat it doesnot forcetheclient to copy packetsthatcannotbeprocessedimmedi-ately. Pandaexploit this propertyin its implementationof streammessages(seeSection6.3).

Many systemsbasedon active messages[148], invoke messagehandlerswhile draining the network. (Draining the network consistsof moving packetsfrom thenetwork to hostmemoryandis necessaryto avoid congestionanddead-lock.) Theclientsof suchsystemsmustbepreparedto handlepacketswheneverdraining can occur. Unfortunately, draining sometimesoccursat inconvenienttimes. Consider, for example,thecasein which a messagehandlersendsa mes-sage.If thesystemdrainsthenetwork while sendingthis message,it mayrecur-sively activatethesamemessagehandler(for anotherincomingmessage).Sincethe systemdoesnot guaranteeatomichandlerexecution,the programmermustprotectglobal dataagainstinterleaved accessesby multiple handlers. Using alock to turn theentirehandlerinto a critical region leadsto deadlock.Therecur-siveinvocationwill blockwhenit triesto enterthecritical sectionoccupiedby thefirst invocation.Thefirst invocation,however, cannotcontinue,becauseit is wait-ing for therecursive invocationto finish. Chapter6 studiesthis upcallprobleminmoredetailandcomparesseveralsolutions.

LFC separatesnetwork drainingandhandlerinvocation[28]. DrainingoccurswheneveranLFC routinewaitsfor resources(e.g.,a freeentryin thesendqueue),when the userpolls, or when the NI generatesan interrupt. During draining,however, LFC will invoke lfc upcall() only if oneof the following conditionsissatisfied:

1. Theclienthasnotdisablednetwork interrupts.

Page 64: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

42 TheLFC User-Level CommunicationSystem

2. Theclient invokeslfc poll().

3. Theclient invokesanLFC primitivewith its upcallsallowedparametersetto true(nonzero).

All LFC calls thatpotentiallyneedto drain thenetwork take an extra parameternamedupcalls allowed. If the client invokesa routinewith upcalls allowedsetto false(zero),thenLFC will drain thenetwork if it needsto, but will not makeany upcalls. If theparameteris true, thenLFC will drain thenetwork and makeupcalls.

Separatingdrainingandhandlerinvocationis no panacea.If a client enablesdraining but disablesupcalls,LFC must buffer incoming network packets. TopreventLFC from runningout of receive packetstheclient mustimplementflowcontrol. This is crucial, becausethereis nothing LFC, or any communicationsystem,cando againstclientsthatsendanunboundedamountof dataanddo notconsumethatdata.Thesystemwill eitherrunoutof memoryor deadlockbecauseit stopsdrainingthenetwork. SinceLFC’s currentclientsdo not implementflowcontrol—atleast,notfor all theircommunicationprimitives—they cannotdisableupcallsandmustthereforealwaysbepreparedto handleincomingpackets.

With the current interface,clients cannotspecify that the network must bedrainedasynchronously, usinginterrupts,without LFC makingupcalls. If a pro-cessentersa phasein which it cannotprocessupcallsand in which it doesnotinvokeLFC primitives,thenthenetwork will notbedrained.TheMPI implemen-tation describedin Section7.4, for example,doesnot useinterrupts. Drainingoccursonly during the invocationof LFC primitives. Sincetheseprimitivesareinvokedonly whentheapplicationinvokesanMPI primitive, thereis no guaran-tee that a processwill frequentlydrain the network. If a processsendsa largemessageto aprocessengagedin a long-runningcomputation—i.e.,aprocessthatdoesnot drain— thenLFC’s internalflow-control mechanism(seeSection4.1)will eventuallystall the sender, even if the receiver hasspaceto storeincomingpackets.

3.1.5 Synchronization

LFC providesanefficient fetch-and-add(F&A) primitive [127]. A fetch-and-addoperationfetchesthecurrentvalueof a logically sharedintegervariableandthenincrementsthis variable.Sincetheseactionsareperformedatomically, processescanusethe F&A primitive to synchronizetheir actions(e.g., to obtainthe nextfreeslot in asharedqueue).

LFC storesF&A variablesin NI memory. The implementationprovidesoneF&A variableperNI andinitializesall F&A variablesto zero.Thereis nofunction

Page 65: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

3.2Key ImplementationAssumptions 43

to setor resetF&A variablesto a specificvalue,but sucha functioncouldeasilybeadded.

Pandausesa singleF&A variableto implementtotally-orderedbroadcasting(seeChapter6). Totally-orderedbroadcasting,in turn, is usedby the Orcarun-time systemto updatereplicatedobjectsin a consistentmanner(seeSection7.1).Karamchetiet al. describeanotherinterestinguseof fetch-and-addin their pa-per on pull-basedmessaging[76]. In pull-basedmessagingsendersdo not pushmessagedatato receivers. Instead,a sendertransmitsonly a messagepointertothe receiver. The receiver pulls in the messagedatawhenit is readyto receive.Pull-basedmessagingreducescontention.

3.1.6 Statistics

Besidesthefunctionslistedin Table3.1,LFC providesvariousstatisticsroutines.LFC can be compiledsuchthat it countseventssuchas packet transmissions,DMA transfers,etc. The statisticsroutinesareusedto reset,collect, andprintstatistics.

3.2 Key Implementation Assumptions

Wehave implementedLFC onthecomputerclusterdescribedin Section1.6.Theimplementationmakesthefollowing key assumptions:

1. Thenetwork hardwareis reliable.

2. Multicastandbroadcastarenotsupportedin hardware.

3. Eachhostis equippedwith anintelligentNI.

4. Interruptprocessingis expensive.

Assumptions1–3pertainto network architecture.Amongthesethree,thefirstassumption,reliablenetwork hardware,is themostcontroversial.Weassumethatthenetwork hardwareneitherdropsnorcorruptsnetwork packets.With theexcep-tion of customsupercomputernetworks(e.g.,theCM-5 network [91]), however,network hardwareis usuallyconsideredunreliable.This may be dueto the net-work material(e.g., insufficient shieldingfrom electricalinterference)or to thenetwork architecture(e.g.,lackof flow controlin thenetwork switches).

Thesecondassumption,no hardwaremulticast,is satisfiedby mostswitchednetworks(e.g.,ATM), but notby tokenringsandbus-basednetworks(e.g.,Ether-net). We assumethat multicastandbroadcastservicesmustbe implementedinsoftware. Efficient software multicastschemesusespanningtree protocolsin

Page 66: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

44 TheLFC User-Level CommunicationSystem

which network nodesforward the messagesthey receive to their childrenin thespanningtree(seeSection2.7).

Thethird assumption,thepresenceof anintelligentNI, enablestheexecutionof nontrivial protocolcodeon theNI. IntelligentNIs areusedboth in supercom-puters(e.g.,FLASH [85] and the IBM SP/2[129]) and in clusterarchitectures(e.g.,Myrinet andATM clusters).NI-level protocolsform an importantpart ofLFC’s implementation(seeChapter4).

The fourth assumption,expensive interrupts,is relatedto host-processorar-chitectureandoperatingsystemimplementations.Modernprocessorshave deeppipelines,whichmustbeflushedwhenaninterruptis raised.In addition,commer-cial operatingsystemsdonotdispatchinterruptsefficiently to userprocesses[143].We thereforeassumethat interruptprocessingis slow andthat interruptsshouldbeusedonly asa lastresort.

3.3 Implementation Overview

LFC consistsof a library that implementsLFC’s interfaceanda control programthatrunsontheNI andtakescareof packet transmission,receipt,forwarding,andflow control.ThecontrolprogramimplementstheNI-level protocolsdescribedinChapter4. In addition,severaldevice driversareusedto obtainefficient accessto thenetwork device. Thelibrary, theNI controlprogram,andthedevicedriversareall written in C.

3.3.1 Myrinet

Myrinet implementshardwareflow control on the network links betweencom-municatingNIs and hasa very low bit-error rate [27]. If all NIs usea single,deadlock-freeroutingalgorithmandarealwaysreadyto processincomingpack-ets,thenMyrinet will not dropany packets. In a singleadministrativedomainofmodestsize,suchasour Myrinet cluster, theseconditionscanbesatisfied.Con-sequently, LFC assumesthat Myrinet neitherdropsnor corruptspackets. Thelimitationsof this approacharediscussedin moredetail in Section3.9.

3.3.2 Operating SystemExtensions

All clusternodesrun theLinux operatingsystem(RedHatdistribution5.2,kernelversion2.0.36). To supportuser-level communication,we addedseveral devicedriversto theoperatingsystem.

Myricom’s Myrinet device driver was modified so that at most one userata time can openthe Myrinet network device. When a userprocessopensthe

Page 67: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

3.3ImplementationOverview 45

device, the driver mapsthe NI’s memorycontiguouslyinto the user’s addressspace,so theuserprocesscanreadandwrite NI memoryusingnormalloadandstoreinstructions(i.e., programmedI/O). Thedriver alsoprocessesall interruptsgeneratedby theNI. Eachtime thedriver receivesan interrupt,it sendsa SIGIO

softwaresignalto theprocessthatopenedthedevice.At initialization time,LFC’s library allocatesa fixednumberof receive pack-

etsfrom theclient’sheap.Thepagesonwhichthesepacketsarestoredarepinnedby meansof themlock() systemcall. Theclientcanspecifythenumberof receivepacketsthat the library is to allocate.To transferdatafrom its memoryinto thereceive packetsin hostmemory, theNI’s controlprogramneedsthephysicalad-dressesof thereceivepackets.To obtaintheseaddresses,weimplementedasmallpseudodevice driver that translatesa page’s virtual addressto thecorrespondingphysicaladdress.(A pseudodevicedriver is akernelmodulewith adevicedriverinterfacebut without associatedhardware.) LFC’s library invokesthis driver atinitialization time to computeeachpacket’sphysicaladdress.

Writes to device memory(e.g.,NI memory)arenormally not cachedby thewriting processor, becausethe datawritten to device memoryis unlikely to bereadbackagainby the processor. Moreover, in the caseof write-backcaching,the device will not observe the writes until they happento be flushedfrom thecache.By disablingcachingfor devicememory, however, performanceis reducedbecauseabustransactionmustbesetupfor eachwordthatis writtento thedevice.Write-backcaches,in contrast,write completecachelines to memory, which isbeneficialwhena largecontiguouschunkof datais written.

To speedupwritesto NI memory, LFC usesthePentiumPro’sability to setthecachingpropertiesof specificvirtual memoryranges[71]. A smallpseudodevicedriver enableswrite combiningfor the virtual memoryrangeto which the NI’smemoryis mapped.The PentiumPro combinesall writes to a write-combiningmemoryregion in on-processorwrite buffers. Writesaredelayeduntil thewritebuffer is full or until a serializinginstructionis issued.By aggregatingwrites toNI memory, theprocessorcantransferthesewritesin largerburststo theI/O bus,which reducesthenumberof I/O busarbitrations.

As discussedin Section2.2, theuseof write combiningconsiderablyspeedsup programmedI/O datatransfersfrom hostmemoryto NI memory. The maindisadvantageof applyingwrite combiningto a memoryregion is that writes tothatregionmaybereordered.Two sucessivewritesareexecutedin programorderonly if they areseparatedby aserializinginstruction.Whennecessary, weusethePentiumPro’s atomicincrementinstructionto orderwrites.

Page 68: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

46 TheLFC User-Level CommunicationSystem

Type Resource DescriptionHardware SendDMA engine Transferspacketsfrom NI memoryto the

network.ReceiveDMA engine Transferspacketsfrom thenetwork to NI

memory.Host/NI DMA engine Mainly usedto transferpacketsfrom NI

memoryto hostmemory. Also usedtoretrieveandupdatestatusinformation.

Software Sendcredits A sendcredit representsa data packetbuffer in someNI’s memory. For eachdestinationNI, there is a separatere-sourcequeue.

Hostreceivebuffer A receivebuffer in hostmemory.

Table 3.2.Resourcesusedby theNI controlprogram.

3.3.3 Library

LFC’s library implementsall routineslisted in Table3.1. The library managessendandreceivebuffers,communicateswith theNI controlprogram,anddeliverspacketsto clients. Host-NI communicationtakesplacethroughsharedvariablesin NI memory, throughDMA transfersbetweenhostandNI memory, andthroughsignalsgeneratedby thekernelin responseto NI interrupts.

3.3.4 NI Control Program

Themaintaskof theNI controlprogramis to sendandreceivepackets.Outgoingandincomingpacketsusevarioushardwareandsoftwareresourcesasthey travelthroughtheNI. Thecontrolprogramacquiresandactivatestheseresourcesin therightorderfor eachpacket. If thecontrolprogramcannotacquirethenext resourcethat a packet needs,then it queuesthe packet on a resource-specific(software)resourcequeue. All resourceslistedin Table3.2,exceptthehostreceive buffers,haveanassociatedresourcequeue.

Threehardwareresourcesareavailablefor packet transfers:the sendDMAengine,the receive DMA engine,andthe host/NI DMA engine. Theseenginesoperateasynchronously:typically, the NI processorstartsa DMA transferandcontinueswith otherwork, periodicallycheckingif theDMA enginehasfinishedits transfer. (Alternatively, a DMA enginecangeneratean interruptwhenit hascompleteda transfer. LFC’s controlprogram,however, doesnotuseinterrupts.)

Softwareresourcesareusedto implementflow control,bothbetweencommu-nicatingNIs andbetweenanNI andits host. Sendcreditsrepresentbuffer space

Page 69: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

3.3ImplementationOverview 47

in someNI’s memoryandareusedto implementNI-to-NI flow control(seeSec-tion 4.1).BeforeanNI cansendadatapacketto anotherNI, it mustacquireasendcreditfor thatNI. Similarly, beforeanNI cancopy adatapacket to hostmemory,it mustobtainahostbuffer. Hostbuffer managementis describedin Section3.6.

To utilize all resourcesefficiently, LFC’scontrolprogramis structuredaroundresourcesratherthanpackets.Theprogram’smainlooppollsseveralwork queuesandchecksif resourcesareidle. Whenthecontrolprogramfindsa taskon oneofits work queues,it executesthetaskupto thepointwherethetaskneedsaresourcethatis notavailable.Thetaskis thenappendedto theappropriateresourcequeue.Whenthe control programfinds that a resourceis idle, it checksthe resource’squeuefor new work. If thequeueis nonempty, ataskis dequeuedandtheresourceis put backto useagain.

Theresource-centricstructureoptimizesresourceutilization. In particular, allDMA enginescanbeactivesimultaneously. By allowing thereceiveDMA of onepacket to proceedconcurrentlywith thehostDMA of anotherpacket,wepipelinepacket transfersandobtainbetterthroughput.A packet-centricprogramstructurewouldguideasinglepacketcompletelythroughtheNI beforestartingto work onanotherpacket. With this approach,thecontrolprogramusesat mostoneDMAengineata time, leaving theotherDMA enginesidle.

Theresource-centricapproachincreaseslatency, becausepacketsareenqueuedanddequeuedeachtime they usesomeresource.To attackthis problem,LFC’scontrolprogramcontainsa fastpathat thesendingside,whichskipsall queueingoperationsif all resourcesthatapacketneedsareavailable.Wealsoexperimentedwith a receiver-sidefastpath,but foundthat thelatency gainswith this extra fastpathweresmall.To avoid codeduplication,we thereforeremovedthis fastpath.

Thecontrolprogramactsin responseto threeeventtypes:

1. sendrequests

2. DMA transfercompletions

3. timerexpiration

Thecontrolprogram’smainlooppolls for theseeventsandprocessesthem.Sendrequestsarecreatedby thehostlibrary in responseto aclient invocation

of one of the packet launchroutines. Eachsendrequestis stored(using pro-grammedI/O) in a queuein NI memory. Thecontrolprogram’s main loop pollsthis queuefor new requests.

DMA transfersare usedto sendand receive packets and to move receivedpacketsto hostmemory. DMA transfercompletionis signaledby meansof bitsin theNI’s interruptstatusregister. Whenapacket transfercompletes,thecontrolprogrammovesthe packet to its next processingstageandchecksif it canput

Page 70: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

48 TheLFC User-Level CommunicationSystem

2 bytes

Myrinet tag

LFC tag

Source

Size

Credits

Tree

Deadlock channel

User data(1024 bytes)

Fig. 3.1.LFC’s packet format.

theDMA engineto work again.ThecontrolprogramalwaysrestartsthereceiveDMA enginesothatit is alwayspreparedfor incomingpackets.

Myrinet’sNI containsatimerwith agranularityof 0.5µs. Thecontrolprogramusesthis timer to delayinterrupts(seeSection4.4). Timer expiration is signaledby meansof certainbits in theNI’s interruptstatusregister.

3.4 Packet Anatomy

LFC usesseveralpacket typesto carryuserandcontroldataacrossthenetwork.As shown in Figure3.1,eachpacketconsistsof a12-byteheaderanda1024-byteuserdatabuffer. Thebuffer storesthedatathata client wishesto transfer. LFCdoesnot interpretthecontentsof this buffer. WhenLFC allocatesa sendpacket(in lfc sendalloc()) or passesareceivepacket to lfc upcall(), theclient receivesapointerto thedatabuffer. Clientsmustneitherreadnor write thepacket header,but this is notenforced.

Theheader’s tagfield consistsof a 2-byteMyrinet tag. Every Myrinet com-municationsystemcanobtainits own tagrangefrom Myricom. This tagrangeisdifferentfrom thetagrangesassignedto otherregisteredMyrinet communicationsystems.NIs canusethe Myrinet tag to recognizeanddiscardmisroutedpack-

Page 71: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

3.4Packet Anatomy 49

Packet class LFC tag FunctionControl CREDIT Explicit creditupdate

CHANNEL CLEAR DeadlockrecoveryFA REPLY Fetch-and-addreply

Data UCAST UnicastpacketMCAST MulticastpacketFA REQUEST Fetch-and-addrequest

Table 3.3.LFC’s packet tags.

ets from differentcommunicationsystems.All LFC packetscarry Myrinet tag0x0450.

LFC usesan additionaltag field to identify different typesof LFC packets.Unicastandmulticastpackets,for example,usedifferenttags,becausethecontrolprogramtreatsunicastand multicastpackets differently; multicastpackets areforwarded,whereasunicastpacketsarenot. Table3.3 lists the most importantpacket tags.

We distinguishbetweencontrol packets and datapackets. Control packetsrequireonly a smallamountof NI-level processing.Unlike datapackets,controlpackets are never queued;when the NI receives a control packet, it processesit immediatelyand then releasesthe buffer in which the packet arrived. Twopacketbufferssufficeto receiveandprocessall controlpackets.A secondbuffer isneededbecausethecontrolprogramalwaysrestartsthereceiveDMA engine(witha freepacket buffer asthedestinationaddress)beforeit processesthepacket justreceived.

Datapacketsgothroughmoreprocessingstagesandusemoreresources.EachUNICAST packet, for example,needsto becopiedto hostmemory, but thiscanbedoneonly whenafreehostbuffer andtheDMA engineareavailable.Wheneitherresourceis unavailable,thecontrolprogramqueuesthepacket andstartsworkingon anotherpacket. Sincesomeresources(e.g.,hostbuffers)maybeunavailablefor anunpredictableamountof time,thecontrolprogrammayhaveto buffermanydatapackets.Thetotalnumberof packetsthatneedsto bebufferedis boundedbyLFC’s flow controlprotocol,which stallssenderswhenreceiving NIs run out offreepacket buffers.

Thesourcefield in thepacketheaderidentifiesthesenderof apacket. At LFCinitialization time, the client passesto eachLFC processthe numberof partici-patingprocessesandassignsa uniqueprocessidentifier to eachprocess.SinceLFC allows only oneprocessperprocessor, this processidentifieralsoidentifiestheprocessorandtheNI. EachtimeanNI transmitsapacket,LFC storestheNI’sidentifierin thesourcefield.

Page 72: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

50 TheLFC User-Level CommunicationSystem

2. Network send and receive DMA transfers1. Programmed I/O to network interface

3. DMA transfer to host memory

NI receive packetNI send packet

2

1

3

User data

Sender’s address spaceReceiver’s address space

Pinned host receive packetNI NI

Host Host

Fig. 3.2.LFC’s datatransferpath.

The userdata part of sendpacketshasa maximumsizeof 1024bytes,butclientsneednot fill theentiredatapart. Thesizefield specifieshow many byteshave beenwritten into theuserdatapart of a packet. Thesebytesarecalledthevalid userbytesandthey mustbestoredcontiguouslyatthebeginningof apacket.WhenLFC transfersa packet, it transfersthe packet headerand the valid userbytes,but not theunusedbytesat theendof thepacket. This yields low latencyfor smallmessagesandhigh throughputfor largemessages.

Theremainingheaderfieldsaredescribedin detail in Chapter4. Thecreditsfield is usedto piggybackflow control information to the packet’s destination.Thetreefield identifiesmulticasttrees.Thedeadlock channelfield is usedduringdeadlockrecovery.

LFC’s outgoingpacketsaretagged(in hardware)with a CRCthat is checked(in software) by LFC’s control programat the receiving side. CRC errorsaretreatedascatastrophicandcauseLFC to abort.Lostpacketsarenotdetected.

3.5 Data Transfer

LFC transfersdatain packetswhich areallocatedfrom threepacket buffer pools.EachNI maintainsa sendbuffer pool anda receive buffer pool. Hostsmaintainonly a receive buffer pool; all buffers in this pool arestoredin pinnedmemory.The datatransferpath in Figure3.2 for unicastpacketsshows how thesethreepoolsareused.

At thesendingside,theclientallocatessendpacketsfrom thesendbuffer poolin NI memory(transfer1 in Figure3.2). ProgrammedI/O is usedto copy clientdatadirectly, withoutoperatingsystemintervention,into asendpacket. Theclient

Page 73: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

3.5DataTransfer 51

�������������������������������������������������������������������������������������������������������������������������������� ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Send descriptor

Send queue

Transmit queue

Network interface

Client data

Send buffer pool

Blocked-sendsqueues

Fig. 3.3.LFC’s sendpathanddatastructures.

calls oneof LFC’s launchroutines,which writes a senddescriptorinto the NI’ssendqueue(seeFigure3.3). The senddescriptorcontainsthe packet’s destina-tion anda referenceto thepacket. TheNI, which polls thesendqueue,inspectsthe senddescriptor. Whencreditsareavailablefor the specifieddestination,thedescriptoris moved to the transmitqueue; the packet will be transmittedwhenthedescriptorreachestheheadof this queue(transfer2 in Figure3.2). After thepacket hasbeentransmitted,it is returnedto theNI’s sendpacket pool.

Whennosendcreditsareavailable,thedescriptoris addedto ablocked-sendsqueueassociatedwith thepacket’sdestination.Whencreditsflow backfrom somedestination,descriptorsaremovedfrom thatdestination’sblocked-sendsqueuetothetransmitqueue.Thedetailsof thecreditalgorithmaredescribedin Chapter4.

ThedestinationNI receivestheincomingpacket into afreepacketbuffer allo-catedfrom its receive buffer pool. This pool is partitionedevenly amongall NIs:eachNI ownsa fixednumberof thebuffers in thereceive pool andcanfill thesebuffers—andnomore—by sendingpackets.Eachbuffer correspondsto onesendcreditandfilling abuffer correspondsto consuminga sendcredit.

Eachhostis responsiblefor passingtheaddressesof freehostreceive buffersto its NI. Theseaddressesarestoredin a sharedqueuein NI memory(seeSec-tion 3.6). As soonas a free host receive buffer is available, the NI copiesthepacket to hostmemory(transfer3 in Figure3.2). After this copy, theNI receivebuffer is returnedto its pool. Whenno hostreceive buffersareavailable,theNI

Page 74: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

52 TheLFC User-Level CommunicationSystem

Combined data andstatus DMA transfer

Packet buffer in host memory

E

Data DMA transfer

F

E = EMPTYPacket buffer in network interface memory

Header Payload F

Status DMA transfer

Status field (polled by host)

Displacement field

F = FULL

Fig. 3.4. Two waysto transferthe payloadandstatusflag of a network packet.The dashedarrows illustrate a simple, but expensive manner. The solid arrowindicatestheoptimizedtransfer.

simply doesnot copy packetsto the hostanddoesnot releaseany of its receivebuffers. Eventually, senderswill bestalledandremainstalleduntil thereceivinghostpostsnew buffers.

Eachhostreceivebuffer containsastatusfield thatindicateswhetherthecon-trol programhascopieddatainto the buffer. The hostusesthis field to detectincomingmessages.Copying a packet from an NI receive buffer to a host re-ceive buffer involves two actions: copying the packet’s payloadto host mem-ory andsettingthe statusfield in the hostbuffer to FULL. For efficiency, LFCfolds both actionsinto a singleDMA transfer. This is illustratedin Figure3.4,which alsoshows a naive implementationthat usestwo DMA transfersto copythepacket’s payloadandstatusfield. To avoid transferringtheunusedpartof thereceive packet’s datafield, thestatusfield mustbestoredright after thepacket’spayload.(If it werestoredin front of thedata,thehostcouldbelieve it receivedapacketbeforeall of thatpacket’sdatahadarrived.)Thepositionof theFULL wordthereforedependsonthesizeof thepacket’spayload.Thehost,however, needstoknow in advancewhereto poll for thestatusfield. Thecontrolprogramthereforetransfersthe packet’s payloadandthe FULL word to the endof the hostreceivebuffer. In addition,thecontrolprogramtransfersa word thatholdsthepayload’sdisplacementrelative to thebeginningof thehost’s receivebuffer.

Oncefilled, thehostbuffer is passedto theclient-suppliedupcallroutine.Theclientdecideswhenit releasesthis buffer. Oncereleased,thebuffer is returnedtothehostreceivebuffer pool.

Page 75: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

3.6HostReceiveBuffer Management 53

���������������������� � � � � � � � � � �

!�!�!�!�!�!�!!�!�!�!�!�!�!!�!�!�!�!�!�!"�"�"�"�"�""�"�"�"�"�""�"�"�"�"�"#�#�#�#�##�#�#�#�#$�$�$�$$�$�$�$%�%�%%�%�%%�%�%&�&&�&&�&

'�'�'�''�'�'�'(�(�(�((�(�(�(Head

Host memory

Host receive queue

Empty Tail

Network interface memory

DMA transfer

Shared queue

Fig. 3.5. LFC’s receivedatastructures.

Multicastpacketsfollow thesamepathasunicastpackets,but mayaddition-ally beforwardedby receiving NIs to otherdestinations.An NI receivebuffer thatcontainsamulticastpacket is notreturnedto its pooluntil themulticastpackethasbeenforwardedto all forwardingdestinations.

3.6 Host ReceiveBuffer Management

Eachhost hasa pool of free host receive buffers. The client specifiesthe ini-tial numberof buffers in this pool. At initialization time, the library obtainsthephysical1 addressof eachbuffer in thepool andstoresthis addressin thebuffer’sdescriptor.

Besidesthepool, the library managesa receivequeueof hostreceive buffers(seeFigure3.5). The library maintainsthreequeuepointers.Headpointsto thefirst FULL, but unprocessedpacket buffer. This is the next buffer that will bepassedto lfc upcall(). Emptypointsto thefirst EMPTY receive buffer. This is thenext buffer that the library expectsto be filled by the NI. Tail points to the lastbuffer in thequeue.Whenthelibrary allocatesnew receivebuffersfrom thepool,it appendsthemto thereceivequeueandadvancestail.

Eachtime the library appendsa buffer to its receive queue,it alsoappendsthe physicaladdressof the buffer to a sharedqueuein NI memory. When theNI control programreceivesa network packet, it tries to fetch an addressfrom

1To be precise,we obtain the bus address.This is a device’s addressfor a particularhostmemorylocation. On somearchitectures,this addresscanbedifferentfrom thephysicaladdressusedby thehost’s CPU,but on Intel’s x86 architecturethe busaddressandthephysicaladdressareidentical.

Page 76: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

54 TheLFC User-Level CommunicationSystem

thesharedqueueandcopy thepacket to this address.Whenthesharedqueueisempty, the control programdefersthe transferuntil the queueis refilled by thehostlibrary.

It is notnecessarytoseparatethereceivequeueandthepool,but doingsooftenreducesthememoryfootprint of thehostreceive buffers. In mostcases,we keeponly a smallnumberof bufferson thereceive queue.As long astheclient is ableto useandreusethesebuffers,only a smallportionof theprocessor’s datacacheis pollutedby host receive buffers. In somecases,however, clientsneedmorebuffers,whicharethenallocatedfrom thepoolandaddedto thereceivequeue.Inthosecases,a largerportionof thecachewill bepollutedby hostreceivebuffers.

Whenthelibrary needsto drain(seeSection3.1.4),it performstwo actions:

1. If thepacketpointedto by emptyhasbeenfilled, thelibrary advancesempty.If emptycannotbeadvanced(becauseall packetsarefull) and upcallsarenot allowed, then the library appendsa new buffer from the pool to bothqueues.

2. If headdoesnot equalemptyandupcallsareallowed,thelibrary dequeuesheadandcalls lfc upcall(), passingheadasa parameter. Head is thenad-vanced.If thereareno buffersleft in thereceivequeue,thenthelibrary ap-pendsanew buffer from thepool to bothqueuesbeforecalling lfc upcall().

Packetsfreedby clientsareappendedto bothqueues,unlessthequeuealreadycontainsa sufficient numberof emptypackets. In the latter case,the packet isreturnedto thebuffer pool.

3.7 MessageDetectionand Handler Invocation

LFC deliverspacketsto a receiving processby invoking lfc upcall(). This routineis either invoked sychronously, in the context of a client call to a drainingLFCroutine,or asynchronously, in the context of a signalhandler. Here,we discusstheimplementationof bothmechanisms.

Eachinvocationof lfc poll() checksif thecontrol programhascopieda newpacket to hostmemory. If lfc poll() finds a new packet, it invokes lfc upcall(),passingthepacketasaparameter. To allow fair schedulingof theclientcomputa-tion andthepacket handler, lfc poll() deliversat mostonenetwork packet.

Thecheckfor thenext packet is implementedby readingthestatusfield of thenext undeliveredpacket buffer (emptyin Figure3.5). Whenthis buffer’s addresswas passedto the control program,its statusfield is set to EMPTY. The pollsucceedswhenthehostdetectsthat this field hasbeensetto FULL. This polling

Page 77: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

3.7MessageDetectionandHandlerInvocation 55

strategy is efficient, becausethe statusfield residesin hostmemoryandwill becached(seeSection2.5).

Asynchronouspacketdeliveryis implementedcooperatively by LFC’scontrolprogram,theMyrinet device driver, andLFC’s library. In principle,LFC’s con-trol programgeneratesan interruptafter it hascopieda packet to hostmemory.However, sincethereceiving processmaypoll, thecontrolprogramdelaysits in-terrupts. Myrinet NIs containan on-boardtimer with 0.5 µs granularity. Whena packet arrives, the control programsetsthis timer to a user-specifiedtimeoutvalue(unlessthe timer is alreadyrunning). An interruptis generatedonly if thehostfails to processa packet beforethetimer expires. Thedetailsof this pollingwatchdog mechanismaredescribedin Chapter4.

Interruptsgeneratedby thecontrolprogramarereceivedby theoperatingsys-temkernel,whichpassesthemto theMyrinet devicedriver. Thedrivertransformstheinterruptinto a user-level softwaresignalandsendsthis networksignal to theclientprocess.Thesesignalsarenormallycaughtby theLFC library. Thelibrary’ssignalhandlerchecksif theclient is runningwith interruptsenabled.If this is thecase,thesignalhandlerinvokeslfc poll() to processpendingpackets;otherwiseitreturnsimmediately.

Recall that clients may dynamicallydisableand enablenetwork interrupts.For efficiency, the interrupt statusmanipulationfunctionsare implementedbytoggling an interruptstatusflag in hostmemory, without any communicationorsynchronizationwith thecontrolprogram.Beforesendingan interrupt,thecon-trol programfetches(usinga DMA transfer)the interruptstatusflag from hostmemory. No interruptis generatedwhentheflag indicatesthattheclientdisabledinterrupts.

Sincethe library doesnot synchronizewith thecontrolprogramwhenit ma-nipulatestheinterruptstatusflag, two racescanoccur:

1. The control programmay raisean interrupt just after the client disabledinterrupts.Unlessprecautionsaretaken,thesignalhandlerwill invoke thepackethandlereventhoughtheclientexplicitly disabledinterrupts.Tosolvethis problem,LFC’s signalhandleralwayschecksthe interruptstatusflagbeforeinvoking lfc poll() and ignoresthe signal if the raceoccurred. Inpractice,thisracedoesnotoccurveryoften,sotheadvantageof inexpensiveinterruptmanipulationoutweighsthecostof spuriousinterrupts.

2. Thecontrolprogrammaydecidenot to raiseaninterruptjustaftertheclient(re-)enabledinterrupts.This raceis fatal for clientsthat rely on interruptsfor correctness.To avoid lost interrupts,the control programcontinuestogenerateperiodic interruptsfor a packet until the hosthasprocessedthatpacket (seeChapter4).

Page 78: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

56 TheLFC User-Level CommunicationSystem

3.8 Fetch-and-Add

TheF&A implementationis straightforward. Every NI storesoneF&A variable,which is identified by the var argumentin lfc fetch and add() (seeTable 3.1).Lfc fetch and add() performsa remoteprocedurecall to the NI that storestheF&A variable. The processthat invoked lfc fetch and add() allocatesa requestpacket, tagsit with tagFA REQUEST, andappendsit to theNI’s hostsendqueue.

WhenthedestinationNI receivestherequest,it incrementsits F&A variableandstoresthepreviousvaluein a replypacket. Thereplypacket,acontrolpackettaggedFA REPLY, is thenappendedto theNI’s transmitqueue.

Theimplementationassumesthateachparticipatingprocesscanissueatmostone F&A requestat a time. To allow multiple outstandingrequests,we needa form of flow control that preventsthe NI that processesthoserequestsfromrunningoutof memoryasit createsandqueuesFA REPLY packets.FA REQUEST

packetsconsumecredits,but thosecreditsarereturnedassoonastherequesthasbeenreceivedandpossiblybeforea reply hasbeengenerated.

Whenthe reply arrivesat the requestingNI, the control programcopiestheF&A resultfrom thereply to a locationin NI memorythatis polledby lfc fetch -and add(). Whenlfc fetch and add() findstheresult,it returnsit to theclient.

This implementationis simpleandefficient. TheFA REQUEST packet is han-dled entirely by the receiving NI, without any involvementof the hostto whichtheNI is attached.This is particularlyimportantwhenprocessesissuemany F&Arequests.Many of theserequestswouldeitherbedelayedor generateinterruptsifthey hadto behandledby thehost.

Theimplementationusesadatapacket for therequestandacontrolpacket forthereply. Therequestis adatapacketonly becausethecurrentimplementationofLFC doesnot allow hoststo sendcontrolpackets.Althoughthis is easyto add,itwill increasethesendoverheadfor all packets,becausetheNI will have to checkif thehostsendsa dataor acontrolpacket.

3.9 Limitations

Theimplementationof LFC on Myrinet makesa numberof simplifying assump-tions,someof which limit its applicability in a productionenvironment.We dis-tinguishbetweenfunctionallimitationsandsecurityviolations.

3.9.1 Functional Limitations

LFC’s functional limitations are all relatedto the assumptionthat LFC will beusedin ahomogeneous,dedicatedcomputingenvironment.

Page 79: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

3.9Limitations 57

HomogeneousHosts

LFC assumesthatall hostprocessorsusethesamebyteorder. Addingbyteswap-ping for packet headersis straightforwardandwill addlittle overhead.(It is theclient’s responsibilityto convert thedatacontainedin network packets;LFC doesnotknow whattypeof dataclientsstorein their packets.)Sinceour cluster’shostprocessorsarelittle-endianandthe NI is big-endian,somebyte swappingis al-readyusedfor datathat needsto be interpretedboth by the hostprocessorandthe NI. This concernsmainly shareddescriptorsand packet headers;userdatacontainedin packetsis not interpretedby LFC andis neverbyte-swapped.

DeviceSharing

LFC canserviceatmostoneuserprocessperhost,mainlybecauseLFC’s controlprogramdoesnot separatethepacketsfrom differentusers.In our currentenvi-ronment(DAS), theMyrinet devicedriverreturnsanerrorif asecondusertriestoobtainaccessto theMyrinet device.

Multiple userscanbe supportedby giving differentusersaccessto differentpartsof the NI’s memory. This canbe achieved by settingup appropriatepagemappingsbetweentheuser’s virtual addressspaceandthe NI’s memory. If thisis done,theNI’s controlprogrammustpoll multiplecommandqueueslocatedondifferentmemorypages,which is likely to havesomeimpactonperformance.

NewerMyrinet network interfacesprovideadoorbellmechanism.Thismech-anismdetectshostprocessorwrites to certainregionsandappendsthe targetad-dressandvaluewritten to a FIFO queue.With this doorbellmechanism,the NIneednot poll all thepagesthatdifferentprocessescanwrite to; it needonly polltheFIFOqueue.

Recall that LFC partitionsthe available receive buffer spacein NI memoryamongall participatingnodes.If theNI’s memoryis alsopartitionedto supportmultiprogramming,thentheavailablebuffer spaceperprocesswill decreaseno-ticeably. Eachprocesswill be given fewer sendcreditsper destinationprocess,which reducesperformance. A simple, but expensive solution is to buy morememory. However, thismayonly bepossibleto alimited extent.A lessexpensiveandmorescalablesolutionis to employ swappingtechniquessuchasdescribedinSection2.4.

Fixed ProcessConfiguration

LFCassumesthatall of anapplication’sprocessesareknownatapplicationstartuptime; userscannotaddor remove processesto the initial setof processes.Themainobstacleto addingprocessesdynamicallyis thatLFC evenly dividesall NIreceive buffers (or rather, thesendcreditsthat representthem)amongthe initial

Page 80: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

58 TheLFC User-Level CommunicationSystem

setof processes.Whena processis added,sendcreditsmustbereallocatedin asafeway.

ReliableNetwork

LFC assumesthat thenetwork hardwarenever dropsor corruptspackets. WhileLFC testsfor CRC errors in incoming packets, it doesnot recover from sucherrors.Lostpacketsarenotdetectedat all.

A retransmissionmechanismwouldallow anLFC applicationto toleratetran-sientnetwork failures. In addition,retransmissionallows communicationproto-cols to drop packetswhenthey run out of resources.With retransmission,LFCwould not needto partitionNI receive buffersamongall senders.Chapter8 con-sidersalternative implementationswhich do retransmitdroppedand corruptedpackets.

3.9.2 Security Violations

LFC doesnot limit a userprocess’s accessto theNI andis thereforenot secure.Any userprocesswith NI accesscancrashany otherprocessor eventhekernel.With a little kernelsupport,though,LFC canbe madesecure;the techniquestoachievesecureuser-level communicationarewell-known. Moreover, LFC’s sendand receive methodsrequirerelatively simple and inexpensive securitymecha-nisms.Below, wedescribethesecurityproblemsin moredetail.Wherenecessary,wealsoprovidesolutions.Thesesolutionshavenotbeenimplemented.

Network Interface Access

LFC(or rather, theMyrinet devicedriverusedby LFC)allowsanarbitraryprocessto mapall NI memoryinto its addressspaceanddoesnot restrictaccessto thismemory. As a result, userscan freely modify the NI control program. Oncemodified,thecontrolprogramcanbothreadandwrite all hostmemorylocations.

Accessto theNI canberestrictedby mappingonly partof theNI’s memoryinto a process’s addressspace.The processis not given any accessto the con-trol programor to anotherprocess’s pages.LFC caneasilybemodifiedto workwith a device driver that implementsthis typeof protectionandtheexpectedtheperformanceimpactis small.Thepagemappingscanbesetup at applicationini-tializationtimesothatthecostof settingup thesemappingsis not incurredonthecritical communicationpath.

Page 81: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

3.10RelatedWork 59

AddressTranslation

TheLFC library passesthephysicalmemoryaddressesof freehostreceivebuffersto the control program. Sincethe control programdoesnot checkif thesead-dressesarevalid, amaliciousprogramcanpassanillegaladdress,which thecon-trol programwill useasthedestinationof aDMA transfer.

A simple solution is to have all host receive buffers allocatedby a trustedkernelmodule.This moduleassignsto eachbuffer anindex (a small integer)andmapsthis index to thebuffer’sphysicaladdress.Themappingis passedto theNI,but not to theuserprocess.Theuserprocess,i.e., theLFC library, is only giveneachbuffer’s virtual addressandits index. Insteadof passinga buffer’s physicaladdressto theNI, thelibrary will passthebuffer’s index. TheNI controlprogramcaneasilycheckif theindex it receivesis valid. Thisway, DMA transfersto hostmemorycanonly targetthebuffersallocatedby thetrustedkernelmodule.

Another solution is to usea secureaddresstranslationmechanismsuchasVMMC-2’s user-managedTLB [49] (seeSection2.3). This schemeis morein-volved, but supportsdynamicpinning andunpinningof any pagein the user’saddressspace,whereastheschemeabovewould pin all hostreceive buffersonceandfor all.

ForeignPackets

LFC’s control programmakesno seriousattemptto rejectnetwork packetsthatoriginatefrom anotherapplication. LFC rejectsonly packets that do not carryLFC’s Myrinet tag or that carry an out-of-rangeLFC tag. Consequently, mali-ciousor faulty programscansenda packet to any otherapplicationaslong asthepacket carriestagsknown by LFC. TheHamlynsystem[32] provideseachparal-lel application’sprocesswith asharedrandombit patternthatis appendedto eachmessageandverifiedby the receiver. The bit patternis sufficiently long that anexhaustivesearchfor its valueis computationallyinfeasible.

3.10 RelatedWork

Thedesignof LFC wasmotivatedin partby theshortcomingsof existingsystems.Sincemany of thesesystemswerediscussedin Chapter2, we will not treatthemall againhere.

LFC usesoptimistic interrupt protection[134] to achieve atomicity with re-spectto network signals.This techniquewasusedin theMachoperatingsystemkernelto reducetheoverheadof hardwareinterrupt-maskmanipulation.With thistechnique,interrupt-maskingroutinesdo not truly disableinterrupts,but manipu-latea softwareflag that representsthecurrentinterruptmask. It is the responsi-

Page 82: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

60 TheLFC User-Level CommunicationSystem

bility of theinterrupthandlerto checkthisflag. If theflagindicatesthatinterruptsarenotallowed,theinterrupthandlercreatesaninterruptcontinuation, somestatethatallows the interrupt-handlingcodeto be executedat a later time. LFC usesthesametechnique,but doesnot needto createinterruptcontinuations,becausetheNI will automaticallyregeneratetheinterrupt.

Somesystemsemulateasynchronousnotificationusinga backgroundthreadthat periodicallypolls the network [54]. The problemis to find a goodpollingfrequency. Also, addinga threadto anotherwisesingle-threadedsoftwaresystemis oftendifficult.

A similar idea is usedin the Message-PassingLibrary (MPL) for the IBMSP/2.Thislibrary usesperiodictimerinterruptsto allow thenetwork to bedrainedeven if sendersand receiversare engagedin long-runningcomputations[129].The main differenceis that MPL’s timeout handlerdoesnot invoke usercode,becauseMPL doesnotprovideanasynchronousnotificationinterface.

Anotheremulationapproachis to usea compileror binary rewriter to insertpollsautomaticallyinto applications.Shasta[125], afine-grainDSM, usesbinaryinstrumentation.In a simulationstudy[114], Perkovic andKelehercompareau-tomaticinsertionof pollsby abinaryrewriter andthepolling-watchdogapproach.Their simulationresults,performedin the context of the CVM DSM, indicatethat both approacheswork well andperformbetterthanan approachthat reliesexclusively onpolling duringmessagesends.

3.11 Summary

In this chapter, we have presentedthe interfaceand implementationof LFC, anew user-level communicationsystemdesignedto supportthe developmentofparallelprogrammingsystems.LFC providesreliablepoint-to-pointandmulticastcommunication,interruptmanagement,andsynchronizationthroughfetch-and-add. LFC’s interfaceis low-level. In particular, thereis no messageabstraction:clientsoperateonsendandreceivepackets.

This chapterhasdescribedthekey componentsof LFC’s implementationonMyrinet (library, device drivers, and NI control program). We have describedLFC’s datatransferpath,buffer management,control transfer, andsynchroniza-tion. TheNI-level communicationprotocols,which form animportantpartof theimplementation,aredescribedin thenext chapter.

Page 83: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Chapter 4

Core Protocolsfor IntelligentNetwork Interfaces

This chaptergivesa detaileddescriptionof LFC’s NI-level protocols: UCAST,MCAST, RECOV, and INTR. The UCAST protocol implementsreliablepoint-to-point communicationbetweenNIs. UCAST assumesreliablenetwork hard-wareandpreservesthis reliability usingsoftwareflow controlimplementedat thedatalink level, betweenNIs. LFC is namedafter this property:Link-level FlowControl.

Link-level flow controlalsoformsthebasisof MCAST, anNI-level spanningtreemulticastprotocol.MCAST forwardsmulticastpacketson theNI ratherthanon thehost;this preventsunnecessaryreinjectionof packetsinto thenetwork andremoveshostprocessing—in particularinterruptprocessingandcopying— fromthecritical multicastpath.

MCAST works for a limited, but widely usedclassof multicasttrees(e.g.,binary andbinomial trees). For treesnot in this class,MCAST may deadlock.The RECOV protocol extendsMCAST with a deadlockrecovery protocol thatallows theuseof arbitrarymulticasttrees.

UCAST, MCAST, andRECOV areconcernedwith datatransferbetweenNIs.The last protocoldescribedin this chapter, INTR, dealswith NI-to-hostcontroltransfer. INTR reducesthe overheadof network interruptsby delayingnetworkinterrupts.

This chapteris structuredas follows. Sections4.1 through4.4 discuss,re-spectively, UCAST, MCAST, RECOV, andINTR. Section4.5 discussesrelatedwork.

61

Page 84: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

62 CoreProtocolsfor IntelligentNetwork Interfaces

4.1 UCAST: ReliablePoint-to-Point Communication

The UCAST protocol implementsreliableandFIFO point-to-pointtransfersofpacketsbetweenNIs. Sinceweassumethatthehardwaredoesnotdropor corruptpackets(seeSection3.2), theonly causeof packet lossis lack of buffer spaceonreceiving nodes(bothNIs andhosts).Thischapterdealsonly with NI-level bufferoverflow; host-level bufferingwasdiscussedin Section3.6.

UCAST usesa variantof sliding window flow control betweeneachpair ofNIs to preventNI-level buffer overflows. Theprotocolguaranteesthatno NI willsendapacket to anotherNI beforeit knows thereceiving NI canstorethepacket.To achieve this,eachNI assignsanequalnumberof its receivebuffersto eachNIin thesystem.An NI S thatneedsto transmita packet to anotherNI R maydo soonly whenit hasat leastonesendcredit for R. Eachsendcredit correspondstooneof R’s freereceivebuffersfor S. EachtimeSsendsapacket to R, Sconsumesoneof its sendcreditsfor R. OnceS hasconsumedall its sendcreditsfor R, Smust wait until R freessomeof its receive buffers for S and returnsnew sendcreditsto S. Sendcreditsarereturnedby meansof explicit acknowledgementsorby piggybackingthemonapplication-level returntraffic.

4.1.1 Protocol Data Structures

Figure4.1 shows the typesandvariablesusedby UCAST. All variablesarebestoredin NI memory. UCAST usestwo typesof network packets: UNICAST

packetscarryuserdata;CREDIT packetsareusedonly to returncreditsto asenderanddo not carryany data. Eachpacket carriesa headerof typePacketType thatidentifiesthepacket’s type(tag) andsender(src). In addition,theprotocolusesacreditsfield in theheaderto returnsendcreditsto anNI.

Transmissionrequestsare storedin senddescriptorsof type SendDescthatidentify thepacket to betransmitted(packet) andthedestinationNI (dest). Multi-plesenddescriptorscanreferto thesamepacket; thisis importantfor themulticastprotocol,whichcanforwarda singlepacket to multipledestinations.

EachNI maintainsprotocolstatefor eachotherNI. This stateis storedin aprotocolcontrolblock of typeProtocolCtrlBlock. Theprotocolstatefor a givenNI consistsof thenumberof remainingsendcreditsfor thatNI (sendcredits), aqueueof blockedtransmissionrequests(blocked sends), andthenumberof creditsthatcanbereturnedto thatNI (freecredits).

Finally, eachNI maintainsseveralglobalvariables.EachNI storesits uniqueid in LOCAL ID, thenumberof NIs in thesystemin NR NODES, andacreditre-freshthresholdin CREDITREFRESH. Therefreshthresholdis usedby receiversto determinewhencreditsmustbereturnedto thesenderthatconsumedthem.Theprotocolcontrol blocksarestoredin arraypcbtab. Variablenr receivedcounts

Page 85: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.1UCAST: ReliablePoint-to-PointCommunication 63

typedef enum ) UNICAST, CREDIT * PacketType;

typedef struct )PacketType tag; / + packet type + /unsigned src; / + sender of packet + /unsigned credits; / + piggybacked credits + /* PacketHeader;

typedef struct )PacketHeader hdr;Byte data[PACKET SIZE];* Packet;

typedef struct )unsigned dest;Packet + packet;* SendDesc;

typedef struct )unsigned send credits;SendDescQueue blocked sends;unsigned free credits;* ProtocolCtrlBlock;

unsigned LOCAL ID, NR NODES, CREDIT REFRESH; / + runtime constants + /ProtocolCtrlBlock pcbtab[MAX NODES]; / + per-node protocol state + /unsigned nr pkts received;

Fig. 4.1.Protocoldatastructuresfor UCAST.

Page 86: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

64 CoreProtocolsfor IntelligentNetwork Interfaces

void event unicast(unsigned dest, Packet + pkt) )SendDesc + desc = alloc send desc();

pkt , hdr.tag = UNICAST;desc , dest = dest;desc , packet = pkt;credit send(desc);*

void credit send(SendDesc + desc) )ProtocolCtrlBlock + pcb = &pcbtab[desc , dest];

if (pcb , send credits - 0) )pcb , send credits ./. ; / + consume a credit + /send packet(desc);return;*

enqueue(&pcb , blocked sends, desc); / + wait for a credit + /*void send packet(SendDesc + desc) )

ProtocolCtrlBlock + pcb = &pcbtab[desc , dest];Packet + pkt = desc , packet;

pkt , hdr.src = LOCAL ID;pkt , hdr.credits = pcb , free credits; / + piggyback credits + /pcb , free credits = 0;

transmit(desc , dest, pkt);*Fig. 4.2.Sendersideof theUCASTprotocol.

how many packetsanNI hasreceived;this variableis usedby theINTR protocol(seeSection4.4).

4.1.2 Sender-SideProtocol

Figure4.2showsthereliability protocolexecutedbyeachsendingNI. All protocolcodeis event-driven; this part of the protocol is activatedin responseto unicasteventswhich are generatedwhen the NI’s host launchesa packet. How LFCgeneratesthis eventwasdiscussedin Chapter3.

Eachunicasteventis dispatchedto eventunicast(). Thisroutinecreatesasenddescriptorandpassesthisdescriptorto credit send(), whichchecksif asendcreditis availablefor thedestinationspecifiedin thesenddescriptor. If this is thecase,

Page 87: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.1UCAST: ReliablePoint-to-PointCommunication 65

thepacket is transmitted;otherwisethesenddescriptoris queuedon theblocked-sendsqueuefor thedestinationNI. It will remainqueueduntil thedestinationNIhasreturnedasufficientnumberof sendcredits.

Packets are transmittedby sendpacket(). This routine is also usedto re-turn sendcreditsto the destinationNI. In the protocolcontrol block of eachNIwe counthow many creditscanbe returnedto thatNI; sendpacket() copiesthecounterinto thecreditsfield of thepacketheaderandthenresetsthecounter. Thisway, any packet travelling to an NI will automaticallyreturnany availablesendcreditsto thatNI.

Sendpacketscanbereturnedto theNI sendbufferpoolassoonastransmissionhascompleted. We do not show sendpacket deallocation,becauseit doesnottriggerany significantprotocolactions.

4.1.3 Receiver-SideProtocol

Thereceiversideof thereliability protocolis shown in Figure4.3.Thispartof theprotocolhandlestwo events: packet arrival andpacket release.Incomingpack-etsaredispatchedto eventpacket received(). This routineacceptsany new sendcreditsthat the sendermay have returnedandthenprocessesthe packet. If thepacket is a UNICAST datapacket, it is deliveredto thehost. In this protocolde-scription,we arenot concernedwith the detailsof NI-to-hostdelivery. CREDIT

packetsserve only to sendcreditsto an NI; they do not requireany further pro-cessingandarediscardedimmediately.

Routineacceptnew credits() addsthe credits returnedin packet pkt to thesendcreditsfor NI src. Next, this routinewalkstheblocked-sendsqueueto trans-mit to src asmany blockedpacketsaspossible.Sincetheblocked-sendsqueueischeckedimmediatelywhencreditsarereturned,packetsarealwaystransmittedinFIFO order.

Although UCAST is not concernedwith the exactway in which packetsaredeliveredto the host, it needsto know whena packet buffer canbe reusedfornew incoming packets. The sendcredit consumedby the packet’s sendercan-not be returneduntil the packet is availablefor reuse.We thereforerequirethata packet releaseevent be generatedwhen a receive packet is released. Thisevent is dispatchedto eventpacket released(), which returnsa sendcredit tothe packet’s senderandthendeallocatesthe packet. Creditsarereturnedby re-turn credit to(), which incrementsa counter(freecredits) in theprotocolcontrolblock of thepacket’s sender. If this counterreachesthecredit refreshthreshold,return credit to() sendsan explicit CREDIT packet to the sender. This happensonly if datapacketsflow mainly in onedirection.If thereis sufficient returntraf-fic freecreditswill never reachtherefreshthreshold,becausesendpacket() willalreadyhavepiggybackedall freecreditson anoutgoingUNICAST packet.

Page 88: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

66 CoreProtocolsfor IntelligentNetwork Interfaces

void event packet received(Packet + pkt) )accept new credits(pkt);

switch(pkt , hdr.tag) )case UNICAST:

deliver to host(pkt);nr pkts received++;break;

case CREDIT:break;**

void accept new credits(Packet + pkt) )ProtocolCtrlBlock + pcb = &pcbtab[pkt , hdr.src];SendDesc + desc;

/ + Retrieve (piggybacked) credits and unblock blocked senders. + /pcb , send credits += pkt , hdr.credits;while (! queue empty(&pcb , blocked sends) && pcb , send credits - 0) )

desc = dequeue(&pcb , blocked sends);credit send(desc);**

void event packet released(Packet + pkt) )return credit to(pkt , hdr.src);*

void return credit to(unsigned sender) )ProtocolCtrlBlock + pcb = &pcbtab[sender];

pcb , free credits++;if (pcb , free credits 0 CREDIT REFRESH) )

Packet credit packet;SendDesc + desc = alloc send desc();

credit packet.hdr.tag = CREDIT; / + send explicit credit packet + /desc , dest = sender;desc , packet = &credit packet;send packet(desc);**

Fig. 4.3.Receiversideof theUCAST protocol.

Page 89: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.2MCAST: ReliableMulticastCommunication 67

Host Host

Network interface Network interface

Fig. 4.4.Host-level (left) andinterface-level (right) multicastforwarding.

UCAST is simpleandefficient. Sincetheprotocolpreservesthereliability ofthe underlyinghardwareby avoiding receiver buffer overruns,sendersneednotbuffer packets for retransmission,nor needthey manageany timers. The maindisadvantageof UCAST is that it staticallypartitionstheavailablereceive bufferspaceamongall senders,soit needsmoreNI memoryasthenumberof processorsis increased.Chapter8 studiesthis issuein moredetail.

4.2 MCAST: ReliableMulticast Communication

Sincewe assumethatmulticastis not supportedin hardware,we provide a soft-warespanning-treemulticastprotocol.Thisprotocol,MCAST, implementsareli-ableandFIFO-orderedmulticast.BeforedescribingMCAST, wediscussdifferentmulticastforwardingstrategiesandtheimportanceof multicasttreetopology.

4.2.1 Multicast Forwarding

Mostspanning-treeprotocolsusehost-level forwarding.With host-level forward-ing, thesenderis theroot of a multicasttreeandsendsa multicastpacket to eachof its children. Eachchild’s NI passesthepacket to its host,which reinjectsthepacket to forwardit to thenext level in themulticasttree(seeFigure4.4).

Host-level forwardinghasthreedrawbacks.First,sinceeachreinjectedpacketwasalreadyavailablein NI memory, host-level forwardingresultsin anunneces-saryhost-to-NIdatatransferateachinternaltreenodeof themulticasttree,whichwastesbusbandwidthandprocessortime. Second,no forwardingtakesplaceun-lessthehostprocessesincomingpackets.If onehostdoesnot poll thenetwork inatimely manner, all its childrenwill beaffected.Insteadof relyingonpolling, theNI canraisean interrupt,but interruptsareexpensive. Third, thecritical sender-receiverpathincludesthehost-NIinteractionsof all thenodesbetweenthesenderandthereceiver. For eachmulticastpacket, theseinteractionsconsistof copyingthepacket to thehost,hostprocessing,andreinjectingthepacket.

Page 90: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

68 CoreProtocolsfor IntelligentNetwork Interfaces

MCAST addressestheseproblemsby meansof NI-level forwarding:multicastpacketsareforwardedby theNI without interventionby hostprocessors.

4.2.2 Multicast TreeTopology

Multicast treetopologymattersin two ways. First, differenttreeshave differentperformancecharacteristics.Many differenttreetopologieshave beendescribedin theliterature[30, 77, 80]. Deeptrees(i.e., treeswith a low fanout),for exam-ple, give goodthroughput,becauseinternalnodesneedto forwardpacketsfewertimes,so they uselesstime per multicast. Latency increasesbecausemorefor-wardingactionsareneededbeforethe lastdestinationis reached.Dependingonthetypeof multicasttraffic generatedby anapplication,onetopologygivesbetterperformancethananother. Our goal is not to inventor analyzea particulartopol-ogythatis optimalin somesense.No topologyis optimalunderall circumstances.It is important,however, thatamulticastprotocoldoesnotprohibit topologiesthatareappropriatefor theapplicationat hand.

Second,with MCAST’s packet forwarding scheme(describedlater), sometreescan inducebuffer deadlocks.A multicastpacket requiresbuffer spaceatall its destinations.This buffer spacecanbeobtainedbefore thepacket is sentorit canbe obtainedmoredynamically, asthe packet travels from onenodein thespanningtreeto another. Thefirst approachis conceptuallysimple,but requiresaglobalprotocolto reserve buffersat all destinationnodes.Thesecondapproach,takenby MCAST, introducesthedangerof buffer deadlocks.A sendermayknowthatits multicastpacketcanreachthenext nodein thespanningtree,but doesnotknow if the packet cantravel further onceit hasreachedthat node. Section4.3givesanexampleof suchadeadlock.

MCAST avoidsbuffer deadlocksby restrictingthetopologyof multicasttrees.An alternative approachis to detectdeadlocksandrecover from them. This ap-proachis takenby RECOV, anextensionof MCAST thatallows theuseof arbi-trarymulticasttrees.

4.2.3 Protocol Data Structures

In MCAST, NIs multicastinsidemulticastgroupsthatarecreatedat initializationtime; MCAST doesnot includea groupjoin or leave protocol. Eachmemberofa multicastgrouphasits own spanningtreewhich it usesto forward multicastpacketsto theothergroupmembers.

Figure4.5 shows MCAST’s datastructures.MCAST builds on UCAST, sowe reuseandextendUCAST’s datastructuresandroutines.MCAST introducesa new packet type, MULTICAST, andextendthe packet headerwith a treefield.Eachprocessis givena uniqueindex for eachmulticastgroupthatit is a member

Page 91: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.2MCAST: ReliableMulticastCommunication 69

typedef enum ) UNICAST, MULTICAST, CREDIT * PacketType;

typedef struct )12131 / + as before + /unsigned tree; / + identifies multicast tree + /* PacketHeader;

typedef struct )unsigned parent;unsigned nr children;unsigned children[MAX FORWARD];/ + MAX FORWARD depends on tree size and shape + /* Tree;

Tree treetab[MAX TREES];

Fig. 4.5. Protocoldatastructuresfor MCAST.

of. This index identifiestheprocess’s multicasttreefor thatparticularmulticastgroup.Whentheprocesstransmitsapacket to theothermembersof thatmulticastgroup,it storesthetreeidentifier in thepacket’s treefield (seeFigure3.1). EachNI storesa multicastforwardingtable(treetab) that is indexedby this treefield.Thetableentryfor agiventreespecifieshow many childrentheNI hasin thattreeandlists thosechildren.

4.2.4 Sender-SideProtocol

An NI initiates a multicast in responseto multicast eventswhich are handledby eventmulticast(). This routinemarksthe packet to be multicastasa MUL-TICAST packet, storesthemulticasttreeidentifier in thepacket header, andtriesto transmit the packet to all first-level children in the multicast tree by callingforward packet().

Routineforward() looksupthetableentryfor themulticasttreethatthepacketwassenton andcreatesa senddescriptorfor eachforwardingdestinationlistedin thetableentry. Fromthenon, exactly thesameprocedureis followedasfor aunicastpacket. Thepacket will betransmittedto a forwardingdestinationonly ifcreditsareavailablefor thatdestination;otherwise,thesenddescriptoris movedto thedestination’sblocked-sendsqueue.

Page 92: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

70 CoreProtocolsfor IntelligentNetwork Interfaces

void event multicast(unsigned tree, Packet + pkt) )pkt , hdr.tag = MULTICAST;pkt , hdr.tree = tree;forward packet(pkt);*

void forward packet(Packet + pkt) )Tree + tree = &treetab[pkt , hdr.tree];SendDesc + desc;unsigned i;

for (i = 0; i 4 tree , nr children; i++) )desc = alloc send desc();desc , dest = tree , children[i];desc , packet = pkt;

credit send(desc);**Fig. 4.6.Sender-sideprotocolfor MCAST.

4.2.5 Receiver-SideProtocol

Figure4.7 shows the receiving sideof MCAST. Whena multicastpacket is re-ceived,it is deliveredto thehost,just likeaunicastpacket. In addition,thepacketis forwardedto all of thereceiving NI’schildrenin themulticasttree.This is doneusingthesameforwardingroutinedescribedabove(forward packet()).

An NI receive buffer that holds a multicastpacket is releasedwhen it hasbeendeliveredto the hostand whenit hasbeenforwardedto all childrenin themulticasttree. At that point, eventpacket released() is called. As before,thisroutinereturnsa sendcredit to the(immediate)senderof thepacket. In thecaseof a multicastpacket, the immediatesenderis the receiving NI’s parentin themulticasttree. We do not usethepacket header’s sourcefield (src) to determinethepacket’ssender, becausethisfield is overwrittenduringpacket forwarding(bysendpacket()). Instead,we usethe treefield to find this NI’s parent;this field isnevermodifiedduringpacket forwarding.

To makethismodifiedpacketreleaseroutinework for unicastpackets,wealsodefineunicasttrees. Thesetreesconsistof two nodes,asenderandareceiver; thesenderis the parentof the receiver. Oneunicasttreeis definedfor eachsender-receiver pair. UNICAST is modifiedsothat it writestheunicasttreeidentifier ineachoutgoingunicastpacket. Unicasttreesarenot definedonly to make packetreleasework smoothly, but arealsoimportantin theRECOV protocol.

Page 93: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.2MCAST: ReliableMulticastCommunication 71

void event packet received(Packet + pkt) )accept new credits(pkt);

switch(pkt , hdr.tag) )case UNICAST: 13121 ; break / + as before + /case CREDIT: 13121 ; break / + as before + /case MULTICAST:

deliver to host(pkt);nr pkts received++;forward packet(pkt);break;**

void event packet released(Packet + pkt) )return credit to(treetab[pkt , hdr.tree].parent);*

Fig. 4.7.Receiveprocedurefor themulticastprotocol.

4.2.6 Deadlock

MCAST’s simplicity resultsfrom building on reliablecommunicationchannelsbetweenNIs. Unfortunately, MCAST is susceptibleto deadlockif an arbitrarytopologyis used.Figure4.8illustratesadeadlockscenario.ThefigureshowsfourNIs,eachwith its ownsendandreceivepool. RecallthattheNI receivebufferpoolis partitioned.In thisscenario,eachprocessoris therootof adegeneratemulticasttree,a linear chain. Initially, eachNI owns onesendcredit for eachdestinationNI. All processors(not shown in thefigure)have injecteda multicastpacket andeachmulticastpacket hasbeenforwardedto thenext NI in thesender’s multicastchain. That is, eachNI hasspentits sendcredit for thenext NI in the multicastchain. Now every packet needsto be forwardedagainto reachthenext NI in itschain,but no packet canmake progress,becausethe target receive buffer on thenext NI is occupiedby anotherblockedpacket. EachNI hasfreereceive buffers,but UCAST’s flow controlschemedoesnot allow onesendingNI to useanothersendingNI’s receivebuffers.

Thisdeadlockis theresultof thespecificmulticasttreesusedto forwardpack-ets(linearchains).By default,LFC usesbinarytreesto forwardmulticastpackets.Figure4.9 shows the binary multicasttreefor processor0 in a 16-nodesystem.The multicasttreefor processorp is obtainedby renumberingeachtreenodenin the multicasttreeof processor0 to 5 n 6 p7 modP, whereP is the numberofprocessors.With thesebinarytrees,it is impossibleto constructa deadlockcycle(seeAppendixA).

Page 94: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

72 CoreProtocolsfor IntelligentNetwork Interfaces

8�8�8�8�89�9�9�9:�:�:�:�::�:�:�:�:;�;�;�;;�;�;�;

<�<�<�<�<=�=�=�=�=

>�>�>�>>�>�>�>?�?�?�??�?�?�?

Network interface 0 Network interface 1

Network interface 3

I0 I1

I2I3 3I

I0 I1

I2

I1

I2I3

0I

I2

1II0

I3

Send pool

Network interface 2

Receivepool

Fig. 4.8.MCAST deadlockscenario.

7 8

3

1

0

4 6

2

5

15

9 10 1211 1413

Fig. 4.9.A binarytree.

Page 95: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.3RECOV: DeadlockDetectionandRecovery 73

To avoid buffer deadlocks,MCAST imposestwo restrictions:

1. Multicastgroupsmaynotoverlap.

2. Not all multicasttreetopologiesarevalid.

Thefirst restrictionimpliesthatwecannotusebroadcastandmulticastat thesametime, becausethebroadcastgroupoverlapswith every multicastgroup. Therea-sonsfor theserestrictionsanda precisecharacterizationof valid multicasttreesaregivenin AppendixA.

4.3 RECOV: DeadlockDetectionand Recovery

MCAST’s restrictionsonmulticasttreetopologyarepotentiallycumbersome,be-causesometopologiesthat are not deadlock-freeare more efficient than othercombinations[80]. For large messages,for example,linear chainsgive the bestthroughput,but asillustratedin theprevioussection,chainsarenotdeadlock-free.Several sophisticatedmulticast protocolsbuild treesthat containshort chains.Thesetreesarenotdeadlock-freeeither, yetwewouldliketo usethemif deadlockis unlikely.

To solve this problem,we have extendedMCAST with a deadlockrecoverymechanism.Theextendedprotocol,RECOV, switchesto a slower, but deadlock-freeprotocolwhena deadlockis suspected.In this section,we first give a high-level overview of RECOV’s deadlockdetectionandrecovery protocol. Next, wegivea detaileddescriptionof theprotocol.

4.3.1 DeadlockDetection

RECOV usesthe feedbackfrom UCAST’s flow control schemeto detectpoten-tial deadlocks.In the caseof a buffer deadlock,eachNI in the deadlockcyclehasconsumedall sendcreditsfor the next NI in the cycle. In RECOV, an NIthereforesignalsa potentialdeadlockwhen it hasrun out of sendcreditsfor adestinationthat it needsto senda packet to. That is, nonemptyblocked-sendsqueuessignaldeadlock.With this triggermechanismanNI cansignalapotentialdeadlockunnecessarily, becausean NI may run out of sendcreditseven if thereis no deadlock.Theadvantageof this detectionmechanism,however, is thatNIsneedonly local information(thestateof theirblocked-sendsqueues)to detectpo-tentialdeadlocks.No communicationis neededto detectadeadlock,whichkeepsthedetectionalgorithmsimpleandefficient.

Sinceadeadlockneednotinvolveall NIs,eachNI mustmonitorall itsblocked-sendsqueue;progresson some,but not all queuesdoesnot guaranteefreedomof

Page 96: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

74 CoreProtocolsfor IntelligentNetwork Interfaces

deadlock. Checkingfor a nonemptyblocked-sendsqueueis efficient, but maysignaldeadlocksoonerthana timeout-basedmechanism.

Whenan NI hassignaleda potentialdeadlock,it initiatesa recovery action.No new recoveriesmaybestartedby thatNI while a previousrecovery is still inprogress.Whenarecovery terminates,theNI thatinitiatedit will searchfor othernonemptyblocked-sendsqueuesandstartanew recovery if it findsone.To avoidlivelock,theblocked-sendsqueuesareservedin a round-robinfashion.

Multiple NIs candetectandrecoverfrom adeadlocksimultaneously. Simulta-neousrecoveriesdonot interfere,but canbeinefficient. In adeadlockcycle,onlyoneNI needsto usethe slower escapeprotocol to guaranteeforward progress.With simultaneousrecoveries,all NIs that detectdeadlockwill start using theslowerescapeprotocol[96].

4.3.2 DeadlockRecovery

To ensureforwardprogresswhenadeadlockis suspected,RECOV requiresP @ 1extra bufferson eachNI, whereP is thenumberof NIs usedby theapplication.EachNI N owns onebuffer on every otherNI. Thesebuffers form a dedicatedrecoverychannelfor N. Communicationon this recoverychannelcanresultonlyfrom arecovery initiatedby N.

WhenanNI suspectsdeadlock,it will useits recoverychanneltomakeprogresson onemulticastor unicasttree. (Recall that we have definedboth unicastandmulticasttree,so thatevery datapacket travelsalongsometree.) OtherNIs willusethischannelonly whenthey receiveapacket thatwastransmittedonthechan-nel. Specifically, whenanNI receives,on recovery channelC, a packet p trans-mittedalongtreeT, thenthatNI mayforwardonepacket to eachof its childreninT. Theforwardedpacketsmustalsobelongto T. This guaranteesthatwe cannotconstructacyclewithin arecoverychannel,becauseall communicationlinks usedin therecoverychannelform asubtreeof T.

OnceanNI hasinitiateda recovery on its recovery channel,it maynot initi-ateanotherrecoveryuntil therecovery channelis completelyclear. Therecoverychannelis clearedbottom-up.Whena leaf in therecovery treereleasesits escapebuffer (becauseit hasdeliveredthepacket to its host),it sendsaspecialacknowl-edgementto itsparentin therecoverytree.Whentheparenthasreleasedits escapebuffer andwhenit hasreceivedacknowledgementsfrom all its childrenin there-covery tree,thenit will propagatean acknowledgementto its parent.Whentheroot of the recovery treereceivesan acknowledgement,the recovery channelisclear.

EachNI caninitiate at mostonerecoveryat a time,but differentNIs canstarta recovery simultaneously, even in a single multicast tree. Suchsimultaneousrecoveriesdo not interferewith eachother, becausethey usedifferent recovery

Page 97: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.3RECOV: DeadlockDetectionandRecovery 75

D

E F

Recovery treeR

Fig. 4.10.Subtreecoveredby thedeadlockrecoveryalgorithm.

channels:eachrecovery usestherecovery channelownedby theNI that initiatestherecovery.

The protocol’s operationis illustratedin Figure4.10. In this figure, verticesrepresentNIs and the edgesshow how theseNIs are organizedin a particularspanningtree.Dashededgesbetweenaparentandachild indicatethattheparenthasnosendcreditsleft for thechild.

NI R triggersarecoverybecauseit hasnosendcreditsfor its child D. Rselectsablockedpacket p at theheadof oneof its blocked-sendsqueuesanddeterminesthe treeT that this packet is beingforwardedon. R now becomesthe root of arecoverytreeandwill useits recoverychannelto forward p. Rmarksp andsendsp to D. D thenbecomespartof therecovery tree.If D hasany childrenfor whichit hasno sendcredits,thenD may further extendthe recovery treeby usingR’srecovery channelto forward p. If p canbeforwardedto anNI usingthenormalcredit mechanism(e.g., to F), thenD will do so, andthe recovery treewill notincludethatNI. (If F cannotforward p to its child, thenF will initiate a recoveryon its own recoverychannel.)

4.3.3 ProtocolData Structures

We now presentthe RECOV protocolasan extensionof MCAST. Figure4.11showsthenew andmodifieddatastructuresandvariablesusedby RECOV. A newtype of packet, CHANNEL CLEAR, is usedto cleara recovery channel. CHAN-NEL CLEAR packetstravel from theleavesto therootof a recovery tree.

The packet headercontainstwo new fields. (In reality LFC combinesthesetwo fieldsin a singledeadlock channelfield (seeSection3.4)). Field deadlockedis setto TRUE whena packet needsto usea recovery channel.Thechannelfieldidentifiestherecovery channel.This field containsa valid valuefor all UNICAST

Page 98: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

76 CoreProtocolsfor IntelligentNetwork Interfaces

typedef enum ) UNICAST, MULTICAST, CREDIT, CHANNEL CLEAR * PacketType;

typedef struct )12131 / + as before + /int deadlocked; / + TRUE iff transmitted over recovery channel + /unsigned channel; / + identifies owner of the recovery channel + /* PacketHeader;

typedef struct )12131 / + as before + /int deadlocked; / + saves deadlocked field of incoming packets + /unsigned channel; / + saves channel field of incoming packets + /* SendDesc;

typedef struct )12131 / + as before + /unsigned nr acks; / + need to receive this many recovery acks + /* ProtocolCtrlBlock;

typedef struct )12131 / + as before + /UnsignedQueue active channels; / + active recovery channels in this tree + /* Tree;

int recovery active; / + TRUE iff the local node initiated a recovery + /Fig. 4.11.Deadlockrecoverydatastructures.

Page 99: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.3RECOV: DeadlockDetectionandRecovery 77

void forward packet(Packet + pkt) )Tree + tree = &treetab[pkt , hdr.tree];int deadlocked = pkt , hdr.deadlocked; / + save! + /unsigned channel = pkt , hdr.channel; / + save! + /SendDesc + desc;unsigned i;

for (i = 0; i 4 tree , nr children; i++) )desc = alloc send desc();desc , dest = tree , children[i];desc , packet = pkt;desc , deadlocked = deadlocked;desc , channel = channel;

credit send(desc);**Fig. 4.12.Multicastforwardingin thedeadlockrecoveryprotocol.

andMULTICAST packetsthat travel acrossa recovery channel(i.e., packetsthathave deadlocked setto TRUE) andfor all CHANNEL CLEAR packets. The samefieldsareaddedto senddescriptors.

The protocolcontrol block is extendedwith a counter(nr acks) that countshow many CHANNEL CLEAR packetsremainto bereceivedbeforeachannel-clearacknowledgementcanbepropagatedup therecovery tree.

TheTreedatastructureis extendedwith a queueof active recovery channels(activechannels). Eachtime an NI participatesin a recovery on somerecoverychannelC for sometreeT, it addsC to thequeueof treeT.

4.3.4 Sender-SideProtocol

Thesendersideof RECOV consistsof severalmodificationsto theMCAST andUCAST routines. The modified multicast forwarding routine is shown in Fig-ure4.12.For eachchild, RECOV mustknow whetherthepacket to beforwardedarrivedon a deadlockrecovery channel.At thetime thepacket is forwardedto aspecificchild, wecanno longertrustall informationin thepacketheader, becausethis informationmayhavebeenoverwrittenwhenthepacketwasforwardedto an-otherchild. Therefore,wehavemodifiedforward packet() to savethedeadlockedandchannelfieldsof theincomingmulticastpacket in thesenddescriptorsfor theforwardingdestinations.

Thesaveddeadlockedfield is usedby credit send() (seeFigure4.13).As be-fore, this routinefirst triesto consumea sendcredit. If all sendcreditshave been

Page 100: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

78 CoreProtocolsfor IntelligentNetwork Interfaces

consumed,however, sendcredit() no longerimmediatelyenqueuesthepacket —or rather, the descriptorthat referencesthe packet— on a blocked-sendsqueue.Instead,wefirst checkif thepacketarrivedonarecoverychannel(usingthesaveddeadlocked field). If this is the case,we forward a packet that belongsto thesametreeon the samerecovery channel.We cannotalwaysforward the packetreferencedby desc, becauseotherpacketsmay be waiting to travel to the samedestination.If we bypassthesepacketsandoneof themarrivedon thesametreeasthebypassingpacket, thenweviolateFIFOness.To preventthis,next in tree()(not shown) first appendsdescto theblocked-sendsqueuefor thepacket’s desti-nation. Next, it removesfrom this queuethe first packet that is travelling alongthesametreeasthepacket referencedby descandit returnsthesenddescriptorfor this packet. This packet is transmittedon therecoverychannel.

If the packet that credit send() tries to transmitdid not arrive on a deadlockrecovery channel,thencredit send() will starta new recovery on this NI’s recov-ery channel.However, it cando soonly if it is not alreadyworking on anearlierrecovery. If this NI hasalreadyinitiateda recovery andthis recovery is still ac-tive,thenthepacket is simplyenqueuedon its destination’sblocked-sendsqueue.OtherwisetheNI startsanew recoveryon its own recoverychannel.

For completeness,wealsoshow themodifiedsendpacket() routine.This rou-tine now alsocopiesthe deadlocked field and the channelfield into the packetheader.

4.3.5 Receiver-SideProtocol

Thereceiver-sidecodeof RECOV is shown in Figure4.14andFigure4.15.Onlya few modificationsto eventpacket received() areneeded.

First,specialactionmustbetakenwhenapacketarrivesonarecoverychannel.In thatcase,thereceiving NI createssomestatefor therecoverythatthispacket ispartof. In theprotocolcontrolblock of therecoverychannel’s owner, theNI setsthe channel-clearcounterto 1. This counteris usedwhenthe recovery channelis clearedandindicateshow many partiesneedto agreethat thechannelis clearbeforeachannel-clearacknowledgementcanbesentuptherecoverytree.As longasthereceiving NI doesnot extendtherecovery tree,only it needsto saythatthechannelis clear, hencethe counteris set to 1. If, however, the recovery tree isextendedby forwardingthepacketon its recoverychannel,thencredit send() willincreasethecounterto indicatethatanadditionalchannel-clearacknowledgementis neededfrom thechild thatthepacket wasforwardedto.

Routineeventpacket received() alsostorestheidentity of therecovery chan-nel’sowner(theNI thatinitiatedtherecovery)onaqueueassociatedwith thetreethattheincomingpacket travelledon. This informationis usedwhenpacketsarereleased.

Page 101: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.3RECOV: DeadlockDetectionandRecovery 79

void credit send(SendDesc + desc) )ProtocolCtrlBlock + pcb = &pcbtab[desc , dest];

if (pcb , send credits - 0) )desc , deadlocked = FALSE;pcb , send credits ./. ; / + consume a credit + /send packet(dest);return;*

if (desc , deadlocked) )/ + desc , packet arrived on a deadlock channel. We forward desc , packet,+ or a predecessor in the same tree, on the same channel.+ /pcbtab[desc , channel].nr acks++;desc = next in tree(pcb, desc); / + preserve FIFOness in tree + /send packet(desc); / + no credit consumed by this send + /* else if (recovery active) )/ + Cannot do more than one recovery at a time. + /enqueue(&pcb , blocked sends, desc);* else )/ + Initiate recovery on my deadlock channel + /recovery active = TRUE;pcbtab[LOCAL ID].nr acks = 1;desc , deadlocked = TRUE;desc , channel = LOCAL ID; / + my recovery channel + /send packet(desc); / + no credit consumed by this send + /**

void send packet(SendDesc + desc) )ProtocolCtrlBlock + pcb = &pcbtab[desc , dest];Packet + pkt = desc , packet;

pkt , hdr.src = LOCAL ID;pkt , hdr.credits = pcb , free credits; / + piggyback credits + /pkt , hdr.deadlocked = desc , deadlocked;pkt , hdr.channel = desc , channel;pcb , free credits = 0; / + we returned all credits to dest + /transmit(desc , dest, pkt);*

Fig. 4.13.Packet transmissionin thedeadlockrecoveryprotocol.

Page 102: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

80 CoreProtocolsfor IntelligentNetwork Interfaces

void event packet received(Packet + pkt) )accept new credits(pkt);

if (pkt , hdr.deadlocked) )ProtocolCtrlBlock + channel owner = &pcbtab[pkt , hdr.channel];Tree + tree = &treetab[pkt , hdr.tree];

/ + Register my state for this recovery in the pcb of the node that+ initiated the recovery. Add the owner’s id to a queue of ids+ that is maintained for the tree that the packet belongs to.+ /channel owner , nr acks = 1; / + one ’ack’ for local delivery + /enqueue(&tree , active channels, pkt , hdr.channel);*

switch(pkt , tag) )case UNICAST: 13131 ; break; / + as before + /case MULTICAST: 13121 ; break; / + as before + /case CREDIT: 13131 ; break; / + as before + /case CHANNEL CLEAR:

release recovery channel(pkt , channel);break;**

Fig. 4.14.Receiveprocedurefor thedeadlockrecoveryprotocol.

Page 103: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.3RECOV: DeadlockDetectionandRecovery 81

void release recovery channel(unsigned channel) )ProtocolCtrlBlock + channel owner = &pcbtab[channel];

channel owner , nr acks ./. ;if (channel owner , nr acks - 0) return; / + channel not clear yet + // + Below and at this tree node, the recovery channel is clear. + /if (channel == LOCAL ID) )

/ + Recovery completed; the entire recovery channel is clear. + /recovery active = FALSE;try new recovery(); / + search for new nonempty blocked-sends queue + /* else )Packet deadlock ack;SendDesc + desc = alloc send desc();

/ + Propagate channel-clear ack to parent + /deadlock ack.hdr.tag = CHANNEL CLEAR;desc , dest = pcb , parent;desc , packet = &deadlock ack;desc , deadlocked = FALSE;desc , channel = channel;send packet(desc);**

void event packet released(Packet + pkt) )Tree + tree = &treetab[pkt , hdr.tree];

if (queue empty(&tree , active channels)) )return credit to(tree , parent);* else )/ + Do not return a credit, but try to clear a deadlock channel + /unsigned channel = dequeue(&tree , active channels);release recovery channel(channel);**

Fig. 4.15.Packet releaseprocedurefor thedeadlockrecoveryprotocol.

Page 104: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

82 CoreProtocolsfor IntelligentNetwork Interfaces

Second,we mustprocessincomingchannel-clearacknowledgements.This isdoneby releaserecovery channel() (seeFigure 4.15). This routinepropagateschannel-clearacknowledgementsup therecovery treeanddetectsrecovery com-pletion. If thechannel-clearcounterdoesnot dropto zero,thenthis routinedoesnothing. If thecounterdropsto zero,therearetwo possibilities. If thechannel-clearacknowledgementhasreachedtherootof therecoverytree,thentherecoveryis complete. The owner of the recovery channelknows that the entirerecoverychannelis free and will try to initiate a new recovery on its recovery channel.This is doneby try new recovery() (not shown), which scansall blocked-sendsqueuesfor a blockedpacket. If sucha queueis found,a new recovery is started.If the channel-clearacknowledgementshave not yet reachedthe recovery chan-nel’s owner (the root of the recovery tree), thenwe createa new channel-clearacknowledgementandsendit up therecovery tree.

We have now describedhow a recovery is initiated, how a recovery tree isbuilt, andhow channel-clearacknowledgementstravel backup therecovery tree,but wehavenotexplainedhow theclearingof arecoverychannelis initiated.Thisis doneby theleavesof therecoverytreewhenthey releaseapacket. Themodifiedeventpacket released() first checksif this NI is participatingin any recovery inthetreethatthepacket arrivedon. If it is, thenit will have storedtheidentitiesoftherecoverychannelownersin theactivechannelsqueueassociatedwith thetree.If this queueis not empty, eventpacket released() dequeuesonechannelownerandreturnsthepacket to thatowner’srecoverychannel.If, on theotherhand,thisNI is not participatingin a recovery on thetreethat thepacket arrivedon, thenitwill simply returnasendcreditto thepacket’s lastsender(asbefore).

4.4 INTR: Control Transfer

Sinceinterruptsareexpensive, LFC tries to avoid generatinginterruptsunneces-sarily by meansof a polling watchdog [101]. This is a mechanismthat delaysinterruptsin thehopethatthetargetprocesswill soonpoll thenetwork. Theideais to starta watchdogtimer whena packet arrivesat theNI. TheNI generatesaninterruptonly if thehostdoesnotpoll beforethetimerexpires,

Wherethe original polling watchdogproposalby Maquelinet al. is a hard-waredesign,LFC usestheprogrammableNI processorto implementa softwarepolling watchdog. In addition,LFC refinesthe original designin the followingway. Whenthe watchdogtimer expires,LFC doesnot immediatelygenerateaninterruptif thehosthasnot yet processedall packetsdeliveredto it. Instead,theNI performstwo additionalchecksto determinewhetherit shouldgenerateaninterrupt:aninterruptstatuscheckandaclient progresscheck.

Thepurposeof theinterruptstatuscheckis to avoid generatinginterruptswhen

Page 105: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.5RelatedWork 83

the receiving processrunswith network interruptsdisabled. Recall that clientsmaydynamicallydisableandenablenetwork interrupts.For efficiency, theinter-ruptstatusmanipulationfunctionsareimplementedby togglinganinterruptstatusflagin hostmemory, withoutany communicationor synchronizationwith thecon-trol program(seeSection3.7). Beforesendingan interrupt,thecontrolprogramfetches(usinga DMA transfer)the interruptstatusflag from hostmemory. Nointerruptis generatedwhentheflag indicatesthattheclient disabledinterrupts.

Thegoalof theclientprogresscheckis to avoid generatinginterruptswhenthereceiving processrecentlyprocessedsome(but not necessarilyall) of thepacketsdeliveredto it. The hostmaintainsa countof the numberof packets that it hasprocessed.When the watchdogtimer expires, the control programfetchesthiscounter. No interruptis generatedif thehostprocessedat leastonepacket sincethepreviouswatchdogtimeout.

INTR startsthe watchdogtimer eachtime a packet arrivesand the timer isnot alreadyrunningon behalfof anearlierpacket. Figure4.16shows how INTRhandlespolling watchdogtimeouts. The NI fetchesthe host’s interruptflag andpacket counterusinga singleDMA transfer;both arestoredinto host. The NIthenperformsthefirst partof theclientprogresscheck:if theclienthasprocessedall (nr pkts received) packetsdeliveredby theNI, thentheNI cancelsthepollingwatchdogtimer. In this case,the hostprocessedall packetsthat weredeliveredto it andthereis no needto keepthe timer running. If this checkfails, the NIperformsboththeinterruptstatuscheckandthesecondpartof theclientprogresscheck. If the host recentlydisabledinterruptsor if the hostconsumedsomeofthepacketsdeliveredto it, thentheNI doesnot generateaninterrupt,but restartsthe polling watchdogtimer. If all checksfail, the NI generatesan interruptandrestartsthetimer. Finally, theNI savesthehost’spacketcounterin last processed,sothaton thenext timeoutit cancheckif theclient madeprogress.

4.5 RelatedWork

LFC’sNI-level protocolsimplementreliability, interruptmanagement,andmulti-castforwarding.Wediscussrelatedwork in all threeareas.

4.5.1 Reliability

LFC assumesthat the network hardware is reliable and preserves this reliabil-ity by meansof its link-level flow controlprotocol(UCAST). Severalothersys-tems(BIP [118], Hamlyn[32], Illinois FastMessages[112, 113], FM/MC [146],VMMC [24], andPM [141]) make thesamehardwarereliability assumption(seeFigure2.3). Thesesystems,however, implementflow control in a differentway.

Page 106: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

84 CoreProtocolsfor IntelligentNetwork Interfaces

typedef struct )unsigned nr pkts processed;int intr disabled;* HostStatus;

unsigned last processed;

void event watchdog timeout(void) )HostStatus host;

fetch from host(&host);if (host.nr pkts processed == nr pkts received) )

timer cancel(); / + client processed all packets + /* else if (host.intr disabled ABA host.nr pkts processed - last processed) )timer restart(); / + client disabled interrupts or polled recently + /* else )send interrupt to host();timer restart();*

last processed = host.nr pkts processed;*Fig. 4.16.Timeouthandlingin INTR.

With the exceptionof PM (seeSection2.6) noneof thesesystemsimplementslink-level flow control. Instead,they assumethattheNI cancopy datato thehostsufficiently fastto preventbuffer overflows(seeSection1.6).

With theexceptionof FM/MC, noneof theabovesystemssupportsacompleteNI-levelmulticast.With NI-level forwarding,multicastpacketsoccupy NI receivebuffersfor a longertime. As a result,thepressureon thesebuffersincreases,andspecialmeasuresareneededto preventblocked-packetresets.FM/MC treatshighNI receivebuffer pressureasarareeventandswapsbuffersto hostmemorywhentheproblemoccurs.For applicationsthatmulticastvery heavily, however, swap-pingoccursfrequentlyanddegradesperformance[15]. LFC is moreconservativeandusesNI-level flow controlto solve theproblem.

4.5.2 Interrupt Management

LFC supportsbothpolling andinterruptsascontrol-transfermechanisms.Somecommunicationsystemsprovideonly apolling primitive. Themainproblemwiththisapproachis thatmany parallel-programmingsystems,notablyDSM systems,cannotrely on polling by theapplication.Interruptsprovideasynchronousnotifi-cation,but areexpensiveon commercialarchitecturesandoperatingsystems:the

Page 107: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.5RelatedWork 85

time to field aninterruptoftenexceedsthe latency of a smallmessage.ResearchsystemssuchastheAlewife [2] andtheJ-machine[132] supportfastinterrupts,but themechanismsusedin thesemachineshave not found their way into com-mercialprocessorarchitectures.Even without hardwaresupport,operatingsys-temscan,in principle,dispatchinterruptsefficiently [94, 143]. Thekey ideais tosave minimal processorstate,andleave the rest(e.g.,floating-pointstate)up totheapplicationprogram.In practice,however, interruptprocessingoncommodityoperatingsystemshasremainedexpensive.

LFC doesnot attack the overheadof individual interrupts,but aims to re-ducethe interrupt frequency by optimistically delayinginterrupts. Maquelinetal. proposedthepolling watchdog,a hardwareimplementationof this idea[101].We augmentedthepolling watchdogwith theinterrupt-statusandclient-progresschecksdescribedin Section4.4. Thesetwo checksfurther reducethenumberofspuriousinterrupts.

4.5.3 Multicast

Several researchershave proposedto use the NI insteadof the host processorto forward multicastpackets [56, 68, 80, 146]. Our NI-level protocol is origi-nal, however, in that it integratesunicastandmulticastflow control andusesadeadlock-recoveryschemeto avoid routingrestrictions.

Verstoepet al. describea system,FM/MC, that implementsanNI-level mul-ticastfor Myrinet [146]. FM/MC runson thesamehardwareasLFC, but imple-mentsbuffer managementandreliability in a very differentway. Section4.5.4comparesLFC andFM/MC in moredetail.

Huangand McKinley proposeto exploit ATM NIs to implementcollectivecommunicationoperations,includingmulticast[68]. In theirsymmetricbroadcastprotocol,NIs useanATM multicastchannelto forwardmessagesto theirchildren;multicastacknowledgementsare also collectedvia the NIs. The sendinghostmaintainsaslidingwindow; it appearsthatasinglewindow is usedperbroadcastgroup. LFC, in contrast,usessliding windows betweeneachpair of NIs. Thisallows LFC to integrateunicastandmulticastflow control;HuangandMcKinleydo notdiscussunicastflow controlandpresentsimulationresultsonly.

Gerlaet al. [56] alsodiscussusingNIs for multicastpacket forwarding.Theyproposea deadlockavoidanceschemethatdividesreceive buffersin two classes.Thefirst classis usedwhenmessagestravel to higher-numberedNIs, thesecondwhengoing to lower-numberedNIs. This schemerequiresbuffer resourcespermulticasttree,whichis problematicin asystemwith many smallmulticastgroups.

KesavanandPandastudiedoptimalmulticasttreeshapesfor systemswith pro-grammableNIs [80]. They describetwo packet-forwardingschemes,first-child-first-servedandfirst-packet-first-served(FPFS),andshow thatthelatterperforms

Page 108: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

86 CoreProtocolsfor IntelligentNetwork Interfaces

CCCCDDDD

EEEE

LFC

MulticastP2P1P0

P0 P1 P2NI

Host Shared

Shared

FM/MC

F�F�F�FG�G�G�GH�H�HI�I�IJ�J�JK�K�KL�L�LL�L�LM�M�MM�M�M N�N�N�NN�N�N�NO�O�OO�O�O P�P�PP�P�PQ�Q�QQ�Q�Q R�R�R�RR�R�R�RS�S�S�SS�S�S�S T�T�T�TT�T�T�TU�U�U�UU�U�U�U V�V�V�VV�V�V�VW�W�WW�W�W X�X�X�XX�X�X�XY�Y�Y�YY�Y�Y�Y

Z�Z�Z�Z[�[�[�[\�\�\]�]�]

^�^�^_�_�_`�`�`a�a�ab�b�bc�c�cd�d�de�e�e f�f�f�fg�g�g�g h�h�h�hi�i�i j�j�j�jk�k�k�k l�l�l�lm�m�m n�n�n�no�o�o�o p�p�p�pq�q�q

r�r�rr�r�rs�s�ss�s�st�t�tu�u�uv�v�v�vw�w�w�wx�x�xy�y�y zzzz{{{{

||||}�}�}~�~�~�������������� ���������� ���������� ���������� ����������

��������������Fig. 4.17.LFC’s andFM/MC’sbufferingstrategies.

best. The paperdoesnot discussflow control issues.MCAST integratesFPFSwith UCAST’s flow control scheme. MCAST usually forwardspackets to allchildrenasthesepacketsarrive, just like in FPFS.WhenanNI hasto forwardapacket to a destinationfor which no sendcreditsareavailable,however, the NIwill queuethatpacket andstartworkingonanotherpacket.

All spanning-treemulticastprotocolsmustdealwith the possibility of dead-lock. Deadlockis oftenavoidedby meansof a deadlock-freeroutingalgorithm;the literatureon suchalgorithmsis abundant. Most researchin this area,how-ever, appliesto routingat thehardwarelevel. Also, aspointedout by Anjan andPinkston[4], mostprotocolsavoiddeadlockby imposingroutingrestrictions[43].This is theapproachtakenby MCAST. RECOV wasinspiredby theDISHA pro-tocol for wormhole-routednetworks [4]. DISHA, however, is a deadlockrecov-ery schemefor unicastworms,whereasRECOV dealswith buffer deadlocksin astore-and-forwardmulticastprotocol.

4.5.4 FM/MC

FM/MC implementsa reliableNI-level multicastfor Myrinet, but doesso in averydifferentway thanLFC. Themostimportantdifferencesarerelatedto buffermanagementandthereliability protocol.

Figure4.17 shows how LFC andFM/MC allocatereceive buffers. FM/MCusesa singlequeueof NI buffers for all inboundnetwork packets. LFC, on theotherhand,staticallypartitionsits NI buffersamongall senders:onesendercan-notuseanothersender’sreceivebuffers.At thehostlevel, thesituationhasalmostbeenreversed. FM/MC allocatesa fixed numberof unicastbuffers per sender.Theseunicastbuffers are managedby a standardsliding-window protocol thatruns on the host. In addition, FM/MC hasanother, separateclassof multicastbufferswhich arenot partitionedstaticallyamongsenders.LFC doesnot distin-guishbetweenunicastandmulticastbuffersandusesasinglepoolof hostreceivebuffers.

FM/MC’s multicastbuffers areallocatedto sendersby a centralcredit man-agerthat runson oneof theNIs. A multicastcredit representsa buffer on every

Page 109: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

4.5RelatedWork 87

receiving host. Beforea processcanmulticasta message,it mustobtaincreditsfrom thecreditmanager. To avoid theoverheadof a credit request-replypair foreverymulticastpacket,hostscanrequestmultiplecreditsandcachethem.Creditsthathave beenconsumedarereturnedto thecreditmanagerby meansof a tokenthatrotatesamongall NIs.

At theNI level, nosoftwareflow controlis present.EachNI containsasinglereceive queuefor all inboundpackets. Underconditionsof high load,this queuefills up. FM/MC thenmovespart of the receive queueto hostmemoryto makespacefor inboundpackets. Eventually, senderswill run out of creditsand thememorypressureon receiving NIs will drop; packets can then be copiedbackto NI memoryandprocessedfurther. The idea is that this type of swappingtohostmemorywill occuronly underexceptionalconditions,andit is assumedthatswappingwill not take too muchtime. Eventually, however, FM/MC reliesonhardwareback-pressureto stall sendingNIs.

By default, FM/MC usesthe samebinary treesasLFC to forward multicastpackets. The protocol,however, allows the useof arbitrarymulticasttrees,be-causethecreditandswappingmechanismavoid buffer deadlocksat theNI level.WhenNI buffersfill up, they arecopiedto hostmemory. Thecentralizedcreditmechanismguaranteesthatbuffer spaceis availableonthehost.Sincepacketsarenot forwardedfrom hostmemory, thereis no risk of buffer deadlockat the hostlevel.

Both LFC andFM/MC have their strengthsandweaknesses.An importantadvantageof LFC over FM/MC is its simplicity. A singleflow controlschemeisusedfor unicastandmulticasttraffic. Thereis noneedto communicatewith asep-aratecreditmanager. No requestmessagesareneededto obtaincredits:receiversknow whensendersarelow on creditsandsendcredit updatemessageswithoutreceiving any requests.Finally, sinceLFC implementsNI-level flow control, itdoesnot needto swapbuffersbackandforth betweenhostandNI memory. Thepricewepayfor thissimplicity is therestrictionon thetreetopologiesthatcanbeusedby MCAST. FM/MC canusearbitrarymulticasttrees.

LFC andFM/MC alsodiffer in thewaycreditsareobtainedfor multicastpack-ets.LFC is optimisticin thata senderwaitsonly for creditsfor its childrenin themulticasttree. In FM/MC, a senderwaits until every receiver hasspacebeforesendingthe next multicastpacket(s). On the otherhand,hostbuffers arenot asscarcea resourceasNI buffers,sowaiting in FM/MC mayoccurlessfrequently.

Both protocolshavepotentialscalabilityproblems.LFC partitionsNI receivebuffers amongall senders.FM/MC, in contrast,allows NI buffers to be shared,which is attractive whenthe amountof NI memoryis small andthe numberofnodesin thesystemlarge. Givena reasonableamountof memoryon theNI anda modestnumberof nodes,however, LFC’s partitioningof NI buffers posesnoproblems,andobviatestheneedfor FM/MC’sbuffer swappingmechanism.

Page 110: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

88 CoreProtocolsfor IntelligentNetwork Interfaces

FM/MC hasotherscalabilityproblems.First,all creditrequestsareprocessedby asingleNI, which introducesapotentialbottleneck.Second,to recovermulti-castcredits,FM/MC usesa rotatingtoken. Thecreditmanagermayhave to waitfor acompletetokenrotationuntil it cansatisfya requestfor credits.

4.6 Summary

Thischapterhasgivenadetaileddescriptionof LFC’s NI-level protocolsfor reli-ablepoint-to-pointcommunication,reliablemulticastcommunication,andfor in-terruptmanagement.TheUCAST protocolprovidesreliablepoint-to-pointcom-municationbetweenNIs. MCAST extendsUCAST with multicastsupport,butis not deadlock-freefor all multicasttrees.RECOV extendsMCAST with dead-lock recovery andthe resultingprotocol is deadlock-freefor all multicasttrees.Finally, INTR is a refinedsoftwarepolling watchdogprotocol. INTR delaysthegenerationof network interruptsasmuchaspossibleby monitoringthebehaviorof thehost.

Page 111: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Chapter 5

The Performanceof LFC

Thischapterevaluatestheperformanceof LFC’sunicast,fetch-and-add,andbroad-castprimitivesusingmicrobenchmarks.The performanceof client systemsandapplications,which is whatmattersin theend,will bestudiedin subsequentchap-ters.

5.1 UnicastPerformance

First we discussthe latency and throughputof LFC’s point-to-pointmessage-passingprimitive,lfc ucastlaunch(). Next, wedescribeLFC’spoint-to-pointper-formancein termsof theLogGP[3] parameters.

5.1.1 Latency and Thr oughput

Figure5.1shows LFC’s one-way latency for differentreceive methods.All mes-sagesfit in asinglepacketandarereceivedthroughpolling. Weusedthefollowingreceivemethods:

1. No-touch. The receiving processis notified of a packet’s arrival (in hostmemory),but neverreadsthepacket’scontents.This typeof receivebehav-ior is frequentlyusedin latency benchmarksandgivesagoodindicationoftheoverheadsimposedby LFC.

2. Read-only. Thereceiving processusesa for loop to readthepacket’s con-tentsinto registers(one32-bit word per loop iteration),but doesnot writethe packet’s contentsto memory. This type of behavior canbe expectedfor small control messages,which can immediatelybe interpretedby thereceiving processor.

89

Page 112: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

90 ThePerformanceof LFC

0� 200� 400� 600� 800� 1000Message size (bytes)�

0

10

20

30

40

50

Late

ncy

(mic

rose

cond

s)� Read only

CopyNo touch

Fig. 5.1.LFC’s unicastlatency.

3. Copy. Thereceiving processcopiesthepacket’s contentsto memory. (Thisis doneusingan efficient string-move instruction.) This is the typical be-havior for largermessagesthatneedto becopiedto a datastructurein thereceiving process’saddressspace.

The one-way latency of an emptypacket is 11.6 µs (for all receive strategies).This doesnot includethecostof packet allocationat thesendingsideandpacketdeallocationat the receiving side,becausebothcanbeperformedoff thecriticalpath.Sendpacketallocationcosts0.4µs,receivepacketdeallocationcosts0.7µs.

As expected,the read-onlyandcopy latenciesarehigher than the no-touchlatencies,becausethe receiving processincurs cachemisseswhen it readsthepacket’s contents.Surprisingly, however, thecopy variantis fasterthantheread-only variant.This hastwo causes.First, thecopy variantalwayscopiesincomingpacketsto thesame,write-backcacheddestinationbuffer, sothewritesthatresultfrom thecopy do notgenerateany memorytraffic (aslongasthebuffer fits in thecache).Second,the read-onlyvariantaccessesthe databy meansof a for loop,while thecopy variantusesa faststring-copy instruction.

Figure5.2showstheone-wayunicastthroughputusingthesamethreereceivemethods.(Notethatthemessagesizeaxishasa logarithmicscale.)Thesawtoothshapeof the curves is causedby fragmentation.Messageslarger than1 Kbytearesplit into multiple packets. If the lastof thesepacketsis not full, theoverallefficiency of themessagetransferdecreases.Thiseffect is strongestwhenthelastfragmentis nearlyempty.

With theno-touchreceivestrategy, themaximumthroughputis 72.0Mbyte/s.Readingthe datareducesthe throughputsignificantly. For messagesizesup to128Kbyte,copying thedatais fasterthanjust readingit, for thereasonsdescribed

Page 113: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

5.1UnicastPerformance 91

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)�0

10

20

30

40

50

60

70

Thr

ough

put (

Mby

te/s

econ

d)

No touchCopyRead only

Fig. 5.2. LFC’s unicastthroughput.

above(write-backcachinganda fastcopy instruction).For largemessages,how-ever, thethroughputof thecopy variantdropsnoticeably(e.g.,from 63.2Mbyte/sfor 1 Kbytemessagesto 33.2Mbyte/sfor 1 Mbytemessages).For messagesupto256Kbyte, thereceive buffer usedby thecopy variantoughtto fit in theproces-sor’ssecond-level datacache.SincetheL2 cacheis physicallyindexed,however,someof the buffer’s memorypagesmay map to the samecachelines asotherpages.Theexactpagemappingsvary from run to run,but for largerbuffers,con-flicts aremorelikely to occur. Conflictingpagemappingsresultin conflictmissesduring thecopy operationandthesemissesgeneratememorytraffic. This trafficcompetesfor busandmemorybandwidthwith incomingnetwork packetsandre-ducestheoverall throughput.Messageslargerthan256Kbyteno longerfit in theL2 cache.Cachemissesareguaranteedto occurwhensuchamessageis copied.

AlthoughLFC achievesgoodthroughput,it is notableto saturatethehardwarebottleneck,theI/O bus,which hasa bandwidthof 127Mbyte/s.Themainreasonis thatLFC usesprogrammedI/O (with write combining)at thesendingside.Asshown in Figure2.2, this datatransfermechanismcannotsaturatethe I/O bus.This problemcanbe alleviatedby switching to DMA, but DMA introducesitsown problems(seeSection2.2).

5.1.2 LogGP Parameters

While the throughputand latency graphsare useful, additional insight can begainedby measuringLFC’sLogGPparameters.LogGP[3, 102] is anextensionoftheLogPmodel[40,41],whichcharacterizesacommunicationsystemin termsof

Page 114: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

92 ThePerformanceof LFC

four parameters:latency (L), overhead(o), gap(g), andthenumberof processors(P).

Latencyis thetimeneededbyasmallmessagetopropagatefromonecomputerto another, measuredfrom themomentthesendingprocessoris nolongerinvolvedin thetransmissionto themomentthereceiving processorcanstartprocessingthemessage.Latency includesthetime consumedby thesendingNI to transmitthemessageandthetimeconsumedby thereceiving NI to deliver themessageto thereceiving host. This definitionof latency is differentfrom thesender-to-receiverlatency discussedabove. Thesender-to-receiver latency includessendandreceiveoverhead(describedbelow), whicharenot includedin L.

Overheadis the time a processorspendson transmittingor receiving a mes-sage.We will make theusualdistinctionbetweensendoverhead(os) andreceiveoverhead(or ). Overheaddoesnot include the time spenton the NI andon thewire. Unlike latency, overheadcannotbehidden.

Thegap is the reciprocalof thesystem’s small-messagebandwidth.Succes-sive packetssentbetweena given sender-receiver pair arealwaysat leastg mi-crosecondsapartfrom eachother.

The LogP modelassumesthat messagescarry at mosta few words. Largemessages,however, arecommonin many parallelprogramsandmany commu-nicationsystemscantransfera singlelarge messagemoreefficiently thanmanysmallmessages.LogGPthereforeextendsLogPwith aparameter, G, thatcapturesthebandwidthfor largemessages.For LFC, we defineG asthepeakthroughputundertheno-touchreceivestrategy.

To measurethevaluesof theLogGPparametersfor LFC,weusedbenchmarkssimilar to thosedescribedby Iannelloet al [69]. TheresultingvaluesaregiveninTable5.1. All values,exceptthevalueof G, weremeasuredusingmessageswithapayloadof 16 bytes.

Thesender-to-receiverlatenciesreportedin Figure5.1arerelatedto theLogGPparametersasfollows:

sender-to-receiver latency � os 6 L 6 or .

Thesender-to-receiverlatency of a16-bytemessageis thus1 � 6 6 8 � 2 6 2 � 2 �12� 0 µs, slightly more than the latency of a zero-bytemessage(11.6 µs). Thelargestpart (L) of the sender-to-receiver latency is spenton the sendingandre-ceiving NIs; this partcanin principlebeoverlappedwith usefulwork on thehostprocessor.

5.1.3 Interrupt Overhead

All previousmeasurementswereperformedwith network interruptsdisabled.Wenow briefly considerinterrupt-driven messagedelivery. As explained in Sec-

Page 115: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

5.2Fetch-and-AddPerformance 93

LogGP parameter ValueSendoverhead(os) 1.6µsReceiveoverhead(or ) 2.2µsLatency (L) 8.2µsGap(g) 5.6µsBig gap(G) 72.0Mbyte/s

Table5.1. Valuesof theLogGPparametersfor LFC.

tion 3.7, LFC delaysinterruptsin the hopethat the destinationprocesswill pollbeforethe interruptneedsto be raised. The default delay is 70 µs. This value,approximatelytwice theinterruptoverhead(seebelow), wasdeterminedafterex-perimentingwith someof theapplicationsdescribedin Chapter8.

To measuretheoverheadof interrupt-drivenmessagedelivery, we setthede-lay to 0 µs andmeasurethe null unicastlatency in two differentways: first, us-ing a programthatdoesnot poll for incomingmessagesandthatusesinterruptsto receive messages;second,using a programthat doespoll and that doesnotuseinterrupts.Thedifferencein latency measuredby thesetwo programsstemsfrom interruptoverhead.Using this method,we measurean interruptoverheadof 31 µs, which is morethantwice the unicastnull latency. This time includescontext-switchingfrom usermodeto kernelmode,interruptprocessinginsidethekerneland the Myrinet device driver, dispatchingthe user-level signalhandler,andreturningfrom thesignalhandler(usingthesigreturn() systemcall). Noticethat the last overheadcomponent,the signalhandlerreturn,neednot be on thecritical communicationpath:a messagecanbeprocessedanda reply canbesentbeforereturningfrom thesignalhandler.

This largeinterruptoverheadis notuncommonfor commercialoperatingsys-tems. In user-level communicationsystemswith low-latency communication,however, the high cost of using interruptsshouldclearly be avoided wheneverpossible.It is exactly thishighcostthatmotivatedtheadditionof apolling watch-dogmechanismto LFC.

5.2 Fetch-and-AddPerformance

A contention-freefetch-and-addoperationonavariablein localNI memorycosts20.7 µs. Sincewe have not optimizedthis local case,an FA REQUEST andanFA REPLY messagewill besentacrossthenetwork. Eachmessagetravels to theswitch to which the sendingNI is attached;this switch thensendsthe messagebackto thesameNI.

A contention-freeF&A operationon a remoteF&A variablecosts19.8 µs,

Page 116: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

94 ThePerformanceof LFC

0� 20 40 60Number of processors

0

100

200

300

400

500

Late

ncy

(mic

rose

cond

s)�

0� 1� 2� 3� 4�

5�01020304050

Fig. 5.3. Fetch-and-addlatenciesundercontention.

which is slightly lessexpensive thanthelocal case.To detecttheF&A reply, thehost that issuedthe operationpolls a location in its NI’s memory. In the localcase,this polling stealsmemorycyclesfrom the local NI processorwhich needsto processtheF&A requestandsendtheF&A reply(to itself). In theremotecase,receiving therequest,processingit, andsendingthereply areall performedby aremoteNI, without interferencefrom apolling host.

Figure5.3showshow F&A latenciesincreaseundercontention.In thisbench-mark,eachprocessor, excepttheprocessorwheretheF&A variableis stored,ex-ecutesa tight loop in which an F&A operationis issued. With as few asthreeprocessors(seeinset),the NI that servicesthe F&A variablesbecomesa bottle-neckandthelatenciesincreaserapidly.

5.3 Multicast Performance

The multicastperformanceevaluationis organizedasfollows. First, we presentlatency, throughput,and forwarding overheadresultsfor LFC’s basicprotocol.Next, we evaluatethe deadlockrecovery protocolundervariousconditions. InChapter8 wewill alsocomparetheperformanceof LFC’smulticastprotocolwithprotocolsthatusehost-level forwarding.

5.3.1 Performanceof the BasicMulticast Protocol

We first examinethe latency, throughput,andforwardingoverheadof LFC’s ba-sic multicastprotocol. The basicprotocolusesbinary treesand thereforedoesnot needthedeadlockrecovery mechanism.In the following measurements,the

Page 117: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

5.3MulticastPerformance 95

0� 200� 400� 600� 800� 1000Message size (bytes)�

0

50

100

150La

tenc

y (m

icro

seco

nds)

�64 processors32 processors16 processors8 processors4 processors�

Fig. 5.4. LFC’s broadcastlatency.

maximumpacket sizeis 1 Kbyte andeachsenderis allocated � 512P � sendcredits,

whereP is the numberof processors.All packetsarereceived throughpolling.Weusethecopy receivestrategy describedin Section5.1.

Figure 5.4 shows the broadcastlatenciesof the basicprotocol. We definebroadcastlatency asthe latency betweenthe senderandthe last receiver of thebroadcastpacket. As expected,the latency increaseslinearly with increasingmessagesizeand logarithmicallywith increasingnumbersof processors.Witha 4-nodeconfiguration,we obtaina null latency of 22 µs; with 64 nodes,thenulllatency is 72µs.

Figure5.5 shows the throughputof the basicprotocol. Throughputis mea-suredusinga straightforwardblasttestin which thesendertransmitsmany mes-sagesof a given sizeandthenwaits for an (empty)acknowledgementmessagefrom all receivers.We reportthethroughputobservedby thesender.

All curvesdeclinefor largemessages.This is dueto thesamecacheeffect asdescribedin Section5.1. For messagesthatfit in thecache,themaximummulti-castthroughput(41 Mbyte/sfor 4 processors)is lessthanthe maximumunicastthroughput(63 Mbyte/s) and decreasesas the numberof processorsincreases.Theseeffectsaredueto thefollowing throughput-limitingfactors:

1. NI memory supportsat most two memory accessesper NI clock cycle(33.33MHz). Memory referencesare issuedby the NI’s threeDMA en-ginesandby theNI processor. Noneof thesesourcescanissuemorethanonememoryreferenceperclockcycle. As aresult,themaximumpacketre-ceiveandsendrateis 127Mbyte/s(33.33MHz � 32 bits). This maximumcanbe obtainedonly with large packets;with LFC’s 1 Kbyte packets,themaximumDMA speedis approximately120Mbyte/s.

Page 118: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

96 ThePerformanceof LFC

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)�0

10

20

30

40

Thr

ough

put (

Mby

te/s

econ

d)�

4 processors8 processors16 processors32 processors64 processors

Fig. 5.5. LFC’s broadcastthroughput.

2. Thefanoutof themulticasttreedeterminesthenumberof forwardingtrans-fers that an NI’s sendDMA enginemust make. Sincetheseforwardingtransfersmustbe performedserially, the attainablethroughputis propor-tionalto thereciprocalof thefanout.Binarytreeshaveamaximumfanoutoftwo. Themaximumthroughputthatcanbeobtainedis therefore120� 2 � 60Mbyte/s.

3. High throughputcanonly beobtainedif theNI processormanagesto keepits DMA enginesbusy.

4. Multicast packets travelling acrossdifferent logical links in the multicasttreemaycontendfor thesamephysicallinks. Theamountof contentionde-pendson themappingof processesto processors.Sincedifferentmappingscangiveverydifferentperformanceresults,we useda simulated-annealingalgorithmto obtain reasonablemappingsfor all single-senderthroughputbenchmarks.The resultingmappingsgive muchbetterperformancethanrandommappings,but arenot necessarilyoptimal, becausesimulatedan-nealingis not guaranteedto find a true optimumandbecausewe did nottakeacknowledgementtraffic into account.

Thefirst threefactorsexplainwhy wedonotattaintheunicastthroughput,but notwhy throughputdecreaseswhenthenumberof processorsis increased.Sincetheamountof forwardingwork performedby the bottlenecknodesof the multicasttree—i.e., internalnodeswith two children—is independentof the sizeof themulticasttree,we would expect that throughputis also independentof the size

Page 119: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

5.3MulticastPerformance 97

1

P-2 P-132

0

Fig. 5.6. Treeusedto determinemulticastgap.

1 2 3� 4 5� 6� 7 8� 9� 10Fanout

0

10

20

30

40

50

Mul

ticas

t gap

(m

icro

seco

nds)

Measured multicast gapLeast-squares fit

Fig. 5.7.LFC’s broadcastgap.

of thetree. In reality, addingmoreprocessorsincreasesthedemandfor physicalnetwork links. Theresultingcontentionleadsto decreasedthroughput.

To estimatethe amountof work performedby the NI to forward a multicastpacket, we can measurethe multicastgap g 5 P7 . This is the reciprocalof themulticastthroughputon P processorsfor small (16-byte)messages.We assumethatthemulticastgapis proportionalto thefanoutF 5 P7 of thebottlenecknode(s)in themulticasttree. That is, we assumeg 5 P7�� α � F 5 P7�6 β, for someα andβ.By varyingF 5 P7 andmeasuringtheresultingmulticastgapsg 5 P7 , we canfind αandβ. LFC’sbinarytrees,however, haveafixedfanoutof two. To varyF 5 P7 , wethereforeusemulticasttreesthathavetheshapeshown in Figure5.6. In suchtrees,thereis exactly oneforwardingnodeandthefanoutof thetreeis F 5 P7 � P @ 2.

We measuredthemulticastgapsfor variousnumbersof processorsanduseda least-squaresfit to determineα andβ from thesemeasurements.Themeasuredgapsandtheresultingfit areshown in Figure5.7;wefoundα � 4 � 6 µsperforwardandβ � 5 � 2 µs. Theforwardingtime perdestination(α) is fairly large; this time

Page 120: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

98 ThePerformanceof LFC

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)�0

10

20

30T

hrou

ghpu

t (M

byte

/sec

ond)

�Basic, 8 procs.Recovery, 8 procs.Basic, 16 procs.Recovery, 16 procs.

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)�0

10

20

30

Basic, 32 procs.Recovery, 32 procs.Basic, 64 procs.Recovery, 64 procs.

Fig. 5.8. Impactof deadlockrecoveryonbroadcastthroughput.

is usedto look up the forwarding destination,to allocatea senddescriptor, toenqueuethisdescriptor, andto initiate thesendDMA.

5.3.2 Impact of DeadlockRecovery and Multicast TreeShape

We assumethatmostprogramswill rarelymulticastin sucha way thatdeadlockrecovery is necessary. It is thereforeimportant to know the overheadthat theprotocolimposeswithout recovery.

Enablingdeadlockrecovery doesnot increasethe broadcastlatency. In thelatency benchmark,theconditionthat triggersdeadlockrecovery (anNI hasrunoutof sendcredits)is neversatisfied,sotherecoverycodeis neverexecuted.Also,the testfor theconditiondoesnot addany overhead,becauseit takesplacebothin thebasicandin theenhancedprotocol.

Throughputis affectedbyenablingdeadlockrecovery. Figure5.8showssingle-senderthroughputresultsfor 8, 16,32,and64processors,with andwithoutdead-lock recovery. For therecoveryprotocolmeasurements,weusedalmostthesameconfigurationasfor the basicprotocolmeasurementsshown in Figure5.5. Theonly differenceis thatwe enableddeadlockrecovery andaddedP @ 1 NI receivebufferson eachNI (seeSection4.3). We usethesamebinary treesasbefore,sono truedeadlockscanoccur. Sinceour deadlockrecovery schememakesa local,conservativedecision,however, it still signalsdeadlock.

Our measurementsindicatethat,with a few exceptions,all recoveriesareini-tiatedby theroot of themulticasttree. Thedeadlockedsubtreesthat resultfromtheserecoveriesaresmall. With 64 processorsand64 Kbyte messages,for ex-

Page 121: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

5.3MulticastPerformance 99

ample,eachdeadlockrecovery results(on average)in 1.1 packetsbeingsentviaa deadlockrecoverychannel.In almostall cases,thedeadlockedsubtreeconsistsof theroot andoneof its children.Theroot performslesswork thanits children,becauseit doesnothaveto receiveanddeliverincomingmulticastpackets.Conse-quently, therootcansendoutmulticastpacketsfasterthanits childrencanprocessthemandthereforerunsoutof credits,which triggersadeadlockrecovery.

For 8 processors,theperformanceimpactof theserecoveriesis small,but forlargernumbersof processors,thenumberof recoveriesincreasesandthethrough-put decreasessignificantly. With 64 Kbyte messages,for example,thenumberofrecoveriespermulticastincreasesfrom 0.02for 8 processorsto 0.37for 64 pro-cessors(onaverage).Thisincreasein thenumberof recoveriesandthedecreaseinthroughputappearto becausedby increasedcontentionon thephysicalnetworklinks (dueto the largernumberof processors).This contentiondelaysacknowl-edgements,whichcausessendersto runoutof creditsmorefrequently. Theprob-lem is aggravatedslightly by a fastpathin themulticastcodethattriesto forwardincoming multicastpackets immediately. If this succeeds,the sendchannelisblockedby anoutgoingdatapacket andcannotbeusedby anacknowledgement.Removing thisoptimizationimprovesperformancewhendeadlockrecoveryis en-abled,but reducesperformancewhendeadlockrecovery is disabled.

Figure5.9shows theimpactof treeshapeon broadcastthroughput.We com-parebinary treesandlinear chainson 64 processors.We usea large numberofprocessors,becausethat is when we expect the performancecharacteristicsofdifferenttreeshapesto bemostvisible. In bothcases,deadlockrecovery wasen-abled,becausewith chains,deadlockscanoccurwhenmultiplesendersmulticastconcurrently. With a singlesender, however, deadlockscannotoccur, andwe ex-pectlow-fanouttreesto performbest.Figure5.9 confirmsthis expectation.Thelinearchain(fanout1) performsbetterthanthebinarytree(fanout2).

To testthebehavior of thedeadlockrecoveryschemein thecasethatdeadlockscan occur, we performedanall-to-all benchmarkin which all sendersbroadcastsimultaneously. In this case,truedeadlocksmayoccurfor thechain.Figure5.10shows the per-senderthroughputon 64 processorsfor the chainand the binarytree. Due to contentionfor network resources(links, buffers, andNI processorcycles),theper-senderthroughputis muchlower thanin thesingle-sendercase.

As expected,thechainperformsworsethanthebinarytree.Table5.2 reportsseveralstatisticsthat illustratethebehavior of therecovery protocolfor bothtreetypes. Thesestatisticsare for an all-to-all exchangeof 64 Kbyte messageson64 processors.In this table,Lused is thenumberof logical links betweenpairsofNIs thatareusedin theall-to-all test;Lav is thetotalnumberof logical links avail-able(64 � 63). Lused� Lav indicateshow well a particulartreespreadsits packetsoverdifferentlinks. RC is thenumberof deadlockrecoveries;M is thenumberofmulticasts.RC � M indicateshow oftenarecoveryoccurs.D is thenumberof pack-

Page 122: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

100 ThePerformanceof LFC

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)�0

10

20

30T

hrou

ghpu

t (M

byte

/sec

ond)

� ChainBinary

64 processors

Fig. 5.9. Impactof treetopologyonbroadcastthroughput.

64 256� 1K 4K 16K� 64KMessage size (bytes)�

0.0

0.1

0.2

Thr

ough

put (

Mby

te/s

econ

d)

� BinaryChain

64 processors

Fig. 5.10. Impactof deadlockrecoveryonall-to-all throughput.

Page 123: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

5.4Summary 101

Treeshape Lused� Lav RC � M D � RC 6 1Binary 0 � 51 0 � 38 3 � 3Chain 0 � 02 1 � 00 63� 9

Table5.2.Deadlockrecoverystatistics.Lused— #NI-to-NI links usedin this test;Lav — #NI-to-NI links available;RC — #deadlockrecoveries;M — #multicasts;D — #packetsthatuseadeadlockrecoverychannel.

etsthatareforcedto travel througha slow deadlockrecoverychannel.D � RC 6 1is theaveragesizeof deadlockedsubtrees.Thesestatisticsshow that,with chains,every multicasttriggersa deadlockrecovery action. What is worse,theserecov-eriesinvolve all 64 NIs, soeachmulticastpacket usesa slow deadlockchannel.With chains,this benchmarkcanuseonly a small fraction(0.02)of theavailableNI receive buffer space.In combinationwith thehigh communicationload, thiscausessendersto runout of creditsandtriggersdeadlockrecoveries.With binarytrees,whichmakebetteruseof theavailableNI buffer space,deadlockrecoveriesoccurlessfrequentlyandthesizeof thedeadlockedsubtreesis muchsmaller.

Summarizing,we find thatdeadlockrecovery is triggeredfrequently, but thatits effect on performanceis modestwhen the communicationload is moderateandtrue deadlocksdo not occur. To reducethe numberof falsealarms,a morerefined,timeout-basedmechanismcouldbeused[96]. This mechanismdoesnotsignaldeadlockimmediatelywhenablocked-sendsqueuebecomesnonempty, butstartsa timer andwaitsfor a timeoutto occur. Thetimer is canceledwhenasenddescriptoris removedfrom theblocked-sendsqueue.

5.4 Summary

In this chapter, we have analyzedthe performanceof LFC using microbench-marks. LFC’s point-to-pointperformanceis comparableto thatof existing user-level communicationsystemsanddoesnot suffer much from the NI-level flowcontrolprotocol.Specifically, on thesamehardwareLFC achievessimilar point-to-point latenciesasIllinois FastMessages(version2.0),which usesa host-levelvariant of the sameflow control protocol. Runningthe protocol betweenNIs,however, allows for asimpleandefficientmulticastimplementation.

Due to our useof programmedI/O at the sendingside, LFC cannotattainthemaximumavailablethroughput,but we believe thatLFC’s throughputis stillsufficiently high thatthis is notaproblemin practice.At leastonestudysuggeststhatparallelapplicationsaremoresensitive to sendandreceive overheadthantothroughput[102].

Page 124: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

102 ThePerformanceof LFC

LFC’s NI-level fetch-and-addimplementationis not muchfasterthana host-level implementationwould be,but hastheadvantagethat it neednever generateaninterrupt.

LFC’s NI-level multicastalso avoids unnecessaryinterrupts(during packetforwarding)andeliminatesan unnecessarydatatransfer. The binary treesusedby thebasicmulticastprotocolgivegoodlatency andthroughput.Client systemsthatwish to optimizefor eitherlatency or throughputcanuseothertreeshapesifthey useLFC’s extendedmulticastprotocolwhich includesa deadlockrecoverymechanism.We showed that the extendedprotocolallows clients to obtain theadvantagesof aspecifictreeshapesuchasthechain.

Page 125: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Chapter 6

Panda

This chapterdescribesthedesignandimplementationof thePandacommunica-tion systemon top of LFC. Pandais a multithreadedcommunicationlibrary thatprovidesflow-controlledpoint-to-pointmessagepassing,remoteprocedurecall,andtotally-orderedbroadcast.Since1993,versionsof Pandahave beenusedbyimplementationsof theOrcasharedobjectsystem[9, 19,124], theMPI message-passingsystem,a parallel Java system[99], an implementationof Linda tuplespaceson MPI [33], a subsetof thePVM message-passingsystem[124], theSRprogramminglanguage[124], andaparallelsearchsystem[121].

Portabilityandefficiency have beenthemaingoalsof all Pandaimplementa-tions. Pandahasbeenportedto a variety of communicationarchitectures.Thefirst Pandaimplementation[19] wasconstructedon a clusterof workstations,us-ing theUDPdatagramprotocol.Sincethen,versionsof PandahavebeenportedtotheAmoebadistributedoperatingsystem[111], to atransputer-basedsystem[65],to MPI for the IBM SP/2,to active messagesfor the Thinking MachinesCM-5,to Illinois FastMessagesfor Myrinet clusters[10], andto LFC, also for Myri-net clusters. This chapter, however, focuseson a singlePandaimplementation:Panda4.0onLFC. Panda4.0differssubstantiallyfrom theoriginalsystemwhichhadnoseparatemessage-passinglayerandhadadifferentmessageinterface[19].

Thesecondgoal,efficiency, conflictswith thefirst, portability. Whereasporta-bility favors a modular, layeredstructure,efficiency dictatesan integratedstruc-ture.In thischapterwedemonstratethatPandacanbeimplementedefficiently (onLFC) without sacrificingportability. This efficiency resultsfrom Panda’s flexibleinternalstructure,carefullydesignedinterfaces(bothin PandaandLFC), exploit-ing LFC’s communicationmechanisms,andintegratingmultithreadingandcom-munication.

In this chapter, we show how PandausesLFC’s packet-basedinterface,NI-level multicastandfetch-and-add,andpolling watchdogto implementthefollow-ing abstractions:

103

Page 126: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

104 Panda

1. Asynchronousupcalls.Sincemany PPSsrequireasynchronousmessagede-livery, Pandadeliversall incomingmessagesby meansof asynchronousup-calls.Toavoid theuseof interruptsfor eachincomingmessage,Pandatrans-parentlyanddynamicallyswitchesbetweenusingpolling andinterrupts.Tochoosethemostappropriatemechanism,Pandausesthread-schedulingin-formation.

2. Streammessages.LFC’s packet-basedcommunicationinterfaceallows aflexible implementationof streammessages[112]. Streammessagesallowend-to-endpipeliningof datatransfers.Our implementationavoidsunnec-essarycopying anddecouplesmessagenotificationandmessageconsump-tion.

3. Totally-orderedbroadcast.Pandaprovidesasimpleandefficient implemen-tationof a totally-orderedbroadcastprimitive. The implementationbuildsonLFC’s multicastandfetch-and-addprimitives.

The first sectionof this chaptergives an overview of Panda4.0. It describesPanda’s functionalityandits internalstructure.Section6.2 describeshow Pandaintegratescommunicationandmultithreading.Section6.3studiestheimplemen-tation of Panda’s messageabstraction. Section6.4 describesPanda’s totally-orderedbroadcastprotocol. Section6.5 reportson theperformanceof PandaonLFC andSection6.6discussesrelatedwork.

6.1 Overview

This sectiongives an overview of Panda’s functionality, describesthe internalstructureof Panda,anddiscussesthe key performanceissuesin Panda’s imple-mentation.

6.1.1 Functionality

PandaextendsLFC’s low-level message-passingfunctionality in several ways.First, processesthat communicateusingPandaexchangemessagesof arbitrarysizeratherthanpacketswith amaximumsize.

Second,Pandaimplementsdemultiplexing. EachPandamodule(describedlater) allows its clientsto constructcommunicationendpointsandto associateahandlerfunctionwith eachendpoint.Sendersdirect their messagesto suchend-points.Whenamessagearrivesatanendpoint,Pandainvokestheassociatedmes-sagehandlerandpassesthemessageto this handler. LFC, in contrast,dispatchesall network packetsthatarriveat aprocessorto asinglepacket handler.

Page 127: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.1Overview 105

Totally ordered? Yes Yes No

A B A B A B A B

No

Tim

e

Fig. 6.1. Differentbroadcastdeliveryorders.

Third, Pandasupportsmultithreading. Pandaclientscancreatethreadsandusethemto structureasystemor to overlapcommunicationandcomputation.

Finally, Pandasupportsseveral high-level communicationprimitives,all ofwhich operateon messagesof arbitrary length. With one-waymessage pass-ing, messagescanbe sentfrom oneprocessto an endpointin anotherprocess.Suchmessagescanbe received implicitly, by meansof upcalls,or explicitly, bymeansof blockingdowncalls.Remoteprocedurecall, a well known communica-tion mechanismin distributedsystems[23], allows a processto invoke a proce-dure(an upcall) in a remoteprocessandwait for the resultof the invocation. Atotally-ordered broadcastis a broadcastwith strongorderingguarantees.It hasmany applicationsin systemsthatneedto maintainconsistentreplicasof shareddata[75]. Specifically, a totally-orderedbroadcastprimitiveguaranteesthatwhenn processesreceive thesamesetof broadcastmessages,then

1. All messagessentby thesameprocesswill bedeliveredto theirdestinationsin thesameorderthey weresent.

2. All messages(sentby any process)will bedeliveredin thesameorderto alln processes.

The first requirement(FIFOness)is alsosatisfiedby LFC’s broadcastprimitive,but thesecondis not.

Figure6.1 illustratesthe differencebetweena FIFO broadcastanda totally-orderedbroadcast.Two processesA andB concurrentlybroadcastamessage.Thefigureshowsall four possibledeliveryorders.With aFIFObroadcast,all deliveryordersare valid. With a totally-orderedbroadcast,however, only the first twoscenariosarepossible.In thethird andfourthscenario,processesA andB receivethetwo messagesin differentorders.

Both messagepassingand RPC are subjectto flow control. If a receivingprocessdoesnot consumeincomingmessages,Pandawill stall the sender(s)of

Page 128: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

106 Panda

thosemessages.At present,no flow control is implementedfor Panda’s totally-orderedbroadcast.Scalablemulticastflow control is difficult due to the largenumberof receiversthata senderneedsto get feedbackfrom. LFC’s currentsetof clientsandapplications,however, worksfinewithoutmulticastflow control,fortwo reasons.First, dueto application-level synchronizationor dueto the useofcollective-communicationoperations,many applicationshaveat mostonebroad-castingprocessat any time, which reducesthe pressureon receiving processes.Second,a receiver that disablesnetwork interruptsand that doesnot poll, willeventuallystall all nodesthat try to sendto it. If thereis only a singlebroad-castingprocess,this back-pressureis sufficient to stall thatprocesswhenneeded.This is anall-or-nothingmechanism,though:eitherthereceiveracceptsincomingpacketsfrom all sendersor it doesnot receiveany packetsatall.

SincePandaaddsconsiderablefunctionalityto LFC’ssimpleunicastandmul-ticast primitives, the questionariseswhy this functionality is not part of LFCitself. Theansweris thatnot all client systemsneedthis extra functionality. TheCRL DSM system(discussedin Section7.3), for example,needslittle morethanefficientpoint-to-pointmessagepassingandinterruptmanagement.Addingextrafunctionalitywould introduceunnecessaryoverhead.

6.1.2 Structure

Pandaconsistsof several modules,which canbe configuredat compile time tobuild a completePandasystem.Figure6.2 illustratesthemodulestructure.Eachboxrepresentsamoduleandlists its mainfunctions.Thearrowsrepresentdepen-denciesbetweenmodules.A detaileddescriptionof themoduleinterfacescanbefoundin thePandadocumentation[62].

The SystemModule

The mostimportantmoduleis the systemmodule. For portability, Pandaallowsonly thesystemmoduleto accessplatform-specificfunctionalitysuchasthenativemessage-passingprimitives. All othermodulesmustbe implementedusingonlythefunctionalityprovidedby thesystemmodule.

The systemmoduleimplementsthreads,endpoints,messages,and Panda’slow-level unicastandmulticastprimitives. The threadandmessageabstractionsareusedin all Pandamodulesandby Pandaclients. The unicastandmulticastfunctions,however, aremainly usedby Panda’s message-passingandbroadcastmoduleto implementhigher-level communicationprimitives. Pandaclientsusethesehigher-level primitives.

Oneway to achieve bothportability andefficiency would be to defineonly ahigh-level interfacethateveryPandasystemmustimplement.Thefixedinterface

Page 129: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.1Overview 107

Message-passing module

Broadcast module

Messages of arbitrary lengthTotally ordered broadcastReliable broadcast

Unicast and multicastDemultiplexing (endpoints)Protocol header stacksThreads

System module

Messages of arbitrary lengthReliable unicast and multicastDemultiplexing (ports)

RPC module

Reliable remote procedure call

Fig. 6.2. Pandamodules:functionalityanddependencies.

ensuresthatPandaclientswill runonany platformthattheinterfacehasbeenim-plementedon. Also, if only thetop-level interfaceis defined,theimplementationof Pandacanexploit platform-specificproperties.For example,if the message-passingprimitive of the target platform is reliable, then Pandaneednot buffermessagesfor retransmission.

The main problemwith this single-interfaceapproachis that Pandamustbeimplementedfrom scratchfor every target platform. In practiceplatformsvaryin a relatively small numberof ways,so implementingPandafrom scratchforevery platformwould leadto codeduplication.To allow codereuse,Pandahidesplatform-specificcodein thesystemmoduleandrequiresthatall othermodulesbeimplementedin termsof thesystemmodule.Thesemodulescanthusbereusedonotherplatforms.Thesystem-modulefunctionshave a fixedsignature(i.e., name,argumenttypes,andreturntype),but theexactsemanticsof thesefunctionsmayvary in a small numberof predefinedways from onePandaimplementationtoanother. The systemmoduleexportstheparticularsemanticsof its functionsbymeansof compile-timeconstantsthatindicatewhichfeaturesor restrictionsapplyto thetargetplatform. This way, thesystemmodulecanconvey themainproper-tiesof theunderlyingplatformto higher-level modules.Thepropertiesexportedby thesystemmoduledo not have to matchthoseof theunderlyingsystem.Forsomeplatforms,includingLFC, thesystemmodulesometimesexportsastrongerinterfacethanprovidedby theunderlyingsystem.

Variationsof semanticsis possiblealongthefollowing dimensions:

Page 130: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

108 Panda

1. Maximumpacket size. Thesystemmodulecanexport a maximumpacketsizeandleave fragmentationandreassemblyto higher-level modules,or itcanacceptanddelivermessagesof arbitrarysize.

2. Reliability. Thesystemmodulecanexport anunreliableor a reliablecom-municationinterface.

3. Messageordering.ThesystemmodulecanexportFIFOor unorderedcom-municationprimitives.Thesystemmodulecanalsoindicatewhetheror notits broadcastprimitive is totally-ordered.

Panda’s systemmodulefor LFC implementsfragmentationandreassembly, bothfor unicastand broadcastcommunication. (In this respect,the systemmodulethus exports a strongerinterfacethan LFC.) The systemmodulepreserves thereliability andFIFOnessof LFC’s unicastandmulticastprimitives,andexportsa reliable and FIFO interfaceto higher-level Pandamodules. In addition, thesystemmoduleimplementsatotally-orderedbroadcastusingLFC’sfetch-and-addandbroadcastprimitivesandexportsthis totally-orderedbroadcastto higher-levelmodules.With thisconfiguration,theimplementationof thehigher-level modulesis relatively simple. Fragmentation,reliability, andorderingareall implementedin lower layers.

High-Level Modules

We discussthreehigher-level modules: the message-passingmodule,the RPCmodule,andthe broadcastmodule. Thesemodulescanexploit the informationconveyed by the systemmodule’s (compile-time)parameters.This is the onlyinformationaboutthe underlyingsystemavailable to the higher-level modules.Thesemodulescan thereforebe reusedin Pandaimplementationsfor anotherplatform,providedthatthesystemmodulefor thatplatformimplementsthesame(or stronger)semantics.It is not necessaryto build implementationsof higher-level modulesfor everypossiblecombinationof system-modulesemantics.First,many combinationsareunlikely. Forexample,nosystemmoduleprovidesreliablebroadcastandunreliableunicast.Second,weonly needanimplementationfor thesystemmodulethatexportstheweakestsemantics(i.e.,unreliableandunorderedcommunication).Suchanimplementationis suboptimalfor systemmodulesthatoffer strongerprimitives,but it will functioncorrectlyandcanserve asa startingpoint for optimization.

The message-passingmodule implementsreliable point-to-pointcommuni-cationandan endpoint(demultiplexing) abstraction.SinceLFC is reliableandPanda’s systemmoduleperformsfragmentationand reassembly, the message-

Page 131: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.1Overview 109

passingmoduleneedonly implementdemultiplexing, whichamountsto addingasmallheaderto everymessage.

TheRPCmoduleimplementsremoteprocedurecallson top of themessage-passingmodule.An RPCis implementedasfollows. Theclient createsa requestmessageandtransmitsit to the server processusing the message-passingmod-ule’s sendprimitive. The RPCmodulethenblocksthe client (usinga conditionvariable). When the server receives the request,it invokes the requesthandler.This handlercreatesa reply messageandtransmitsit to theclient process.At theclientside,thereply is dispatchedto areplyhandler(internalto theRPCmodule)which awakenstheblockedclient threadandwhich passesthe reply messagetotheawakenedclient. Sincetheimplementationis built uponthemessage-passingmodule’s reliableprimitives,it is smallandsimple.

Thebroadcastmoduleimplementsa totally-orderedbroadcastprimitive. Allbroadcastmessagesaredeliveredin thesametotal orderto all processes,includ-ing the sender. Sincethe systemmodulealreadyimplementsa totally-orderedbroadcast—sothatit canexploit LFC’smulticastandfetch-and-add—theimple-mentationof thebroadcastmoduleis alsoverysimple.

6.1.3 Panda’sMain Abstractions

Below we describetheabstractionsandmechanismsthatareusedthroughoutthePandasystemandby Pandaclients. On LFC, mostof theseabstractionsareim-plementedin thesystemmodule.

Threads

Panda’s multithreadinginterfaceincludesthe datatypesand functionsthat arefound in mostthreadpackages:threadcreate,threadjoin, schedulingpriorities,locks,andconditionvariables.Sincemostthreadpackagesprovidesimilarmech-anisms,Panda’s threadinterfacecanusuallybe implementedusingwrapperrou-tinesthat invoke theroutinessuppliedby thethreadpackageavailableon thetar-get platform (e.g.,POSIX threads[110]). The LFC implementationof Panda’sthreadinterfaceusesa threadpackagecalledOpenThreads[63, 64]. Section6.2describeshow PandaorchestratestheinteractionsbetweenLFC andOpenThreadsin anefficientway.

Addressing

ThesystemmoduleimplementsanaddressingabstractioncalledServiceAccessPoints(SAPs). Messagestransmittedby the systemmoduleare addressedtoa SAP in a particulardestinationprocess. SincePandadoesnot implementa

Page 132: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

110 Panda

nameservice,all SAPsmustbecreatedby all processesat initializationtime. Thecreatorof a SAP(a Pandaclient or oneof Panda’s modules)associatesa receiveupcall with eachSAP. Whena messagearrivesat a SAP, this upcall is invokedwith themessageasanargument.PandaexecutesatmostoneupcallperSAPatatime. Upcallsof differentSAPsmayrun concurrently.

Panda’shigh-levelmodulesusesimilaraddressingandmessagedeliverymech-anismsasthesystemmodule.Themessage-passingmodule,for example,imple-mentsan endpointabstractioncalledports. On LFC, portsare implementedassystemmoduleendpoints(SAPs),but onotherplatformsthereneednotbeaone-to-onecorrespondencebetweenportsandSAPs.In thePanda4.0implementationon top of UDP, for instance,SAPsare implementedasUDP sockets,but portsarenot. The reasonfor this differenceis that communicationto UDP socketsisunreliable,while communicationto a port is reliable.

As to messagedelivery, almostall modulesdeliver messagesby meansofupcallsassociatedwith communicationsendpoints(SAPs,ports,etc.). TheRPCmoduleis anexception,becauseRPCrepliesaredeliveredsynchronouslyto thethreadthat sentthe request.This threadis blocked until the reply arrives. RPCrequests,on theotherhand,aredeliveredby meansof upcalls.

MessageAbstractions

Pandaprovides two messageabstractions:I/O vectorsat the sendingside andstreammessagesat the receiving side. Both abstractionsareusedby all Pandamodules.

A sendingprocesscreatesa list of pointersto buffersthatit wishesto transmitandpassesthislist to asendroutine.Thebuffer list is calledanI/O vector. Panda’slow-level sendroutines(describedbelow) gather the datareferencedby the I/Ovectorby copying thedatainto network packets.Themainadvantageof a gatherinterfaceover interfacesthatacceptonly a singlebuffer argument,is thatthey donot forcetheclient to copy datainto acontiguousbuffer beforeinvoking thesendroutine(whichwill have to copy themessageat leastoncemore,to theNI).

At the receiving side,Pandausesstreammessages. Streammessageswereintroducedin Illinois Fast Messages(version2.0) [112]. A streammessageisa messageof arbitrarylengththat canbe accessedonly sequentially(i.e., like astream).A streammessagebehaveslike a TCPconnectionthatcontainsexactlyonemessage.Thispropertyallowsreceiversof amessageto begin consumingthatmessagebeforeit hasbeenfully received.Incomingdatacanbecopiedto its finaldestinationin a pipelinedfashion,which is moreefficient thanfirst reassemblingthecompletemessagebeforepassingit to theclient.

Pandaconsistsof multiple protocol layersandPandaclients may add theirown protocol layers. Eachlayer typically addsa headerto every messagethat

Page 133: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.1Overview 111

Headers

Client protocol B

Client protocol A

Message-passing module

System module

Group module

Client protocol C

Unused

RPC module

Header stack Header stack

Fig. 6.3.Two Pandaheaderstacks.

it transmits. To supportefficient headermanipulation,we borrow an ideafromthe x-kernel [116]. All protocolheadersthat needto be prependedto an outgo-ing messagearestoredin a single,contiguousbuffer. By usinga singlebufferto storeprotocol headers,Pandaavoids copying its clients’ I/O vectorsjust toprependpointersto its own headers(which aresmall). Suchbuffers arecalledheaderstacksandareusedboth at the sendingandthe receiving side. Senderswrite (push)their headersto the headerstack. Receiversread(pop) thosehead-ersin reverseorder. Both pushingandpoppingheadersaresimpleandefficientoperations.

Figure 6.3 shows two Pandaprotocol stacksand the correspondingheaderstacks.Eachprotocol layer, whetherinternalor external to Panda,exportshowmuchspacein theheaderstackit andits descendantsin theprotocolgraphneed.Otherprotocolscanretrieve this informationanduseit to determinewherein theheaderstackbuffer they muststoretheir header.

Sendand Receive Interfaces

Table6.1 gives the signaturesof the systemmodule’s communicationroutines.Pan unicast() gathersandtransmitsaheaderstack(proto) andanI/O vector(iov)to a SAP (sap) in the destinationprocess(dest). Whenthe destinationprocessreceives the first packet of the sender’s message,it invokesthe handlerroutineassociatedwith the serviceaccesspoint parameter(sap). The handlerroutinetakesaheaderstackandastreammessageasits arguments.Thereceiving processcanreadthestreammessage’s databy passingthestreammessageto pan msg-consume(). Theheaderstackcanbereaddirectly.

Page 134: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

112 Panda

void pan unicast(unsigned dest, pan sap p sap,pan iovec p iov, int veclen, void *proto, int proto size)

Constructsamessageoutof all veclenbuffersin I/O vectoriov. Bufferproto, of lengthproto size, is prependedto this messageandis usedto storeheaders.The messageis transmittedto processdest. Whendestreceivesthemessage,it invokesthe upcall associatedwith sap,passingthemessage,its headers,andits senderasarguments.

void pan multicast(pan pset p dests, pan sap p sap,pan iovec p iov, int veclen, void *proto, int proto size)

Behaveslike pan unicast(), but multicaststhe I/O vectorto all pro-cessesin dests, not just to asingleprocess.

int pan msg consume(pan msg p msg, void *buf, unsigned size)

Triesto copy thenext sizebytesfrom messagemsgto buf andreturnsthenumberof bytesthathave beencopied.This numberequalssizeunlesstheclienttriedto consumemorebytesthanthestreamcontains.Whenthelastbytehasbeenconsumed,themessageis destroyed.

Table6.1. Sendandreceive interfacesof Panda’s systemmodule.

Pan msgconsume() consumesthenext n bytesfrom thestreammessageandcopiesthemto a client-specifiedbuffer. If lessthann byteshave arrived at thetimethatpan msgconsume() is called,pan msgconsume() blocksandpollsuntilat leastn byteshavearrived.

In many cases,themessagehandlerreadsall of astreammessage’sdata.If thisis inconvenient,however, thenthereceiving processcanstorethestreammessage,returnfrom thehandler, andconsumethedatalater. Storingthestreammessageconsistsof copying a pointer to the messagedatastructure;the contentsof themessageneednotbecopied.

OtherPandamoduleshave similar sendandreceive signatures.In all cases,the sendroutine acceptsan I/O vectorwith buffers to transmitand the receiveupcalldeliversa streammessageandaprotocolheaderstack.

Upcalls

Thesystemmoduledeliversevery incomingmessageby invoking theupcall rou-tine associatedwith themessage’s destinationSAP. Theupcall routineis usuallya routineinsidethehigher-level modulethatcreatedtheSAP. In mostcases,thisroutinepassesthemessageto a higher-level layer (eitherthePandaclient or an-

Page 135: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.2IntegratingMultithreadingandCommunication 113

otherPandamodule). This is doneby invoking anotherupcall or by storingthemessage(e.g.,an RPCreply) in a known location. In othercases,the message(e.g.,anacknowledgement)servesonly controlpurposesinternalto thereceivingmoduleandwill not bepropagatedto higher-level layers.

Whenthesystemmodulereceivesthefirst packet of a message,it schedulesan upcall. EachSAP maintainsa queueof pendingupcalls,which areexecutedin-orderandoneat a time. Clientsandhigher-level modulesdo not have precisecontrol over the schedulingof upcalls. They must assumethat every upcall isexecutedasynchronouslyandprotecttheirglobaldatastructuresaccordingly.

An interestingquestionis whetheranupcallshouldbeviewedasanindepen-dentthreador asa subroutineof the running’computation’ thread. The formerview is morenatural,becausethe next incomingmessagemay have nothingtodo with the currentactivity of the runningthread. Unfortunately, this view hassomeproblemsassociatedwith it. Theseproblemsandtheir resolutionin Pandaarediscussedin Section6.2.

6.2 Integrating Multithr eadingandCommunication

This sectionstudiesthe interactionsbetweenLFC’s communicationmechanismsandmultithreadingin Panda. Section6.2.1explainshow Pandaandothermul-tithreadedclientscanbe implementedsafelyon LFC. Next, in Section6.2.2weshow how theknowledgethatis availablein athreadschedulercanbeexploitedtoreduceinterruptoverhead.Finally, in Section6.2.3wediscussdesignoptionsandimplementationtechniquesfor upcallmodels. The designof an upcall modelisrelatedto multithreadingandefficient implementationsof someupcallmodelsre-quirecooperationof themultithreadingsystem.After discussingdifferentdesignoptions,wedescribePanda’s upcallmodel.

6.2.1 Multithr ead-SafeAccessto LFC

LFC is not multithread-safe:concurrentinvocationsof LFC routinescancorruptLFC’s internaldatastructures.LFC doesnot uselocks to protectits globaldata,becauseseveralLFC clients(e.g.,CRL, TreadMarks,andMPI) do not usemulti-threadingandbecausewearereluctantto makeLFC dependentonspecificthreadpackages.Nevertheless,PandaandothermultithreadedsystemscansafelyuseLFC withoutmodificationsto LFC’s interfaceor implementation.

With single-threadedclientsLFC hasto bepreparedfor two typesof concur-rent invocationof LFC routines:recursive invocationsandinvocationsexecutedby LFC’ssignalhandler. Recursiveinvocationsoccur(only) whenanLFC routinedrainsthenetwork (seeSection3.6). The routinethatsendsa packet, for exam-

Page 136: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

114 Panda

Message handling

Computation

Time

Procedure callPacket in flight

Client (computation)LFC send (lfc_ucast_launch)Client (lfc_upcall)

¡�¡�¡�¡¡�¡�¡�¡¢�¢�¢�¢¢�¢�¢�¢£�£�£�£�££�£�£�£�££�£�£�£�£¤�¤�¤�¤�¤¤�¤�¤�¤�¤¤�¤�¤�¤�¤ ¥�¥�¥�¥�¥¥�¥�¥�¥�¥¥�¥�¥�¥�¥¦�¦�¦�¦�¦¦�¦�¦�¦�¦¦�¦�¦�¦�¦

§�§§�§¨�¨¨�¨

Fig. 6.4.Recursive invocationof anLFC routine.

ple,will poll whennofreesenddescriptorsareavailable.Duringapoll, LFC mayinvoke lfc upcall() to handleincomingnetwork packets.Thisupcallis definedbythe client andmay invoke LFC routines,including the routinethat polled. Thisscenariois illustratedin Figure6.4. To preventcorruptionof global stateduetorecursion,LFC alwaysleavesall globaldatain aconsistentstatebeforepolling.

When we allow network interrupts,LFC routinescan be interruptedby anupcallatany point,whichcaneasilyresultin dataraces.To avoid suchdataraces,LFC logically1 disablesnetwork interruptswhenever an LFC routineis enteredthataccessesglobalstate.

Thetwo measuresdescribedaboveareinsufficient to dealwith multithreadedclients,becausethey donotpreventtwo client threadsfrom concurrentlyenteringLFC. To solve this problem,Panda’s systemmoduleemploys a single lock tocontrolaccessto LFC routines.Pandaacquiresthis lock beforeinvokingany LFCroutineandreleasesthelock immediatelyaftertheroutinereturns.

The lock introducesa new problem:recursive invocationsof anLFC routinewill deadlock,becausePandadoesnot allow a threadto acquirea lock multipletimes (i.e., Pandadoesnot implementrecursive locks). To prevent this, Pandareleasesthe lock uponenteringlfc upcall() andacquiresit againjust beforere-turning from lfc upcall(). This works, becauseall recursive invocationshavelfc upcall() in their call chain.Whenlfc upcall() releasesthelock, anotherthreadcanenterLFC. This is safe,becausethe polling threadthat invoked lfc upcall()alwaysleavesLFC’s globalvariablesin aconsistentstate.

With thesemeasures,we can handleconcurrentinvocationsfrom multiplePandathreadsandrecursive invocationsby a single thread. Thereremainsone

1As explainedin Section3.7,LFC doesnot truly disablenetwork interrupts.

Page 137: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.2IntegratingMultithreadingandCommunication 115

problem: network interrupts. Network interruptsareprocessedby LFC’s signalhandler(seeSection3.7).Wheninvoked,thissignalhandlerchecksthatinterruptsareenabled(otherwiseit returns)andtheninvokeslfc poll() to checkfor pendingpackets. Lfc poll(), in turn, may invoke lfc upcall() andwill do sobeforePandahashada chanceto acquireits lock. Whenthis occurs,lfc upcall() will releasethelock eventhoughthelock hadneverbeenacquired.

The problemis that the signalhandleris anotherentry point into LFC thatneedsto beprotectedwith a lock/unlockpair. To do that,PandaoverridesLFC’ssignalhandlerwith its own signalhandler. Panda’s signalhandleracquiresthelock, invokesLFC’s signal handler, and releasesthe lock. This closesthe lastatomicityhole.

Summarizing,Pandaachievesmultithread-safeaccessto LFC throughthreemeasures:

1. LFC leavesits globalvariablesin a consistentstatebeforepolling.

2. Theclient bracketscallsto LFC routineswith a lock/unlockpair to preventmultipleclient threadsfrom executingconcurrentlywithin LFC.

3. The client brackets invocationsof LFC’s network signal handlerwith alock/unlockpair.

Two propertiesof LFC make thesemeasureswork. First,LFC doesnot provideablockingreceivedowncall,but dispatchesall incomingpacketsto auser-suppliedupcall function. Systemslike MPI, in contrast,provide a blockingreceive down-call. If we bracket calls to sucha blocking receive with a lock/unlockpair, weblock entranceto the communicationlayer until a messageis received. Thiscausesdeadlockif a threadblocked in a receive downcall waits for a messagethatcanbereceivedonly if anotherthreadis ableto enterthecommunicationsys-tem(e.g.,to sendamessage).This typeof blockingis differentfrom thetransientblockingthatoccurswhenLFC hasrunoutof someresource(e.g.,sendpackets).The latter typeof blocking is alwaysresolvedif all participatingprocessesdrainthecommunicationnetwork andreleasehostreceive-packets.

Second,LFC dispatchesall packetsto a singleupcall function,which allowslock managementto becentralized.If thecommunicationsystemwould directlyinvoke userhandlers,then it would be the responsibilityof the userto get thelocking right, or, alternatively, LFC would have to know aboutthelocksusedbytheclient system.

6.2.2 Transparently Switching betweenInterrupts and Polling

In Section2.5 we discussedthe main advantagesand disadvantagesof pollingandinterrupts.This sectionpresentsanapproach,implementedin Pandaandits

Page 138: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

116 Panda

underlyingthreadpackageOpenThreads,thatcombinestheadvantagesof pollingand interrupts. In this approachpolling is usedwhenthe applicationis knownto be idle. Interruptsareusedto deliver messagesto a processthat is busy exe-cuting applicationcode. Using the threadscheduler’s knowledgeof the stateofall threads,Pandadynamically(and transparently)switchesbetweenthesetwostrategies.As aresult,Pandaneverpollsunnecessarilyandweuseinterruptsonlywhentheapplicationis truly busy.

The techniquesdescribedin this sectionwereinitially developedon anotheruser-level communicationsystem(FM/MC [10]). Thesametechniques,however,areusedon LFC. In addition,LFC supportsthemixing of polling andinterruptsby meansof its polling watchdog.

Polling versusInterrupts

Choosingbetweenpolling andinterruptscanbedifficult. Therearedifficultiesintwo areas:easeof programmingandperformancecost. We considertwo issuesrelatedto easeof programming:matchingthepolling rateto themessage-arrivalrateandconcurrency control.

With polling, it is necessaryto roughlymatchthepolling rateto themessage-arrival rate. Unfortunately, many parallelprogrammingsystemscannotpredictwhenmessagesarrive andwhenthey needto beprocessed.Thesesystemsmusteitheruseinterruptsor insertpollsautomatically. Sinceinterruptsareexpensive,apolling-basedapproachis potentiallyattractive. Automaticpolling, however, hasits own costsandproblems.

Polls canbe insertedstatically, by a compiler, or dynamically, by a runtimesystem.A compilermustbeconservative andmay thereforeinsertfar too manypolling statements(e.g.,at thebeginningof every loop iteration).A runtimesys-tem canpoll the network eachtime it is invoked, but this approachworks onlyif the applicationfrequentlyinvokesthe runtimesystem.Alternatively, the run-timesystemcancreateabackgroundthreadthatregularlypolls thenetwork. This,however, requiresthatthethreadschedulerimplementa form of time-slicing,andintroducestheoverheadof switchingto andfrom thebackgroundthread.Pollingis also troublesomefrom a software-engineeringperspective, becauseall code,includinglargestandardlibraries,mayhave to beprocessedto includepolls.

Thesecondissueis concurrency control. Unlike interrupts,usingpolls givesprecisecontrolover message-handlerexecution.Consequently, a single-threadedapplicationthat polls the network only when it is not in a critical sectionneednot uselocks or interrupt-statusmanipulationto protectits shareddata. How-ever, to exploit thesynchronousnatureof polling, onemustknow exactly wherea poll mayoccur. Calling a library routinefrom within a critical sectioncanbedangerous,unlessit is guaranteedthatthis routinedoesnot poll.

Page 139: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.2IntegratingMultithreadingandCommunication 117

Quantifyingthedifferencein costbetweenusinginterruptsandpolling is dif-ficult due to the large numberof parametersinvolved: hardware (cachesizes,network adapters),operatingsystem(interrupthandling),runtimesupport(threadpackages,communicationinterfaces),andapplication(polling policy, message-arrival rate,communicationpatterns).The discussionbelow considersthe basecostsof polling andinterrupts,andtherelationshipwith themessage-arrival rate.

First, executinga singlepoll is typically muchcheaperthantaking an inter-rupt,becausea poll executesentirelyin userspacewithout any context switching(seeSection2.5). Dispatchingan interruptto userspaceon a commodityoper-ating system,on the otherhand,is expensive. The main reasonis that softwareinterruptsaretypically usedto signalexceptionslike segmentationfaults,eventsfor whichoperatingsystemsdonotoptimize[143]. In LFC,asuccessfulpoll costs1.0µs; dispatchinganinterruptandasignalhandlercosts31µs.

Second,comparingthe costof a singlepoll to the costof a single interruptdoesnot provide a sufficient basisfor statementsaboutapplicationperformance.A singleinterruptcanprocessmany messages,so thecostof an interruptcanbeamortizedover multiple messages.Also, eachtime a poll fails, theuserprogramwastesa few cycles. Sincematchingthe polling rateto the message-arrival ratecanbe hard,an applicationmay eitherpoll too often (andthuswastecycles)orpoll too infrequently(anddelaythedeliveryof incomingmessages).

For Panda,the following observationsapply. SincePandais multithreaded,many Pandaclientswill alreadyuselocksto protectaccessesto globaldata.Suchclientswill have no troublewhenupcallsarescheduledpreemptively anddo notgetany benefitfrom executingupcallsatknown pointsin time. Moreover, severalof Panda’sclientscannotpredictwhenmessageswill arrive,yetneedto respondtothosemessagesin a timely manner. Thesetwo (ease-of-use)observationssuggestaninterrupt-drivenapproach.However, sincewe expectthat theexclusiveuseofinterruptswill leadto badperformancefor clientsthat communicatefrequently,polling shouldalsobe taken into consideration.Below, we describehow Pandatriesto getthebestof bothworlds.

Exploiting the ThreadScheduler

Bothpolling andinterruptshavetheiradvantagesandit is oftenbeneficialto com-bine the two of them. LFC providesthe mechanismsto switch dynamicallybe-tweenpolling and interrupts: primitivesto disableandenableinterrupts,andapolling primitive. Usingtheseprimitives,a sophisticatedprogrammercangettheadvantagesof both polling andinterrupts. The CRL runtimesystem,for exam-ple, takes this approach[73]. Unfortunately, this approachis error-prone: theprogrammermayeasilyforgetto poll or, worse,to disableinterrupts.

To solve this problem,we have deviseda simplestrategy for switchingbe-

Page 140: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

118 Panda

Procedure callFast thread switchMessage in flightOpenThreads idle/polling loop

Panda/LFC communication codeApplication code

Processor B

Processor A

Application

Idle/polling

LFC/Panda upcall

Time

LFC/Panda upcall

Idle/polling

Application

Request Reply

©�©©�©ª�ªª�ª

«�«�«�«�««�«�«�«�«¬�¬�¬�¬¬�¬�¬�¬

­�­­�­®�®®�® ¯�¯�¯¯�¯�¯°�°�°°�°�°

±�±�±�±�±�±±�±�±�±�±�±²�²�²�²�²²�²�²�²�² ³�³�³�³³�³�³�³´�´�´�´´�´�´�´

Fig. 6.5.RPCexample.

Page 141: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.2IntegratingMultithreadingandCommunication 119

tweenthetwo messageextractionmechanisms.Thekey ideais to integratethreadmanagementandcommunication.This integrationis motivatedby two observa-tions. First, polling is alwaysbeneficialwhentheapplicationis idle (i.e.,whenitwaits for an incomingmessage).Second,whentheapplicationis active, pollingmay still be effective from a performancepoint of view, but insertingpolls intocomputationalcodeandfinding the right polling ratecanbe tedious. Thus,ourstrategy is simple: we poll whenever the applicationis idle and useinterruptswhenever runnablethreadsexist. To beableto poll whentheapplicationis idle,however, we mustbeableto detectthis situation.This suggeststhat thedecisionto switchbetweenpolling andinterruptsbetakenin thethreadscheduler.

We implementedthis strategy in the following way. By default, Pandausesinterruptsto receive messages.However, when OpenThreads’s threadsched-uler detectsthatall threadsareblocked, it disablesnetwork interruptsandstartspolling the network in a tight loop. A successfulpoll resultsin the invocationof lfc upcall(). If theexecutionof the handlerwakesup a local thread,thentheschedulerwill re-enablenetwork interruptsandscheduletheawakenedthread.Ifanapplicationthreadis runningwhenamessagearrives,LFC will generateanet-work signal.SinceLFC delaysthegenerationof network interrupts,it is unlikelythatinterruptsaregeneratedunnecessarily(seeSection3.7).

OpenThreadsdoesnot detectall typesof blocking: a threadis consideredblockedonly if it blocksby invokingoneof Panda’s threadsynchronizationprim-itives(pan mutex lock(), pan cond wait(), or pan thread join()). If a threadspin-waitsonamemorylocationthenthisis notdetectedbyOpenThreads.Also,Open-Threadsis notawareof blockingsystemcalls. If asystemcall blocks,it will blocktheentireprocessandnotjustthethreadthatinvokedthesystemcall. Somethreadpackagessolvethisproblemby providing wrappersfor blockingsystemcalls.Thewrapperinvokestheasynchronousversionof thesystemcall andtheninvokesthescheduler, which canthenblock thecalling thread.Whena poll or a signalindi-catesthat thesystemcall hascompleted,theschedulercanreschedulethethreadthatmadethesystemcall.

Figure 6.5 illustratesa typical RPC scenario. A threadon processorA is-suesanRPCto remoteprocessorB, which alsorunsa computationthread.Aftersendingits request,thesendingthreadblockson a conditionvariableandPandainvokesthethreadscheduler. Theschedulerfindsno otherrunnablethreads,dis-ablesinterrupts,andstartspolling thenetwork.

Whentherequestarrivesat processorB, it interruptstherunningapplicationthread. (SinceprocessorB hasan active thread,interruptsare enabled.) Therequestis thendispatchedto thedestinationSAP’smessagehandler. Thishandlerprocessestherequest,sendsa reply, andreturns.

On processorA, the polling threadreceives the reply, entersthe message-passinglibrary, andthensignalstheconditionvariablethattheapplicationthread

Page 142: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

120 Panda

blocked on. When the handlerthreaddies, the schedulerre-enablesinterrupts,becausethe applicationthreadhasbecomerunnableagain. Next, the schedulerswitchesto theapplicationthreadandstartsto run it.

Summarizing,thebehavior of theintegratedthreadsandcommunicationsys-temis thatsynchronouscommunication,like thereceiptof anRPCreply, is usu-ally handledthroughpolling. Asynchronouscommunication,on theotherhand,is mostlyhandledthroughinterrupts.This mixing of polling andinterruptscom-binesnicelywith LFC’s polling watchdog.

6.2.3 An Efficient Upcall Implementation

In thissectionwediscussthedesignoptionsfor upcallimplementations.Next, wediscussthreeupcallmodelsthat take differentpositionsin thedesignspace.Thepropertiesof thesemodelshave influencedthe designof Panda’s upcall model,which is discussedtowardtheendof thissection.

The two main designaxes for upcall systemsareupcall context andupcallconcurrency. Upcallscanbeinvokedfrom differentprocessingcontexts. A simpleapproachis to invoke upcall routinesfrom the threadthathappensto be runningwhenamessagearrives.Wecall this theprocedurecall approach.Thealternativeapproach,the threadapproach,is to give eachupcall its own executioncontext.This amountsto allocatinga threadper incoming message.The exact contextfrom which a systemmakesupcallsaffectsboththeprogrammingmodelandtheefficiency of upcalls.

Upcall concurrency determineshow many upcallscanbeactive concurrentlyin asinglereceiver process.Themostrestrictivepolicy allowsat mostoneupcallto executeat any time; themostliberal strategy allows any numberof upcallstoexecuteconcurrently. Again,differentchoicesleadto differencesin theprogram-mingmodelandperformance.

Table6.2summarizesthedesignaxesandthedifferentpositionsontheseaxes.The tablealso classifiesthreeupcall modelsalong theseaxes: active message,single-threadedupcalls,and popupthreads. The active-messages model [148]is restrictive in that it prohibitsall blocking in messagehandlers. In particular,active-messagehandlersmaynotblockonlocks,whichcomplicatesprogrammingsignificantly. On the otherhand,highly efficient implementationsof this modelexist. Someof theseimplementationsallow active-messagehandlerinvocationsto nest,othersdonot. Popupthreads[116] allow messagehandlersto blockatanytime. Popupthreadsareoftenslower thanactivemessages,becauseconceptuallya threadis createdfor eachmessage.Finally, we considersingle-threadedup-calls [14], whichdisallow blockingin some,but notall cases.Below, wecomparetheexpressivenessof thesemodelsandthecostof implementingthem.

Page 143: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.2IntegratingMultithreadingandCommunication 121

Concurrency ContextProcedure call Thread

Oneupcallat a time Activemessages Single-threadedupcallsMultiple upcalls Activemessages Popupthreads

Table6.2. Classificationof upcallmodels.

int x, y;

voidhandle store(int µ x addr, int y val, int z0, int z1)¶

µ x addr = y val;·voidhandle read(int src, int µ x addr, int z0, int z1)¶

AM send 4(src, handle store, x addr, y, 0, 0);·¸¹¸¹¸/ µ send read request to processor 5 µ /AM send 4(5, handle read, my cpu, &x, 0, 0);¸¹¸¹¸Fig. 6.6.Simpleremotereadwith activemessages.

ActiveMessages

As explainedin Section2.8, anactive messageconsistsof theaddressof a han-dler functionanda smallnumberof datawords(typically four words).Whenanactivemessageis received,thehandleraddressis extractedfrom themessageandthehandleris invoked;thedatawordsarepassedasargumentsto thehandler. Forlargemessages,theactive-messagesmodelprovidesseparatebulk-transferprim-itives.

A typical useof active messagesis shown in Figure 6.6. In this example,processormy cpusendsanactive messageto readthevalueof variabley on pro-cessor5. To sendthe message,we usethe hypotheticalroutine AM send4(),which acceptsa destination,a handlerfunction,andfour word-sizedarguments.At the receiving processor, the handlerfunction is appliedto the arguments.Inthiscase,themessagecontainstheaddressof thehandlerhandleread(), theiden-

Page 144: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

122 Panda

tity of thesender(my cpu), andtheaddressat which theresultof thereadshouldbe stored(&x). Functionhandleread() will be invokedon processor5 with srcsetto my cpuandx addr setto &x. Thehandlerreplieswith anotheractive mes-sagethat containsthe valueof y. Whenthis reply arrivesat processormy cpu,handlestore() is invokedto storethevaluein variablex. We assumethatreadinganinteger(variabley) is anatomicoperation.

Active-messageimplementationsdeliverperformancecloseto thatof therawhardware. An importantreasonfor this high performanceis thatactive-messagehandlersdo not have their own executioncontext. Whenan applicationthreadpollsor is interruptedby anetwork signal,thecommunicationsysteminvokesthehandler(s)of incomingmessagesin the context of the applicationthread. Thatis, thehandler’s stackframesaresimply stackedon top of theapplicationframes(seeFigure6.8(a)). No separatethreadis created,which eliminatesthe costofbuilding a threaddescriptor, allocatingastack,anda threadswitch.

The lack of a dedicatedthreadstack makes active messagesefficient, butmakesthemunattractiveasa general-purposecommunicationmechanismfor ap-plicationprogrammers.Active-messagehandlersarerun on the stacksof appli-cationthreads.If an active-messagehandlerblocks,thenthe applicationthreadcannotbe resumed,becausepart of its stackis occupiedby the active-messagehandler(seeFigure6.8(a)). This type of blocking occurswhen the applicationthreadholdsa resource(e.g.,a lock) that is alsoneededby the active-messagehandler. Clearly, a deadlockis createdif the active-messagehandlerwaits untiltheapplicationthreadreleasestheresource.

Similar problemsarisewhena handlerwaits for the arrival of anothermes-sage.Considerfor examplethetransmissionof a largereplymessageby anupcallhandler. If themessageis sentby meansof a flow-controlledprotocol,thenthesendroutinemayblock whenits sendwindow closes.Thesendwindow canbereopenedonly afteranacknowledgementhasbeenprocessed,which requiresan-otherupcall. If the active-messagessystemallows at mostoneupcall at a time,thenwe have a deadlock. If the systemallows multiple upcalls,however, thentheacknowledgementhandlercanberun on top of theblockedhandler(i.e., asanestedupcall)andno deadlockoccurs.

Theseproblemsarenot insolvable.If it is necessaryto suspenda handler, theprogrammercanexplicitly savestatein anauxiliarydatastructure,acontinuation,andlet thehandlerreturn.Thestatesavedin thecontinuationcanlaterbeusedbyalocal threador anotherhandlerto resumethesuspendedcomputation.Weassumethatcontinuationsarecreatedmanuallyby theprogrammer. Sometimes,though,it is possibleto createcontinuationsautomatically. Automaticapproacheseitherrely onacompilerto identify thestateto besavedor savestateconservatively. If acompilerrecognizespotentialsuspensionpointsin aprogram,thenit canidentifylive variablesandgeneratecodeto save live variableswhenthe computationis

Page 145: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.2IntegratingMultithreadingandCommunication 123

suspended.Blocking a threadis an exampleof conservative statesaving: thesavedstateconsistsof theentirethreadstack.

To selectbetweenthesestate-saving alternatives,thefollowing tradeoffs mustbeconsidered.With manualstatesaving, no compileris neededandapplication-specificknowledgecanbeexploitedto minimizetheamountof statesaved.How-ever, manualstatesaving canbetediousanderrorprone.Compilerscanremovethis burden,but mostcommunicationsystemsareconstructedaslibrariesratherthanlanguages.Finally, blockinganentirethreadis simple,but requiresthatev-ery upcall be run in its own threadcontext, otherwisewe will alsosuspendthecomputationon which theupcallhasbeenstacked. This alternative is discussedbelow, in thesectiononpopupthreads.

In smallandself-containedsystems,continuationscanbeusedeffectively tosolvetheproblemof blockingupcalls(seeSection7.1).Ontheotherhand,contin-uationsaretoo low-level anderrorproneto beusedby applicationprogrammers.In fact,theoriginalactive-messageproposal[148] explicitly statesthattheactive-messageprimitivesarenot designedto behigh-level primitives.Activemessagesare usedin implementationsof several parallelprogrammingsystems.A goodexampleis Split-C [39, 148],anextensionof C thatallowstheprogrammerto useglobal pointersto remotewordsor arraysof data. If a global pointerto remotedatais dereferenced,anactivemessageis sentto retrieve thedata.

Single-ThreadedUpcalls

In thesingle-threadedupcallmodel[14], all messagessentto aparticularprocessareprocessedby asingle,dedicatedthreadin thatprocess.Wereferto this threadas the upcall thread. Also, in its basicform, the single-threadedupcall modelallowsatmostoneupcallto executeat a time.

Figure 6.7 shows how single-threadedupcallscan be usedto accessa dis-tributedhashtable. Sucha tableis oftenusedin distributedgame-treesearchingto cacheevaluationvaluesof boardpositionsthathavealreadybeenanalyzed.Tolook up an evaluationvalue in a remotepart of the hashtable,a processsendsa handlelookupmessageto the processorthat holds the tableentry. Sincethetablemay be updatedconcurrentlyby local worker threads,andbecausean up-dateinvolvesmodifying multiple words(a key anda value),eachtableentry isprotectedby a lock. In contrastwith theactive-messagesmodel,thehandlercansafelyblockonthis lock whenit is heldby somelocal thread.While thehandlerisblocked,though,noothermessagescanbeprocessed.Thesingle-threadedupcallmodelassumesthat locks areusedonly to protectsmall critical sectionsso thatpendingmessageswill notbedelayedexcessively.

Allowing at mostoneupcall at a time restrictsthe setof actionsthat a pro-grammercansafelyperformin thecontext of anupcall. Specifically, this policy

Page 146: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

124 Panda

voidhandle lookup(int src, int key, int º ret addr, int z0)»

int index;int val;

index = hash(key);

lock(table[index].lock);if (table[index].key == key)

»val = table[index].value;¼

else»val = -1;¼

unlock(table[index].lock);

AM send 4(src, handle reply, ret addr, val, 0, 0);¼Fig. 6.7. Messagehandlerwith locking.

doesnot allow anupcall to wait for thearrival of a secondmessage,becausethisarrival canbedetectedonly if thesecondmessage’supcallexecutes.For example,if amessagehandlerissuesaremoteprocedurecall to anotherprocessor, deadlockwouldensuebecausethehandlerfor theRPC’s replymessagecannotbeactivateduntil the handlerthat sentthe requestreturns. In practice,the single-threadedupcall model requiresthat messagehandlerscreatecontinuationsor additionalthreadsin caseswhereconditionsynchronizationor synchronouscommunicationis needed.

The differencebetweensingle-threadedupcallsandactive messagesis illus-tratedin Figure 6.8. With active messages,upcall handlersare stacked on theapplication’s executionstack. Someimplementationsallow upcall handlerstonest,othersdonot. Single-threadedupcallhandlers,in contrast,donot runon thestackof an arbitrary thread;they arealwaysrun, oneat a time, on the stackoftheupcallthread.Executingmessagehandlersin thecontext of this threadallowsthe handlersto block without blocking other threadson the sameprocessor. Inparticular, thesingle-threadedupcallmodelallowsmessagehandlersto uselocksto synchronizetheir shared-dataaccesseswith theaccessesof otherthreads.Thisis animportantdifferencewith activemessages,whereall blockingis disallowed.If anactive-messagehandlerblocked,it wouldoccupy partof thestackof anotherthread(seeFigure6.8(a)),which thencannotberesumedsafely.

Page 147: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.2IntegratingMultithreadingandCommunication 125

Active-message handler 2

Active-message handler 1

Application code

Application thread

(a)Activemessages.

Application thread Upcall thread

(b) Single-threadedupcalls.

Popup threadsApplication thread

(c) Popupthreads.

Fig. 6.8. Threedifferentupcallmodels.

Page 148: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

126 Panda

voidhandle job request(int src, int º jobaddr, int z0, int z1)»

int job id;

lock(queue lock);while (is empty(job queue))

»wait(job queue nonempty, queue lock);¼

job id = fetch job(job queue);unlock(queue lock);

AM send 4(src, handle store, jobaddr, job id, 0, 0);¼Fig. 6.9. Messagehandlerwith blocking.

The single-threadedupcall modelhasbeenimplementedin several versionsof Panda.The currentversionof Panda,Panda4.0, usesa variantof thesingle-threadedupcallmodelthatwewill describelater.

Popup Threads

While thesingle-threadedupcallmodelis moreexpressivethantheactive-messagesmodel,it is still a restrictive modelbecauseall messagesarehandledby a singlethread.Thepopup-threadsmodel[116], in contrast,allowsmultiplemessagehan-dlersto beactive concurrently. Eachmessageis allocatedits own threadcontext(seeFigure6.8(c)). As a result,eachmessagehandlercansynchronizesafelyonlocks andconditionvariablesand issue(possiblysynchronous)communicationrequests,just likeany otherthread.

Figure6.9 illustratesthe advantagesof popupthreads. In this example,themessagehandlerhandlejob request() attemptsto retrievea job identifier(job id)from a job queue. When no job is available, the handlerblocks on conditionvariable job queuenonemptyand waits until a new job is addedto the queue.While this is a naturalway to expressconditionsynchronization,it is prohibitedin both theactive-messagesandthesingle-threadedupcallmodel. In theactive-messagesmodel,thehandleris notevenallowedto blockonthequeuelock. In thesingle-threadedupcallmodel,thehandlercanlock thequeue,but is not allowedto wait until a job is added,becausethenew job mayarrive in a messagenot yetprocessed.To addthisnew job, it is necessaryto runanew messagehandlerwhilethecurrenthandleris blocked.With single-threadedupcalls,this is impossible.

Page 149: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.2IntegratingMultithreadingandCommunication 127

Dispatchinga popupthreadneednot beany moreexpensive thandispatchinga single-threadedupcall [88]. Popupthreads,however, have somehiddencoststhatdo not immediatelyshow up in microbenchmarks:

1. Whenmany messagehandlersblock, the numberof threadsin the systemcanbecomelarge,whichwastesmemory.

2. Thenumberof runnablethreadson a nodemayincrease,whichcanleadtoschedulinganomalies.In earlierexperimentswe observeda severeperfor-mancedegradationfor asearchalgorithmin whichpopupthreadswereusedto servicerequestsfor work [88]. The schedulerrepeatedlychoseto runhigh-priority popupthreadsinsteadof the low-priority applicationthreadthat generatednew work. As a result,many uselessthreadswitcheswereexecuted.DruschelandBangafoundsimilarschedulingproblemsin UNIXsystemsthatprocessincomingnetwork packetsat toppriority [47].

3. Becausepopupthreadsallow multiplemessagehandlersto executeconcur-rently, theorderingpropertiesof theunderlyingcommunicationsystemmaybe lost. For example,if two messagesaresentacrossa FIFO communica-tion channelfrom onenodeto another, thereceiving processwill createtwothreadsto processthesemessages.Sincethe threadschedulercansched-ule thesethreadsin any order, the FIFO propertyis lost. In general,onlythe programmerknows when the next messagecansafelybe dispatched.Hence,if FIFO orderingis needed,it hasto beimplementedexplicitly, forexampleby taggingmessageswith sequencenumbers.

Several systemsprovide popupthreads. Among thesesystemsareNexus [54],Horus [144], and the x-kernel [116]. Thesesystemshave all beenusedfor avarietyof parallelanddistributedapplications.

Panda’s Upcall Model

Thefirst Pandasystem[19] implementedpopupthreads.To reducetheoverheadof threadswitching,however, laterversionshave usedthesingle-threadedupcallmodel.Two kindsof threadswitchingoverheadwereeliminated:

1. In our early Pandaimplementationson Unix and Amoeba[139] all in-comingmessagesweredispatchedto a singlethread,thenetwork daemon,which passedeachmessageto a popupthread.(Thesesystemsmaintaineda pool of popupthreadsto avoid threadcreationcostson thecritical path.)Theuseof single-threadedupcallsallowedthenetwork daemonto processall messageswithoutswitchingto apopupthread.

Page 150: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

128 Panda

2. Multiple upcall threadscouldbe blocked,waiting for anevent thatwastobegeneratedby a local (computation)threador amessage.Whentheeventoccurred,all threadswere awakened,even if the event allowed only onethreadto continue;the other threadswould put themselvesbackto sleep.It is frequentlypossibleto avoid putting upcall threadsto sleep. Instead,upcallscan createcontinuationswhich can be resumedby other threadswithout any threadswitching. Onceupcallsdo not put themselvesto sleepany more,they canbeexecutedby asingle-threadedupcallinsteadof popupthreads.

In retrospect,thefirst typeof overheadcouldhavebeeneliminatedwithoutchang-ing theprogrammingmodel,by meansof lazy thread-creationtechniquessuchasdescribedbelow. Thesecondproblemoccurredin theimplementationof Orcaandis discussedin moredetail in Section7.1.4.

Panda4.0 implementsanupcallmodelthat lies betweenactive messagesandsingle-threadedupcalls. The model is as follows. First, as in single-threadedupcalls, Pandaclients can use locks in upcalls, but shouldnot usecondition-synchronizationor synchronouscommunicationprimitivesin upcalls.

Second,Pandaclientsmustreleasetheir locksbeforeinvokingany Pandasendor receive routine.This restrictionis themaindifferencebetweenPanda’s upcallmodelandsingle-threadedupcalls.It sometimesallowstheimplementationto runupcallson thecurrentstackratherthanon aseparatestack.

Finally, Pandaallows up to oneupcall per endpointto be active at any time.Pandaclientsmustthereforebepreparedto dealwith concurrency betweenupcallsfor differentendpoints.In practice,thisposesnoproblems,becauseprogrammersalreadyhave to dealwith concurrency betweenupcallsanda computationthreador betweenmultiple computationthreads.SincePandarunsonly upcallsfor dif-ferentendpointsconcurrently, theorderof messagessentto any singleendpointis preserved,soweavoid a disadvantageof popupthreads.

Implementation of Panda’s Upcall Model

On LFC, we useanoptimizedimplementationof Panda’s upcallmodel. In manycases,this implementationcompletelyavoids threadswitching during messageprocessing.The implementationdistinguishesbetweensynchronousand asyn-chronousupcalls. Synchronousupcallsoccurimplicitly, asthe resultof pollingby an LFC routine, or explicitly as the result of an invocationof lfc poll() byPanda.Pandapollsexplicitly whenall threadsareidle or whenit triesto consumean asyet unreceived part of a streammessage(seeSection6.3). A successfulpoll resultsin the (synchronous)invocationof lfc upcall() in the context of thecurrentthread(or the last-active threadif all threadsareidle). This synchronous

Page 151: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.2IntegratingMultithreadingandCommunication 129

upcallqueuesthepacket just receivedat thepacket’sdestinationSAP. If this is thefirst packet of a message,andno othermessagesarequeuedbeforethis message,thenPandainvokestheSAP’s upcallroutine(by meansof a plain procedurecall)without switchingto anotherstack. This is safe,becausePanda’s upcall modelrequiresthatthepolling threaddoesnot holdany locks.

AsynchronousupcallsoccurwhentheNI generatesa network interrupt.Net-work interruptsarepropagatedto thereceiving userprocessby meansof aUNIXsignal. The signal handlerinterruptsthe running threadand executeson thatthread’s stack. Eventually, the signal handlerinvokes lfc poll() to processthenext network packet. In thiscase,thereis noguaranteethattheinterruptedthreaddoesnot hold any locks,so we cannotsimply invoke a SAP’s upcall routine. Asimplesolutionis to storethepacket just receivedin aqueueandsignalaseparateupcallthread.Unfortunately, this involvesa full threadswitch.

To avoid this full threadswitch,OpenThreadsinvokeslfc poll() by meansofa special,slightly moreexpensiveprocedure-callmechanism.Insteadof runninglfc poll() onthestackof theinterruptedthread,OpenThreadsswitchesto aspecialupcall stackandrunsthe poll routineon that stack. At first sight, this is just athreadswitch,but therearetwo differences.First, sincetheupcall stackis usedonly to run upcalls,OpenThreadsdoesnot needto restoreany registerswhenitswitchesto this stack. Put differently, we alwaysstartexecutingat the bottomof theupcall stack. Second,OpenThreadssetsup thebottomstackframeof theupcall stackin sucha way that the poll routinewill automaticallyreturn to thesignalhandler’s stackframeon thetop of thestackof the interruptedthread(seeFigure6.10(a)).

If theSAPhandlerdoesnotblock, thenwewill leavebehindanemptyupcallstack,returnto thestackof theinterruptedthread,returnfrom thesignalhandler,andresumethe interruptedthread. This is the commoncasethat OpenThreadsoptimizes. If, on the otherhand,the SAP handlerblockson a lock, thenOpen-Threadswill disconnecttheupcallstackfrom thestackof theinterruptedthread.This is illustratedin Figure6.10(b).OpenThreadsmodifiesthereturnaddressinthebottomstackframeso that it pointsto a specialexit function. This preventsthe upcall from returningto the stackof the interruptedthreadandit allows theinterruptedthreadto continueexecutionindependently.

Summarizing,OpenThreads’s specialcalling mechanismoptimistically ex-ploits theobservationthatmosthandlersrun to completionwithout blocking. Inthis case,upcallsexecutewithout truethreadswitches.If thehandlerdoesblock,thenwe promoteit to an independentthreadwhich OpenThreadsschedulesjustlikeany otherthread.

Page 152: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

130 Panda

½�½½�½½�½½�½¾�¾¾�¾¾�¾¾�¾Signal frameReturn link

Upcallthread

Interrupted thread

(a) Interruptedthreadandupcallthreadbefore the upcall threadhasblocked. Theupcallwill re-turn to the interruptedthread’sstack.

¿�¿�¿¿�¿�¿¿�¿�¿¿�¿�¿À�À�ÀÀ�À�ÀÀ�À�ÀÀ�À�ÀÁ�Á�ÁÁ�Á�ÁÁ�Á�ÁÁ�Á�ÁÂ�Â�ÂÂ�Â�ÂÂ�Â�ÂÂ�Â�ÂSignal frame

Upcallthread

Exit frame

Return link

Interrupted thread

(b) When the upcall threadblocks, Open-Threadsmodifiesthereturnaddress.

Fig. 6.10.Fastthreadswitchto anupcallthread.

6.3 StreamMessages

Panda’sstreammessagesareimplementedin thesystemmodule.Theimplemen-tationminimizescopying andallowspipeliningof datatransfersfrom applicationto applicationprocess.Thekey ideais illustratedin Figure6.11.Thedata-transferpipelineconsistsof four stages,whichoperatein parallel.In thefirst stage,Pandacopiesclient datainto LFC sendpackets. In the secondstage,LFC transmitssendpacketsto thedestinationNI. In the third stage,the receiving NI copiesre-ceive packets to hostmemory. Finally, in the fourth stage,the receiving Pandaclient consumesdatafrom receive packetsin hostmemory. Messageslargerthana singlepacket will benefitfrom the parallelismin this pipeline. (A streamofshortmessagesalsobenefitsfrom this pipelining, but the key featureof streammessagesis thatthesameparallelismcanbeexploitedwithin a singlemessage.)

Streammessagesareimplementedasfollows. At thesendingside,thesystem-modulefunctionspan unicast() andpan multicast() areresponsiblefor messagefragmentation. They repeatedlyallocatean LFC sendpacket, storea messageheaderinto this packet, andfill the remainderof the packet with datafrom theuser’s I/O vectorandtheheaderstack.

Unicastheaderformatsareshown in Figure6.12.Thefirst packetof amessagehasa differentheaderthanthepacketsthat follow. Both headerscontaina countof piggybackedcreditsanda field that identifiesthedestinationSAP. Thecredits

Page 153: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.3StreamMessages 131

Receive packets (NI)

I/O vector1 1

3

4

Send packets (NI) Receive packets (host)

4

2

Panda

LFC

Network

Panda messagePanda client

Fig. 6.11. Application to applicationstreamingwith Panda’s streammessages.(1) Pandacopiesuserdatainto a sendpacket. (2) LFC transmitsa sendpacket.(3) LFC copiesthepacket to hostmemory. (4) ThePandaclient consumesdata.

Piggybacked credits

Header stack size

Message size

SAP identifier

Header for first packet Header for subsequent packets

Piggybacked credits

SAP identifier

Fig. 6.12.Panda’smessage-headerformats.

field is usedby Panda’s sliding-window flow-controlscheme.TheSAPidentifieris usedfor reassembly. Pandadoesnot interleaveoutgoingpacketsthatbelongtodifferentmessages.Consequently, sendersneednotaddamessageid to outgoingpackets. The sourceidentifier in the LFC header(seeFigure3.1) andthe SAPidentifier in the Pandaheadersuffice to locatethe messageto which a packetbelongs.

Theheaderof amessage’sfirst packetcontainstwo extrafields: thesizeof theheaderstackcontainedin thepacket andthetotalmessagesize.Theheader-stacksizeis usedto find thestartof thedatathatfollowstheheaders.Thetotalmessagesizeis usedto determinewhenall of amessage’spacketshavearrived.

Figure6.13illustratesthereceiver-sidedatastructures.Thereceivermaintainsanarrayof SAPs.EachSAPcontainsa pointerto theSAP’s handleranda queueof pendingstreammessages.A streammessageis consideredpendingwhenitsfirst packet hasarrived andthe receiving processhasnot yet entirely consumedthestreammessage.A streammessageis representedby a datastructurethat iscreatedwhenthe streammessage’s first packet arrives. This datastructurecon-

Page 154: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

132 Panda

SAP 0SAP 1SAP 2SAP 3

Packet

Stream message

Fig. 6.13.Receiver-sideorganizationof Panda’sstreammessages.

tainsan arraywhich holdspointersto the streammessage’s constituentpackets.Packetsareenteredinto this arrayasthey arrive. In Figure6.13,onestreammes-sageis pendingfor SAP0 andtwo streammessagesarependingfor SAP3. Thefirst streammessagefor SAP3 consistsof threepackets. All packetsbut thelastonearefull.

Whena packet arrives, lfc upcall() searchesthe destinationSAP’s queueofpendingstreammessages.If it doesnot find the streammessageto which thispacket belongs,it createsa new streammessageand appendsit to the queue.(SinceLFC deliversall unicastpacketsin-order, thereis noneedto storeanoffsetor sequencenumberin themessageheaders.)If anincomingpacket is not thefirstpacketof its streammessage,thenlfc upcall() will find thestreammessagein theSAP’s queueof pendingmessages.The packet is thenappendedto the streammessage’spacket array.

Whenastreammessagereachestheheadof its SAP’s queue,PandadequeuesthestreammessageandinvokestheSAPhandler, passingthestreammessageasanargument.ThestreammessagesqueuedataspecificSAPareprocessedoneatatime. (Thatis, aSAP’shandleris not invokedagainuntil thepreviousinvocationhasreturned.)

Pan msgconsume() copiesdatafrom the packets in a message’s packet listto userbuffers. Eachtime pan msgconsume() hasconsumeda completepacket,it returnsthepacket to LFC. If a Pandaclient doesnot want to consumeall of amessage’sdata,it caneitherskipsomebytesor discardtheentiremessagewithoutcopying any data.

Page 155: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.4Totally-OrderedBroadcasting 133

6.4 Totally-Order edBroadcasting

Thissectiondescribesanefficientimplementationof Panda’stotally-orderedbroad-castprimitive. Totally-orderedbroadcastis a powerful communicationprimitivethatcanbeusedto manageshared,replicateddata.Panda’s totally-orderedbroad-cast,for example,is usedto implementmethodinvocationsfor replicatedsharedobjectsin theOrcasystem(seeSection7.1).

A broadcastis totally-orderedif all broadcastmessagesarereceivedin asingleorder by all receivers and if this order agreeswith the order in which senderssent the messages.Most totally-orderedbroadcastprotocolsusea centralizedcomponentto implementthe orderingconstraintand the protocol presentedinthis sectionis no exception. The protocolusesa centralnode,a sequencer, toordermessages.Theprotocolwe describeusesLFC’s fetch-and-addprimitive toobtainsequencenumbersto ordermessages.

Theprotocolissimple.Beforesendingabroadcastmessage,thesenderjustin-vokeslfc fetch and add() to obtainasystem-wideuniquesequencenumber. Thissequencenumberis attachedto themessageandthenthemessage’s packetsarebroadcastusingLFC’s broadcastprimitive. All sendersperformtheir fetch-and-addoperationson a singlefetch-and-addvariablethatactsasa sharedsequencenumber. Thisvariableis storedin asingleNI’s memory.

Receiversassemblethemessagein thesameway they assembleunicastmes-sages(seeSection6.3). The only differenceis that a messageis not delivereduntil all precedingmessageshavebeendelivered.Thisenforcesthetotalorderingconstraint.

This Get-Sequence-number-then-Broadcast(GSB) protocol was first devel-opedfor a transputer-basedparallelmachine[65]. Themaindifferencewith theoriginal implementationis that with LFC, requestsfor a sequencenumberarehandledentirelyby theNI. This reducesthelatency of suchrequestsin two ways.First, theNI neednotcopy therequestto hostmemoryandthehostneednotcopya reply messageto NI memory. This gain is measurable– an LFC fetch-and-addcostslessthanan LFC (host-to-host)roundtrip– but small (19.8 µs versus23.3µs). Themaingainis thereductionin thenumberof interrupts.Namely, thesequencerdoesnot expecta sequencenumberrequest.Therefore,suchrequestsarequitelikely to bedeliveredby meansof aninterruptratherthana poll. LFC’sfetch-and-addprimitiveavoidsthis typeof interrupt.

6.5 Performance

In this sectionwe discussPanda’s performance. We measurethe latency andthroughputfor the message-passingandbroadcastmodules. We do not present

Page 156: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

134 Panda

0Ã 200Ã 400Ã 600Ã 800Ã 1000Message size (bytes)Ä

0

10

20

30

40

50

Late

ncy

(mic

rose

cond

s)

ÅPandaLFC

Fig. 6.14.Panda’smessage-passinglatency.

separateresultsfor theRPCmodule,becauseRPCsaretrivially composedof pairsof message-passingmodulemessages.WecomparethePandameasurementswiththe LFC measurementsof Chapter5, to assessthe costof Panda’s higher-levelfunctionality.

In Chapter5 wemeasuredlatency andthroughputusingthreereceivemethods:no-touch,read-only, andcopy. Hereweuseonly thecopy method;this is themostcommonscenariofor Pandaclientsandincludesall overheadthatPanda’sstream-messageabstractionaddsto LFC’s packet interface.

6.5.1 Performanceof the Message-PassingModule

Figure6.14showstheone-way latency for Panda’smessage-passingmodule.Forcomparison,wealsoshow theone-way latency (with copying at thereceiverside)of LFC’s unicastprimitive. As shown, Pandaaddsa constantamount(approx-imately 5 µs) of overheadto LFC’s latency. Thereare several reasonsfor thisincrease:

1. Locking. To ensuremultithread-safeexecution,Pandabracketsall calls toLFC routineswith lock/unlockpairs.

2. Messageabstraction.At thesendingside,Pandahasto processanI/O vectorbeforesendinganLFC packet. At thereceiving side,incomingpacketshaveto beappendedto a streammessagebeforethereceiver upcallcanprocessthedata. Pandaalsomaintainsseveralpointersandcountersto keeptrackof thecurrentpacket andthecurrentpositionin thatpacket.

Page 157: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.5Performance 135

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ä0

10

20

30

40

50

60

Thr

ough

put (

Mby

te/s

econ

d)

LFCPanda

Fig. 6.15.Panda’s message-passingthroughput.

3. Demultiplexing. Pandaaddsa headerto eachLFC packet. The headeridentifiesthe destinationport andthestreammessageto which the packetbelongs.

4. Flow control. Pandahasto checkandupdateits flow-controlstatefor eachoutgoingandincomingpacket.

For 1024-bytemessages,the differencebetweenPanda’s latency andLFC’s la-tency is largerthanfor smallermessagesizes.For this messagesize,Pandaneedstwo packets,whereasLFC needsonly one;this is dueto Panda’sheader.

Figure6.15showstheone-waythroughputfor Panda’smessage-passingmod-ule andfor LFC. Due to the overheadsdescribedabove, Panda’s throughputforsmall messagesis not asgoodasLFC’s throughput. In particular, sendingandreceiving the first packet of a streammessageinvolvesmorework thansendingandreceiving subsequentpackets. At thesendingside,for example,we needtostoretheheaderstackinto thefirst packet; at thereceiving side,wehave to createa streammessagewhenthefirst packet arrives. For largermessages,theseover-headscanbe amortizedover multiple packets. For this reason,Pandadoesnotreachits peakthroughputuntil a messageconsistsof multipleLFC packets.

For larger messages,PandasometimesattainshigherthroughputsthanLFC.This is dueto cacheeffectsthatoccurduringthecopying of datainto sendpacketsandoutof receivepackets.

Page 158: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

136 Panda

0Ã 200Ã 400Ã 600Ã 800Ã 1000Message size (bytes)Ä

0

50

100

150

200La

tenc

y (m

icro

seco

nds)

Å

16 processorsÆ

Panda, orderedPanda, unorderedLFC, unordered

0Ã 200Ã 400Ã 600Ã 800Ã 1000Message size (bytes)Ä

0

50

100

150

200 64 processorsÆ

Panda, orderedPanda, unorderedLFC, unordered

Fig. 6.16.Panda’sbroadcastlatency.

6.5.2 Performanceof the BroadcastModule

Wemeasuredbroadcastperformanceusingthreedifferentbroadcastprimitives:

1. LFC’s broadcastusingthecopy receivestrategy (seeSection5.3).

2. Panda’s unorderedbroadcast.Pandaprovidesanoptionto disabletotal or-dering. When this option is enabled,Pandadoesnot tag broadcastmes-sageswith asystem-wideuniquesequencenumber. Thatis, Pandaskipsthefetch-and-addoperationthatit normallyperformsto obtainsuchasequencenumber.

3. Panda’s totally-orderedbroadcast.This is thebroadcastprimitivedescribedearlierin this chapter.

Figure 6.16 shows the latency for all threeprimitives, for messagesizesup to1 Kbyteandfor 16and64processors.Thedifferencein latency for anull messagebetweenPanda’s unorderedbroadcastandLFC’s broadcastis 9 µs. As expected,addingtotal orderingincreasesthe latency by a constantamount: the costof afetch-and-addoperation.On 64 processors,the null latency differencebetweenPanda’sunorderedandorderedbroadcastsis 23µs.

Figure6.17 shows broadcastthroughputfor 16 and64 processors.Addingtotal orderingreducesthe throughputfor small andmedium-sizemessages.Forlargermessages,however, PandareachesLFC’s throughput.

Figure6.16andFigure6.17show only single-senderbroadcastperformance.With multiple senders,theperformanceof totally-orderedbroadcastsmaysuffer

Page 159: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.5Performance 137

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ä0

10

20

30

Thr

ough

put (

Mby

te/s

econ

d)

16 processors

LFC, unorderedPanda, unorderedPanda, ordered

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ä0

10

20

30

64 processors

LFC, unorderedPanda, unorderedPanda, ordered

Fig. 6.17.Panda’s broadcastthroughput.

from thecentralizedsequencer. In practice,however, this is rarelya problem.Inmany applications,thereis at mostonebroadcastingprocessat any time. Whenprocessesdo broadcastsimultaneously, the overall broadcastrate is often low.In an Orcaperformancestudy[9], we measuredthe overheadof obtaininga se-quencenumberfrom thesequencer. For 7 out of the8 broadcastingapplicationsconsidered,the time spenton fetchingsequencenumberswaslessthan0.2%oftheapplication’s executiontime. Only oneapplicationsufferedfrom thecentral-izedsequencer:for thisapplication,alinearequationsolver, thetimeto accessthesequenceraccountedfor 4% of theapplication’s executiontime. All applicationswererun on FM/MC, which usesan NI-level fetch-and-addto obtainsequencenumbers,just likeLFC.

6.5.3 Other PerformanceIssues

Someoptimizationsdiscussedin thischapterreducecoststhatarearchitectureandoperatingsystemdependent.The costof a threadswitch, for example,dependsontheamountof CPUstatethathasto besavedandonthewayin whichthisstateis saved.TheearlyPandaimplementationsranonSPARC processorarchitectureson which each(user-level) threadswitchrequiresa trap to theoperatingsystem.This trap is neededto flushtheregisterwindows,a SPARC-specificcacheof thetop of the runtimestack,to memory[72, 79]. On other architectures,threadswitchesarecheaper. OnthePentiumProsusedfor theexperimentsin this thesis,aswitchfrom onePandathreadto anothercosts1.3µs. This is still aconsiderableoverheadto addto LFC’s one-waynull latency, but it is notexcessive.

Page 160: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

138 Panda

Interruptsandsignalprocessingareothersourcesof architectureandoperating-systemdependentoverhead.Whereasthreadswitchesarecheapon our experi-mentalplatform,interruptsandsignalsareexpensiveandshouldbeavoidedwhenpossible(seeSection2.5).

6.6 RelatedWork

Wediscussrelatedwork in fiveareas:portablemessage-passinglibraries,pollingand interrupts,upcall modelsand their implementation,streammessages,andtotally-orderedbroadcasting.

6.6.1 Portable Message-PassingLibraries

PVM [137] and MPI [53] are the most widely usedportablemessage-passingsystems.Unlike Panda,thesesystemstargetapplicationprogrammers.ThemaindifferencesbetweenthesesystemsandPandaarethatPVM andMPI arehardtousein a multithreadedenvironment,usereceive downcallsinsteadof upcallsanddonotsupporttotally-orderedbroadcast.

PVM andMPI do not provide a multithreadingabstraction.Both systemsuseblockingreceivedowncallsanddonot providea mechanismto notify anexternalthreadschedulerthata threadhasblocked. This is not a problemif theoperatingsystemprovideskernel-scheduledthreadsandtheclientusesthesekernelthreads.If, on theotherhanda client employs a threadpackagenot supportedby theop-eratingsystem,or if theoperatingsystemis not awareof threadsat all, thentheuseof blocking downcallsandthe lack of a notificationmechanismcomplicatetheuseof multithreadedsystemson topof MPI andPVM.

MultithreadedMPI andPVM clientscanavoid theblocking-receive problemby using nonblockingvariantsof the receive calls, but this meansthat appli-cationswill have to poll for messages,becausePVM and MPI do not supportasynchronous(i.e., interrupt-driven) messagedelivery. The samelack of asyn-chronousmessagedelivery complicatesthe implementationof PPSsthat requireasynchronousdelivery to operatecorrectlyandefficiently. ThesePPSsareforcedto poll (e.g.,in abackgroundthread),but finding theright polling rateis difficult.

LFC doesnot provide receive downcalls: all packets are deliveredthroughupcalls. Blocking receive downcalls can still be implementedon top of LFC,though.(Panda’s message-passingmoduleprovidesa blockingreceive.) Whenablockingreceive is implementedoutsidethemessage-passingsystem,controlcanbepassedto thethreadschedulerwhena threadblocksona receive.

NeitherPVM nor MPI providesa totally-orderedbroadcast.While a totally-orderedbroadcastcan be implementedusing PVM and MPI primitives, such

Page 161: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.6RelatedWork 139

an implementationwill be slower thanan implementationthat supportsorderedbroadcastingat thelowestlayersof thesystem.

The Nexus communicationlibrary [54] hasthe samegoalsasPanda;Nexusis intendedto beusedasa compilertargetor aspartof a larger runtimesystem.Nexus provides an asynchronous,one-way message-passingprimitive called aRemoteServiceRequest(RSR). Thedestinationof anRSRis namedby meansof a global pointer, which is a system-wideuniquenamefor a communicationendpoint. Endpointscanbe createddynamicallyandreferencesto themcanbepassedaroundin messages.Unlike Panda,Nexus doesnot provide a broadcastprimitive.

Horus [144] and its precursorIsis [22] were designedto simplify the con-structionof distributedprograms.Horusfocuseson message-orderingandfault-toleranceissues. Pandasupportsparallel ratherthan distributed programming,providesonly onetypeof orderedbroadcast,anddoesnotaddressfault tolerance.Like Panda,Horuscanbeconfiguredin differentways.Horus,however, is muchmoreflexible in that it allows protocolstacksto be specifiedat run time ratherthanat compiletime.

6.6.2 Polling and Interrupts

In theirremote-queueingmodel[28], Breweretal. focusonthebenefitsof polling.They recognize,however, that interruptsareindispensableandcombinepollingwith selectiveinterrupts.Interruptsaregeneratedonly for specificmessages(e.g.,operating-systemmessages)or underspecificcircumstances(e.g.,network over-flow). In contrast,our integratedthreadpackagechoosesbetweenpolling andinterruptsbasedon theapplication’sstate(idle or not).

CRL [74] is a DSM systemthat illustratesthe needto combinepolling andinterruptsin a singleprogram.Operationson shareddataarebracketedby callsto the CRL library. During or betweensuchoperations,a CRL applicationmaynot enterthelibrary for a long time,sounlesstheresponsibilityfor polling is puton theuserprogram,protocolrequestscanbehandledin a timely manneronly byusinginterrupts. CRL thereforeusesinterruptsto deliver protocolrequestmes-sages;polling is usedto receive protocolreply messages.This typeof behavioroccursnotonly in CRL, but alsoin otherDSMssuchasCVM [114] andOrca[9].It is exactly thekind of behavior thatPandadealswith transparently.

Fosteretal. describetheproblemsinvolvedin implementingNexus’smessagedelivery mechanism(RSRs)on differentoperatingsystemsand hardware [54].Available mechanismsfor polling and interrupts,and their costs,vary widelyacrossdifferentsystems.Moreover, thesemechanismsarerarelywell integratedwith the availablemultithreadingprimitives. For example,a blocking readon asocket may block the entireprocessratherthan just the calling thread. We be-

Page 162: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

140 Panda

lieve that our system,in which we have full control over threadschedulingandcommunication,achievesthedesiredlevel of integration.

6.6.3 Upcall Models

Wehavedescribedthreemessage-handlingmodelsin whicheachmessageresultsin a handlerinvocationat the receiving side. Nexususestwo of thesemodelstodispatchRSRhandlers.Nexus eithercreatesa new threadto run the handlerorelserunsthehandlerin a preallocatedthread.Thefirst casecorrespondsto whatwecall thepopupthreadsmodel,thesecondto theactive-messagesmodel.In thesecondcase,thethreadis not allowedto block.

A model closely relatedto popupthreadsis the optimistic active-messagesmodel [150]. This model removessomeof the restrictionsof the basicactive-messagesmodel. It extendsthe classof handlersthat can be executedas anactive-messagehandler—i.e.,without creatinga separatethread—with handlersthatcansafelybeabortedandrestartedwhenthey block. A stubcompilerprepro-cessesall handlercode.Whenthestubcompilerdetectsthata handlerperformsa potentially blocking action, it generatescodethat abortsthe handlerwhen itblocks. Whenaborted,thehandleris re-runin thecontext of a first-classthread(which is createdon the fly). Thread-managementcostsare thus incurredonlyif a handlerblocks;otherwisethehandlerrunsasefficiently asa normalactive-messagehandler.

Optimisticactive messagesarelesspowerful thanpopupthreads.First, opti-misticactivemessagesrequirethatall potentiallyblockingactionsberecognizableto thestubcompiler. Popupthreads,in contrast,canbeimplementedwithoutcom-piler supportanddonot rely onprogrammerannotationsto indicatewhatpartsofa handlermay block. Second,an optimistic active-messageshandlercanbe re-run safelyonly if it doesnot modify any globalstatebeforeit blocks;otherwisethis statewill bemodifiedtwice. Theprogrammeris thusforcedto avoid or undochangesto globalstateuntil it is known thatthehandlercannotblock any more.

The stack-splittingandreturn-addressmodificationtechniqueswe useto in-voke messagehandlersefficiently are similar to the techniquesusedby LazyThreads[58]. Lazy Threadsprovideanefficient fork primitive thatoptimisticallyrunsa child on its parent’s stack. Whenthechild suspends,its returnaddressismodifiedto reflectthenew state.Also, futurechildrenwill beallocatedon differ-entstacks.In our case,a threadthatfindsa message—eitherbecauseit polledorbecauseit receiveda signal—needsto fork a messagehandlerfor this message.Unlike Lazy Threads,however, the parentthread(the threadthat polled or thatwas interrupted)doesnot wait for its child (the messagehandler)to terminate.Also, we runall children,includingthefirst child, on theirown stack.

Page 163: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.6RelatedWork 141

6.6.4 StreamMessages

Streammessageswereintroducedin Illinois FastMessages(version2.0). Streammessagesin FastMessagesdiffer in severalwaysfrom streammessagesin Panda.The main differenceis that FastMessagesrequiresthat eachincomingmessagebe consumedin thecontext of the message’s handler. This implies that theuserneedsto copy themessageif themessagearrivesataninconvenientmoment.LFCallowsahandlerto putasideastreammessagewithoutcopying thatmessage.

This differenceresultsfrom the way receive-packet managementis imple-mentedin FastMessages.FastMessagesappendsincomingpacketsto a circularqueueof packet buffersin hostmemory. For eachpacket in this queue,FastMes-sagesinvokesa handlerfunction. Whenthe handlerfunction returns,FastMes-sagesrecycles the buffer. This schemeis simple,becausethe hostnever needsto communicatethe identitiesof freebuffers to theNI. Both thehostandtheNImaintaina local index into thequeue:thehost’s index indicateswhich packet toconsumenext andtheNI’s counterindicateswhereto storethenext packet.

LFC explicitly passestheidentitiesof freebuffersto theNI (seeSection3.6),which hasthe advantagethat the hostcan replacebuffers. This is exactly whathappenswhen lfc upcall() doesnot allow LFC to recycle a packet immediately.As a result,packetscanbe put asidewithout copying, which is convenientandefficient. Clientsshouldbe aware,however, thatpacket buffers residein pinnedmemoryandarethereforea relatively preciousresource.If a client continuestobuffer packetsanddoesnotreturnthem,thenLFC will continueto addnew packetbuffers to its receive ring andwill eventually run out of pinnablememory. Toavoid this, clientsshouldkeeptrackof their resourceusageandtake appropriatemeasures,eitherby implementingflow control or by copying packetswhenthenumberof bufferedpacketsexceedsa threshold.

6.6.5 Totally-Order edBroadcast

Totally-orderedbroadcastis awell knownconceptin distributedcomputingwhereit is usedto simplify theconstructionof distributedprograms.However, totally-orderedbroadcastis not usedmuch in parallel programming. As we shall seein Section7.1, though,the Orcasharedobjectsystemusesthis primitive in aneffectiveway to updatereplicatedobjects.

Many differenttotally-orderedbroadcastprotocolsaredescribedin thelitera-ture. Heinzleet al. describetheGSBprotocolusedby Pandaon LFC [65]. ThemaindifferencebetweenGSBandPanda’s implementationis thatPandahandles(throughLFC) sequencenumberrequestson theprogrammableNI; this reducesthenumberof network interrupts.

Kaashoekdescribestwo other protocols,Point-to-point/Broadcast(PB) and

Page 164: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

142 Panda

Broadcast/Broadcast(BB), whichalsorely onacentral,dedicatedsequencernode[75]. In PB, the sendersendsits messageto the sequencer. The sequencertagsthe messagewith a sequencenumberandbroadcaststhe taggedmessageto alldestinations. In BB, the senderbroadcastsits messageto all destinationsandthe sequencer. Whenthe sequencerreceivesthe message,it assignsa sequencenumberto the messageand then broadcastsa small messagethat identifiesthemessageandcontainsthesequencenumber. Receiversarenot allowedto delivera messageuntil they have received its sequencenumberanduntil all precedingmessageshavebeendelivered.

The performancecharacteristicsof GSB,PB, andBB dependon the type ofnetwork they run on. PB andBB weredevelopedon anEthernet,wherea broad-casthasthe samecostasa point-to-pointmessage.On switchednetworks likeMyrinet, however, a (spanning-tree)broadcastis much more expensive than apoint-to-pointmessage.Consequently, the performanceof large messageswillbe dominatedby the cost of broadcastingthe data,irrespective of the orderingmechanism.

For smallmessages,BB’sseparatesequence-numberbroadcastis unattractive,becausethe worst-caselatency of a totally-orderedbroadcastbecomesequaltotwice the latency of an unorderedbroadcast(one for the dataand one for thesequencenumber). Incidentally, BB wasalsoconsideredunattractive for smallmessagesin an Ethernetsetting,but for a different reason:BB generatesmoreinterruptsthanPB.

PB is muchmoreattractive for smallmessages,becauseit addsonly a singlepoint-to-pointmessageto the broadcastof the data. GSB usestwo messagestoobtaina sequencenumber. For large messages,however, PB hasthe disadvan-tagethat all datapacketsmusttravel to andthroughthe sequencer, thusputtingmoreloadon the network andon the sequencer. PB alsoputsmoreload on thesequencerthanGSBif many processorstry to broadcastsimultaneously. With PB,theoccupancy of thesequencerwill behigh, becausethesequencermustbroad-castevery datapacket. Which GSB, the occupancy will be lower, becausethesequencerneedonly respondto fetch-and-addrequests.Finally, if all broadcastsoriginatefrom thesequencer, therearefeweropportunitiesto piggybackacknowl-edgementsonmulticasttraffic.

6.7 Summary

Thischapterdescribedtheimplementationof Pandaon LFC. PandaextendsLFCwith multithreading,messagesof arbitrary length, demultiplexing, remotepro-cedurecall, andtotally-orderedbroadcast.The efficient implementationof thisfunctionality is enabledby LFC’s performanceandinterfaceandby novel tech-

Page 165: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

6.7Summary 143

niquesemployedby Panda.LFC allows packetsto bedeliveredthroughpollingand interrupts. Both mechanismsare useful, but manuallyswitching betweenthemis tediousanderror-prone. Pandahidesthe complexity of managingbothmechanisms.Pandaclientsneednot poll, becausePanda’s threadingsubsystem,OpenThreads,transparentlyswitchesto polling whenall client threadsare idle.All messagesaredeliveredusingasynchronousupcalls. Pandausesthe single-threadedupcall model. This model imposesfewer restrictionsthan the active-messagesmodel,but morethanpopupthreads.Upcallsaredispatchedefficientlyusingthreadinlining andlazy thread-creation.

Pandaimplementsstreammessages,anefficient messageabstractionthaten-ablesend-to-endpipelining of datatransfers.Panda’s streammessagesseparatenotificationandthe consumptionof messagedata,which allows clientsto defermessageprocessing.

Finally, we discusseda simplebut efficient implementationof totally-orderedbroadcasting.This implementationis enabledby LFC’s efficient multicastprimi-tiveandits fetch-and-addprimitive.

Page 166: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0
Page 167: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Chapter 7

Parallel-Programming Systems

This chapterfocuseson the implementationof parallel-programmingsystems(PPSs)on LFC andPanda. We will show that the communicationmechanismsdevelopedin thepreviouschapterscanbeappliedeffectively to avarietyof PPSs.Weconsiderfour PPSs:

Ç Orca[9], anobject-basedDSM

Ç Manta,aparallelJavasystem[99]

Ç CRL [74], a region-basedDSM

Ç MPICH [61], animplementationof theMPI message-passingstandard[53]

Thesesystemsoffer differentparallel-programmingmodelsandtheir implemen-tationsstressdifferentpartsof theunderlyingcommunicationsystems.

This chapteris organizedasfollows. Section7.1 to Section7.4 describetheprogrammingmodel, implementation,and performanceof, respectively, Orca,Manta,CRL, andMPI. Section7.5discussesrelatedwork. Section7.6comparesthedifferentsystemsandsummarizesthechapter.

7.1 Orca

Orcais a PPSbasedon theshared-objectprogrammingmodel. This sectionde-scribesthis programmingmodel,givesanoverview of theOrcaimplementation,andthenzoomsin ontwo importantimplementationissues:operationtransferandoperationexecution.Finally, it discussestheperformanceof Orca.

145

Page 168: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

146 Parallel-ProgrammingSystems

7.1.1 Programming Model

In shared-memoryand page-basedDSM systems[78, 82, 92, 155], processescommunicateby readingandwriting memorywords. To synchronizeprocesses,theprogrammermustusemutual-exclusionprimitivesdesignedfor sharedmem-ory, suchas locks and semaphores.Orca’s programmingmodel, on the otherhand,is basedon high-level operationson shareddatastructuresandon implicitsynchronization,which is integratedinto themodel.

Orca programsencapsulateshareddata in objects, which are manipulatedthroughoperationsof anabstractdatatype. An objectmaycontainany numberofinternalvariablesandarbitrarily complex datastructures(e.g.,lists andgraphs).A key ideain Orca’smodelis to makeeachoperationonanobjectatomic, withoutrequiringtheprogrammerto uselocks. All operationson anobjectareexecutedwithout interferingwith eachother. Eachoperationis appliedto a singleobject,but within this object the operationcanexecutearbitrarily complex codeusingtheobject’s data.Objectsin Orcaarepassive: objectsdo not containthreadsthatwait for messages.Parallel executionis expressedthroughdynamicallycreatedprocesses.

The shared-objectmodel resemblesthe useof monitors. Both sharedob-jectsandmonitorsarebasedon abstractdatatypesandfor bothmodelsmutual-exclusionsynchronizationis doneby thesysteminsteadof theprogrammer. Forcondition synchronization,however, Orca usesa higher-level mechanism[12],basedonDijkstra’sguardedcommands[44]. A guard is abooleanexpressionthatmustbesatisfiedbeforetheoperationcanbegin. An operationcanhave multipleguards,eachof which hasan associatedsequenceof statements.If the guardsof anoperationareall false,theprocessthat invokedtheoperationis suspended.As soonasoneor moreguardsbecometrue,onetrueguardis selectednondeter-ministically andits sequenceof statementsis executed.This mechanismavoidstheuseof explicit wait andsignalcalls thatareusedby monitorsto suspendandresumeprocesses,simplifying programming.

Figure7.1 givesan exampledefinition of an object type Int. The definitionconsistsof an integer instancevariablex andtwo operations,inc() andawait().Operationinc() incrementsinstancevariablex andoperationawait() blocksuntilx hasreachedat leastthevalueof parameterv. Instancesof typeInt areinitializedby theinitialization block thatsetsinstancevariablex to zero.

Figure7.2showshow Orcaprocessesaredefinedandcreatedandhow objectsaresharedbetweenprocesses.At application-startuptime, theOrcaruntimesys-tem (RTS) createsone instanceof processtype OrcaMain() on processor0. Inthis example,OrcaMain() creates15 otherprocesses—instancesof processtypeWorker()— on processors1 to 15. To eachof theseprocessesOrcaMain() passesa referenceto object counter. When all processeshave beenforked, counter,

Page 169: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.1Orca 147

object implementation Int;x: integer;

operation inc();begin

x := x + 1;end;

operation await(v : integer);begin

guard x È v dood;

end;

beginx := 0;

end;end;

Fig. 7.1.Definitionof anOrcaobjecttype.

processWorker(counter:shared Int);begin

counter$inc();ɹɹÉend;

processOrcaMain();begin

counter:Int;

for i in 1..15dofork Worker(counter)on i;

od;counter$await(15);ɹɹÉ

end;

Fig. 7.2.Orcaprocesscreation.

Page 170: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

148 Parallel-ProgrammingSystems

Compiled Orca program

Orca runtime system

Panda and OpenThreads

LFC

Myrinet and Linux

Fig. 7.3.Thestructureof theOrcasharedobjectsystem.

which wasoriginally local to OrcaMain(), is sharedbetweenOrcaMain() andallWorker() processes.OrcaMain() waitsuntil all Worker() processeshaveindicatedtheirpresenceby incrementingcounter.

7.1.2 Implementation Overview

Figure7.3 shows thesoftwarecomponentsthatareusedduring theexecutionofanOrcaprogram.TheOrcacompilertranslatesthemodulesthatmakeupanOrcaprogram. For portability, the compilergeneratesANSI C. Sincethe Orcacom-piler performsmany optimizationssuchascommon-subexpressionelimination,strengthreduction,andcodemotion, the C codeit generatesoften performsaswell asequivalent,hand-codedC programs.TheC codeis compiledto machinecodeby a platform-dependentC compiler. The resultingobjectfiles are linkedwith theOrcaRTS, Panda,LFC, andothersupportlibraries. Below, we describehow thecompilerandtheRTSsupporttheefficientexecutionof Orcaprograms.

The Compiler

BesidestranslatingsequentialOrcacode,the compilersupportsan efficient im-plementationof operationson sharedobjectsin threeways. First, the compilerdistinguishesbetweenreadandwrite operations. Readoperationsdo not mod-ify the stateof the object they operateon. Operationawait() in Figure7.1, forexample,is a readoperation.If thecompilercannottell if anoperationis a readoperation,thenit markstheoperationasawrite operation.

Second,thecompilertriesto determinetherelativefrequency of readandwriteoperations[11]. The resultingestimatesarepassedto theRTS which usesthemto determineanappropriatereplicationstrategy for theobject.For example,if thecompilerestimatesthatanobjectis readmuchmorefrequentlythanit is written,then the RTS will replicatethe object, becausereadoperationson a replicatedobjectdo not requirecommunication.Thecompiler’s estimatesareusedonly as

Page 171: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.1Orca 149

a first guess.TheRTS maintainsusagestatisticsandmaylaterdecideto reviseapreviousdecision[87].

Third, thecompilergeneratesoperation-specificmarshalingcode.Marshalingis discussedin Section7.1.3.

The Runtime System

TheOrcaRTS implementsprocesscreationandtermination,shared-objectman-agement,andoperationson sharedobjects.Processcreationoccursinfrequently;mostprogramscreate,at initialization time, asmany processesasthenumberofprocessors.EachOrca processis implementedasa Pandathread. To createaprocessin responseto anapplication-level FORK call, theRTS broadcastsa forkmessage.Uponreceiving this message,theprocessoron which theprocessis tobecreated(theforkee)issuesanRPCbackto theforking processor. This RPCisusedto fetch copiesof any sharedobjectsthatneedto be storedon the forkee’sprocessor. Whenthe forkeereceives the RPC’s reply, it createsa Pandathreadthatexecutesthecodefor thenew Orcaprocess.Theforkerusesa totally-orderedbroadcastmessageratherthanapoint-to-pointmessageto initiatethefork. Broad-castingforks allows all processorsto keeptrackof thenumberandtypeof Orcaprocesses;this informationis usedduringobjectmigrationdecisions.

The RTS implementstwo object-managementstrategies. The simpleststrat-egy is to storea single copy of an object in a single processor’s memory. Todecidein which processor’smemorytheobjectmustbestored,theRTSmaintainsoperationstatisticsfor eachsharedobject. TheRTS usesthesestatisticsandthecompiler-generatedestimatesto migrateeachobjectto theprocessorthataccessestheobjectmostfrequently[87].

Thesecondstrategy is to replicateanobjectin thememoriesof all processorsthatcanaccesstheobject.In thiscase,themainproblemis to keepthereplicasina consistentstate.Orcaguaranteesa sequentiallyconsistent[86] view of sharedobjects. Among others,sequentialconsistency requiresthat all processesagreeon the order of writes to sharedobjects. To achieve this, all write operationsareexecutedby meansof a totally-orderedbroadcast,which leadsnaturallyto asingleorderfor all writes.Feketeetal. givedetailedformaldescriptionof correctimplementationstrategiesfor Orca’s memorymodel[52].

Most programscreateonly a small numberof sharedobjects. Using thecompiler-generatedhintsandits runtimestatistics,theRTSusuallydecidesquicklyandcorrectlyon which processor(s)it muststoreeachobject[87]. For theserea-sons,object-managementactionshavenot beenoptimized:replicatingor migrat-ing anobjectinvolvesa totally-orderedbroadcastandanRPC.

Operationsareexecutedin oneof threeways: througha local procedurecall,througharemoteprocedurecall,or throughatotally-orderedbroadcast.Figure7.4

Page 172: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

150 Parallel-ProgrammingSystems

RemoteWrite

Yes

callLocal procedure Totally-ordered

broadcastLocal procedurecall

Remote procedure call

No

Read Local

Replicated object?

Local or remote object?Read or write operation?

Fig. 7.4.Decisionprocessfor executinganOrcaoperation.

shows how oneof thesemethodsis selected.A readoperationon a replicatedobjectis executedlocally, withoutcommunication.This is, of course,thepurposeof replication: to avoid communication.Write operationson replicatedobjectsrequirea totally-orderedbroadcast.For operationson nonreplicatedobjectstheRTS doesnot distinguishbetweenreadandwrite operations,but consideronlytheobject’s location. If theobject is storedin the memoryof theprocessorthatinvokestheoperation,thentheoperationis executedbymeansof alocalprocedurecall; otherwisea remoteprocedurecall is used.

The descriptionso far is correctonly for operationsthat have at most oneguard.With multiple guards,thecompilerdoesnot know in advancewhethertheoperationis areador write operation:thisdependsontheguardalternativethatisselected.Thecompilercouldconservatively classifyoperationsthathave at leastonewrite alternativeaswrite operations.Instead,however, thecompilerclassifiesindividual alternativesasreador write alternatives. Whenthe RTS executesanoperation,it first triesto executea readalternative,becausefor replicatedobjectsreadsare lessexpensive thanwrites. The write alternativesare tried only if allreadalternativesfail. If theobjectis replicated,this involvesabroadcast.

Operationsthat requirecommunicationare executedby meansof functionshipping. Insteadof moving theobjectto theinvoker’s processor, theRTS movesthe invocationandits parametersto all processorsthat storethe object. This isdoneusinganRPCor a totally-orderedbroadcast,asexplainedabove. Sinceallprocessorsrun the samebinary program,the codeto executean operationneednotbetransportedfrom oneprocessorto another;asmalloperationidentifiersuf-fices. In additionto this identifier the RTS marshalsthe object identity andtheoperation’sparameters.In thecaseof anoperationon anonreplicatedobject,thisinformationis storedin theRPC’s requestmessage.Whentherequestarrivesatthe processorthat storesthe object,this processor’s RTS executesthe operationandsendsareply thatcontainstheoperation’soutputparametersandreturnvalue.In thecaseof a write operationon a replicatedobject,the invoker’s RTS broad-caststhe operationidentifier andthe operationparametersto all processorsthat

Page 173: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.1Orca 151

type Person= recordage:integer;weight: integer;

end;

type PersonList= array [integer] of Person;

Fig. 7.5. Orcadefinitionof anarrayof records.

hold a replica. SincetheRTS storeseachreplicatedobjecton all processorsthatreferenceit, the invoker’s processoralwayshasa copy of theobject. All proces-sorsthat have a replica perform the operationwhen they receive the broadcastmessage;otherprocessorsdiscardthe message.On the invoker’s machine,theRTS additionallypassestheoperation’soutputparametersandreturnvalueto the(blocked)invoker.

7.1.3 Efficient Operation Transfer

This sectiondescribeshow the Orca RTS, the Orca compiler, Panda,andLFCcooperateto transferoperationinvocationsasefficiently aspossible.

Compiler and Runtime Support for Marshaling Operations

In earlier implementationsof Orca,all marshalingof operationswasperformedby theRTS.To marshalanoperation,theRTS neededthefollowing information:

Ç thenumberof parameters

Ç theparametermodes(I N, OUT, I N OUT, or SHARED)

Ç theparametertypes

Thisinformationwasstoredin arecursivedatastructurein whicheachcomponenttypeof acomplex typehadits own typedescriptornode.Duringoperationexecu-tion, theRTS madetwo passesover this datastructure:oneto find the total sizeof thedatato bemarshaledandoneto copy thedatainto amessagebuffer. For anarrayof recordssuchasdefinedin Figure7.5,for example,theRTSwould inspecteachrecord’s typedescriptorto determinetherecord’ssize,evenif all recordshadthesame,staticallyknown size.

Thecurrentcompilerdeterminesobjectsizesatcompiletimewheneverpossi-ble. (Orcasupportsdynamicarraysanda built-in graphdatatype,soobjectsizescannotalwaysbe computedstatically.) For eachoperationdefinedaspart of an

Page 174: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

152 Parallel-ProgrammingSystems

objecttype,thecompilergeneratesoperation-specificmarshalingandunmarshal-ing routines,which areinvoked by the OrcaRTS. In somecases,theseroutinescall backinto theRTS to marshaldynamicdatastructuresandobjectmetastate.

Figure7.6shows thecodegeneratedto marshalandunmarshaltheOrcatypedefinedin Figure7.5; this codehasbeeneditedandslightly simplified for read-ability. Thefigureshows only theroutinesthatareusedto constructandreadanoperationrequestmessage.Anothertriple of routinesis generatedfor RPCreplymessagesthathold OUT parametersandreturnvalues.

Thefirst routine,sizeofiovecarray record(), computesthenumberof point-ers in the I/O vectorand is usedby the RTS to allocatea sufficiently large I/Ovector. (TheRTS cachesI/O vectors,but hasto checkif thereis a cachedvectorthatcanhold asmany pointersasneeded.)Thesecondroutine,fill iovecarray -record(), generatesaPandaI/O vectorthatcontainspointersto thedataitemsthatareto bemarshaled.If thearrayis empty, the routineaddsonly a pointerto thearraydescriptor;otherwiseit addsa pointerto thearraydescriptoranda pointerto thearraydata.

To transmitthe operation,the RTS storesits header(s)in a headerstackandpassesboth the I/O vectorandthe headerstackto oneof Panda’s sendroutines.Pandacopiesthedataitemsreferencedby theI/O vectorinto LFC sendpackets.

At thereceiving side,Pandacreatesastreammessageandpassesthismessageto the OrcaRTS. The RTS looks up the target objectandthe descriptorfor theoperationthat is to be invoked,andtheninvokesunmarshal array record(), thethird routine in Figure7.6. This operation-specificroutinefirst unmarshalsthearraydescriptorandthendecidesif it needsto unmarshalany data.

Data Transfers

Above,we discussedtheOrcapartof operationtransfers.This sectiondiscussestheentirepath,throughall communicationlayers. All datatransfersinvolved inan operationon a remoteobjectareshown in Figure7.7. At the sendingside,Pandacopies(usingprogrammedI/O) datareferencedby the I/O vectordirectlyfrom userdatastructuresinto LFC sendpackets. The secondstagein the datatransferpipelineconsistsof moving LFC sendpackets to the destinationNI. Inthethird stage,LFC usesDMA to movenetwork packetsfrom NI memoryto hostmemory. Pandaorganizesthepacketsin hostmemoryinto a streammessageandpassesthis messageto theOrcaRTS. In thefourth stage,theRTS allocatesspaceto hold thedatastoredin themessageandunmarshals(i.e., copies)the contentsof themessageinto this space.All four stagesoperateconcurrently. As soonasPandahasfilled a packet, for example,LFC transmitsthis packet, while Pandastartsfilling thenext packet.

At the sendingside,oneunnecessarydatatransferoccurs. SinceLFC uses

Page 175: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.1Orca 153

int sizeof iovec array record(PersonListDesc º a)»if (a Ê a sz Ë 0)

»return 2; / º pointer to descriptor and to array data º /¼

return 1; / º pointer to descriptor; no data (empty array) º /¼pan iovec p fill iovec array record(pan iovec p iov, PersonListDesc º a)»

iov Ê data = a; / º add pointer to the array descriptor º /iov Ê len = sizeof( º a);iov++;

if (a Ê a sz Ë 0)»

/ º add pointer to array data º /iov Ê data = (Person º ) a Ê a data;iov Ê len = a Ê a sz º sizeof(Person);iov++;¼

return iov;¼void unmarshal array record(pan msg p msg, PersonListDesc º a)»

Person º r;pan msg consume(msg, a, sizeof( º a)); / º unmarshal array descriptor º /if (a Ê a sz Ë 0)

»r = malloc(a Ê a sz º sizeof(Person));a Ê a data = r;pan msg consume(msg, r, a Ê a sz º sizeof(Person));¼

else»a Ê a sz = 0;a Ê a data = 0;¼¼

Fig. 7.6.Specializedmarshalingcodegeneratedby theOrcacompiler.

Page 176: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

154 Parallel-ProgrammingSystems

Orca datastructures

Orca datastructures

LFC

Panda

Orca runtime

Orca application

11

23

4 4I/O vector

Send packets(in NI memory) Receive packets

(in NI memory)

Receive packets(in host memory)

Message

2. Data transmission1. Marshaling of Orca data into send packets

3. DMA transfer from NI to host memory4. Unmarshaling of Panda stream message

Data transferPointer

Fig. 7.7.Datatransferpathsin Orca.

programmedI/O to movedatainto sendpackets,thedatacrossesthememorybustwice: oncefrom hostmemoryto theprocessorregistersandfrom thoseregistersto NI memory. By usingDMA insteadof PIO, thedatawould crossthememorybusonly once.Thedatain Figure2.2suggeststhat for messagesof 1 Kbyte andlarger DMA will be fasterthanPIO. The useof DMA also freesthe processorto do otherwork while the transferprogresses.The sameasynchrony, however,alsoimplies thata mechanismis neededto testif a transferhascompleted.ThisrequirescommunicationbetweenthehostandtheNI. Either thehostmustpoll alocationin NI memory, whichslowsdown thedatatransfer, or theNI mustsignalthe endof the datatransferby meansof a small DMA transferto hostmemory,which increasestheoccupancy of boththeNI processorandtheDMA engine.

Anotherproblemwith DMA is thatit is necessaryto pin thepagescontainingthe Orca datastructuresor to set up a dedicatedDMA area. Unlessthe Orcadatastructurescanbestoredin theDMA area,thelatterapproachrequiresacopyto theDMA area.On-demandpinningof Orcadatais feasible,but complex. Toavoid pinningandunpinningoneverytransfer, acachingschemesuchasVMMC-2’s UTLB is needed(seeSection2.3). For small transfers,sucha schemeis lessefficient thanprogrammedI/O.

At thereceiving side,theoptimalscenariowouldbefor theNI to transferdatafrom thenetwork packetsin NI memoryto thedatabuffersallocatedby theRTS(i.e., to avoid theuseof hostreceive packets). In practice,this is difficult. First,the NI hasto know to which messageeachincomingdatapacket belongs.ThisimpliesthattheNI hasto maintainstatefor eachmessageandlook upthisstatefor

Page 177: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.1Orca 155

RTS serverthreads

Broadcastqueue

Orca processes

Thread

Upcall

RPC queue

Upcall threadPanda

RTS

Orca

Fig. 7.8.Orcawith popupthreads.

eachincomingpacket. Second,theNI hasto find a sufficiently largedestinationbuffer in hostmemoryinto which thedatacanbecopied.Third, afterthedatahasbeencopied,theNI hasto notify thereceiving process.UnlessthehostandtheNIperforma handshake to agreeuponthedestinationbuffer, this notificationcannotbe merged with the datatransferas in LFC. A handshake, however, increaseslatency.

Summarizing,eliminatingall copiesis difficult andit is doubtfulwhetherdo-ing so will significantly improve performance. We thereforedecidedto useapotentiallyslower, but simplerdatapath.

7.1.4 Efficient Implementation of Guarded Operations

This sectiondescribesan efficient implementationof Orca’s guardedoperationson top of Panda’s upcall model. WhenPandareceivesan operationrequest,itdispatchesanupcallto theOrcaRTS.It is notobvioushow theRTSshouldexecutethe requestedoperationin the context of this upcall. The problemis that theoperationmayblockonaguard.Panda’supcallmodel,however, forbidsmessagehandlersto block andwait for incomingmessages.Suchmessagesmayhave tobereceivedandprocessedto makeaguardevaluateto true.

To solvethisproblem,anearlierversionof theOrcaRTSimplementedits ownpopupthreads.The structureof this RTS, which we call RTS-threads,is shownin Figure7.8. In RTS-threads,Panda’s upcall threaddoesnot executeincomingoperationrequests,but merelystoresthemin oneof two queues.RPCrequestsfor operationsonnonreplicatedobjectsarestoredin theRPCqueueandbroadcastmessagesfor write operationson replicatedobjectsare storedin the broadcastqueue.Thesequeuesareemptiedby a pool of RTS-level server threads.Whenall server threadsareblocked, the RTS extendsthe threadpool with new server

Page 178: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

156 Parallel-ProgrammingSystems

threads. Sincethe server threadsdo not listen to the network, they can safelyblockonaguardwhenthey processanincomingoperationrequestfrom theRPCqueue.(This blockingis implementedby meansof conditionvariables.)Also, itis now safeto performsynchronouscommunicationwhile processingabroadcastmessage.This typeof nestedcommunicationoccursduring thecreationof Orcaprocessesandduringobjectmigration.

RTS-threadsusesonly onethreadto servicethebroadcastqueue.With mul-tiple threads,incoming(totally-ordered)broadcastmessagescouldbe processedin a differentorderthanthey werereceived.Usingonly a singlethread,however,implies (onceagain)that,without extra measures,write operationson replicatedobjectscannotblock.

Besidesthis orderingandblockingproblem,RTS-threadssuffers from prob-lems relatedto the useof popupthreads(seeSection6.2.3): increasedthreadswitchingandincreasedmemoryconsumption.Processinganincomingoperationrequestalwaysrequiresat leastonethreadswitch(from Panda’supcallto aserverthread).In addition,whenmany incomingoperationrequestsblock,RTS-threadsis forced to allocatemany threadstacks,which wastesmemory. Finally, whensomeoperationmodifiesan object that many operationsare blocked on, RTS-threadswill signalall threadsassociatedwith theseoperations.Thesethreadswillthenre-evaluatethe guardthey wereblocked on, perhapsonly to find that theyneedto blockagain.This leadsto excessive threadswitching.

To solvetheseproblems,werestructuredtheRTSsothatall operationrequestscanbeprocesseddirectlyby Panda’supcall,without theuseof serverthreads.Thenew structureis shown in Figure7.9.Thenew RTSemploysonly oneRTSserverthread(notshown),whichis usedonly duringactionsthatoccurinfrequently, suchasprocesscreationandobjectmigration.In thesecases,it is necessaryto performanRPC(i.e., synchronouscommunication)in responseto anincomingbroadcastmessage.ThisRPCis performedby theRTSserverthread,justasin RTS-threads.

Blocking on guardsis handledby exploiting the observation that Orcaoper-ationscanblock only at the very beginning. A blocked operationcanthereforealwaysberepresentedby anobjectidentifier, anoperationidentifier, andtheop-eration’s parameters.This invocationinformationis availableto theRTS whenitinvokesanoperation.Whenanoperationblocks,thecompiler-generatedcodeforthat operationreturnsan error statusto the RTS. Insteadof blocking the callingthreadona conditionvariable,asin RTS-threads,theRTS createsa continuation.In thiscontinuation,theRTSstorestheinvocationinformationandthenameof anRTS function that,given the invocationinformation,canresumetheblocked in-vocation.Dif ferentresumefunctionsareused,dependingon theway theoriginalinvocationtook place.Theresumefunctionfor anRPCinvocation,for example,differs from the resumefunction for a broadcastinvocation,becauseit needstosenda reply messageto theinvokingprocessor.

Page 179: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.1Orca 157

Upcall thread

Upcall

Thread

Panda

Orca processesOrca

RTS

Fig. 7.9.Orcawith single-threadedupcalls.

cont init(contqueue, lock)cont clear(contqueue)state ptr = cont alloc(contqueue, size, contfunc)cont save(state ptr)cont resume(contqueue)

Fig. 7.10.Thecontinuationsinterface.

Eachsharedobjecthasa continuationqueuein which theRTS storescontin-uationsfor blockedoperations.Whentheobjectis latermodified,themodifyingthreadwalkstheobject’scontinuationqueueandcallseachcontinuation’sresumefunction. If anoperationfailsagain,its continuationremainson thequeue,other-wiseit is destroyed.

Figure7.10 shows the interfaceto the continuationmechanism.Queuesofcontinuationsresembleconditionvariables,whichareessentiallyqueuesof threaddescriptors.This similarity makesreplacingconditionvariableswith continua-tionsquiteeasy. Cont init() initializesacontinuationqueue;initially, thequeueisempty. Likea conditionvariable,eachcontinuationqueuehasanassociatedlock(lock) thatensuresthataccessesto thequeueareatomic.Cont clear() destroys acontinuationqueue.Cont alloc() heap-allocatesa continuationstructureandas-sociatesit with a continuationqueue.(Continuationscannotbeallocatedon thestack,becausethe stackis reused.)Cont alloc() returnsa pointer to a buffer ofsizebytesin which theclient savesits state.After saving its state,theclient callscont save() which appendsthecontinuationto thequeue.Together, cont alloc()andcont save() correspondto await operationonaconditionvariable.In thecaseof conditionvariables,however, noseparateallocationcall is needed,becausethesystemknowswhatstateto save: theclient’s threadstack.Finally, cont resume()

Page 180: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

158 Parallel-ProgrammingSystems

resemblesa broadcaston a conditionvariable;it traversesthequeueandresumesall continuations.

Representingblockedoperationsby continuationsinsteadof blockedthreadshasthreeadvantages.

1. Continuationsconsumevery little memory. Thereis no associatedstack,just theinvocationinformation.

2. Resuminga blockedinvocationdoesnot involve any signalingandswitch-ing to otherthreads,but only plainprocedurecalls.

3. Continuationsareportableacrosscommunicationarchitectures.With ap-propriatesystemsupport(suchas the fast handlerdispatchdescribedinSection6.2.3),popupthreadscanbe implementedmoreefficiently thaninRTS-threads.Continuationsdo not requiresuchsupportand they avoidsomeof thedisadvantagesof popupthreads.

Manualstatesaving with continuationsis relatively easyif messagehandlerscanblockonly at thebeginning(aswith Orcaoperations),becausethestatethatneedsto be saved consistsessentiallyof the messagethat hasbeenreceived. Manualstatesaving is more difficult when handlersblock after they have creatednewstate.In thiscase,thehandlermustbesplit in two parts:thecodeexecutedbeforeandafter thesynchronizationpoint. For very largesystems,this manualfunctionsplitting becomestedious. The difficulty lies in identifying the statethat needsto becommunicatedfrom thefunction’s first half to its secondhalf (by meansofa continuation). This statemay well includestatestoredin stackframesbelowtheframethatis aboutto block (assumingthatstacksgrow upward). In theworstcase,theentirestackmustbesaved.

A secondcomplicationarisesif a handlerperformssynchronouscommunica-tion. A synchronouscommunicationcall canonly besplit in two partsif thereis anonblockingalternative for thesynchronousprimitive. Earlierversionsof Panda,for example,did not provide asynchronousmessagepassingprimitives,but onlya synchronousRPCprimitive. In this case,trueblockingcannotbeavoidedandit is necessaryto handoff stateto a separatethreadthat cansafelyperformthesynchronousoperation.In thecontinuation-basedRTS,suchcasesarehandledbyasingleRTSserver thread.

7.1.5 Performance

Thissectiondiscussesonly operationlatency andthroughput.An elaborateappli-cation-performancestudywasperformedusinganOrcaimplementationthatrunson Panda(version3.0) andFM/MC [9]. Applicationperformanceon Panda4.0

Page 181: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.1Orca 159

0Ã 200Ã 400Ã 600Ã 800Ã 1000Message size (bytes)Ä

0

50

100

Rou

ndtr

ip la

tenc

y (m

icro

seco

nds)

Ì OrcaPandaLFC

Fig. 7.11.Roundtriplatenciesfor Orca,Panda,andLFC.

andLFC is discussedin Chapter8 which studiestheapplication-level effectsofdifferentLFC implementations.All measurementsin thischapterwereperformedon theexperimentalplatformdescribedin Section1.6,unlessnotedotherwise.

Performanceof Operationson NonreplicatedObjects

Figure 7.11 shows the executiontime of an Orca operationon a remote,non-replicatedobject. Theoperationtakesanarrayof bytesasits only argumentandreturnsthe samearray. The horizontalaxis specifiesthe numberof bytesin thearray. Theoperationis performedusinga PandaRPCto theremotemachine.Forcomparison,Figure7.11alsoshows roundtriplatenciesfor PandaandLFC usingmessagesthathave thesamesizeastheOrcaarray. In all threebenchmarks,therequestandreply messagesarereceivedthroughpolling.

Orca-specificprocessingcostsapproximately15 µs. This includesthetime tolock theRTS, to readtheRTS headers,to allocatespacefor thearrayparameter,to look uptheobject,to executetheoperation,to updatetheruntimeobjectaccessstatistics,andto build thereplymessage.

We usedthe sametest to measureoperationthroughput. The resultsof thistest, and the correspondingroundtrip throughputsfor Pandaand LFC with thecopy receive strategy, areshown in Figure7.12. We defineroundtripthroughputas Í 2mÎ Ï RTT, wherem is thesizein bytesof thearraytransmittedin therequestandreplymessageandRTT themeasuredroundtriptime.

Comparedto the throughputobtainedby PandaandLFC, Orca’s throughputis good. This is dueto theuseof I/O vectorsandstreammessages,which avoid

Page 182: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

160 Parallel-ProgrammingSystems

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ä0

10

20

30

40

50

Rou

ndtr

ip th

roug

hput

(M

byte

/sec

ond)

Ð LFCPandaOrca

Fig. 7.12.Roundtripthroughputfor Orca,Panda,andLFC.

extra copying insidethe OrcaRTS. For all systems,throughputdecreaseswhenmessagesno longerfit into theL2 cache(256Kbyte).

Performanceof Operationson ReplicatedObjects

Figure 7.13 shows the latency of a write operationon an object that hasbeenreplicatedon 64 processors.The operationtakes an array of bytesas its onlyargumentanddoesnot returnany results.This operationis performedby meansof a totally-orderedPandabroadcast.For comparison,thefigure alsoshows thebroadcastlatenciesfor Panda(orderedandunordered)andLFC (unordered).Inall cases,thelatency shown is thelatency betweenthesenderandthelastreceiverof thebroadcast.Orcaaddsapproximately18µs(17%)to Panda’s totally-orderedbroadcastprimitive.

Figure7.14shows thethroughputobtainedusingthesamewrite operationasabove. The lossin throughputrelative to Panda(with ordering)is at most15%.This occurswhenOrcasendstwo packetsandPandaone(dueto an extra Orcaheader).In all othercasesthelossis at most8%.

The Impact of Continuations

Sincewehavenoimplementationof RTS-threadsonPanda4.0,wecannotdirectlymeasurethe gainsof usingcontinuationsinsteadof popupthreads. An earliercomparisonshowed that operationlatencieswent from 2.19 ms in RTS-threads

Page 183: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.1Orca 161

0Ã 200Ã 400Ã 600Ã 800Ã 1000Message size (bytes)Ä

0

50

100

150

200

Late

ncy

(mic

rose

cond

s)

Å64 processorsÆ

OrcaPanda, orderedPanda, unorderedLFC

Fig. 7.13.Broadcastlatency for Orca,Panda,andLFC.

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ä0

5

10

15

20

25

Thr

ough

put (

Mby

te/s

econ

d)

Ñ

64 processors

LFC, unorderedPanda, unorderedPanda, orderedOrca

Fig. 7.14.Broadcastthroughputfor Orca,Panda,andLFC.

Page 184: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

162 Parallel-ProgrammingSystems

Issue Orca MantaProgramming Objectplacement Transparent Explicitmodel Mutual exclusion Peroperation, Synchronized

implicit methods/blocksCondition Guards Conditionsynchronization variablesGarbagecollection No Yes

Implementation Compilation To C To assembly(x86 andSPARC)

Objectreplication Yes NoObjectmigration Yes NoInvocation Totally-ordered RMImechanisms bcastandRPCUpcallmodels Pandaupcalls Popupthreads

Table 7.1.A comparisonof OrcaandManta.

to 1.90ms in the continuation-basedRTS [14], an improvementof 13%. Thesemeasurementswereperformedon 50 MHz SPARCClassicclonesconnectedby10MbpsEthernet.For communicationweusedPanda2.0on topof thedatagramserviceof theAmoebaoperatingsystem. Threadswitchingon theSPARC wasexpensive, becauseit requireda trap to the kernel. On the PentiumPro, threadswitchesareperformedentirelyin userspace,sothegainsof usingcontinuationsinsteadof popupthreadsaresmaller. Nevertheless,thecostsof switchingfrom aPandaupcallto apopupthreadaresignificant.In Manta,for example,dispatchinga popupthreadcosts4 µs andaccountsfor 9% of the executiontime of a nulloperationona remoteobject(seeSection7.2).

7.2 Manta

Mantais a parallel-programmingsystembasedon Java [60]. Java is a portable,type-secure,andobject-orientedprogramminglanguagewith automaticmemorymanagement(i.e., garbagecollection); thesepropertieshave madeJava a verypopularprogramminglanguage. Java is also an attractive (base)languageforparallelprogramming.Multithreading,for example,is part of the languageandanRPC-likecommunicationmechanismis availablein standardlibraries.

Manta implementsthe JavaParty [117] programmingmodel,which is basedon sharedobjectsandresemblesOrca’s shared-objectmodel. Theprogrammingmodelandits implementationon PandaandLFC aredescribedbelow. A detaileddescriptionof Mantaisgivenin theMScthesisof MaassenandvanNieuwpoort[98].

Page 185: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.2Manta 163

7.2.1 Programming Model

MantaextendsJava with onekeyword: remote. Java threadscanshareinstancesof any classthattheprogrammertagswith thiskeyword. Suchinstancesarecalledremoteobjects. In addition,Mantaprovideslibrary routinesto createthreadsonremoteprocessors.Threadsthat run on the sameprocessorcan communicatethroughglobalvariables;threadsthatrunondifferentmachinescancommunicateonly by invokingmethodson sharedremoteobjects.

ThisprogrammingmodelresemblesOrca’sshared-objectmodel,but thereareseveral importantdifferences,which aresummarizedin Table7.1 anddiscussedbelow. Section7.2.2discussestheimplementationdifferences.

First, in Manta,objectplacementis not transparentto theprogrammer. Whencreatingaremoteobject,theprogrammermustspecifyexplicitly onwhichproces-sortheobjectis to bestored.Eachobjectis storedonasingleprocessor—Mantadoesnot replicateobjects—andcannotbemigratedto anotherprocessor.

Second,OrcaandJavausedifferentsynchronizationmechanisms.In Orca,alloperationson anobjectareexecutedatomically. In Java, individual methodscanbe marked as ’synchronized.’ Two synchronizedmethodscannotinterferewitheachother, but a nonsynchronizedmethodcaninterferewith other(synchronizedandnonsynchronized)methods.Also, Javausesconditionvariablesfor conditionsynchronization;Orcausesguards.

The last differenceis that Java is a garbage-collectedlanguage,which hassomeimpacton communicationperformance(seeSection7.2.3).

7.2.2 Implementation

LikeOrca,Mantadoesnot useLFC directly, but builds on Panda.Mantabenefitsfrom Panda’s multithreadingsupport,Panda’s transparentmixing of polling andinterrupts,andPanda’s streammessages.SinceMantadoesnot replicateobjects,it neednot maintainmultiple consistentcopiesof an object,andthereforedoesnotusePanda’s totally-orderedbroadcastprimitive.

Mantaachieveshighperformancethroughafastremote-objectinvocationmech-anismand the useof compilationinsteadof interpretation. Both arediscussedbelow.

Remote-ObjectInvocation

Like Orca,Mantausesfunction shippingto accessremoteobjects. Method in-vocationson a remoteobjectareshippedby meansof anobject-orientedvariantof RPCcalledRemoteMethodInvocation(RMI) [152]. EachMantaRMI is exe-cutedby meansof aPandaRPC.

Page 186: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

164 Parallel-ProgrammingSystems

ProcessinganRMI requestconsistsof executingamethodonaremoteobject.All parametersexceptremoteobjectsarepassedby value. Method invocationscanblock at any time on a conditionvariable.This is differentfrom Orcawhereblocking occursonly at the beginning of an operation. Due to this difference,Manta cannotalways usePandaupcallsand continuationsto handleincomingRMIs. If a methodis executedby a Pandaupcall andblockshalfway through,thenthe Mantaruntimesystemis not awareof the statecreatedby that methodandcannotcreateacontinuation.

To dealwith blockingmethods,Mantausespopupthreadsto processincomingRMIs. WhenanRMI requestis deliveredby a Pandaupcall, theMantaruntimesystempassesthe requestto a popupthreadin a threadpool. This threadexe-cutesthemethodandsendsthereply. Whenthemethodblocks,thepopupthreadis blockedon a Pandaconditionvariable;no continuationsarecreated.In short,Mantausesthe samesystemstructureasearlierOrcaimplementations(cf. Fig-ure7.8).

Compiler Support

BesidesdefiningtheJava language,theJavadevelopersalsodefinedtheJavaVir-tualMachine(JVM) [95], avirtual instructionsetarchitecture.Javaprogramsaretypically compiledto JVM bytecode;this bytecodeis interpreted.This approachallows Java programsto berun on any machinewith a Java bytecodeinterpreter.The disadvantageis that interpretationis slow. Currentresearchprojectsareat-tackingthis problemby meansof just-in-time(JIT) compilationof Javabytecodefor aspecificarchitectureandby meansof hardwaresupport[104].

Mantaincludesa compiler that translatesJava sourcecodeto machinecodefor SPARC andIntel x86architectures[145]. This removestheoverheadof inter-pretationfor applicationsfor which sourcecodeis available.For interoperability,Mantacanalsodynamicallyload andexecuteJava bytecode[99]. The experi-mentsbelow, however, useonly nativebinarieswithoutbytecode.

The compiler also supportsManta’s RMI. For eachremoteclass,the com-piler generatesmethod-specificmarshalingand unmarshalingroutines. Mantamarshalsarraysin the sameway asOrca(seeFigure7.6). At the sendingside,thecompiler-generatedcodeaddsa pointerto thearraydatato a PandaI/O vec-tor andPandacopiesthe datainto LFC sendpackets. At the receiving side,thecompiler-generatedcodeusesPanda’span msgconsume() to copy datafrom LFChostreceivepacketsto aJavaarray.

For nonarraytypes,Mantamakesanextracopy at thesendingside.All nonar-ray datathat is to be marshaledis copiedto a singlebuffer anda pointerto thisbuffer is addedto a PandaI/O vector. Sincenonarrayobjectsareoftensmall,theextracopying costsmayoutweighthecostof I/O vectormanipulations.

Page 187: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.2Manta 165

0Ã 200Ã 400Ã 600Ã 800Ã 1000Message size (bytes)Ä

0

50

100

150

Rou

ndtr

ip la

tenc

y (m

icro

seco

nds)

Ì Manta RMIOrcaPandaLFC

Fig. 7.15.Roundtriplatenciesfor Manta,Orca,Panda,andLFC.

7.2.3 Performance

Figures7.15and7.16show Manta’s RMI (roundtrip)latency andthroughput,re-spectively. Thesenumbersweremeasuredby repeatedlyperforminganRMI onaremoteobject.Themethodtakesasinglearrayparameterandreturnsthisparame-ter; thehorizontalaxesof Figure7.15andFigure7.16specifythesizeof thearray.For comparison,thefigureshowsalsotheroundtriplatenciesandthroughputsforLFC, Panda,andOrca.All four systemsmake thesamenumberof datacopies.

The null latency for an empty array is 76 µs. This high latency is due tounoptimizedarraymanipulations.With zeroparametersinsteadof anemptyarrayparameter, thenull latency dropsto 48 µs. Thethreadswitchfrom Panda’supcallto a popupthreadin theMantaruntimesystemcosts4 µs. For nonemptyarrays,Manta’s latency increasesfasterthanthelatency of Orca,PandaandLFC. This isnotdueto extra copying, but to thecacheeffectdescribedbelow.

Figure7.16showstheroundtripthroughputfor LFC, Panda,Orca,andManta.Manta’s RMI peakthroughputis muchlower thanthepeakthroughputfor Orca,Panda,andLFC. The reasonis that Mantadoesnot free the datastructuresintowhich incomingmessagesareunmarshaleduntil thegarbagecollectoris invoked.In this test, the garbagecollector is never invoked, so eachincoming messageis unmarshaledinto a freshmemoryarea,which leadsto poor cachebehavior.Specifically, eachtime a messageis unmarshaledinto a buffer, the contentsofreceive buffers usedin previous iterationsis flushedfrom the cacheto memory.This problemdoesnot occurin theothertests(Orca,Panda,andLFC), becausethey reusethesamereceivebuffer in eachiteration.

Page 188: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

166 Parallel-ProgrammingSystems

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ä0

10

20

30

40

50

60

Rou

ndtr

ip th

roug

hput

(M

byte

/sec

ond)

Ð LFCPandaOrcaManta RMI

Fig. 7.16.Roundtripthroughputfor Manta,Orca,Panda,andLFC.

If Mantacanseethat the dataobjectscontainedin a requestmessagearenolongerreachableafter completionof the remotemethodthat is invoked, thenitcan immediatelyreusethe messagebuffer without waiting for the garbagecol-lector to recycle it. At present,the Manta compiler is able to perform the re-quiredescapeanalysis[36] for simplemethods,includingthemethodusedin ourthroughputbenchmark.Sincewewishto illustratethegarbagecollectionproblemandsinceescapeanalysisworks only for simplemethods,we show the unopti-mizedthroughput.With escapeanalysisenabled,MantatracksPanda’s through-put curve.

7.3 The C RegionLibrary

TheC RegionLibrary (CRL) is ahigh-performancesoftwareDSM systemwhichwasdevelopedat MIT [74]. UsingCRL, processescanshareregionsof memoryof auser-definedsize.TheCRL runtimesystemimplementscoherentcachingfortheseregions.ThissectiondiscussesCRL’sprogrammingmodel,its implementa-tion onLFC, andtheperformanceof this implementation.

7.3.1 Programming Model

CRL requirestheprogrammerto storeshareddatain regions. A region is a con-tiguouschunkof memoryof a user-definedsize; a region canneithergrow nor

Page 189: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.3TheC RegionLibrary 167

shrink.Processescanmapregionsinto theiraddressspaceandaccessthemusingmemory-referenceinstructions.

To make changesvisible to other processes,and to observe changesmadeby otherprocesses,eachprocessmustencloseits region accessesby calls to theCRL runtimesystem. (CRL is library-basedandhasno compiler.) A seriesofreadaccessesto a region r shouldbebracketedby calls to rgn start read(r)andrgn endread(r), respectively. If a seriesof accessesincludesan instructionthatmodifiesa region r, thenrgn start write(r) andrgn endwrite(r) shouldbeused.Rgnstart read() locks a region for readingand rgn start write() locks a regionfor writing. Both callsblock if anotherprocesshasalreadylockedtheregion in aconflictingway. (A conflictoccurswheneverat leastonwrite accessis involved.)Rgnend read() andrgn endwrite() releasethereadandwrite lock, respectively.If all regionaccessesin aprogramareproperlybracketed,thenCRL guaranteesasequentiallyconsistent[86] view of all regions.However, theCRL library cannotverify thatusersindeedbracket theiraccessesproperly.

7.3.2 Implementation

AlthoughCRL usesa sharingmodelthat is similar to Orca’s sharedobjects,theimplementationis different. First, CRL runsdirectly on LFC anddoesnot usePanda. Second,CRL usesa differentconsistency protocolfor replicatedsharedobjects.Orcaupdatessharedobjectsby meansof functionshipping. CRL, in con-trast,usesinvalidationanddatashipping. (A function-shippingimplementationof CRL exists[67], but theLFC implementationof CRL is basedon theoriginal,data-shippingvariant.)CRL usesaprotocolsimilar to thecachecoherenceproto-colsusedin scalablemultiprocessorssuchastheOrigin2000[151] andtheMITAlewife [1]. The protocol treatsevery sharedregion asa cacheline andrunsadirectory-basedcachecoherenceprotocolto maintaincacheconsistency.

The core of CRL’s runtime systemis a statemachinethat implementstheconsistency protocol. The runtimesystemmaintains(meta)statefor eachregionon all nodesthat cachethe region. The statemachineis implementedasa setof handlerroutinesthat are invoked whena specificevent occurs. An event iseitherthe invocationof a CRL library routinethatbracketsa region access(e.g.,rgn start write()) or thearrival of a protocolmessage.A detaileddescriptionofthestatemachineis givenin Johnson’sPhDthesis[73].

WhenaprocesscreatesaregiononaprocessorP, thenP becomestheregion’shomenode. Thehomenodestorestheregion’sdirectoryinformation;it maintainsa list of all processorsthatarecachingtheregion. It is the locationwherenodesthatarenot cachingtheregion canobtaina valid copy of the region. It is alsoacentralpointatwhichconflictingregionaccessesareserializedsothatconsistencyis maintained.

Page 190: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

168 Parallel-ProgrammingSystems

Inva

lidat

e

Read miss

ACK

ACK

ACK + data

(a) Clean read miss (b) Dirty read miss

Read miss

ACK + data ACK + data

Invalidate

InvalidateWrite miss

ACK + data ACK + data

Write miss

ACK + data

P H

R

R

P H

HP

R

R

HP W

W

(c) Clean write miss (d) Dirty write miss

Invalidate

Fig. 7.17.CRL readandwrite misstransactions.

CRL Transactions

Communicationin CRL is causedmainly by readandwrite misses.Figure7.17shows the communicationpatternsthat result from differenttypesof misses.Amissis a cleanmissif thehomenodeH hasanup-to-datecopy of theregion thatprocessP is trying to access,otherwisethemissis a dirty miss. In thecaseof acleanmiss, the datatravels acrossthe network only once(from H to P). For acleanwrite miss,the homenodesendsinvalidationsto all nodesR that recentlyhad readaccessto the region. The region is not shippedto P until all readersR have acknowledgedtheseinvalidations. In thecaseof a dirty missthe dataistransferredtwice, oncefrom the lastwriter W to homenodeH andoncefrom Hto the requestingprocessP. The CRL versionwe useddoesnot implementthethree-partyoptimizationthatavoidsthis doubledatatransfer. (Laterversionsdidimplementthis optimization).

Page 191: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.3TheC RegionLibrary 169

Polling and Interrupts

CRL usesboth polling andinterruptsto deliver messages.The implementationis single-threaded.All messages,whetherthey are received throughpolling orinterrupts,arehandledin thecontext of theapplicationthread.

The runtime systemnormally enablesnetwork interruptsso that nodescanbe interruptedto processincoming requests(misses,invalidations,and region-maprequests)even if they happento be executingapplicationcode. For someapplications,interruptsarenecessaryto guaranteeforward progress.For otherapplications,they improveperformanceby reducingresponsetime.

In many cases,though, interruptscan be avoided. All transactionsin Fig-ure 7.17startwith a requestfrom nodeP to homenodeH andendwith a replyfrom H to P. While waiting for thereply from H, P is idle. Similarly, if H sendsout invalidationsthenH will be idle until thefirst acknowledgementarrivesandin betweensubsequentacknowledgements.In theseidle periods,CRL disablesnetwork interruptsandpolls. This way the expectedreply messageis receivedthroughpolling ratherthaninterrupts,which reducesthereceiveoverhead.

This behavior is a goodmatchto LFC’s polling watchdogwhich workswellfor clientsthat mix polling andinterrupts. Interruptsareusedto handlerequestmessagesunlessthe destinationnodehappensto be waiting (andpolling) for areply to oneof its own requests.By delayinginterrupts,LFC increasestheproba-bility thatthis happens.

MessageHandling

CRL sendstwo typesof messages.Data messagesareusedto transferregionscontaininguserdata(e.g.,whenaregionis replicated).Thesemessagesconsistofasmallheaderfollowedby theregiondata.Control messagesaresmallmessagesusedto requestcopiesof regions,to invalidateregions,to acknowledgerequests,etc. Both dataandcontrol messagesarepoint-to-pointmessages;CRL doesnotmakesignificantuseof LFC’s multicastprimitive.

EachCRL messagecarriesa small header. Eachheadercontainsa demulti-plexingkey thatdistinguishesbetweencontrolmessages,datamessages,andother(lessfrequentlyused)messagetypes. For controlmessages,theheadercontainsthe addressof a handlerfunction andup to four integer-sizedargumentsfor thehandler. A datamessageconsistsof a region pointer, a region offset, anda fewmorefields. Theregion pointeridentifiesa mappedregion in therequestingpro-cess’saddressspace.Sinceregionscanhaveany size,they maywell belargerthananLFC packet. Theoffsetis usedto reassemblelargedatamessages;it indicateswherein theregion to storethedatathatfollows theheader.

Incomingpacketsarealwaysprocessedimmediately. The datacontainedin

Page 192: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

170 Parallel-ProgrammingSystems

0Ã 200Ã 400Ã 600Ã 800Ã 1000Region size (bytes)

0

50

100

150

Mis

s la

tenc

y (m

icro

seco

nds)

Ò CRL, 8 readersCRL, 4 readersCRL, 1 readerCRL, 0 readersLFC

Fig. 7.18.Write misslatency.

datapacketsis alwayscopiedto its destinationwithout delay. Protocol-messagehandlers,in contrast,cannotalwaysbeinvokedimmediately;in thatcasethepro-tocolmessageis copiedandqueued.(Thiscopy is unnecessary, but protocolmes-sagesconsistof only a few words.) Sinceno processcaninitiate morethanonecoherencetransaction,theamountof buffering requiredto storeblockedprotocolhandlersis bounded.

7.3.3 Performance

CRLprogramssendmany controlmessagesandarethereforesensitiveto sendandreceive overhead.Moreover, severalCRL applicationsusesmall regions,whichyieldssmalldatamessages.Sincemany CRL actionsusea request-replystyleofcommunication,communicationlatenciesalsohaveanimpactonperformance.

Figures7.18and7.19show theperformanceof cleanwrite missesfor variousnumbersof readers.ThehomenodeH sendsan invalidationto eachreaderandawaitsall readers’acknowledgementsbeforesendingthedatato requestingpro-cessP (seeFigure7.17(c)).TheLFC curveshows theperformanceof a raw LFCtestprogramthatsendsrequestandreply messagesthathave thesamesizeasthemessagessentby CRL in thecaseof zeroreaders(i.e.,whenno invalidationsareneeded).

For small regions,CRL addsapproximately5 µs (22%) to LFC’s roundtriplatency (seeFigure7.18). CRL’s throughputcurvesarecloseto the throughputcurveof theLFC benchmark.

Page 193: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.4MPI — TheMessagePassingInterface 171

64 256 1K 4K 16

K64

K25

6K 1M

Region size (bytes)

0

10

20

30

40

50

60M

iss

thro

ughp

ut (

Mby

te/s

econ

d)

LFCCRL, 0 readersCRL, 1 readerCRL, 4 readersCRL, 8 readers

Fig. 7.19.Write missthroughput.

7.4 MPI — The MessagePassingInterface

TheMessagePassingInterface(MPI) is astandardlibrary interfacethatdefinesalargevarietyof message-passingprimitives[53]. MPI is likely themostfrequentlyusedimplementationof themessage-passingprogrammingmodel.

SeveralMPI implementationsexist. Major vendorsof parallelcomputers(e.g.,IBM andSiliconGraphics)havebuilt implementationsthatareoptimizedfor theirhardwareandsoftware.A popularalternative to thesevendor-specificimplemen-tationsis MPI Chameleon(MPICH) [61]. MPICH is free and widely usedinworkstationenvironments.We implementedMPI on LFC by portingMPICH toPanda.Thefollowing subsectionsdescribethis MPI/Panda/LFCimplementationandits performance.

7.4.1 Implementation

MPICH consistsof threelayers. At the bottom,platform-specificdevice chan-nelsimplementreliable,low-level sendandreceive functions.Themiddle layer,theapplicationdevice interface(ADI), implementsvarioussendandreceive fla-vors (e.g.,rendezvous)usingdowncallsto the currentdevice channel. The toplayerimplementstheMPI interface.TheADI andthedevicechannelshavefixedinterfaces.WeportedMPICH version1.1andADI version2.0to Pandaby imple-mentinga new device channel.This Pandadevice channelimplementsthebasicsendandreceive functionsusingPanda’s message-passingmodule.

SinceMPICH is not a multithreadedsystem,the Pandadevice channeluses

Page 194: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

172 Parallel-ProgrammingSystems

a stripped-down versionof Pandathat doesnot supportmultithreadingandthatreceivesall messagesthroughpolling. We refer to this simplerPandaversionasPanda-ST(Panda-SingleThreaded).Themultithreadedversionof Pandadescribedin Chapter6 is referredto asPanda-MT(PandaMultiThreaded).

By assumingthatthePandaclient is single-threaded,Panda-STeliminatesalllocking insidePanda.Panda-STalsodisablesLFC’s network interrupts.MPICHdoesnot needinterrupts,becauseall communicationis directly tied to sendandreceive actionsin userprocesses.It sufficesto poll whena receive statementisexecuted(eitherby theuserprogramor aspartof theimplementationof acomplexfunction). Messagehandlersdo not execute,not evenlogically, in thecontext ofadedicatedupcallthread.Instead,they executein thecontext of theapplication’smainthread,bothphysically(on thesamestack)andlogically.

MPICH provides its own spanning-treebroadcastimplementation. UnlikeLFC’smulticastimplementation,however, MPICH forwardsmessagesratherthanpackets.This way, themulticastprimitivecanbeimplementedeasilyin termsofMPI’s unicastprimitiveswhich alsooperateon messages.An obviousdisadvan-tageof this implementationchoiceis the lossof pipeliningat processorsthatareinternal nodesof a multicasttree and that have to forward messages.Given alargemulticastmessage,sucha processorwill not startforwardingbeforeit hasreceivedtheentiremessage.If doneon a per-packet basis,asin LFC, forwardingcanbegin assoonasthefirst packet hasarrivedat theforwardingprocessor’s NI.Weexperimentedbothwith MPICH’s default broadcastimplementationandwithanimplementationthatusesPanda’s unorderedbroadcaston top of LFC’s broad-cast.Theperformanceof thesetwo broadcastimplementations,MPI-default andMPI-LFC, is discussedin thenext section.

7.4.2 Performance

Below, we discussthe unicastandbroadcastperformanceof MPICH on PandaandLFC.

UnicastPerformance

Figure7.20andFigure7.21show MPICH’s one-way latency andthroughput,re-spectively. Both testswereperformedusingMPI’s MPI Send() andMPI Recv()functions.For comparison,thefiguresalsoshow theresultsof thecorrespondingtestsat thePandaandLFC level.

Figure7.20shows MPI’s unicastlatency and,for comparison,theunicastla-tenciesof PandaandLFC. MPICH is a heavyweightsystemandadds6 µs (42%)to Panda’s0-byteone-way latency.

Exceptfor smallmessages,MPICH attainsthesamethroughputasPandaand

Page 195: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.4MPI — TheMessagePassingInterface 173

0Ã 200Ã 400Ã 600Ã 800Ã 1000Message size (bytes)Ä

0

10

20

30

40

50

Late

ncy

(mic

rose

cond

s)

Å MPIPanda (no threads)LFC

Fig. 7.20.MPI unicastlatency.

LFC. Thereasonis thatMPICH makesno extrapassesover thedatathatis trans-ferred.MPICH simplyconstructsanI/O vectorandhandsit to Panda,which thenfragmentsthemessage.As a result,theMPICH layerdoesnot incur any per-bytecostsandthroughputis good.

BroadcastPerformance

Figure7.22shows multicastlatency on 64 processorsfor LFC, Panda,andthreeMPI implementations.Both MPI-default resultswereobtainedusingMPICH’sdefaultbroadcastimplementation,which forwardsbroadcastmessagesasawholeratherthanpacketby packet. MPI-defaultnormallyusesbinomialbroadcasttrees.SinceLFC usesbinary trees,we addedan option to MPICH that allows MPI-default to usebothbinomialandbinarytrees.TheMPI-LFC resultswereobtainedusingLFC’s broadcastprimitive(i.e.,with binarytrees).

As in the unicastcase,we find that MPICH addsconsiderableoverheadtoPanda. Binary and binomial treesyield no large differencesin MPI-default’sbroadcastperformance,because64-nodebinaryandbinomialtreeshavethesamedepth. There is a large differencebetweenthe MPI-default versionsandMPI-LFC. With binarytrees,MPI-default adds67 µs (77%)to MPI-LFC’s latency fora 64-bytemessage;with binomial trees,theoverheadis 59 µs (68%). In this la-tency test it is not possibleto pipelinemultipacket messages.The causeof theperformancedifferenceis thatMPI-default addsat leasttwo extra packet copiesto thecriticial pathat eachinternalnodeof themulticasttree:onecopy from thenetwork interfaceto thehostandonecopy perchild from thehostto thenetworkinterface.

Page 196: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

174 Parallel-ProgrammingSystems

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ä0

10

20

30

40

50

60

Thr

ough

put (

Mby

te/s

econ

d)

Ñ LFCPanda (no threads)MPI

Fig. 7.21.MPI unicastthroughput.

0Ã 200Ã 400Ã 600Ã 800Ã 1000Message size (bytes)Ä

0

100

200

300

Late

ncy

(mic

rose

cond

s)

Å

64 processorsÆ

MPI-default, binaryMPI-default, binomialMPI-LFCPanda (unordered, no threads)LFC

Fig. 7.22.MPI broadcastlatency.

Page 197: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.5RelatedWork 175

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ä0

5

10

15

20

25

Thr

ough

put (

Mby

te/s

econ

d)

64 processorsÆ

LFCPanda, unordered, no threadsMPI-LFCMPI-default, binaryMPI-default, binomial

Fig. 7.23.MPI broadcastthroughput.

Figure7.23comparesthe throughputof thesameconfigurations.Thediffer-encebetweenthe MPI-default curves is causedby the differencein treeshape:binomial treeshave a higherfanoutthanbinary trees,which reducesthroughput.Both MPI-default versionsperformsworsethanMPI-LFC, not so muchduetoextra datacopying, but becauseMPI-default doesnot forwardmultipacket multi-castmessagesin a pipelinedmanner. In addition,host-level, message-basedfor-wardinghasa largermemoryfootprint thanpacket-basedforwarding. Whenthemessageno longerfits into theL2 cache,MPI-default’s throughputdropsnotice-ably(from 20to 9 Mbyte/sfor binarytreesandfrom 10to 5 Mbyte/sfor binomialtrees).Theproblemis thattheuseof MPI-default resultsin onecopying passovertheentiremessagefor eachchild thata processorhasto forwardthemessageto.If the messagedoesnot fit in the cache,thenthesecopying passestrashthe L2cache.

7.5 RelatedWork

Weconsiderrelatedwork in two areas:operationtransferandoperationexecution.

7.5.1 Operation Transfer

Orca’s operation-specificmarshalingroutineseliminate interpretationoverheadby exploiting compile-timeknowledge. In addition, they do not copy to inter-mediateRTS buffers,but generateanI/O vectorwhich is passedto Panda’s send

Page 198: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

176 Parallel-ProgrammingSystems

routines.Thesameapproachis takenin Manta[99] (seeSection7.2),which gen-eratesspecializedmarshalingcodefor Javamethods.

In theirimplementationof OptimisticOrca[66] bymeansof OptimisticActiveMessages(OAMs), Hsiehet al. usecompilersupportto optimizethe transferofsimpleoperationsof simpleobjecttypes(seealsoSection6.6.3). For suchoper-ations,OptimisticOrcaavoidsthegenericmarshalingpathandcopiesargumentsdirectly into an(optimistic)activemessage.At thereceiving side,theseoptimisticactivemessagesaredispatchedin a specialway (seeSection7.5.2).We optimizemarshalingfor all operations,usinga genericcompilationstrategy. To attaintheperformanceof Optimistic Orca,however, further optimizationwould be neces-sary. For example,we would have to eliminatePanda’s I/O vectors(which areexpensive for simpleoperations).

Muller et al. optimizedmarshalingin Sun’s RPCprotocolby meansof a pro-gramspecializationtool, Tempo,for theC programminglanguage[109]. Giventhestubsgeneratedby Sun’snonspecializingstubcompilerthis tool automaticallygeneratesspecializedmarshalingcodeby removing many testsandindirections.WeusetheOrcacompilerinsteadof ageneral-purposespecializationtool to gen-erateoperation-specificcode.

As describedin Section7.1.3,thereareredundantdatatransferoperationsonOrca’s operationtransferpath. Thesetransferscanbe removed by usingDMAinsteadof PIO at thesendingside,andby removing thereceiver-sidecopy (fromLFC packetsto Orcadatastructures).ChangandvonEickendescribeazero-copyRPCarchitecturefor Java, J-RPC,that removesthesetwo datatransfers[35]. J-RPCis a connection-orientedsystemin which receiversassociatepinnedpageswith individual (per-sender)endpoints.Oncea senderhasfilled thesepages,thereceiver unpinsthemandallocatesnew pages.The addressof the new receiveareais piggybackedonRPCreplies.

J-RPCsuffersfrom two problems.First, asnotedby its designers,allocatingpinnedmemoryfor eachendpointdoesnotscalewell to largenumbersof proces-sors. Second,J-RPCrelieson piggybackingto returninformationaboutreceiveareasto asender. Thisworkswell for RPC,but not for one-waymulticasting.

7.5.2 Operation Execution

Whereasour approachhas beento avoid expensive threadswitchesby hand,OAMs [66, 150], Lazy TaskCreation[105], andLazy Threads[58] all rely oncompilersupport.OAMs transformanAM handlerthatrunsinto a lockedmutexinto a true thread.Theoverheadof creatinga threadis paidonly whenthe lockoperationfails. Hsiehet al. usedOAMs to improvetheperformanceof oneof thefirst Panda-basedOrcaimplementations.Like RTS-threads,this implementationusedpopupthreads.On theCM-5, the implementationbasedon OAMs reduced

Page 199: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.6Summary 177

thelatency of Orcaobjectinvocationby anorderof magnitudeby avoidingthreadswitchingandby avoiding the CM-5’s bulk transfermechanismfor small mes-sages.In contrastwith our approach,OAMs usecompilersupportand requirethatall locksbeknown in advance.

DravesandBershadusedcontinuationsinsidean operatingsystemkerneltospeedupthreadswitching[46]. Insteadof blockingakernelthread,acontinuationis createdandthesamestackis usedto run thenext thread.Weusethesametech-niquein userspacefor thesamereasons:to reducethreadswitchingoverheadandmemoryconsumption.Orca’s continuation-basedRTS usescontinuationsin thecontext of upcallsandaccessesthemthroughacondition-variablelike interface.

Ruhl and Bal usecontinuationsin a collective-communicationmodule im-plementedon top of Panda[123]. Several collective-communicationoperationscombineresultsfrom differentprocessorsusinga reversespanningtree. Ratherthan assigningto one particular thread(e.g., the local computationthread)thetaskof waiting for, combining,andpropagatingdata,the implementationstoresintermediateresultsin continuations.Subresultsarecreatedeitherby the localcomputationthreador by anupcallthread.Thefirst threadthatcreatesasubresultcreatesa continuationandstoresits subresultin this continuation.Whenanotherthreadlaterdeliversa new subresult,it mergesits subresultwith theexistingsub-resultby invokingthecontinuation’sresumefunction.Thisavoidsthreadswitchesto a dedicatedthread.

7.6 Summary

This chapterhasshown that the communicationmechanismsdevelopedin thepreviouschapterscanbeappliedeffectively to avarietyof PPSs.

Wedescribedportabletechniquesto transferandexecuteOrcaoperationseffi-ciently. Operationtransferis optimizedby lettingthecompilergenerateoperation-specificmarshalingcode.Also, theOrcaRTS doesnot copy datato intermediatebuffers: datais copieddirectly betweenLFC packetsandOrcadatastructures.Operationexecutionis optimizedby representingblockedinvocationscompactlyby continuationsinsteadof blockedthreads.Theuseof continuationsallowsoper-ationsto beexecuteddirectly by Pandaupcalls,savesmemory, andavoidsthreadswitchingduringthere-evaluationof theguardsof blockedoperations.

The Orca RTS exploits all communicationmechanismsof PandaandLFC.Panda’s transparentswitchingbetweenpolling andinterruptsandLFC’s pollingwatchdogwork well for messagesthat carry operationrequestsor replies. Aninterruptis generatedwhenthereceiverof a requestis engagedin a long-runningcomputationanddoesnotrespondto therequest.RPCrepliesareusuallyreceivedthroughpolling.

Page 200: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

178 Parallel-ProgrammingSystems

Panda’s totally-orderedbroadcastis usedto updatereplicatedobjects. Thisbroadcastis implementedefficiently usingLFC’sNI-supportedfetch-and-addandmulticastprimitives. LFC’s NI-level fetch-and-adddoesnot generateinterruptson thesequencernode.LFC’s multicastmaygenerateinterruptson thereceivingnodes,but theseinterruptsarenoton themulticast-packet forwardingpath.

Panda’s streammessagesallow theRTS to marshalandunmarshaldatawith-out making unnecessarycopies. Finally, Panda’s upcall model works well forOrca,eventhoughblockingupcallsoccurfrequentlyin Orcaprograms.

Manta hasa similar programmingmodelas Orca (sharedobjects)andalsousessomeof Orca’s implementationtechniques,in particularfunction shippingandcompiler-generated,method-specificmarshalingroutines. The communica-tion requirementsof MantaandOrcaaredifferent,though.First,Mantadoesnotreplicateobjectsandthereforedoesnot needmulticastsupport. Second,Mantausespopupthreadsto processincomingoperationrequests,becauseJava meth-odscancreatenew statebeforethey block, which makesit difficult to representa blocked invocationby a continuation(asin Orca). In thecurrentimplementa-tion, the useof popupthreadsaddsa threadswitch to the executionpathof alloperationson remoteobjects.

In termsof its communicationrequirements,CRL is thesimplestof thePPSsdiscussedin this chapter. CRL runs directly on top of LFC and benefitsfromLFC’s efficientpacket interfaceandits polling watchdog.

Theimplementationof MPI usesPanda’sstreammessagesandPanda’sbroad-cast. As in OrcaandManta,Panda’s streammessagesallow datato be copieddirectly betweenLFC packetsandapplicationdatastructures.Panda’s broadcastis muchmoreefficient thanMPICH’sdefaultbroadcast.MPICH performsall for-wardingonthehost,ratherthanontheNI. Whatis worse,however, is thatMPICHforwardsmessage-by-messageratherthanpacket-by-packet,andthereforeincursthefull messagelatency at eachhopin its multicasttree.

All PPSshavethesamedatatransferbehavior. At thesendingside,eachclientcopiesdatafrom client-level objectsinto LFC sendpacketsin NI memory. At thereceiving side,eachclient copiesdatafrom LFC receive packetsin hostmemoryto client-level objects.

Table7.2 summarizestheminimumlatency andthemaximumthroughputofcharacteristicPPS-level operationson PPS-level data. The tablealsoshows thecostof thecommunicationpatterninducedby theseoperations,atall softwarelev-elsbelow thePPSlevel. Theperformanceresultsfor differentlevelsareseparatedby slashes;differencesin theseresultsarecausedby layer-specificoverheads.Re-call thatPandacanbeconfiguredwith (Panda-MT)or without(Panda-ST)threads.

For CRL, Table7.2shows thecostof a cleanwrite miss.Thecommunicationpatternconsistsof a small, fixed-sizecontrol messageto the homenodewhichreplieswith adatamessagecontainingaregion. For MPI, thetableshowsthecost

Page 201: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

7.6Summary 179

PPS Latency (µs) Throughput (Mbyte/s)Unicast Broadcast Unicast Broadcast

LFC/CRL 23/28 Ó / Ó 60/56 Ó / ÓLFC/Panda-ST/MPI 12/15/21 72/78/87 63/61/60 28/27/27LFC/Panda-MT/Orca 24/31/46 72/103/121 58/56/52 28/28/27LFC/Panda-MT/Manta 24/31/76 Ó / Ó / Ó 58/56/35 Ó / Ó / Ó

Table 7.2. PPSperformancesummary. All broadcastmeasurementsweretakenon 64processors.

of a one-way messageandthe costof a broadcast.For Orca, it shows the costof an operationon a remote,nonreplicatedobject (which resultsin a RPC)andthecostof anoperationonareplicatedobject(whichresultsin anF&A operationfollowedby abroadcast).Finally, for Manta,thetableshowsthecostof a remotemethodinvocationon a remoteobject.

The resultsin Table7.2 show that our PPSsaddsignificantoverheadto anLFC-level implementationof thesamecommunicationpattern.Theseoverheadsstemfrom severalsources.

1. Demultiplexing. All clientsperformoneormoredemultiplexingoperations.In Orca,for example,anoperationrequestcarriesanobjectidentifierandanoperationidentifier. In CRL, eachmessagecarriestheaddressof a handlerfunctionandthenameof a sharedregion. In MPI, eachmessagecarriesauser-specifiedtag. MPI andOrcaareboth layeredon top of Panda,whichperformsadditionaldemultiplexing.

2. Fragmentationandreassembly. Both PandaandCRL implementfragmen-tationandreassembly. This requiresextra headerfieldsandextra protocolstate.

3. Locking. Orcahasa multithreadedruntimesystemthatusesa singlelockto achieve mutual exclusion. To avoid recursiondeadlocks,this lock isreleasedwhenthe Orcaruntimesysteminvokesanothersoftwaremodule(e.g.,Panda)andacquiredagainwhenthe invocationreturns. Panda-MTalsouseslocksto protectglobaldatastructures.

4. Procedurecalls. Althoughall PPSsandPandauseinlining insidesoftwaremodulesandexport inlined versionsof simplefunctions,many procedurecalls remain. This is due to the layeredstructureof the PPSs. Layeringis usedto managethe complexity and to enhancethe portability of thesesystems:most systemsare fairly large and have beenportedto multipleplatforms.

Page 202: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

180 Parallel-ProgrammingSystems

Theoverheadsaddedby thevarioussoftwarelayersrunningontopof LFC arelarge comparedto, for example,the roundtrip latency at the LFC level (e.g.,anempty-arrayroundtripfor Mantaadds217%to anLFC roundtrip).Consequently,we expect that theseoverheadswill dampenthe application-level performancedifferencesbetweendifferentLFC implementations.Applicationperformanceisstudiedin thenext chapter.

Page 203: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Chapter 8

Multile vel PerformanceEvaluation

The precedingchaptershave shown that the LFC implementationdescribedinChapter3 andChapter4 provideseffectiveandefficientsupportfor severalPPSs.This implementation,however, is but onepoint in the designspaceoutlined inChapter2. This chapterdiscussesalternative designsof two key componentsofLFC: reliablepoint-to-pointcommunicationsby meansof NI-level flow controland reliable multicastcommunicationby meansof NI-level forwarding. Bothservicesare critically important for PPSsand their applications. Existing andproposedcommunicationarchitecturestake widely differentapproachesto thesetwo issues(seeChapter2), yet therehave beenfew attemptsto comparethesearchitecturesin asystematicway.

A key ideain this chapteris to usemultiple implementationsof LFC’s point-to-point,multicast,andbroadcastprimitives.This allows usto evaluatethedeci-sionsmadein theoriginal designandimplementation.We focuson assumptions1 and3 statedin Chapter4: reliablenetwork hardwareand the presenceof anintelligentNI. Thealternative implementationsall relaxoneor both of theseas-sumptions.Theuseof a singleprogramminginterfaceis crucial;existing studiescomparecommunicationarchitectureswith different functionality anddifferentprogramminginterfaces[5], which makesit difficult to isolatetheeffectsof par-ticulardesigndecisions.

Wecomparetheperformanceof fiveimplementationsatmultiple levels.First,we performdirect comparisonsbetweenthe systemsby meansof microbench-marksthatrundirectlyon topof LFC’sprogramminginterface.Second,wecom-paretheperformanceof parallelapplicationsby runningthemonall fiveLFC im-plementations.Eachof theseapplicationsusesoneof four differentPPSs:Orca,CRL, MPI, andMultigame.Orca,CRL, andMPI weredescribedin thepreviouschapter;Multigamewill beintroducedlaterin thischapter. Ourperformancecom-parisonthusinvolvesthreelevelsof software:thedifferentLFC implementations,the PPSs,andthe applications.A key contribution of this chapteris that it ties

181

Page 204: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

182 Multilevel PerformanceEvaluation

togetherlow-level designissues,the communicationstyle inducedby particularPPSs,andcharacteristicsof parallelapplications.

Thischapteris organizedasfollows. Section8.1givesanoverview of all LFCimplementations.Section8.2 andSection8.3 present,respectively, thepoint-to-point andmulticastpartsof the implementationsandcomparethe different im-plementationsusingmicrobenchmarks.Section8.4describesthecommunicationstyleandperformanceof thePPSsusedby our applicationsuite.Section8.5dis-cussesapplicationperformance.Section8.6 classifiesapplicationsaccordingtotheir performanceon differentLFC implementations.Section8.7 studiesrelatedwork.

8.1 Implementation Overview

LFC’s programminginterfaceandoneimplementationweredescribedin Chap-ter 3 andChapter4. This sectiongivesan overview of all implementationsanddescribesthoseimplementationcomponentsthat aresharedby all implementa-tions.

8.1.1 The Implementations

Wedevelopedthreepoint-to-pointreliability schemesandtwo multicastschemes.EachLFC implementationconsistsof a combinationof one reliability schemeandonemulticastscheme.Oneof theseLFC implementationscorrespondsto thesystemdescribedin Chapters3 and4; we refer to this systemasthedefaultLFCimplementation.

Thepoint-to-pointschemesarenoretransmission(Nrx), host-levelretransmis-sion(Hrx), andNI-levelretransmission(Irx). Somesystems,includingtheNI-levelprotocolsdescribedin Chapter4, assumethatthenetwork hardwareandsoftwarebehave andarecontrolledasin anMPPenvironment.Thesesystemsexploit thehigh reliability of many modernnetworks andusenonretransmittingcommuni-cationprotocols[26, 32, 106, 113, 141,154]. Suchprotocolsarecalledcarefulprotocols[106], becausethey may never drop packets, which requirescarefulresourcemanagement.Theno-retransmission(Nrx) schemerepresentsthesepro-tocols; it assumesthat the network hardware never dropsor corruptsnetworkpacketsandwill fail if thisassumptionis violated.

Systemsintendedto operateoutsideanMPPenvironmentusuallyimplementretransmission.Retransmissionusedto be implementedon the host, but sev-eral researchsystems(e.g.,VMMC-2 [49]) implementretransmissionon a pro-grammableNI. Our retransmittingschemes,Hrx andIrx, representthesetwo typesof systems.

Page 205: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.1ImplementationOverview 183

ForwardingRetransmission Hostforwards(Hmc) Interfaceforwards(Imc)No retransmission(Nrx) NrxHmc NrxImc (default)Interfaceretransmits(Irx) IrxHmc IrxImc

Hostretransmits(Hrx) HrxHmc Not implemented

Table8.1. Fiveversionsof LFC’s reliability andmulticastprotocols.

The multicastschemesarehost-level multicast(Hmc) andNI-level multicast(Imc). On scalable,switchednetworks, multicastingis usually implementedbymeansof spanning-treeprotocolsthat forwardmulticastdata.Most communica-tion systemsimplementtheseforwardingprotocolson topof point-to-pointprim-itives. Other systemsusethe NI to forward multicast traffic, which resultsinfewer datatransfersand interruptson the critical path from senderto receivers[16, 56, 68,146]. Hmc andImc representbothtypesof systems.

The point-to-pointreliability andthe multicastschemescanbe combinedinsix differentways(seeTable8.1).We implementedfiveoutof thesesix combina-tions. Thecombinationof host-level retransmissionandinterface-level multicastforwardinghasnotbeenimplementedfor reasonsdescribedin Section8.3.1.Theimplementationdescribedin Chapter3 andChapter4 is essentiallyNrxImc. Thereare,however, somesmalldifferencesbetweenthe implementationdescribedear-lier and the one describedhere. (NrxImc recycles senddescriptorsin a slightlydifferentway thanthedefault LFC implementation.)

Table8.2summarizesthehigh-level differencesbetweenthefive LFC imple-mentations.We have alreadydiscussedreliability andmulticastforwarding. Allimplementationsusethesamepolling-watchdogimplementation.Theretransmit-ting implementationsuseretransmissionandacknowledgementtimers.IrxImc andIrxHmc useMyrinet’son-boardtimer to implementthesetimers.HrxHmc usesbothMyrinet’son-boardtimerandthePentiumPro’s timestampcounter. All protocolsexceptHrxHmc implementLFC’sfetch-and-addoperationontheNI. SinceHrxHmc

representsconservative protocolsthatmake little useof theprogrammableNI, itstoresfetch-and-addvariablesin host memoryandhandlesfetch-and-addmes-sageson thehost.

8.1.2 Commonalities

TheLFC implementationssharemuchof theircode.All implementationstransmitdatain variable-lengthpackets. Hostsand NIs storepackets in packet bufferswhich all have the samemaximumpacket size. EachNI hasa sendbuffer poolanda receivebuffer pool; hostsonly havea receivebuffer pool.

We distinguishfour packet types: unicast,multicast,acknowledgement,and

Page 206: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

184 Multilevel PerformanceEvaluation

Function NrxImc NrxHmc IrxImc IrxHmc HrxHmc

Reliability NI NI NI NI HostMcastforwarding NI Host NI Host HostPolling watchdog NI + host NI + host NI + host NI + host NI + hostFine-graintimer — — NI NI NI + hostFetch-and-add NI NI NI NI host

Table8.2.Divisionof work in theLFC implementations.

synchronization.Unicastand multicastpackets containclient data. Acknowl-edgementsareusedto implementreliability. Synchronizationpacketscarryfetch-and-add(F&A) requestsandreplies. An F&A operationis implementedasanRPCto the nodeholding the F&A variable. Dependingon the implementation,this variableis storedeitherin hostor NI memory.

All implementationsorganizemulticastdestinationsin a binary treewith thesenderasits rootandforwardmulticastpacketsalongthis tree,in parallel.Oneoftheproblemsin implementinga multicastforwardingschemeis thepotentialforbuffer deadlock.Severalstrategiescanbeusedto dealwith this problem: reser-vationin advance[56, 146], deadlock-freerouting,anddeadlockrecovery [16].

To sendapacket,LFC’ssendroutinesstoreoneor moretransmissionrequestsin senddescriptorsin NI memory(usingPIO). Eachdescriptoridentifiesa sendpacket, thepacket’s destination,thepacket’s size,andprotocol-specificinforma-tion suchasasequencenumber. Multiple descriptorscanreferto thesamepacket,sothatthesamedatacanbetransmittedto multiple destinations.

LFC clientsusePIOto copy datathatis to betransmittedinto LFC sendpack-ets(whicharestoredin NI memory).To speedupthesedatacopies,all implemen-tationsmarkNI memoryasawrite-combinedmemoryregion (seeSection3.3).

Eachhostmaintainsa pool of freereceive buffers. Theaddressesof freehostbuffersarepassedto theNI controlprogramthrougha sharedqueuein NI mem-ory. WhentheNI hasreceiveda packet in oneof its receive buffers,it copiesthepacket to a freehostbuffer (usingDMA). Thehostcanpoll aflag in thisbuffer totestwhetherthepacket hasbeenfilled.

TheNI controlprogram’seventloopprocessesthefollowing events:

1. Host transmissionrequest. The host enqueueda packet for retransmis-sion. The NI tries to move the packet to its packet transmissionqueue.In someprotocols,however, thepacket mayhave to wait until anNI-levelsendwindow opensup. If the window is closed,the packet is storedona per-destinationblocked-sendsqueue. Incomingacknowledgementswilllater openthe window andmove the packet from this queueto the packettransmissionqueue.

Page 207: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.1ImplementationOverview 185

Parameter Meaning UnitP #processorsthatparticipatein theapplication ProcessorsPKTSZ Maximumpayloadof apacket BytesISB NI sendpool size Packet buffersIRB NI receivepool size Packet buffersHRB Hostreceivepool size Packet buffersW Maximumsendwindow size PacketsINTD Polling watchdog’snetwork interruptdelay µsTGRAN Timergranularity µs

Table8.3. Parametersusedby theprotocols.

2. Packettransmitted.TheNI hardwarecompletedthetransmissionof apacket.If thepacket transmissionqueueis notempty, theNI startsanetwork DMAto transmitthenext packet. For eachoutgoingpacket,Myrinet computes(inhardware)aCRCchecksumandappendsit to thepacket. Thereceivehard-warerecomputesandverifiesthechecksumandappendstheresult(check-sumverificationsucceededor failed) to thepacket. This result is checkedin software.

3. Packet received.TheNI hardwarereceivedapacket from thenetwork. TheNI checksthechecksumof thepacket just received. If thechecksumfails,the NI dropsthe packet or signalsan error to the host. (The retransmit-ting protocolsdrop the packet, forcing the senderto retransmit.The non-retransmittingprotocolssignalanerror.) Otherwise,if thepacket is a uni-castor multicastpacket, then the NI enqueuesit on its NI-to-hostDMAqueue. Someprotocolsalsoenqueuemulticastpackets for forwardingonthe packet transmissionqueueor, if necessary, on a blocked-sendsqueue.Acknowledgementandsynchronizationpacketsarehandledin a protocol-specificway.

4. NI-to-hostDMA completion. If the NI-to-hostDMA queueis not empty,thenaDMA is startedfor thenext packet.

5. Timeout.All implementationsusetimeoutsto implementtheNI-levelpollingwatchdogdescribedin Section3.7. Someimplementationsalsousetime-outsto triggerretransmissionsor acknowledgements.

Eachimplementationis configuredusingtheparametersshown in Table8.3.

Page 208: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

186 Multilevel PerformanceEvaluation

8.2 ReliablePoint-to-Point Communication

To comparedifferentpoint-to-pointprotocols,we have implementedLFC’s reli-ablepoint-to-pointprimitive in threedifferentways.Theretransmittingprotocols(Hrx andIrx) assumeunreliablenetwork hardwareandcanrecover from lost,cor-rupted,anddroppedpacketsby meansof time-outs,retransmissions,andCRCchecks.In Hrx, reliability is implementedby the hostprocessor;in Irx, it is im-plementedby theNI. As explainedabove, the threeimplementationshave muchin common;below, we discusshow the threeimplementationsdiffer from oneanother.

8.2.1 The No-RetransmissionProtocol (Nrx)

Nrx assumesreliablenetwork hardwareandcorrectoperationof all connectedNIs.Nrx doesnot detectlost packetsandtreatscorruptedpacketsasa fatalerror. Nrx

never retransmitsany packet andcan thereforeavoid the overheadof bufferingpacketsfor retransmissionandtimermanagement.

To avoid buffer overflow, Nrx implementsflow control betweeneachpair ofNIs (seethe UCAST protocoldescribedin Section4.1). At initialization time,eachNI reservesW ÔÖÕ IRB× PØ receive buffers for eachsender. The numberofbuffers(W) is thewindow sizefor thesliding window protocolthatoperatesbe-tweensendingandreceiving NIs. EachNI transmitsapacketonly if it knowsthatthereceiving NI hasfreebuffers. If apacketcannotbetransmitteddueto aclosedsendwindow, thentheNI queuesthepacketonablocked-sendsqueuefor thatdes-tination.Blockedpacketsaredequeuedandtransmittedduringacknowledgementprocessing(seebelow).

EachNI receive buffer is releasedwhenthe packet containedin it hasbeencopiedto hostmemory. Thenumberof newly releasedbuffersis piggybackedoneachpacket that flows backto the sender. If thereis no return traffic, then thereceiving NI sendsanexplicit half-windowacknowledgementafterreleasingW × 2buffers. (That is, the credit refreshthresholddiscussedin Section4.1 is set toW × 2.)

8.2.2 The Retransmitting Protocols(Hrx and Irx)

The two retransmissionschemesmake different tradeoffs betweensendandre-ceive overheadon thehostprocessorandcomplexity of theNI controlprogram.Wefirst describetheprotocolsin generaltermsandthenexplainprotocol-specificmodifications.In thefollowing, thetermssenderandreceiverrefer to NIs in thecaseof Irx andto hostprocessorsin thecaseof Hrx.

Eachsendermaintainsa sendwindow to eachreceiver. A senderthat has

Page 209: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.2ReliablePoint-to-PointCommunication 187

filled its sendwindow to a particulardestinationis not allowed to sendto thatdestinationuntil someof thepacketsin thewindow have beenacknowledgedbythereceiver. Eachpacket carriesa sequencenumberwhich is usedto detectlost,duplicated,andout-of-orderpackets.After transmittingapacket,thesenderstartsa retransmissiontimer for the packet. When this timer expires,all unacknowl-edgedpacketsareretransmitted.This go-back-Nprotocol[115, 138] is efficientif packetsarerarelydroppedor corrupted(asis thecasewith Myrinet).

Retransmissionrequiresthat packetsbe buffereduntil an acknowledgementis received. With asynchronousmessagepassing,this normally requiresa copyon the sendinghost. However, since the Myrinet hardware can transmitonlypacketsthatresidein NI memory, all protocolsmustcopy datafrom hostmemoryto NI memory. The retransmittingprotocolsusethe packet in NI memoryforretransmissionand thereforedo not make morememorycopiesthanNrx. Thisoptimizationis hardware-specific;it doesnotapplyto NIs thatuseFIFOsinto thenetwork (e.g.,severalATM interfaces).

Receiversacceptapacketonly if it carriesthenext-expectedsequencenumber;otherwisethey dropthepacket. In addition,NIs dropincomingpacketsthathavebeencorrupted—i.e., packetsthat yield a CRC error— or that cannotbe storeddueto ashortageof NI receivebuffers.

Sincesendpacketscannotbe reuseduntil they have beenacknowledgedandsinceNI memoryis relatively small,senderscanrunoutof sendpacketswhenac-knowledgementsdo not arrive promptly. To prevent this, receiversacknowledgeincomingdatapackets in threeways. As in Nrx, receiversusepiggybacked andhalf-window acknowledgements.In addition,eachreceiversetsanacknowledge-menttimer whenever a packet arrives. Whenthe timer expiresandthe receiverhasnot yet sentany acknowledgement,thereceiver sendsanexplicit delayedac-knowledgement.

In Irx, receiving NIs keeptrackof theamountof buffer spaceavailable.Whena packet must be droppeddue to a buffer spaceshortage,the NI registersthis.Whenmorebuffer spacebecomesavailable,theNI sendsa NACK to thesenderof thedroppedpacket. Whena NACK is received,thesenderretransmitsall un-acknowledgedpackets. Without theNACKs, packetswould not beretransmitteduntil thesender’s timerexpired.

8.2.3 Latency Measurements

Figure8.1 shows theone-way latency for Nrx, Hrx, andIrx. All messagesfit in asinglenetwork packet andarereceived throughpolling. Nrx outperformsthe re-transmittingprotocols,whichmustmanagetimersandwhichperformmoreworkon outstandingsendpacketswhenan acknowledgementis received. All curveshave identicalslopes,indicatingthattheprotocolshave identicalper-bytecosts.

Page 210: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

188 Multilevel PerformanceEvaluation

0Ù 200Ù 400Ù 600Ù 800Ù 1000Message size (bytes)Ú

0

10

20

30

40

Late

ncy

(mic

rose

cond

s)Û Hrx

IrxNrx

Fig. 8.1.LFC unicastlatency.

Protocol os or L g os Ü or Ü LNrx 1.5 2.3 7.2 6.7 11.0Irx 1.4 2.3 8.3 9.7 12.0Hrx 2.4 5.3 6.5 8.3 14.2

Table8.4.LogPparametervalues(in microseconds)for a16-bytemessage.

Table8.4showsthevaluesof theLogP[40,41] parametersandtheend-to-endlatency (os Ü or Ü L) of thethreeprotocols.TheLogPparametersweredefinedinSection5.1.2.Nrx andIrx performthesameworkonthehost,sothey have(almost)identicalsendandreceive overheads(os andor , respectively). Irx, however, runsaretransmissionprotocolon theNI, which is reflectedin theprotocol’s largergap(g Ô 9 Ý 7 versusg Ô 6 Ý 7) andlatency (L Ô 8 Ý 3 versusL Ô 7 Ý 2). Hrx runsa similarprotocolon thehostandthereforehaslargersendandreceive overheadsthanNrx

andIrx.Table8.5 shows theF&A latenciesfor the differentprotocols.Again, Nrx is

the fastestprotocol (18 µs). Hrx is the slowest protocol (31 µs); it is the onlyprotocolin which F&A operationsarehandledby thehostprocessorof thenodethat storesthe F&A variable. In an application,handlingF&A requestson thehostprocessorcanresultin interrupts.In this benchmark,however, Hrx receivesall F&A requeststhroughpolling.

8.2.4 Window Sizeand ReceiveBuffer Space

To obtainmaximumthroughput,acknowledgementsmustflow backto thesenderbeforethe sendwindow closes. To avoid senderstalls, all protocolstherefore

Page 211: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.2ReliablePoint-to-PointCommunication 189

Protocol Latency (µs)Nrx 18Irx 25Hrx 31

Table 8.5.Fetch-and-addlatencies.

5Þ 10 15Send-window size (packets)

0

20

40

60

Pea

k th

roug

hput

(M

byte

/s)

1 Kbyte packets,with receiver copyß

NrxIrxHrx

5Þ 10 15Send-window size (packets)

0

20

40

60

2 Kbyte packets,with receiver copyß

NrxIrxHrx

Fig. 8.2.Peakthroughputfor differentwindow sizes.

needsufficiently largesendwindows andreceive buffer pools. Figure8.2 showsthemeasuredpeakthroughputfor variouswindow sizesandfor two packet sizes.To speedup sequencenumbercomputationsfor Irx, we useonly power-of-twowindow sizes.In thisbenchmark,thereceiving processcopiesdatafrom incomingpackets to a receive buffer. This is not strictly necessary, but all LFC clientsperformsucha copy while processingthedatathey receive. Throughputwithoutthis receiver-sidecopy is discussedlater.

For 1 Kbyte packets,Nrx andIrx needonly a smallwindow (W à 4) to attainhigh throughput. Hrx’s throughputstill increasesnoticeablywhenwe grow thewindow sizefrom W Ô 4 (45 Mbyte/s)to W Ô 8 (56 Mbyte/s). A smallwindowis importantfor Nrx, becauseeachNI allocatesW receive buffersper sender. Inthe retransmittingprotocols(Irx andHrx) NI receive buffers aresharedbetweensendersandweneedonly W receivebuffersto allow any singlesenderto achievemaximumthroughput.

Surprisingly, Irx’s throughputdecreasesfor W á 8. Thereasonis thatNI-levelacknowledgementprocessingin Irx involvescancelingall outstandingretransmis-sion timers. With a larger window, more(virtual) timersmustbe canceled.(Irxmultiplexesmultiple virtual timerson a singlephysicaltimer; cancelinga virtualtimersconsistsof adequeueoperation.)At somepoint, theslow NI canno longer

Page 212: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

190 Multilevel PerformanceEvaluation

hidethiswork behindthecurrentsendDMA transfer. Theeffectdisappearswhenwe switchto 2 Kbyte packets— thatis, whenthesendDMA transferstake moretime.

Figure8.3shows one-way throughputfor 1-Kbyteand2-Kbytepackets,withandwithout receiver-sidecopying. For all threeprotocols,thewindow sizewassetto W Ô 8.

With 1Kbytepacketsandwithoutcopying,weseecleardifferencesin through-putbetweenthethreeprotocols.Performingretransmissionadministrationon thehost(Hrx) increasesthe per-packet costsandreducesthroughput.Whenthe re-transmissionwork is movedfrom thehostto the NI (Irx), throughputis reducedfurther. TheNI performsessentiallythesamework asthehostin Hrx, but doessomoreslowly. TheLogPmeasurementsin Table8.4 confirmthis: Irx hasa largergap(g) thanthe otherprotocols. Increasingthe packet sizeto 2 Kbyte reducesthe numberof timesper-packet costsare incurredand increasesthe throughputof all protocols,but mostnoticeablyfor Irx. For Hrx andNrx, theimprovementissmallerandcomesmainly from fastercopying at thesendingside. Irx, however,is now ableto hidemoreof its per-packet overheadin thelatency of largerDMAtransfers.

Throughputis reducedwhenthereceiving processcopiesdatafrom incomingpacketsto adestinationbuffer, especiallywhenthemessageis largeanddoesnotfit into the L2 cache. This is important,becauseall LFC clientsdo this. Sincethemaximummemcpy() speedon the PentiumPro (52 Mbyte/s)is lessthanthemaximumthroughputachieved by our protocols(without copying), the copyingstagebecomesabottleneck.In addition,memorytraffic dueto copying interfereswith theDMA transfersof incomingpackets.

With copying, thethroughputdifferencesbetweenthethreeprotocolsarefairlysmall.To alargeextent,this is dueto theuseof theNI memoryas’retransmissionmemory,’ which eliminatesacopy at thesendingside.

8.2.5 SendBuffer Space

While NI receive buffer spaceis an issuefor Nrx, sendbuffer spaceis just asimportantfor Irx andHrx. First, however, we considerthe sendbuffer spacere-quirementsfor Nrx. Nrx can reusesendbuffers as soonas their contentshavebeenput on thewire, soonly a few sendbuffersareneededto achievemaximumthroughputbetweena singlesender-receiverpair. With morereceiversit is usefulto havealargersendpool,in casesomereceiverdoesnotconsumeincomingpack-etspromptly. Transmissionsto that receiver will besuspendedandextra packetsarethenneededto allow transmissionsto otherreceivers.

In the retransmittingprotocols,sendbuffers cannotbe reuseduntil an ac-knowledgementconfirmsthereceiptof their contents.(This is a consequenceof

Page 213: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.2ReliablePoint-to-PointCommunication 191

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ú0

20

40

60

80

Thr

ough

put (

Mby

te/s

econ

d)

1 Kbyte packets; no copy

NrxHrxIrx

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ú0

20

40

60

80

Thr

ough

put (

Mby

te/s

econ

d)

1 Kbyte packets; copyâ

NrxHrxIrx

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ú0

20

40

60

80

Thr

ough

put (

Mby

te/s

econ

d)

2 Kbyte packets; no copy

NrxHrxIrx

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ú0

20

40

60

80

Thr

ough

put (

Mby

te/s

econ

d)

2 Kbyte packets; copyâ

NrxHrxIrx

Fig. 8.3.LFC unicastthroughput.

Page 214: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

192 Multilevel PerformanceEvaluation

usingNI memoryasretransmissionmemory.) Sincewedonotacknowledgeeachpacket individually, asenderthatcommunicateswith many receiversmayrunoutof sendbuffers.Oneof ourapplications,aFastFourierTransform(FFT)program,illustratesthis problem.In FFT, eachprocessorsendsa uniquemessageto everyotherprocessor. With 64processors,thesemessagesareof mediumsize(2 Kbyte)andfill lessthanhalf a window. If thenumberof sendbuffers is small,a sendermayrun out of sendbuffersbeforeany receiver’shalf-window acknowledgementhasbeentriggered.

If thenumberof sendbuffers is largerelative to thenumberof receivers,thesenderwill trigger half-window acknowledgementsbeforerunning out of sendbuffers. In general,we cannotsendmore than P ãåä W × 2 æ 1ç packets withouttriggeringat leastonehalf-window acknowledgement,unlesspacketsarelost.

To avoid unnecessaryretransmissionswhen the numberof sendbuffers issmall, Irx andHrx could acknowledgepackets individually, but this will leadtoincreasednetwork occupancy, NI occupancy, and(in thecaseof Hrx) hostoccu-pancy. Instead,Irx andHrx startanacknowledgementtimer for incomingpackets.Eachreceiver maintainsonetimer per sender. The timer for senderS is startedwheneveradatapacket from Sarrivesandthetimer is not yet running.Thetimeris canceledwhenanacknowledgement(possiblypiggybacked)travelsbackto S.If thetimerexpires,thereceiversendsanexplicit acknowledgement.

Thisschemerequiressmalltimeoutvaluesfor theacknowledgementtimer, be-causeasenderneedsonly afew hundredmicrosecondsto allocateall sendbuffers.UsingOStimersignalsto implementthetimeronHrx doesnotwork, becausethegranularityof OStimersis oftentoocoarse(10msis acommonvalue).TheMyri-netNI, however, providesaccessto aclockwith agranularityof 0.5µsand is ableto sendsignalsto theuserprocess.WethereforeusetheNI asanintelligentclock,asfollows. WechooseaclockperiodTGRANandlet theNI generateaninterrupteveryTGRANµs.

To reducethe numberof clock signals,we usetwo optimizations.First, be-foregeneratingaclock interrupt,theNI readsaflagin hostmemorythatindicateswhetherany timersare running. No interrupt is generatedwhenno timersarerunning. Second,eachtime the host performsa timer operationit recordsthecurrenttime by readingthePentiumPro’s timestampcounter(seeSection1.6.2).This counteris incrementedevery clock cycle (5 ns). If necessary, the host in-vokestimeouthandlers.Beforegeneratingaclock interrupt,theNI readsthetimerecordedby thehostto decidewhetherthehostrecentlyreadits clock. If so,nointerruptis generated.

Page 215: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.3ReliableMulticast 193

8.2.6 ProtocolComparison

An importantdifferencebetweenthe retransmitting(Irx and Hrx) and the non-retransmitting(Nrx) protocolsis that the retransmittingprotocolsdo not reservebuffer spacefor individual senders. In Irx and Hrx, NI receive buffers can besharedby all sendingNIs, becauseanNI thatrunsout of receivebufferscandropincomingpackets,knowing thatthesenderswill eventuallyretransmit.In Irx andHrx, a singlesendercanfill all of an NI’s receive buffers. Nrx, in contrast,mayneverdroppacketsandthereforeallocatesafixednumberof NI receivebufferstoeachsender. In mostcasesthis is wasteful,but sincethesmall bandwidth-delayproductof Nrx allowssmallwindow sizes,thisstrategy doesnotposeproblemsinmedium-sizeclusters.Nevertheless,this staticbuffer allocationschemepreventsNrx from scalingto very largeclusters,unlessNI memoryscalesaswell.

Acknowledgementsin Nrx have a differentmeaningthanin Irx andHrx. Nrx

implementsflow control(at theNI level), while Irx andHrx implementonly relia-bility. In Nrx, anacknowledgementconsistsof acountof buffersreleasedsincethelastacknowledgement.In Irx andHrx, anacknowledgementconsistsof asequencenumber. (A countinsteadof asequencenumberfailswhenanacknowledgementsis lost.) Thesequencenumberindicateswhichpacketshavearrivedat thereceiver(hostor NI), but doesnot indicatewhich packetshavebeenreleased.

An obviousdisadvantageof Nrx is thatit cannotrecoverfrom lostor corruptedpackets. On Myrinet, we have observed both typesof errors, causedby bothhardware andsoftwarebugs. On the positive side, careful protocolsare easierto implementthanretransmittingprotocols. Thereis no needto buffer packetsfor retransmission,to maintainsequencenumbers,or to dealwith timers.For thesamereasons,thenonretransmittingLFC implementationsachievebetterlatenciesthan the retransmittingimplementations.With 1 Kbyte packets thereis also athroughputadvantage,but this is partlyobscuredby thememory-copy bottleneck.

In spiteof thesedifferences,themicrobenchmarksindicatethatall threepro-tocolscanbemadeto performwell, includingprotocolsthatperformmostof theirwork on theNI. Thekeys to goodperformanceareto useNI buffers for retrans-missionandto overlapDMA transferswith usefulwork.

8.3 ReliableMulticast

We considertwo multicastschemes,which differ in wherethey implementtheforwardingof multicastpackets. In theHmc protocols,multicastpacketsarefor-wardedby thehostprocessor. In theImcprotocols,theNI forwardsmulticastpack-ets. In Section8.3.1,we first comparehost-level andinterface-level forwarding.Section8.3.2comparesindividual implementationsin moredetail. Section8.3.3

Page 216: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

194 Multilevel PerformanceEvaluation

Host Host

Network interfaceNetwork interface

Fig. 8.4.Host-level (left) andinterface-level (right) multicastforwarding.

discussesdeadlockissues.Finally, Section8.3.4analyzestheperformanceof allmulticastimplementations.

8.3.1 Host-Level versusInterface-Level Packet Forwarding

Host-level forwardingusinga spanningtree is the traditionalapproachfor net-workswithout hardwaremulticast.Thesenderis theroot of a multicasttreeandtransmitsa packet to eachof its children.Eachreceiving NI passesthepacket toits host,which reinjectsthepacket to forward it to thenext level in themulticasttree(seeFigure8.4). All host-level forwardingprotocolsreinjecteachmulticastpacket at mostonce. Insteadof makinga separatehost-to-NIcopy for eachfor-wardingdestination,thehostcreatesmultiple transmissionrequestsfor thesamepacket.

Host-level forwarding hasthreedisadvantages.First, sinceeachreinjectedpacket wasalreadyavailablein theNI’s memory, host-level forwardingresultsinan unnecessaryhost-to-NIdatatransferat eachinternal treenodeof the multi-casttree,whichwastesbusbandwidthandprocessortime. Second,noforwardingtakesplaceunlessthehostprocessesincomingpackets. If onehostdoesnot pollthenetwork in a timely manner, all its childrenwill beaffected. Insteadof rely-ing on polling, theNI canraisean interrupt,but interruptsareexpensive. Third,thecritical sender-receiver pathincludesthehost-NI interactionsof all thenodesbetweenthe senderand the receiver. For eachmulticastpacket, theseinterac-tions consistof copying the packet to the host,hostprocessing,andreinjectingthepacket.

TheNI-level multicastimplementationsaddressall threeproblems.In the Imc

protocols,the hostdoesnot reinjectmulticastpackets into the network (whichsavesa datatransfer),forwardingtakesplaceeven if thehostis not polling, andthehost-NIinteractionsarenoton thecritical path.

Host-level andNI-level multicastforwardingcanbecombinedwith thethreepoint-to-pointreliability schemesdescribedin the previous section. We imple-mentedall combinationsexcept the combinationof interface-level forwarding

Page 217: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.3ReliableMulticast 195

with host-level retransmission.In sucha variantwe expectthe hostto maintainthesendandreceive window administration,asin Hrx. However, if anNI needsto forward an incomingpacket thenit needsa sequencenumberfor eachof theforwardingdestinations.To make this schemework we would have to duplicatesequencenumberinformation(in aconsistentway).

8.3.2 AcknowledgementSchemes

For reliability, multicastpacketsmust be acknowledgedby their receivers, justlike unicastpackets. To avoid an acknowledgementimplosion, most multicastforwardingprotocolsusea reverseacknowledgementschemein which acknowl-edgementsflow backalongthemulticasttree.

Our protocolsusetwo reverseacknowledgementschemes:ACK-forwardandACK-receive. ACK-forwardis usedby NrxImc(thedefault implementation),IrxImc,andHrxHmc. In theseprotocols,acknowledgementsaresentat the level (NI orhost)atwhichmulticastforwardingtakesplace.Themaincharacteristicof ACK-forwardis thatit doesnotallow areceiverto acknowledgeamulticastpacket to itsparentin themulticasttreeuntil thatpacket hasbeenforwardedto thereceiver’schildren.

ACK-forward cannoteasily be usedby NrxHmc or IrxHmc, becausein theseprotocolsacknowledgementsaresentby theNI andforwardingis performedbythehost.NrxHmc andIrxHmc thereforeuseasimplerscheme,ACK-receive,whichacknowledgesmulticastpacketsin thesameway asunicastpackets(i.e.,withoutwaiting for forwarding).In thecaseof NrxHmc, multicastpacketscanbeacknowl-edgedassoonasthey havebeendeliveredto thelocal host.In thecaseof IrxHmc,multicastpacketscanbeacknowledgedassoonasthey havebeenreceivedby theNI.

A potentialproblemwith ACK-receive is that it createsa flow-control loop-hole thatcannoteasilybeclosedby anLFC client. In NrxHmc andIrxHmc, a hostreceive buffer containinga multicastpacket is not releaseduntil the packet hasbeendeliveredto theclientandhasbeenforwardedto all children.Consequently,afterprocessinga packet, theclient cannotbesurethatthepacket is availableforreuse,becauseit may still have to be forwarded. This makesit difficult for theclient to implementa fool-proofflow-controlscheme(whenneeded).

8.3.3 DeadlockIssues

With aspanning-treemulticastprotocol,it is easyto createabuffer deadlockcyclein which all NIs (or hosts)have filled their receive poolswith packetsthat needto beforwardedto thenext NI (or host)in thecycle which alsohasa full receivepool. Suchdeadlockscanbe preventedor recoveredfrom in several ways. The

Page 218: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

196 Multilevel PerformanceEvaluation

0Ù 200Ù 400Ù 600Ù 800Ù 1000Message size (bytes)Ú

0

50

100

150

200

250

Late

ncy

(mic

rose

cond

s)Û

64 processors, 1 Kbyte packets, no receiver copy

IrxHmcHrxHmcNrxHmcIrxImcNrxImc

Fig. 8.5.LFC multicastlatency.

NI-level flow controlprotocolof NrxImc(thedefault implementation),for example,allows deadlock-freemulticastingfor a certainclassof multicasttrees(includingbinarytrees).

Our otherprotocolsdo not implementtrue flow control at the multicastfor-wardinglevel andmaythereforesuffer from deadlocks.For certaintypesof mul-ticasttrees,includingbinarytrees,theACK-forwardschemepreventsbufferdead-locks if sufficient buffer spaceis availableat the forwardinglevel. Buffer spaceis sufficient if eachreceiver canstorea full sendwindow from eachsender(seeAppendixA). NrxImc’s NI-level flow control schemesatisfiesthis requirement.Unfortunately, reservingbuffer spacefor eachsenderdestroys oneof theadvan-tagesof retransmission:betterutilizationof receivebuffers(seeSection8.2.4).

NrxHmc and IrxHmc do not provide any flow control at the forwarding (i.e.,host)level andarethereforesusceptibleto deadlock.With our benchmarksandapplications,however, thenumberof hostreceivebuffersis sufficiently largethatdeadlocksdo notoccur.

8.3.4 Latency and Thr oughput Measurements

Multicast latency for 64 processorsis shown in Figure8.5. We definemulticastlatency asthe time it takesto reachthe last receiver. The top threecurvesin thegraphcorrespondto the Hmc protocols,which performmulticastforwardingonthehostprocessor. Theseprotocolsclearlyperformworsethanthosethatperformmulticastforwardingon the NI. The reasonis simple: with host-level forward-ing two extra datacopiesoccuron the critical pathat eachinternalnodeof themulticasttree.

Page 219: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.4Parallel-ProgrammingSystems 197

256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Ú0

5

10

15

20T

hrou

ghpu

t (M

byte

/sec

ond)

è

64 processors, 1 Kbyte packets,with receiver copy

NrxImcNrxHmcHrxHmcIrxImcIrxHmc

Fig. 8.6.LFC multicastthroughput.

Figure8.6shows multicastthroughputon 64 processors,using1 Kbyte pack-etsandcopying atthereceiverside.Throughputin thisgraphrefersto thesender’soutgoingdatarate,not to thesummedreceiver-sidedatarate.In contrastwith thelatency graph(Figure8.5), this graphshows that interface-level forwardingdoesnotalwaysyield thebestresults:for messagesizesup to 128Kbyte,for example,HrxHmc achieveshigherthroughputthanIrxImc.

Thedip in theHrxHmc curve is theresultof HrxHmc’s higherreceive overheadanda cacheeffect. For larger messages,the receive buffer no longerfits in theL2 cache,andsothecopying of incomingpacketsbecomesmoreexpensive, be-causethesecopiesnow generateextramemorytraffic. In HrxHmc, theseincreasedcopying coststurn thereceiving hostinto thebottleneck:dueto its higherreceiveoverhead,HrxHmc doesnot manageto copy a packet completelybeforethe nextpacket arrives. In the otherprotocols,the receive overheadis lower, so thereisenoughtime to copy anincomingpacketbeforethenext packet arrives.

8.4 Parallel-Programming Systems

All applicationsdiscussedin thischapterrunononeof thefollowing PPSs:Orca,CRL, MPI, or Multigame. Orca, CRL, andMPI were discussedin Chapter7;Multigameis introducedbelow. This sectiondescribesthecommunicationstyleof all PPSsanddiscussesPPS-level performance.

Page 220: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

198 Multilevel PerformanceEvaluation

8.4.1 Communication Style

Eachof our PPSsimplementsa smallnumberof communicationpatterns,whichareexecutedin responseto actionsin theapplication.For eachPPS,we summa-rize its main communicationpatternsanddiscussthe interactionsbetweenthesepatternsandthecommunicationsystem.

CRL

In CRL, mostcommunicationresultsfrom readandwrite misses.The resultingcommunicationpatternsweredescribedin Section7.3.2(seeFigure7.17).

CRL’s invalidation-basedcoherenceprotocoldoesnot allow applicationstomulticastdata.(CRL usesLFC’smulticastprimitiveonly duringinitializationandbarriersynchronization).Also, sinceCRL’s single-threadedimplementationdoesnot allow a requestingprocessto continuewhile a missis outstanding,applica-tionsaresensitive to roundtripperformance.Roundtriplatency is moreimportantthanroundtripthroughput,becauseCRL mainly sendssmall messages.Controlmessagesarealwayssmallanddatamessagesaresmallbecausemostapplicationsusefairly smallregions.

The roundtripnatureof CRL’s communicationpatternsandthe small regionsizesusedin mostCRL applicationsleadto low acknowledgementratesfor theNrx protocols.The roundtripsmake piggybackingeffective anddueto thesmalldatamessagesfew half-window acknowledgementsareneeded.Finally, NrxImc

andNrxHmc neversenddelayedacknowledgements.This is important:for oneofourapplications(Barnes,seeSection8.5),for example,HrxHmc sends31timesasmany acknowledgementsasNrxImc, thedefault LFC implementation.

MPI

SinceMPI is a message-passingsysytem,programmerscanexpressmany differ-ent communicationpatterns. MPI’s main restrictionis that incomingmessagescannotbe processedasynchronously, which occasionallyforcesprogrammerstoinsertpolling statementsinto their programs.Thesensitivity of anMPI programto theparametersof thecommunicationarchitecturedependslargelyon theappli-cation’scommunicationrateandcommunicationpattern.

All our MPI applications(QR, ASP, andSOR)useMPI’s collective-commu-nicationoperations.Internally, theseoperations(broadcastandreduce-to-all)allusebroadcasting.

All MPI measurementsin thischapterwereperformedusingthesameMPICH-basedMPI implementationasdescribedin Section7.4. RecallthatMPICH pro-vides a default spanning-treebroadcastimplementationbuilt on top of unicast

Page 221: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.4Parallel-ProgrammingSystems 199

NrxIm

c

MPIC

H0.0

0.5

1.0

1.5N

orm

aliz

ed e

xecu

tion

time

ASP/MPIé

NrxIm

c

MPIC

H0.0

0.5

1.0

1.5QR/MPIé

Fig. 8.7. Application-level impactof efficientbroadcastsupport.

primitives.Thisdefault implementationcanbereplacedwith amoreefficient ’na-tive’ implementation.All MPI measurementsin Section8.5wereperformedwitha native implementationthat usesPanda’s broadcastwhich, in turn, usesoneofour implementationsof LFC’s broadcast.In Section7.4, we usedmicrobench-marksto show that this native implementationis more efficient than MPICH’sdefault implementation.Figure8.7 shows that this differenceis alsovisible atthe applicationlevel. The figure shows the relative performanceof two broad-castingMPI applications,ASPandQR, which we will discussin moredetail inSection8.5. The MPICH-default measurementswere performedusing unicastprimitiveson top of thedefault LFC implementation(NrxImc). For bothASPandQR, this default broadcastis significantly slower than the implementationthatusesthebroadcastof NrxImc: ASPruns25%slowerandQRruns50%slower.

MPICH’s default broadcastimplementationsuffersfrom two problems.First,MPICH forwardsentiremessagesratherthanindividual packets. For largemes-sages,this eliminatespipeliningandreducesthroughput.ASPpipelinesmultiplebroadcastmessages,eachof which consistsof multiple packets. In QR, thereisno suchpipeliningof messages;only thepacketswithin a singlemessagecanbepipelined.Second,thedefault implementationcannotreuseNI packet buffers.Ateachinternalnodeof themulticasttree,themessagewill becopiedto thenetworkinterfaceoncefor eachchild. TheLFC implementationsreuseNI packet buffersto avoid this repeatedcopying.

Orca

Orcaprogramscanperformtwotypesof communicationpatterns:RPCandtotally-orderedbroadcast. In their communicationbehavior, Orca programsresemblemessage-passingprograms,exceptin thatthereis noasynchronousone-waymes-sagesendprimitive. Sucha primitivecanbesimulatedby meansof multithread-

Page 222: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

200 Multilevel PerformanceEvaluation

ing, but noneof our applicationsdoesthis. (Our Orcaapplicationsaredominatedby multicasttraffic.) In Orca,messagescarry theparametersandresultsof user-definedoperations,soprogrammershave controlover messagesizeandmessagerate.

Multigame

Multigame(MG) is a parallelgame-playingsystemdevelopedby Romein[122].Giventherulesof a boardgameanda boardevaluationfunction,Multigameau-tomaticallysearchesfor goodmoves,usingoneof severalsearchstrategies(e.g.,IDA*). Duringasearch,processorspushsearchjobsto eachother. A job consistsof (recursively) evaluatinga boardposition. To avoid re-searchingpositions,po-sitionsarecachedin adistributedhashtable.Whenaprocessneedsto readatableentry, it sendsthe job to theownerof the tableentry [122]. Theowner looksuptheentryandcontinuesto work on the job until it needsto accessa remotetableentry.

Multigame’s job descriptorsaresmall (32 bytes). To avoid the overheadofsendingandreceiving many smallmessages,Multigameaggregatesjob messages.The communicationgranularitydependson the maximumnumberof jobs permessage.

Multigamerunson Panda-STandreceivesall messagesthroughpolling. Al-most all communicationconsistsof small to medium-sizeone-way messages,which are sent to randomdestinations. The maximummessagesize dependson the messageaggregationlimit. The messageratealsodependson this limitand,additionally, on theapplication-specificcostof evaluatinga boardposition.Sendersneednot wait for replies, so as long as eachprocesshasa sufficientamountof work to do, latency (in the LogP sense)is unimportant. Communi-cationoverheadis dominatedby sendandreceiveoverhead.

8.4.2 PerformanceIssues

Table7.2summarizestheminimumlatency andthemaximumthroughputof char-acteristicoperationsfor Orca,CRL, andMPI. Rulesin Multigamespecificationsarenot easilytied to communicationpatterns.Themostimportantpattern,how-ever, consistsof sendinganumberof jobsin asinglemessageto aremoteproces-sor. All communicationconsistsof Panda-STpoint-to-pointmessages.

Table7.2showsthatourPPSsaddsignificantoverheadto anLFC-level imple-mentationof thesamecommunicationpattern,sowe expectthattheseoverheadswill reducetheapplication-level performancedifferencesbetweendifferentLFCimplementations.

Page 223: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.5ApplicationPerformance 201

Client-level optimizationsform anotherdampeningfactor. Two optimizationsareworthmentioning:messagecombiningandlatency hiding.

As describedabove,Multigameperformsmessagecombiningby aggregatingsearchjobsin per-processorbuffers.Insteadof sendingoutasearchjob assoonasit hasbeengenerated,theMultigameruntimesystemstoresjobsin anaggregationqueuefor thedestinationprocessor. Thecontentsof this queueis not transmitteduntil a sufficient numberof jobshasbeenplacedinto thequeue.The differencein packet ratebetweenPuzzle-4andPuzzle-64in Figure8.8 shows thatmessagecombiningsignificantlyreducesthe packet ratefor all protocols.Of course,theresultingperformancegain (seeFigure 8.9) is larger for protocolswith higherper-packet costs.

CRL andOrcaimplementlatency hiding. Both systemsperformRPC-styletransactions:a processorsendsa requestto anothermachineandwaits for thereply. Both systemsthenstartpolling the network andprocessincomingpack-ets while waiting for the reply. Consequently, protocolscan compensatehighlatenciesby processingotherpacketsin their idle periods.(This holdsonly if theincreasedlatency is not theresultof increasedhostprocessoroverhead.)

8.5 Application Performance

Thissectionanalyzestheperformanceof severalapplicationson all fiveLFC im-plementations.Section8.5.1summarizestheperformanceresults.Sections8.5.2to 8.5.10discussandanalyzetheperformanceof individualapplications.

8.5.1 PerformanceResults

Table8.6 lists, for eachapplication,thePPSit runson, its input parameters,se-quentialexecutiontime, speedup,andparallelefficiency. (Parallel efficiency isdefinedasspeedupdividedby thenumberof processors:E64 Ô S64× 64.) These-quentialexecutiontime andparallelefficiency weremeasuredusingthe defaultimplementation(NrxImc). Sequentialexecutiontimesrangefrom a few secondsto severalminutes;speedupson 64 processorsrangefrom poor(Radix)to super-linear (ASP). While we usesmall problemsthat easilyfit into the memoriesof64 processors,7 out of 10 applicationsachieve anefficiency of at least50%. Su-perlinearspeedupoccursbecauseweusefixed-sizeproblemsandbecause64pro-cessorshavemorecachememoryat their disposalthanasingleprocessor.

With the exceptionof Awari and the Puzzleprograms,all programsimple-mentwell-known parallelalgorithmsthatarefrequentlyusedto benchmarkPPSs.All CRL applications(Barnes,FFT, andRadix), for example,areadaptationsofprogramsfrom theSPLASH-2suiteof parallelbenchmarkprograms[153].

Page 224: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

202 Multilevel PerformanceEvaluation

Application PPS Problemsize T1 S64 E64

ASP MPI 1024nodes 49.32 74.73 1.17Awari Orca 13stones 448.90 31.81 0.50Barnes CRL 16,384bodies 123.24 23.25 0.36FFT CRL 1,048,576complex floats 4.46 49.56 0.77LEQ Orca 1000equations 610.90 30.12 0.47Puzzle-4 MG 15-puzzle,ê 4 jobs/message 281.00 45.32 0.71Puzzle-64 MG 15-puzzle,ê 64 jobs/message 281.00 53.63 0.84QR MPI 1024 ë 1024doubles 54.16 45.51 0.71Radix CRL 3,000,000ints, radix128 3.20 10.32 0.16SOR MPI 1536 ë 1536doubles 30.91 51.52 0.80

Table 8.6. Application characteristicsandtimings. T1 is the executiontime (inseconds)on oneprocessor;S64 is the speedupon 64 processors;E64 is the effi-ciency on 64processors.

We performedthe applicationmeasurementsusing the following parametersettings:64 processors(P = 64), a packet sizeof 1 Kbyte (PKTSZ = 1), 4096hostreceivebuffers(HRB = 4096),asendwindow of 8 packetbuffers(W = 8), aninterruptdelayof 70µs(INTD = 70),andatimergranularityof 5000µs(TGRAN= 5000). For the retransmittingprotocols(IrxImc, IrxHmc, andHrxHmc), we use256NI sendbuffersand386NI receive buffers(ISB + IRB = 256+ 384= 640).For thecarefulprotocols(NrxImc andNrxHmc), we useISB + IRB = 128+ 512=640. With thesesettings,thecontrolprogramoccupiesapproximately900Kbyteof theNI’s memory(1 Mbyte). The remainingspaceis usedby thecontrolpro-gram’s runtimestack.

Communicationstatisticsfor all applicationsaresummarizedin Figure8.8,whichgivesper-processordataandpacketrates,brokendown accordingto packettype. The figure distinguishesunicastdatapackets,multicastdatapackets,ac-knowledgements,and synchronizationpackets. Data and packet ratesrefer toincomingtraffic, soa broadcastis countedasmany timesasthenumberof desti-nations(63).

Thegoalof thefigureis to show thattheapplicationsexhibit diversecommuni-cationpatterns.Compare,for example,Barnes,Radix,andSOR.Barneshashighpacket ratesandlow datarates(i.e., packetsaresmall). Radix, in contrast,hasbothhigh packet anddatarates.SORhasa low packet rateanda high datarate(i.e., packetsare large). Theserateswereall measuredusingHrxHmc; the rateson other implementationsaredifferent, of course,but show similar differencesbetweenapplications.

Figure8.9 shows applicationperformanceof thealternative implementations

Page 225: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.5ApplicationPerformance 203

0Ù 2ì 4í

6îData rate (Mbyte/processor/s)ï0Ù 2ì 4

í6î

Data rate (Mbyte/processor/s)ï0Ù 2ì 4í

6îData rate (Mbyte/processor/s)ï

SOR/MPI Radix/CRL QR/MPI Puzzle-64/MG Puzzle-4/MG LEQ/Orca FFT/CRL Barnes/CRL Awari/Orca ASP/MPI

UnicastMulticast

0Ù 5000 10000Ù 15000Ù 20000Ù 25000ÙPacket rate (packets/processor/s)

0Ù 5000 10000Ù 15000Ù 20000Ù 25000ÙPacket rate (packets/processor/s)

0Ù 5000 10000Ù 15000Ù 20000Ù 25000ÙPacket rate (packets/processor/s)

0Ù 5000 10000Ù 15000Ù 20000Ù 25000ÙPacket rate (packets/processor/s)

0Ù 5000 10000Ù 15000Ù 20000Ù 25000ÙPacket rate (packets/processor/s)

0Ù 5000 10000Ù 15000Ù 20000Ù 25000ÙPacket rate (packets/processor/s)

SOR/MPI Radix/CRL QR/MPI Puzzle-64/MG Puzzle-4/MG LEQ/Orca FFT/CRL Barnes/CRL Awari/Orca ASP/MPI

UnicastMulticastAckðSync

Fig. 8.8.Dataandpacket ratesof NrxImc on 64nodes.

Page 226: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

204 Multilevel PerformanceEvaluation

relative to the performanceof the default implementation(NrxImc). Noneof thealternative implementationsis fasterthanthe default implementation.In severalcases,however, thealternativesareslower. Below, theseslowdownsarediscussedin moredetail.

8.5.2 All-pairs ShortestPaths (ASP)

ASPsolvestheAll-pairs ShortestPath(ASP)problemusinganiterativealgorithm(Floyd-Warshall).ASPfindstheshortestpathbetweenall nodesin a graph.Thegraphis representedasa distancematrix which is row-wisedistributedacrossallprocessors.In eachiteration,oneprocessorbroadcastsoneof its 4 Kbyte rows.Thealgorithmiteratesoverall of thisprocessor’srowsbeforeswitchingto anotherprocessor’s rows. Consequently, thecurrentsendercanpipelinethebroadcastsofits rows.

Figure 8.9 shows that both protocolsthat use interface-level multicast for-wardingperformbetterthan the protocolsthat usehost-level forwarding. Thisis surprising,becauseFigure8.6 showedthat IrxImc achievestheworstmulticastthroughput. Indeed,ASP revealedseveral problemswith host-level forwardingthat do not show up in microbenchmarks.First, in ASP, it is essentialthat thesendercanpipelineits broadcasts.Receivers,however, areoftenstill working onone iterationwhenbroadcastpackets for the next iterationarrive. Thesepack-ets are not processeduntil the receivers invoke a receive primitive. (This is apropertyof our MPI implementation,which usesonly polling; seeSection7.4.)Consequently, thesenderis stalled,becauseacknowledgementsdo not flow backin time. In the interface-level forwardingprotocols,acknowledgementsaresentby the NI, not by the hostprocessor. As long asthe hostsuppliesa sufficientlylargenumberof freehostreceivebuffers,NIs cancontinueto deliverandforwardbroadcastpackets.

We augmentedthe Hmc versionsof ASPwith application-level polling state-ments(MPI probe()), which improved the performanceof the host-level proto-cols. Figure8.9 shows the improvednumbers.Nevertheless,the Hmc protocolsdonotattainthesameperformanceastheImc protocols.Theremainingdifferenceis dueto theprocessoroverheadcausedby failedpollsandhost-level forwarding.Theeffect of forwarding-relatedprocessoroverheadis visible in themulticastla-tency benchmark(seeFigure8.5).

8.5.3 Awari

Awari createsanendgamedatabasefor Awari,atwo-playerboardgame,by meansof parallel retrogradeanalysis[8]. In contrastwith top-down searchtechniqueslike α-β search,retrogradeanalysisproceedsbottom-upby making unmoves,

Page 227: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.5ApplicationPerformance 205

1.0ñ 1.1ñ 1.2ñ 1.3ñ 1.4ñ 1.5ñSlowdown relative to NrxImcò

SOR/MPI

Radix/CRL

QR/MPI

Puzzle-64/MG

Puzzle-4/MG

LEQ/Orca

FFT/CRL

Barnes/CRL

Awari/Orca

ASP/MPI

NrxHmcIrxImcIrxHmcHrxHmc

Fig. 8.9.Normalizedapplicationexecutiontimes.

Page 228: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

206 Multilevel PerformanceEvaluation

startingwith theendpositionsof Awari. TheOrcaprogramcreatesanendgamedatabaseDBn, which can be usedby a game-playingprogram. Eachentry inthedatabaserepresentsa boardposition’s hashandtheboard’s game-theoreticalvalue.Thegame-theoreticalvaluerepresentstheoutcomeof thegamein thecasethatbothplayersmakebestmovesonly. DBn containsgame-theoreticalvaluesforall boardsthathaveatmostn stonesleft on theboard.Thegame-theoreticalvalueis anumberbetweenæ n andn thatindicatesthenumberof piecesthatcanbewon(or lost). WeranAwari with n Ô 13.

ParallelAwari operatesasfollows. Eachprocessorstorespartof thedatabase.The parentsof a boardare the boardsthat can be reachedby applying a legalunmoveto theboard.Whenaprocessorupdatesaboard’sgame-theoreticalvalueit mustalsoupdatetheboard’sparents.Sincetheboardsarerandomlydistributedacrossall processors,it is likely thata parentis locatedon anotherprocessor. Asingleupdatemaythusresultin severalremoteupdateoperationsthatareexecutedby RPCs.

To avoid excessivecommunicationoverhead,remoteupdatesaredelayedandstoredin a queuefor their destinationprocessor. As soonasa reasonablenumberof updateshasaccumulated,they aretransferredto their destinationwith a singleRPC.Theperformanceof Awari is determinedby theseRPCs;broadcastsdo notplayamajorrole in this application.

Theupdatesaretransferredby a singlecommunicationthread,which canrunin parallelwith thecomputationthread.Application-level statistics,however, in-dicatethatmostupdatesaretransmittedwhenthereis nowork for thecomputationthread.As a result,performanceis dominatedby thespeedat which work canbedistributed. Queuedupdatesthatmustbesentto different destinationprocessorscannotbesentoutataratethatis higherthantherateatwhichthecommunicationthreadcanperformRPCs.

AlthoughAwari’sdatarateappearslow (seeFigure8.8),communicationtakesplacein specificprogramphases.In thesephases,communicationis bursty andmostmessagesaresmall despitemessagecombining. Performanceis thereforedominatedby occupancy ratherthanroundtriplatency. The LogP parametersinTable8.4 show that Irx hasthe highestgap(g Ô 9 Ý 7 µs), followed by Hrx (g Ô8 Ý 3 µs) andthenNrx (g Ô 6 Ý 7 µs). This ranking is reflectedin the application’sperformance.

8.5.4 Barnes-Hut

Barnessimulatesa galaxyusingthe hierarchicalBarnes-HutN-body algorithm.Theprogramorganizesall simulatedbodiesin asharedoct-tree.Eachnodein thistreeandeachbody(storedin aleaf) is representedasasmallCRL region(88–108bytes).Eachprocessorownspartof thebodies.Most communicationtakesplace

Page 229: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.5ApplicationPerformance 207

during the phasethat computesthe gravitational forcesthat bodiesexert on oneanother. In this phase,eachprocessortraversesthetreeto computefor eachof itsbodiesthe interactionwith otherbodiesor subtrees.Barneshasa relatively highpacket rate,but dueto thesmallregionsthedatarateis low (seeFigure8.8).

Even thoughBarnesrunson a differentPPS,its performanceprofile in Fig-ure8.9is similar to thatof Awari. SinceCRL makeslittle useof LFC’smulticast,thereis nolargedifferencebetweeninterface-level andhost-level forwarding.Re-transmissionsupport,however, leadsto decreasedperformance.This effect isstrongestfor Irx, whichhasthehighestinterpacketgap(g Ô 9 Ý 7µs)andis thereforemorelikely to suffer from NI occupancy underhighloads.ThesehighloadsoccurbecauseBarneshasa high packet rate(seeFigure8.8). Due to CRL’s roundtripcommunicationstyle (seeSection7.3),Barnesis sensitive to this increasein oc-cupancy.

8.5.5 FastFourier Transform (FFT)

FFT performsa FastFourier transformon an arrayof complex numbers.Thesenumbersarestoredin a matrix which is partitionedinto P2 squareblocks(whereP is thenumberof processors).Eachprocessorownsaverticalstripeof P blocksand eachblock is storedin a 2 Kbyte CRL region. The main communicationphasesconsistof matrix transposes.During a transposea personalizedall-to-allexchangeof all blockstakesplace: eachprocessorexchangeseachof its blocks(excepttheblockonthediagonal)with ablockownedby anotherprocessor. Eachblock transferis theresultof awrite miss.Thenodethatgeneratesthewrite misssendsasmallrequestmessageto theblock’shomenode,whichreplieswith adatamessagethatcontainstheblock.

As explainedin Section8.2.5, this communicationpatternputspressureonthenumberof sendbuffersfor theretransmittingimplementations.Sincewehaveconfiguredtheseimplementationswith a fairly largenumberof sendbuffers(ISB= 256),however, this posesno problems.

Theperformanceresultsin Figure8.9show no largedifferencesbetweendif-ferentLFC implementations.Theretransmittingprotocolsperformslightly worsethanthe carefulprotocols.Occasionally, theseprotocolssenddelayedacknowl-edgements.If we increasethe delayedacknowledgementtimeout, no delayedacknowledgementsaresent,andthedifferencesbecomeevensmaller.

8.5.6 The Linear Equation Solver (LEQ)

LEQ is an iterative solver for linear systemsof the form Ax Ô b. Eachiterationrefinesa candidatesolutionvectorxi into a bettersolutionxi ó 1. This is repeateduntil thedifferencebetweenxi ó 1 andxi becomessmallerthana specifiedbound.

Page 230: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

208 Multilevel PerformanceEvaluation

In eachiteration,eachprocessorproducesa partof xi ó 1, but needsall of xi asitsinput. Therefore,all processorsexchangetheir128-bytepartialsolutionvectorsattheendof eachiteration(6906total). Eachprocessorbroadcastsits partof xi ó 1 toall otherprocessors.After this exchange,all processorssynchronizeto determinewhetherconvergencehasoccurred.

In LEQ, HrxHmc suffers from its host-level fetch-and-addimplementation.ProcessingF&A requestson thehostratherthanon theNI increasestheresponsetime (seeTable8.5)andtheoccupancy of theprocessorthatstorestheF&A vari-able(processor0).

Although LEQ is dominatedby broadcasttraffic, the costof executinga re-transmissionprotocolappearsto be themainperformancefactor: increasedhostandNI occupancy leadto decreasedperformance.Thisis dueto LEQ’scommuni-cationbehavior. All processesbroadcastat thesametime,congestingthenetworkandtheNIs. In ASPandQR,processesbroadcastoneat a time.

8.5.7 Puzzle

PuzzleperformsaparallelIDA* searchtosolvethe15-puzzle,awell-know single-player sliding-tile puzzle. We experimentedwith two versionsof Puzzle: inPuzzle-4,Multigameaggregatesat most4 jobs beforepushingthesejobs to an-otherprocessor, while in Puzzle-64upto 64jobsareaccumulated.Bothprogramssolvethesameproblem,but Puzzle-4sendsmany moremessagesthanPuzzle-64.

In Puzzle,all communicationis one-wayandprocessesdonotwait for incom-ing messages.As a result, sendandreceive overheadaremore importantthanNI-level latency andoccupancy. HrxHmc hastheworstsendandreceiveoverheadsof our LFC implementations(seeTable8.4) andthereforeperformsworsethanthe other implementations.In Puzzle-4,the differencebetweenHrxHmc andtheotherprotocolsis larger thanin Puzzle-64,becausePuzzle-4needsto sendmoremessagesto transferthesameamountof data.Consequently, thehighersendandreceiveoverheadsareincurredmoreoftenduringtheprogram’sexecution.

8.5.8 QR Factorization

QR is a parallel implementationof QR factorization[59] with columndistribu-tion. In eachiteration,onecolumn,theHouseholdervectorH, is broadcastto allprocessors,whichupdatetheircolumnsusingH. Thecurrentupperrow andH arethendeletedfrom thedatasetso that thesizeof H decreasesby 1 in eachof the1024iterations.Thevectorwith maximumnormbecomestheHouseholdervectorfor thenext iteration. This is decidedwith a Reduce-To-All collective operationto which eachprocessorcontributestwo integersandtwo doubles.

Page 231: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.5ApplicationPerformance 209

The performancecharacteristicsof QR aresimilar to thoseof ASP(seeFig-ure8.9): theimplementationsbasedonhost-level forwardingperformworsethanthosebasedon interface-level forwarding. Theperformanceof QR is dominatedby the broadcastsof column H. As in ASP, host-level forwarding leadsto in-creasedprocessoroverhead.In addition,theimplementationsbasedon host-levelforwardingaresensitiveto theirhigherbroadcastlatency. At thestartof eachiter-ation,eachreceiving processormustwait for an incomingbroadcast.In contrastwith ASP, thebroadcastingprocessorcannotgetaheadof thereceiversby pipelin-ing multiplebroadcasts,becausetheReduce-to-Allsynchronizesall processorsineachiteration. (Also, dueto pivoting, the identity of thebroadcastingprocessorchangesin almostevery iteration.)

8.5.9 Radix Sort

Radixsortsa largearrayof (random)integersusingaparallelversionof theradixsortalgorithm.Eachprocessorownsa contiguouspartof thearrayandeachpartis furthersubdividedinto CRL regions,which actassoftwarecachelines. Com-municationis dominatedby the algorithm’s permutationphase,in which eachprocessormovesintegersin its partition to someotherpartition of the array. Ifthe keys aredistributeduniformly, eachprocessoraccessesall otherprocessors’partitions.We chosea region size(1 Kbyte) thatminimizeswrite sharingin thisphase.After thepermutationphaseeachprocessorreadsthenew valuesin its par-tition andstartsthenext sortingiteration.Radixhasthehighestunicastdatarateof all our applications(seeFigure8.8). In eachof the threepermutationphases,mostof thearrayto besortedis transferredacrossthenetwork.

Although Radix sendslarger messagesthanAwari, Barnes,andLEQ, it hasa similar performanceprofile as theseapplications. Radix’s messagesare stillrelatively small (at most1 Kbyte of region data)andeachmessagerequiresatmosttwo LFC packets,a full packet anda nearlyemptypacket. Consequently,Radixremainssensitiveto thesmall-messagebottleneck,sothatNI-level retrans-mission(IrxImc andIrxHmc) performsworst,followedby host-level retransmission(HrxHmc).

8.5.10 SuccessiveOverrelaxation (SOR)

SORis usedto solve discretizedLaplaceequations.TheSORprogramusesred-black iteration to updatea 1536 ë 1536matrix. Eachprocessorowns an equalnumberof contiguousrows of thematrix. In eachof the42 iterations,processorsexchangeneighboringborderrows (12Kbyte) andperformasingle-element(onedouble)reductionto determinewhetherconvergencehasoccurred.

Page 232: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

210 Multilevel PerformanceEvaluation

Class Applications NrxImc NrxHmc IrxImc IrxHmc HrxHmc

Roundtrip Barnes,Radix,Awari + + æôæ æôæ æMulticast ASP, QR + æ + æ æõæOne-way Puzzle-4,Puzzle-64 + + + + æ

Table8.7. Classificationof applications.

For SOR,HrxHmc performsworsethanthe otherimplementations.With theHrxHmc implementation,the host processorsuffers from sliding-window stalls.Each12 Kbyte messagethatis sentto a neighborrequires13 LFC packets,whilethe window sizeis only 8 packets. If the destinationprocessis actively pollingwhen the sender’s packetsarrive, it will senda half-window acknowledgementbeforethe senderis stalled. In SOR,however, all processessendout their mes-sagesat approximatelythe sametime, first to their higher-numberedneighbor,thento their lower-numberedneighbor. Thedestinationprocesswill thereforenotpoll until oneof its own sendsblocks(dueto aclosedsendwindow). At thatpoint,thesenderhasalreadybeenstalled.

The implementationsthat implementNI-level reliability do not suffer fromthis problem,becausethey implementthesliding window on theNI. Evenif theNI’s sendwindow to anotherNI closes,the hostprocessorcancontinueto postpacketsuntil the supplyof free senddescriptorsis exhausted.Also, if the NI’ssendwindow to oneNI fills up, it canstill sendpacketsto otherNIs if the hostsuppliespacketsfor otherdestinations.

The performanceof HrxHmc improvessignificantly if the boundaryrow ex-changecodeis modifiedsothatnotall processessendin thesamedirectionat thesametime. In practice,this improvestheprobabilitythatacknowledgementscanbepiggybackedandreducesthenumberof window stalls.

8.6 Classification

Basedon theperformanceanalysisof theindividualapplications,wehave identi-fiedthreeclassesof applicationswith distinctbehavior. Theseclassesandanindi-cationof therelativeperformanceof eachLFC implementationfor eachclassareshown in Table8.7. In this table,a ’ Ü ’ indicatesthatanimplementationperformswell for aclassof applications;a ’ æ ’ indicatesamodestdecreasein performance;and’ æôæ ’ indicatesasignificantdecreasein performance.

Theperformanceof roundtripapplicationsis dominatedby roundtriplatency.Sincemulticastplays no importantrole in theseapplications,performancedif-ferencesbetweentheLFC implementationsaredeterminedby differencesin thereliability schemes.This classof applicationsshows that the robustnessof re-

Page 233: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.7RelatedWork 211

transmissionhasa price. Both the host-level retransmissionschemesand theinterface-level retransmissionschemesperformworsethantheno-retransmissionscheme,but for differentreasons.Host-level retransmission(HrxHmc) suffersfromits highersendandreceiveoverheadandis up to 11%slower thanthedefault im-plementation(NrxImc). NI-level retransmission(IrxImc and IrxHmc) suffers fromincreased(NI) occupancy and latency. Theseimplementationsare up to 16%slower thanthedefault implementation.For this classof applications,theLogPmeasurementspresentedearlierin thischaptergiveagoodindicationof thecausesof differencesin application-level performance.

Theperformanceof themulticastapplicationsis determinedby theefficiencyof the multicastimplementation.The performanceresultsshow the advantagesof NI-level multicast forwarding. HrxHmc, IrxHmc, and NrxHmc all suffer fromhost-level forwardingoverheadwhichtakesawaytimefrom theapplication.Host-level forwardingis upto 38%slowerthaninterface-level forwardingin thedefaultimplementation.

The’class’of one-wayapplicationscontainsbothvariantsof thePuzzleappli-cation. In contrastwith roundtripapplications,one-way applicationsdo not waitfor replymessages,solatency andNI occupancy canbetoleratedfairly well. Sendandreceive overhead,on the otherhand,cannotbehidden. As a result,HrxHmc

suffersfrom its increasedsendandreceive overhead;this leadsto a performancelossof up to 16%relative to thedefault implementation.

Theremainingapplications(FFT, LEQ, andSOR)do not fit into any of thesecategories.

8.7 RelatedWork

This chapterhasconcentratedon NI protocol issuesandtheir relationshipwithpropertiesof PPSsandapplications.Figure8.10classifiesseveralexisting Myri-netcommunicationsystemsalongtwo protocoldesignaxes:reliability andmulti-cast.Mostsystemsdonotimplementretransmission.Weassumethatthis is partlydueto theprototypenatureof researchsystems.The figurealsoshows that fewsystemsprovide multicastor broadcastsupport. Below, we first discussrelatedwork in thesetwo areas.Next, we discussotherNI-relatedapplicationstudies.

8.7.1 Reliability

Only a few studieshave comparedcarefulandretransmittingprotocols.Theear-liest studythatwe areawareof is by MosbergerandPeterson[106], which com-paresa carefulanda retransmittingprotocol implementationfor FiberChannel.They notethe scalabilityproblemsof staticbuffer reservation (as in Nrx). This

Page 234: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

212 Multilevel PerformanceEvaluation

FM/MC

HamlynFMBIP (large)

VMMCTrapeze VMMC-2 AM-II

U-NetBIP (small)

Host retransmitsNo retransmission NI retransmits

Host forwards

NI forwards

PM

Fig. 8.10. Classificationof communicationsystemsfor Myrinet basedon themulticastandreliability designissues.

problemexists, but the low bandwidth-delayproductof modernnetworks (andprotocols)allows theuseof staticreservation in medium-sizeclusters.(Another,dynamicsolution,usedin PM’s NI protocol,is describedbelow.) MosbergerandPetersonusedonly oneapplication,Jacobi.Thatapplicationshows muchlargerbenefitsfor carefulprotocolsthanwe find with our applicationsuite. This maybe due to extra copying in their retransmittingprotocol. As explainedin Sec-tion 8.2.2,our retransmittingprotocolsdo not make morecopiesthanour carefulprotocols.

Someof theprotocolsin Figure8.10combinethestrategiesusedby Nrx, Hrx,and Irx. Like Nrx, for example,PM assumesthat the hardware never dropsorcorruptspackets. Like Irx, though,PM lets sendersshareNI receive buffersanddropsincoming data packets when all buffers are full. When a datapacket isdropped,a negative acknowledgementis sentto its sender. PM never dropsac-knowledgements(negative or positive). Sinceit is assumedthat the hardwaredeliversall packetscorrectly, sendersalwaysknow whenoneof their packetshasbeendroppedandcanretransmitthatpacketwithouthaving to setaretransmissiontimer.

Active MessagesII combinesanNI-level (alternating-bit)reliability protocolwith a host-level sliding window protocolwhich is usedboth for reliability andflow control[37].

Our NI-supportedfine-graintimer implementationis similar to thesoft timerschemedescribedby Aron andDruschel[6] —developedindependently—whoalsousepolling to implementafine-graintimer. They optimizekernel-level timersandpoll thehost’s timestampcounteron systemcalls,exceptions,andinterruptsto amortizethestate-saving overheadof theseeventsover multiple actions(e.g.,network interruptprocessingandtimer processing).Soft timersusethekernel’sclock interruptasa backupmechanism.This interruptoftenhasa granularityofat leastseveralmilliseconds.Hrx usestheNI’s timerasabackup.This timerhasa0 Ý 5µsgranularityandis polledin eachiterationof theNI’smainloop. Ourbackup

Page 235: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.7RelatedWork 213

mechanismis thereforemorepreciseandcangenerateaninterruptata timecloseto the desiredtimer expiration time in casethe host’s timestampcounteris notpolled frequentlyenough.This is only a modestadvantage,becauseit doesnotaddressthe interruptoverheadproblemthatcanresultfrom infrequentpolling ofthetimestampcounter.

8.7.2 Multicast

To thebestof our knowledge,this is thefirst studythatcomparesa high-perfor-manceNI-level and high-performancehost-level multicastand their respectiveimpacton applicationperformance.Several papersdescribemulticastprotocolsthatuseNI-level multicastforwarding[16, 56, 68, 80, 146]. This work wasde-scribedin Section4.5.3. Someof thesepapers(e.g., [146]) comparehost-levelforwardingandNI-level forwarding,but mostof thesecomparisonsusea host-level forwardingschemethat copiesdatato the NI multiple times. We arenotawareof any previousstudythatstudiestheapplication-level impactof bothfor-wardingstrategies.

8.7.3 NI Protocol Studies

Araki et al. usedmicrobenchmarksand the LogP performancemodel to com-paretheperformanceof GenericActiveMessages,Illinois FastMessages(version2), BIP, PM, andVMMC-2 [5]. Their studycomparescommunicationsystemswith programminginterfacesthat are sometimesfundamentallydifferent (e.g.,memory-mappedcommunicationin VMMC-2 versusmessage-basedcommuni-cation in PM andrendezvous-stylecommunicationin BIP versusasynchronousmessagepassingin Fast Messages). In our study, we comparefive differentimplementationsof the sameinterface. The most importantdifferencewith thework describedin this chapter, however, is that Araki et al. do not considertheapplication-level impactof their results.

In anotherstudy [102], Martin et al. do focus on applications,but evaluateonly a single communicationarchitecture(Active Messages)and a single pro-grammingsystem(Split-C). WhereMartin et al. vary the performancecharac-teristics(theLogGPparameters)of a singlecommunicationsystem,we usefivedifferentsystemswhichhavedifferentperformancecharacteristicsdueto thewaythey divide protocolwork betweenthe hostprocessorandthe NI. An importantcontributionof ourwork is theevaluationof differentnetwork interfaceprotocolsusing a wide variety of parallel applications(fine-grain to medium-grain,uni-castandmulticast)andPPSs(distributedsearch,messagepassing,update-basedDSM, and invalidation-basedDSM). Eachsystemhasits own, fairly large andcomplex runtime system,which imposessignificantcommunicationoverheads

Page 236: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

214 Multilevel PerformanceEvaluation

(seeTable7.2). In addition,threeout of four systemsdo not run immediatelyontopof oneNI protocollayer(LFC), but onanintermediatemessage-passinglayer(Panda).

Bilasetal.discussNI supportfor sharedvirtual memorysystems(SVMs)[21].They studyNI supportfor fetchingdatafrom remotememories,for depositingdatainto remotememories,andfor distributedlocks.They find thatthesemecha-nismssignificantlyimprove theperformanceof SVM implementationsandclaimthat they aremorewidely applicable.Using thesemechanisms,andby restruc-turing their SVM, they eliminatedall asynchronoushost-level protocolprocess-ing in their SVM. As a result,interruptsarenot neededandpolling is usedonlywhena messageis expected.This result,however, dependson theability to pushasynchronousprotocolprocessingto theNI. Bilaset al. useaPPSin which asyn-chronousprocessingconsistsof accessingdataat a known addressor accessingalock. Theseactionsarerelatively simpleandcanbeperformedby theNI. PPSssuchasOrcaandManta,however, mustexecuteincominguser-definedoperations,whichcannoteasilybehandledby theNI.

In a simulationstudy, Bilas et al. identify bottlenecksin software shared-memorysystems[20]. This studyis structuredaroundthe samethreelayersaswe usedin this chapter:low-level communicationsoftwareandhardware,PPS,andapplication. Both the communicationlayer andthe PPSsaredifferentfromtheonesstudiedin this thesis.Bilas et al. assumea communicationlayer basedon virtual memory-mappedcommunication(seeChapter2) andanalyzetheper-formanceof page-basedandfine-grainedDSMs.Ourwork usesacommunicationlayerbasedonlow-levelmessagepassingandPPSsthatimplementmessagepass-ing or object-basedsharing.(CRL usesasimilarcachecoherenceprotocolasfine-grainedDSMs,but relieson theprogrammerto define’cachelines’ andto triggercoherenceactions.Fine-grainedDSMssuchasTempest[119] andShasta[125]requirelittle or no programmerintervention.)

8.8 Summary

In thischapter, wehavestudiedtheperformanceof fiveimplementationsof LFC’scommunicationinterface.Eachimplementationis acombinationof onereliabilityscheme(no retransmission,host-level retransmission,or NI-level retransmission)andonemulticastforwardingscheme(host-level forwardingor NI-level forward-ing). Thesefive implementationsrepresentdifferentassumptionsaboutthecapa-bilities of network interfaces(availability of aprogrammableNI processoranditsspeed)andthe operatingenvironment(reliability of the network hardware). Wecomparedtheperformanceof all fiveimplementationsatmultiple levels.Weusedmicrobenchmarksfor direct comparisonsbetweenthe differentimplementations

Page 237: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

8.8Summary 215

andfor comparisonsat thePPSlevel. We usedfour differentPPSs,which repre-sentdifferentprogrammingparadigmsandwhichexhibit differentcommunicationpatterns.Most importantly, wealsoperformedanapplication-level comparison.

Theseperformancecomparisonsreveal several interestingfacts. First, al-thoughperformancedifferencesarevisible, all five LFC implementationscanbemadeto performwell on most LFC-level microbenchmarks.Multicast latencyformsanexception:NI-level multicastforwardingyieldsbettermulticastlatencythanhost-level forwarding.

Second,all PPSsaddlargeoverheadsto LFC implementations.Measuringes-sentiallythesamecommunicationpatternatmultiple levelsshowslargeincreasesin latency andlargereductionsin throughput.While this is a logical consequenceof the layeredstructureof the PPSs,it is important to make this observation,becauseonewould expect that theseoverheads,andnot the relative small per-formancedifferencesbetweenLFC implementations,will dominateapplicationperformance.

Third, in spiteof thepreviousobservation,runningapplicationson thediffer-ent LFC implementationsyields significantdifferencesin applicationexecutiontime. Most of our applicationsexhibited oneof threecommunicationpatterns:roundtrip,one-way, or multicast. For eachof thesepatterns,the relative perfor-manceof theLFC implementationsis similar:

ö Theroundtripapplicationsperformbeston thenonretransmittingLFC im-plementations.Retransmissionsupportaddssufficientoverheadthatit can-not be hiddenby applicationsin this class. The overheadshave a largerimpactwhenretransmissionis implementedon theNI.

ö The multicastapplicationsperformbeston the implementationsthat per-form NI-level forwarding. NI-level forwardingyields lower multicastla-tency anddoesnot wastehost-processorcyclesonpacket forwarding.

ö Theone-way applications—really two variantsof thesameapplications—performwell on all implementations,excepton the implementationsthatimplementhost-level retransmission.This implementationsuffers from itslargersendandreceiveoverheads.

Our original implementationof LFC, NrxImc, performsbestfor all patterns.This implementation,however, optimistically assumesthe presenceof reliablenetwork hardwareandanintelligentNI.

Page 238: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0
Page 239: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Chapter 9

Summary and Conclusions

Theimplementationof a high-level programmingmodelconsistsof multiple lay-ers: a runtimesystemor parallel-programmingsystemthat implementsthe pro-grammingmodel and one or more communicationlayers that have no know-ledgeof theparallel-programmingmodel,but thatprovide efficient communica-tion mechanisms.This thesishasconcentratedon the lower communicationlay-ers,in particularon thenetwork interface(NI) protocollayer. A key contributionof this thesis,however, is that it alsoaddressesthe interactionsof the NI proto-col layerwith higher-level communicationlayersandapplications.Thefollowingsectionssummarizeour work asit relatesto eachlayeranddraw conclusions.

9.1 LFC and NI Protocols

We have implementedour ideasin andon top of LFC, a new user-level commu-nicationsystem.In severalways,LFC resemblesa hardwaredevice: communi-cationis packet-basedandall packetsaredeliveredto a singleupcall. However,LFC addstwo servicesthatareusuallynot providedby network hardware: reli-ablepoint-to-pointandmulticastcommunication.The efficient implementationof bothserviceshasbeenthekey to LFC’s success.

Efficient communicationis of obvious importanceto parallel-programmingsystems(PPSs).While many communicationsystemsprovide low-latency, high-throughputpoint-to-pointprimitives,very few systemsprovideefficientmulticastimplementations.In fact,many donotprovidemulticastatall. Thisis unfortunate,becausePPSssuchasOrcaandMPI rely onmulticastingto updatereplicatedob-jectsandto implementcollective-communicationoperations,respectively. More-over, multicastis notanadd-onfeature:multicastslayeredontopof point-to-pointprimitivesperformconsiderablyworsethanmulticaststhataresupportedby thebottom-mostcommunicationlayer(seeChapters7 and8).

217

Page 240: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

218 SummaryandConclusions

Thepresenceof reliablecommunicationgreatlyreducestheeffort neededtoimplementa PPS.Comparedto fragmentationanddemultiplexing —servicesnotprovidedby LFC— reliability protocolsaredifficult to implementcorrectlyandefficiently. Nevertheless,all PPSsmustdeliverdatareliably.

For efficiency, the default implementationof LFC usesNI support. The NIperformsfour tasks: flow control for reliability, multicastforwarding, interruptmanagement,and fetch-and-addprocessing.The following paragraphsdiscussthesetasksin moredetail.

The default implementationassumesreliable network hardware and imple-mentsreliablecommunicationchannelsby meansof a simple,NI-level flow con-trol protocol. This protocol,UCAST, is alsothe basisof MCAST andRECOV,LFC’s NI-supportedmulticastprotocols. Sincetheseprotocolsassumereliablehardware,however, they canoperateonly in a controlledenvironmentsuchasaclusterof computersdedicatedto runninghigh-performanceparalleljobs.

MCAST is simpleandefficient,but doesnotwork for all multicasttreetopolo-gies. RECOV works for all topologies,but is morecomplex andrequiresaddi-tional NI buffer space.Both MCAST andRECOV performfewer datatransfersthanhost-level store-and-forwardmulticastprotocols,which reducesthenumberof hostcyclesspenton multicastprocessing.TheperformancemeasurementsinChapter8 show that host-level multicastingreducesapplicationperformancebyup to 38%. Thesemeasurementscomparedtheperformanceof anNI-supportedmulticast to an agressivehost-level multicast. In practice,host-level multicastimplementationsarefrequentlyimplementedon top of unicastprimitives,whichfurtherincreasesthenumberof redundantdatatransfers.

Network interruptsform animportantsourceof overheadsin communicationsystems. To reducethe numberof unnecessaryinterrupts,LFC implementsasoftwarepolling watchdogon theNI. This mechanismdelaysnetwork interruptsfor incoming packets for a certainamountof time. In addition, LFC’s pollingwatchdogmonitorshost-level packet processingprogressto determinewhetheranetwork interruptshouldbegenerated.

We have implementedanNI-supportedfetch-and-addoperation.Pandacom-binesLFC’s fetch-and-addwith LFC’s broadcastto implementa totally-orderedbroadcast,which, in turn, is usedby Orca.Fetch-and-addalsohasotherapplica-tions.KaramchetiandChien,for example,haveusedfetch-and-addto implementpull-basedmessaging[76].

Othertypesof synchronizationprimitivescanalsbenefitfrom NI support.Bi-laset al. usetheNI to implementdistributedlocking in a sharedvirtual memorysystem[21]. TheseexamplesindicatethatNI supportfor synchronizationprim-itives is a good idea. Without suchsupport,synchronizationrequestsgenerateexpensive interrupts. It is not clearyet, however, which primitivesexactly mustbe supported.Dif ferentsystemsrequiredifferentprimitivesand,in spiteof the

Page 241: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

9.2Panda 219

generalityclaimedby the implementorsof somemechanisms[21], it is unlikelythatoneprimitivewill satisfyeverysystem.Anotherproblemis thevirtualizationof NI-supportedprimitives. LFC, for example,supportsonly onefetch-and-addvariableper NI. This constraintwasintroducedto boundthe spaceoccupiedbyNI variablesandto avoid introducingan extra demultiplexing step. Ideally, thistypeof constraintwouldbehiddenfrom users.ActiveMessagesII virtualizesNI-level communicationendpointsby usinghostmemoryasbackingstoragefor NImemory[37]. Thisstrategy is general,but increasesthecomplexity anddecreasestheefficiency of theimplementation.

It is frequentlyclaimedthatprogrammableNIs are’tooslow’; suchclaimsareoftenaccompaniedwith referencesto thefailureof I/O channels.LFC performsfour different taskson the NI: flow control for reliability, multicastforwarding,interrupt management,and fetch-and-addprocessing.Nevertheless,LFC is anefficient communicationsystem. The main issueis not whetherNIs shouldbeprogrammable,but which mechanismsthe lowest communicationlayer shouldsupplyandwherethey shouldbeimplemented.ProgrammableNIs canbeconsid-eredonetool in aspectrumof toolsthathelpanswerthis typeof questions;othersincludeformal analysisandsimulation.In this thesis,we studiedreliability, mul-ticast, synchronizationand interruptmanagement.Othershave investigatedNIsupportfor addresstranslationin zero-copy protocols[13, 49, 142], distributedlocking [21], andremotememoryaccess[51, 83]). Someof thesemechanisms(e.g., remotememoryaccess)have recentlyfound their way into industrystan-dards(theVirtual InterfaceArchitecture)andcommercialproducts.

9.2 Panda

Many parallel-programmingsystemsaresufficiently complex thatthey canbenefitfrom anintermediatecommunicationlayerthatprovideshigher-level communica-tion servicesthana systemlikeLFC. Pandaprovidesthreads,messages,messagepassing,RPC,andgroupcommunication.Theseservicesareusedin the imple-mentationof threeof thefour PPSsdescribedin Chapter7.

Panda’s threadpackage,OpenThreads,dynamicallyswitchesbetweenpollingandinterrupts. By default, interruptsareenabled,but whenall threadsareidle,OpenThreadsdisablesinterruptsandpollsthenetwork. Thisstrategy is simplebuteffective. Whena messageis expected,it will oftenbereceivedthroughpolling,which is moreefficient thanreceiving it by meansof an interrupt. Unexpectedmessagesgeneratean interrupt,unlessthe receiving processpolls beforeLFC’spolling watchdoggeneratestheinterrupt.

Panda’s streammessagesallow Pandaclientsto transmitandreceive a mes-sagein a pipelinedmanner:the receiver canstartprocessingan incomingmes-

Page 242: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

220 SummaryandConclusions

sagebeforeit hasbeenfully received. Streammessagesareimplementedon topof LFC’s packet-basedinterfacewithout introducingextradatacopies.

Blockingin messagehandlersis adifficult problem.Creatinga(popup)threadper messageallows messagehandlersto block whenever convenient,but it alsoremovesthe orderingbetweenmessages,introducesschedulinganomalies,andwastes(stack)space.Disallowing all handlersto block (as in active messages)forcesprogrammersto useasynchronousprimitivesandcontinuationswheneverany form of blocking (e.g., acquiringa lock) is required. Pandausessingle-threadedupcallsto processincomingmessages.Messagehandlersareallowedto block on locks, but cannotwait for the arrival of othermessages.If the re-ceiverof amessagecandecideearlythatwaiting for anothermessagewill not benecessary, thenthemessagecanbe processedwithout creatinga new thread. Inothercases,blockingcanbeavoidedby usingnonblockingprimitivesinsteadofblockingprimitivesor by usingcontinuations.

PandaandLFC work well together. Panda’s threadschedulerusesLFC’s in-terrupt supportto switch dynamicallybetweenpolling and interrupts. Streammessagespipeline the transmissionand consumptionof LFC packets. Finally,Panda’s totally-orderedbroadcastcombinesLFC’s fetch-and-addandbroadcastprimitives.

9.3 Parallel-Programming Systems

DifferentPPSshavedifferentcommunicationstylesandarethereforesensitive todifferentaspectsof theunderlyingcommunicationsystem.This is illustratedbythefour PPSsstudiedin Chapter7.

Orcausestwo communicationprimitivesto implementoperationson sharedobjects:remoteprocedurecall andasynchronous,totally-orderedbroadcast.Theperformanceof theOrcaapplicationsthatwestudiedin Chapter8 dependsononeof thetwo. Orcaoperationsmayblock on guards.Sinceguardsoccuronly at thebeginning of an operation,it is easyto suspendthe operationwithout blocking.Without blockingandthreadswitching,theOrcaruntimesystemcreatesa smallcontinuationthatrepresentstheblockedoperation.

Like Orca,theJava-basedMantasystemprovidesmultithreadingandsharedobjects.SinceMantadoesnotreplicateobjects,it requiresonly RPCto implementoperationson sharedobjects.In contrastwith Orca,Mantaoperationscanblockhalfway through. This makesit difficult for the RTS to usesmall continuationsto representthe blocked operation. The RTS thereforecreatesa popupthreadfor eachoperation,unlessthe compilercanprove that the operationwill neverblock. Another complicationin Manta is the presenceof a garbagecollector.The spaceoccupiedby unmarshaledparametersusually cannotbe reuseduntil

Page 243: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

9.4PerformanceImpactof DesignDecisions 221

theparametershave beengarbagecollected.Consequently, unmarshalingleavesa largememoryfootprint andpolluteslargepartsof thedatacache.

In CRL, mostcommunicationpatternsareroundtrips.Wheninvalidationsareneeded,nestedRPCsoccur. On average,CRL’s RPCsaresmallerthanthoseofMantaandOrca,becauseCRL frequentlysendssmall control messages(regionrequests,invalidations,andacknowledgements).MantaandOrcageneratesomecontroltraffic, but far lessthanCRL.

MPI hasthesimplestprogrammingmodelof thefour PPSsstudiedin this the-sis. In mostcases,communicationat theMPI level correspondsin a straightfor-wardway with communicationat lower levels. MPI’s collective-communicationoperationsform an importantexception. Thesehigh-level primitivesareusuallyimplementedby acombinationof point-to-pointandmulticastmessages.

While thesePPSsgeneratedifferentcommunicationpatterns,they often relyon the samecommunicationmechanisms.All PPSsrequirereliablecommuni-cation. Several low-level communicationsystems(e.g.,U-Net andBIP) do notprovide reliablecommunicationprimitives.

Orca,Manta,andCRL all benefitfrom LFC’s andPanda’s mechanismsto re-ducethe numberof network interrupts. OrcaandMantarely on Panda’s threadpackage,OpenThreads,to poll whenall threadsareidle. In, CRL thesameopti-mizationhasbeenhardcoded.

Orca and MPI benefit from an efficient multicast implementation. Orca’stotally-orderedbroadcastis layered(in Panda)on top of LFC’s broadcastandfetch-and-add.MPI’s collective-communicationoperationsuse(throughPanda)LFC’s broadcast.

Orca, Manta, and MPI all usePanda’s streammessages.Streammessagesprovideaconvenientandefficientmessageinterfacewithout introducinginterme-diatedatacopies.TheOrcaandMantacompilersgenerateoperation-specificcodeto marshalto andfrom streammessages.

9.4 PerformanceImpact of DesignDecisions

Chapter8 studiedthe impactof different reliability andmulticastprotocolsonapplicationperformanceandshowed that the LFC implementationdescribedinChapters3 and4 (NrxImc) performsbetterthanalternative implementations.Thisimplementation,however, is agressivein two ways.First, it assumesreliablehard-ware,whichmeansit canbeusedonly in dedicatedenvironments.Second,it del-egatessometasks(flow control,multicast,fetch-and-add,andinterruptmanage-ment)to theNI. Thesetasksareperformedby theNI-level protocolsdescribedinChapter4. Additionalrobustnesscanbeobtainedby meansof retransmission.Weinvestigatedbothhost-level andNI-level retransmission.Theunicast-dominated

Page 244: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

222 SummaryandConclusions

applicationsof Chapter8 show that both typesof retransmissionhave a modestapplication-level cost. The largestoverhead(up to 16%) is incurredby applica-tions in the ’roundtrip’ class.The performancecostof moving NI-level compo-nentsto thehostcanbelarge.Performingmulticastforwardingon thehostratherthanon theNI slowsdown applicationsby up to 38%.

Chapter8 alsorevealedinterestinginteractionsbetweenthestructureof appli-cationsandthe implementationof the underlyingcommunicationsystem(s).InSOR,theexactlocation(hostor NI) of thesliding window protocolhada notice-ableimpacton performance.In ASP, we addedpolling statementsto reducethedelaysthat resultedfrom synchronoushost-level multicastforwarding. In somecases,applicationscanberestructuredto work aroundtheweaknessesof apartic-ular communicationsystem.ThePuzzleapplication,for example,canbeconfig-uredto usea larger aggregationbuffer on top of a host-level reliability protocolto reducetheimpactof increasedsendandreceiveoverhead.TheperformanceofSORwasimprovedby changingtheboundaryexchangephase.

9.5 Conclusions

Basedon our experienceswith LFC andthesystemslayeredon top of LFC, wedraw thefollowing conclusions:

1. Low-levelcommunicationsystemsshouldsupportbothpolling andinterrupt-drivenmessagedelivery. Westudiedfour PPSs.All of themusepolling andthreeof themuseinterrupts(Orca,Manta,andCRL). Transparentemula-tion of interrupt-drivenmessagedeliveryis nontrivial —severalsystemsusebinaryrewriting— or relieson interruptsasabackupmechanism.

2. Low-levelcommunicationsystemsshouldsupportmulticast.Multicastplaysan importantrole in two of the four PPSsstudied(OrcaandMPI). Multi-castimplementationsareoften layeredon top of point-to-pointprimitives.Thisalwaysintroducesunnecessarydatatransfersthatconsumecyclesthatcouldhavebeenusedby theapplication.Moreover, if multicastforwardingis implementedatthemessageratherthanthepacket level, thenthisstrategyalsoreducesthroughputby removing intramessagepipelining.

3. A variety of PPSscanbe implementedefficiently on a simplecommuni-cationsystemif all layerscooperateto avoid unnecessaryinterrupts,threadswitches,anddatacopies.LFCprovidesasimpleinterface:reliable,packet-basedpoint-to-pointandmulticastcommunicationand interrupt manage-ment. Thesemechanismsareusedto implementabstractionsthatpreserveefficiency. In Panda,they areusedto implementOpenThreads’s automatic

Page 245: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

9.6LimitationsandOpenIssues 223

switching betweenpolling and interrupts,messagestreams,and totally-orderedmulticast. Orca,Manta,andMPI areall built on top of Panda.Inaddition,OrcaandMantauseoperation-specificmarshalingin combinationwith streammessagesto avoid makingextra datacopies.

4. An aggressive, NI-supportedcommunicationsystemcanyield betterper-formancethana traditionalcommunicationsystemwithoutNI support.Wehaveexperimentedwith a reliability protocol(UCAST), two multicastpro-tocols(MCAST andRECOV), apolling watchdog(INTR), andafetch-and-addprimitive. Additional robustnessandsimplicity canbeobtainedby sys-temsthatimplementall functionalityonthehost,but thesesystemsperformworse,bothat thelevel of microbenchmarksandat thelevel of applications.

9.6 Limitations and Open Issues

Although this thesishasa broadscope,several importantissueshave not beenaddressedandsomeissuesthathavebeenaddressedrequirefurtherresearch.

First, LFC lacks zero-copy support. Zero-copy mechanismsare basedonDMA ratherthan PIO and, for sufficiently large transfers,give betterthrough-put thanPIO (seeChapter2). Sender-side zero-copy supportis relatively easyto implementanduse. For optimal performance,however, datashouldalsobemoved directly to its final destinationat the receiver side. Somesystems(e.g.,Hamlyn [32]) achieve this by having the senderspecifya destinationaddressinthe receiver’s addressspace.PPSsneedto be restructuredin nontrivial waystodeterminethis addresswithout introducingextra messages[21, 35]. While zero-copy communicationsupportfor PPSswarrantsfurther investigation,we arenotawareof any study that shows significantapplication-level performanceadvan-tagesof zero-copy communicationmechanismsovermessage-basedmechanisms.Thereare studiesthatshow thatparallelapplicationsarerelatively insensitive tothroughput[102].

Second,more considerationneedsto be given to differentbuffer and timerconfigurationsfor the variousLFC implementationsstudiedin Chapter8. Weselectedbuffer and timer settingsthat yield good performancefor eachimple-mentation.It is not completelyclearhow sensitive thedifferentimplementationsareto differentsettings.

Finally, theevaluationin Chapter8 focusedon theimpactof alternative relia-bility andmulticastschemes.Additional work is neededto determinetheimpactof NI-supportedsynchronizationand interruptdelivery (i.e., the polling watch-dog). Someof our results(not presentedin this thesis),however, show a goodqualitativematchwith thosepresentedby Maquelinet al. [101].

Page 246: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0
Page 247: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Bibliography

[1] A. Agarwal, R. Bianchini, D. Chaiken, K.L. Johnson,D. Kranz, J. Kubi-atowicz, B.H. Lim, K. MacKenzie,andD. Yeung. TheMIT Alewife Ma-chine: ArchitectureandPerformance.In Proc. of the 22ndInt. Symp.onComputerArchitecture, pages2–13,SantaMargheritaLigure, Italy, June1995.

[2] A. Agarwal, J. Kubiatowicz, D. Kranz,B.H. Lim, D. Yeung,G. D’Souza,andM. Parkin.Sparcle:An EvolutionaryProcessorDesignfor Large-ScaleMultiprocessors.IEEE Micro, 13(3):48–61,June1993.

[3] A. Alexandrov, M.F. Ionescu,K.E. Schauser, andC. Scheiman.LogGP:IncorporatingLong Messagesinto theLogPModel – OneStepCloserTo-wardsa RealisticModel for Parallel Computation. In Proc. of the 1995Symp.on Parallel AlgorithmsandArchitectures, pages95–105,SantaBar-bara,CA, July1995.

[4] K.V. Anjan and T.M. Pinkston. An Efficient, Fully Adaptive DeadlockRecovery Scheme:DISHA. In Proc. of the22ndInt. Symp.on ComputerArchitecture, pages201–210,SantaMargheritaLigure, Italy, June1995.

[5] S.Araki, A. Bilas,C. Dubnicki, J.Edler, K. Konishi,andJ.Philbin. User-SpaceCommunication:A Quantitative Study. In Supercomputing’98, Or-lando,FL, November1998.

[6] M. Aron andP. Druschel. Soft Timers: Efficient MicrosecondSoftwareTimer Supportfor Network Processing.In Proc.of the17thSymp.on Op-eratingSystemsPrinciples, pages232–246,KiawahIslandResort,SC,De-cember1999.

[7] H.E. Bal. ProgrammingDistributedSystems. PrenticeHall InternationalLtd., HemelHempstead,UK, 1991.

[8] H.E. Bal and L.V. Allis. Parallel RetrogradeAnalysis on a DistributedSystem.In Supercomputing’95, SanDiego,CA, December1995.

225

Page 248: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

226 Bibliography

[9] H.E.Bal, R.A.F. Bhoedjang,R.F.H. Hofman,C. Jacobs,K.G. Langendoen,T. Ruhl, andM.F. Kaashoek.PerformanceEvaluationof theOrcaSharedObjectSystem.ACM Trans.on ComputerSystems, 16(1):1–40,February1998.

[10] H.E. Bal, R.A.F. Bhoedjang,R.F.H. Hofman, C. Jacobs,K.G. Langen-doen,andK. Verstoep. Performanceof a High-Level Parallel LanguageonaHigh-SpeedNetwork. Journalof Parallel andDistributedComputing,40(1):49–64,February1997.

[11] H.E. Bal andM.F. Kaashoek.ObjectDistribution in OrcausingCompile-timeandRun-TimeTechniques.In Conf. onObject-OrientedProgrammingSystems,Languagesand Applications, pages162–177,Washington,DC,September1993.

[12] H.E. Bal, M.F. Kaashoek,andA.S. Tanenbaum.Orca: A LanguageforParallel Programmingof DistributedSystems. IEEE Trans.on SoftwareEngineering, 18(3):190–205,March1992.

[13] A. Basu,M. Welsh,andT. von Eicken. IncorporatingMemory Manage-mentinto User-Level Network Interfaces. In Hot Interconnects’97, Stan-ford, CA, April 1997.

[14] R.A.F. BhoedjangandK.G. Langendoen.FriendlyandEfficient MessageHandling. In Proc. of the 29th AnnualHawaii Int. Conf. on SystemSci-ences, pages121–130,Maui, HI, January1996.

[15] R.A.F. Bhoedjang,J.W. Romein,and H.E. Bal. Optimizing DistributedDataStructuresUsingApplication-SpecificNetwork InterfaceSoftware.InProc. of the1998Int. Conf. on Parallel Processing, pages485–492,Min-neapolis,MN, August1998.

[16] R.A.F. Bhoedjang,T. Ruhl, andH.E. Bal. Efficient Multicast on MyrinetUsingLink-Level Flow Control. In Proc.of the1998Int. Conf. onParallelProcessing, pages381–390,Minneapolis,MN, August1998.

[17] R.A.F. Bhoedjang,T. Ruhl, andH.E. Bal. LFC: A CommunicationSub-stratefor Myrinet. In 4th AnnualConf. of theAdvancedSchool for Com-putingandImaging, pages31–37,Lommel,Belgium,June1998.

[18] R.A.F. Bhoedjang,T. Ruhl, andH.E. Bal. User-Level Network InterfaceProtocols.IEEE Computer, 31(11):53–60,November1998.

Page 249: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Bibliography 227

[19] R.A.F. Bhoedjang,T. Ruhl, R.F.H. Hofman,K.G. Langendoen,H.E. Bal,andM.F. Kaashoek.Panda:A PortablePlatformto SupportParallelPro-grammingLanguages.In Proc.of theUSENIXSymp.on ExperienceswithDistributedandMultiprocessorSystems(SEDMSIV), pages213–226,SanDiego,CA, September1993.

[20] A. Bilas, D. Jiang,Y. Zhou, and J.P. Singh. Limits to the Performanceof Software SharedMemory: A LayeredApproach. In Proc. of the 5thInt. Symp.on High-PerformanceComputerArchitecture, pages193–202,Orlando,FL, January1999.

[21] A. Bilas,C.Liao,andJ.P. Singh.UsingNetwork InterfaceSupportto AvoidAsynchronousProtocolProcessingin SharedVirtual MemorySystems.InProc. of the 26th Int. Symp.on ComputerArchitecture, pages282–293,Atlanta,GA, May 1999.

[22] K.P. Birman. TheProcessGroupApproachto ReliableDistributedCom-puting. Communicationsof theACM, 36(12):37–53,December1993.

[23] A.D. Birrell andB.J.Nelson.ImplementingRemoteProcedureCalls.ACMTrans.onComputerSystems, 2(1):39–59,February1984.

[24] M.A. Blumrich,R.D.Alpert, Y. Chen,D.W. Clark,S.N.Damianakis,E.W.Felten,L. Iftode, K. Li, M. Martonosi,andR.A. Shillner. DesignChoicesin the SHRIMP System: An Empirical Study. In Proc. of the 25th Int.Symp.on ComputerArchitecture, pages330–341,Barcelona,Spain,June1998.

[25] M.A. Blumrich,C. Dubnicki,E.W. Felten,andK. Li. Protected,User-levelDMA for theSHRIMPNetwork Interface.In Proc.of the2ndInt. Symp.onHigh-PerformanceComputerArchitecture, pages154–165,SanJose,CA,February1996.

[26] M.A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E.W. Felten,andJ. Sand-berg. Virtual MemoryMappedNetwork Interfacefor theSHRIMPmulti-computer. In Proc.of the21stInt. Symp.on ComputerArchitecture, pages142–153,Chicago,IL, April 1994.

[27] N.J. Boden,D. Cohen,R.E. Felderman,A.E. Kulawik, C.L. Seitz, J.N.Seizovic, andW. Su. Myrinet: A Gigabit-per-secondLocalAreaNetwork.IEEEMicro, 15(1):29–36,February1995.

Page 250: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

228 Bibliography

[28] E.A. Brewer, F.T. Chong,L.T. Liu, S.D. Sharma,andJ.D. Kubiatowicz.RemoteQueues:ExposingMessageQueuesfor OptimizationandAtomic-ity. In Proc. of the1995Symp.on Parallel AlgorithmsandArchitectures,pages42–53,SantaBarbara,CA, July1995.

[29] A. Brown andM. Seltzer. OperatingSystemBenchmarkingin theWakeofLmbench:A CaseStudyof the Performanceof NetBSDon the Intel x86Architecture. In Proc. of the 1997Conf. on Measurementand Modelingof ComputerSystems(SIGMETRICS), pages214–224,Seattle,WA, June1997.

[30] J. Bruck, L. De Coster, N. Dewulf, C.-T. Ho, andR. Lauwereins.On theDesignandImplementationof BroadcastandGlobalCombineOperationsUsingthePostalModel. IEEE Trans.on Parallel andDistributedSystems,7(3):256–265,March1996.

[31] M. Buchanan.Privatecommunication,1997.

[32] G. Buzzard,D. Jacobson,M. MacKey, S.Marovich, andJ.Wilkes.An Im-plementationof the Hamlyn Sender-managedInterfaceArchitecture. In2nd USENIX Symp.on Operating SystemsDesign and Implementation,pages245–259,Seattle,WA, October1996.

[33] J.Carreira,J.GabrielSilva,K.G. Langendoen,andH.E.Bal. ImplementingTuple Spacewith Threads. In 1st Int. Conf. on Parallel and DistributedSystems, pages259–264,Barcelona,Spain,June1997.

[34] C.-C.Chang,G. Czajkowski, C. Hawblitzel, andT. von Eicken. Low La-tency Communicationon the IBM RS/6000SP. In Supercomputing’96,Pitssburgh,PA, November1996.

[35] C.-C. ChangandT. von Eicken. A SoftwareArchitecturefor Zero-CopyRPCin Java. TechnicalReport98-1708,Dept.of ComputerScience,Cor-nell University, Ithaca,NY, September1998.

[36] J.-D.Choi, M. Gupta,M. Serrano,V.C. Sreedhar, andS. Midkif f. EscapeAnalysisfor Java.In Conf. onObject-OrientedProgrammingSystems,Lan-guagesandApplications, pages1–19,Denver, CO,November1999.

[37] B. Chun,A. Mainwaring,andD.E. Culler. Virtual Network TransportPro-tocolsfor Myrinet. In Hot Interconnects’97, Stanford,CA, April 1997.

[38] D.D. Clark. TheStructuringof SystemsUsingUpcalls.In Proc.of the10thSymp.onOperatingSystemsPrinciples, pages171–180,OrcasIsland,WA,December1985.

Page 251: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Bibliography 229

[39] D.E. Culler, A. Dusseau,S.C. Goldstein,A. Krishnamurty, S. Lumetta,T. von Eicken,andK. Yelick. ParallelProgrammingin Split-C. In Super-computing’93, pages262–273,Portland,OR,November1993.

[40] D.E. Culler, R. Karp, D. Patterson,A. Sahay, K.E. Schauser, E. Santos,R. Subramonian,andT. von Eicken. LogP:Towardsa RealisticModel ofParallelComputation.In 4th Symp.on PrinciplesandPracticeof ParallelProgramming, pages1–12,SanDiego,CA, May 1993.

[41] D.E. Culler, L.T. Liu, R.P. Martin, andC.O. Yoshikawa. AssessingFastNetwork Interfaces.IEEEMicro, 16(1):35–43,February1996.

[42] L. DagumandR. Menon.OpenMP:An Industry-StandardAPI for Shared-Memory Programming. IEEE ComputationalScienceand Engineering,5(1):46–55,January/March1998.

[43] W.J.Dally andC.L.Seitz.Deadlock-FreeMessageRoutingin Multiproces-sorInterconnectionNetworks. IEEETrans.onComputers, 36(5):547–553,May 1987.

[44] E.W. Dijkstra. GuardedCommands. Communicationsof the ACM,18(8):453–457,August1975.

[45] J.J.Dongarra,S.W. Otto, M. Snir, and D.W. Walker. A MessagePass-ing Standardfor MPP and Workstations. Communicationsof the ACM,39(7):84–90,July1996.

[46] R.P. Draves,B.N. Bershad,R.F. Rashid,andR.W. Dean.UsingContinua-tionsto ImplementThreadManagementandCommunicationin OperatingSystems.In Proc.of the13thSymp.onOperatingSystemsPrinciples, pages122–136,Asilomar, PacificGrove,CA, October1991.

[47] P. DruschelandG. Banga. Lazy Receiver Processing(LRP): A NetworkSubsystemArchitecturefor Server Systems. In 2nd USENIXSymp.onOperating SystemsDesignand Implementation, pages261–275,Seattle,WA, October1996.

[48] P. Druschel,L.L. Peterson,andB.S.Davie. Experienceswith aHigh-SpeedNetwork Adaptor: A SoftwarePerspective. In Proc. of the1994Conf. onCommunicationsArchitectures,Protocols,andApplications(SIGCOMM),pages2–13,London,UK, September1994.

[49] C. Dubnicki, A. Bilas, Y. Chen,S. Damianakis,and K. Li. VMMC-2:Efficient Supportfor Reliable,Connection-OrientedCommunication. InHot Interconnects’97, Stanford,CA, April 1997.

Page 252: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

230 Bibliography

[50] C. Dubnicki,A. Bilas,K. Li, andJ.Philbin. DesignandImplementationofVirtual Memory-MappedCommunicationonMyrinet. In 11thInt. ParallelProcessingSymp., pages388–396,Geneva,Switzerland,April 1997.

[51] D. Dunning,G. Regnier, G. McAlpine, D. Cameron,B. Shubert,F. Berry,A.M. Merritt, E. Gronke,andC. Dodd. TheVirtual InterfaceArchitecture.IEEE Micro, 18(2):66–76,March–April 1998.

[52] A. Fekete, M. F. Kaashoek,and N. Lynch. ImplementingSequentiallyConsistentSharedObjectsUsing Broadcastand Point-to-PointCommu-nication.In Proc.of the15thInt. Conf. onDistributedComputingSystems,pages439–449,Vancouver, Canada,May 1995.

[53] MPI Forum. MPI: A MessagePassingInterfaceStandard.Int. Journal ofSupercomputingApplications, 8(3/4),1994.

[54] I. Foster, C. Kesselman,andS. Tuecke. The NEXUS Approachto Inte-gratingMultithreadingandCommunication.Journal of Parallel andDis-tributedComputing, 37(2):70–82,February1996.

[55] M. Frigo, C.E. Leiserson,andK.H. Randall. The Implementationof theCilk-5 MultithreadedLanguage. In Proc. of the Conf. on ProgrammingLanguage DesignandImplementation, pages212–223,Montreal,Canada,June1998.

[56] M. Gerla,P. Palnati,andS.Walton.MulticastingProtocolsfor High-Speed,Wormhole-RoutingLocal Area Networks. In Proc. of the 1996Conf. onCommunicationsArchitectures,Protocols,andApplications(SIGCOMM),pages184–193,StanfordUniversity, CA, August1996.

[57] R.B.Gillett. MemoryChannelfor PCI. IEEEMicro, 15(1):12–18,February1996.

[58] S.C.Goldstein,K.E. Schauser, andD.E.Culler. LazyThreads:Implement-ing a FastParallel Call. Journal of Parallel and DistributedComputing,37(1):5–20,August1996.

[59] G.H. GolubandC. F. vanLoan. Matrix Computations. TheJohnHopkinsUniversityPress,3rd edition,1996.

[60] J. Gosling, B. Joy, and G. Steele. The Java Language Specification.Addison-Wesley, Reading,MA, 1996.

Page 253: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Bibliography 231

[61] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-Performance,PortableImplementationof the MPI MessagePassingInterfaceStandard.Parallel Computing, 22(6):789–828,September1996.

[62] The Pandagroup. ThePanda4.0 InterfaceDocument. Dept. of Mathe-maticsandComputerScience,Vrije Universiteit,April 1998. On-line athttp://www.cs.vu.nl/panda/panda4/panda.html.

[63] M.D. Haines. An Open Implementation Analysis and Design ofLightweightThreads.In Conf. on Object-OrientedProgrammingSystems,LanguagesandApplications, pages229–242,Atlanta,GA, October1997.

[64] M.D. HainesandK.G. Langendoen.Platform-IndependentRuntimeOp-timizationsUsing OpenThreads.In 11th Int. Parallel ProcessingSymp.,pages460–466,Geneva,Switzerland,April 1997.

[65] H.P. Heinzle, H.E. Bal, and K.G. Langendoen. ImplementingObject-BasedDistributedSharedMemory on Transputers.In TransputerAppli-cationsand Systems’94, pages390–405,Villa Erba, Cernobbio,Como,Italy, September1994.

[66] W.C.Hsieh,K.L. Johnson,M.F. Kaashoek,D.A. Wallach,andW.E.Weihl.Efficient Implementationof High-Level Languageson User-Level Com-municationArchitectures.TechnicalReportMIT/LCS/TR-616,MIT Lab-oratoryfor ComputerScience,May 1994.

[67] W.C. Hsieh,M.F. Kaashoek,andW.E. Weihl. DynamicComputationMi-grationin DSM Systems.In Supercomputing’96, Pitssburgh,PA, Novem-ber1996.

[68] Y. HuangandP.M. McKinley. Efficient Collective Operationswith ATMNetwork InterfaceSupport. In Proc. of the 1996 Int. Conf. on ParallelProcessing, volumeI, pages34–43,Bloomingdale,IL, August1996.

[69] G. Iannello,M. Lauria,andS.Mercolino. Cross-PlatformAnalysisof FastMessagesfor Myrinet. In Proc. of the Workshopon Communication,Ar-chitecture, andApplicationsfor Network-basedParallel Computing(LNCS1362), pages217–231,LasVegas,NV, January1998.

[70] Intel Corporation.PentiumPro FamilyDeveloper’sManual, 1995.

[71] Intel Corporation. PentiumPro Family Developer’s Manual, Volume3:OperatingSystemWriter’sGuide, 1995.

Page 254: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

232 Bibliography

[72] SPARC International.TheSPARCArchitectureManual: Version8, 1992.

[73] K.L. Johnson.High-PerformanceAll-Software DistributedShared Mem-ory. PhDthesis,MIT Laboratoryfor ComputerScience,Cambridge,USA,December1995.TechnicalReportMIT/LCS/TR-674.

[74] K.L. Johnson,M.F. Kaashoek,andD.A. Wallach.CRL: High-PerformanceAll-SoftwareDistributedSharedMemory. In Proc. of the 15th Symp.onOperatingSystemsPrinciples, pages213–226,CopperMountain,CO,De-cember1995.

[75] M.F. Kaashoek.GroupCommunicationin DistributedComputerSystems.PhDthesis,Vrije UniversiteitAmsterdam,1992.

[76] V. KaramchetiandA.A. Chien.A Comparisonof ArchitecturalSupportforMessagingin theTMC CM-5 andtheCrayT3D. In Proc.of the22ndInt.Symp.onComputerArchitecture, pages298–307,SantaMargheritaLigure,Italy, June1995.

[77] R.M. Karp, A. Sahay, E.E.Santos,andK.E. Schauser. OptimalBroadcastandSummationin theLogPModel. In Proc.of the1993Symp.onParallelAlgorithmsandArchitectures, pages142–153,Velen,Germany, June1993.

[78] P. Keleher, A.L. Cox, S. Dwarkadas,andW. Zwaenepoel.TreadMarks:DistributedSharedMemoryon StandardWorkstationsandOperatingSys-tems.In Proc.of theWinter 1994UsenixConf., pages115–131,SanFran-cisco,CA, January1994.

[79] D. Keppel. Register Windows and User SpaceThreadson the SPARC.TechnicalReportUWCSE91-08-01,Dept.of ComputerScienceandEn-gineering,Universityof Washington,Seattle,WA, August1991.

[80] R. Kesavan and D.K. Panda. Optimal Multicast with PacketizationandNetwork InterfaceSupport. In Proc. of the 1997 Int. Conf. on ParallelProcessing, pages370–377,Bloomingdale,IL, August1997.

[81] C.H.Koelbel,D.B. Loveman,R. Schreiber, JrG.L. Steele,andM.E. Zosel.TheHigh PerformanceFortran Handbook. MIT Press,Cambridge,MA,1994.

[82] L.I. KontothanassisandM.L. Scott. UsingMemory-MappedNetwork In-terfaceto ImprovethePerformanceof DistributedSharedMemory. In Proc.of the2nd Int. Symp.on High-PerformanceComputerArchitecture, pages166–177,SanJose,CA, February1996.

Page 255: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Bibliography 233

[83] A. Krishnamurthy, K.E. Schauser, C.J.Scheiman,R.Y. Wang,D.E. Culler,andK. Yelick. Evaluationof ArchitecturalSupportfor Global Address-BasedCommunicationin Large-ScaleParallel Machines. In Proc. of the7th Int. Conf. on Architectural Supportfor ProgrammingLanguagesandOperatingSystems, pages37–48,Cambridge,MA, October1996.

[84] V. Kumar, A. Grama,A. Gupta,andG. Karypsis. Introductionto ParallelComputing:DesignandAnalysisof Algorithms. TheBenjamin/CummingsPublishingCompany, Inc.,RedwoodCity, CA, 1994.

[85] J.Kuskin,D. Ofelt, M. Heinrich,J.Heinlein,R. Simoni,K. Gharachorloo,J.Chapin,D. Nakahira,J.Baxter, M. Horowitz, A. Gupta,M. Rosenblum,andJ.Hennessy. TheStanfordFLASH Multiprocessor.In Proc.of the21stInt. Symp.on ComputerArchitecture, pages302–313,Chicago,IL, April1994.

[86] L. Lamport. How to Make a MultiprocessorComputerThatCorrectlyEx-ecutesMultiprocessPrograms.IEEE Trans.on Computers, C-28(9):690–691,September1979.

[87] K. Langendoen,R.A.F. Bhoedjang,andH.E. Bal. AutomaticDistributionof SharedDataObjects.In Third Workshopon Languages,Compilers andRun-TimeSystemsfor ScalableComputers, Troy, NY, May 1995.

[88] K.G. Langendoen,R.A.F. Bhoedjang,and H.E. Bal. Models for Asyn-chronousMessageHandling. IEEE Concurrency, 5(2):28–37,April–June1997.

[89] K.G. Langendoen,J. Romein,R.A.F. Bhoedjang,andH.E. Bal. Integrat-ing Polling, Interrupts,and ThreadManagement. In The 6th Symp.ontheFrontiersof MassivelyParallel Computation, pages13–22,Annapolis,MD, October1996.

[90] M. LauriaandA.A. Chien. MPI-FM: High PerformanceMPI on Worksta-tion Clusters.Journal of Parallel andDistributedComputing, 40(1):4–18,January1997.

[91] C.E.Leiserson,Z.S.Abuhamdeh,D.C.Douglas,C.R.Feynman,M.N. Gan-mukhi, J.V. Hill, W.D. Hillis, B.C. Kuszmaul,M.A. St.Pierre,D.S.Wells,M.C. Wong-Chan,S.-W. Yang,andR. Zak. TheNetwork Architectureofthe ConnectionMachineCM-5. In Proc. of the 1992Symp.on ParallelAlgorithmsandArchitectures, pages272–285,SanDiego,CA, June1992.

Page 256: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

234 Bibliography

[92] K. Li andP. Hudak. Memory Coherencein SharedVirtual Memory Sys-tems.ACM Trans.onComputerSystems, 7(4):321–359,November1989.

[93] C. Liao, D. Jiang,L. Iftode, M. Martonosi,andD.W. Clark. MonitoringSharedVirtual Memory Performanceon a Myrinet-basedPC Cluster. InSupercomputing’98, Orlando,FL, November1998.

[94] J. Liedtke. On Micro-Kernel Construction. In Proc. of the 15th Symp.on Operating SystemsPrinciples, pages237–250,CopperMountain,CO,December1995.

[95] T. Lindholm and F. Yellin. The Java Virtual Machine Specification.Addison-Wesley, Reading,MA, 1997.

[96] P. Lopez,J.M. Martınez,andJ.Duato. A Very Efficient DistributedDead-lock DetectionMechanismfor WormholeNetworks. In Proc. of the 4thInt. Symp.onHigh-PerformanceComputerArchitecture, pages57–66,LasVegas,NV, February1998.

[97] H. Lu, S.Dwarkadas,A.L. Cox,andW. Zwaenepoel.QuantifyingthePer-formanceDifferencesbetweenPVM andTreadMarks.Journal of ParallelandDistributedComputing, 43(2):65–78,June1997.

[98] J. MaassenandR. van Nieuwpoort. FastParallel Java. Master’s thesis,Dept.of MathematicsandComputerScience,Vrije Universiteit,Amster-dam,TheNetherlands,August1998.

[99] J.Maassen,R.vanNieuwpoort,R.Veldema,H.E.Bal, andA. Plaat.An Ef-ficient Implementationof Java’s RemoteMethodInvocation. In 7th Symp.on PrinciplesandPracticeof Parallel Programming, pages173–182,At-lanta,GA, May 1999.

[100] A.M. Mainwaring,B.N. Chun,S. Schleimer, andD.S.Wilkerson.SystemAreaNetwork Mapping.In Proc.of the1997Symp.onParallel AlgorithmsandArchitectures, pages116–126,Newport,RI, June1997.

[101] O. Maquelin,G.R.Gao,H.H.J.Hum,K.B. Theobald,andX. Tian. PollingWatchdog:CombiningPolling andInterruptsfor Efficient MessageHan-dling. In Proc. of the 23rd Int. Symp.on ComputerArchitecture, pages179–188,Philadelphia,PA, May 1996.

[102] R.P. Martin,A.M. Vahdat,D.E.Culler, andT.E.Anderson.Effectsof Com-municationLatency, Overhead,andBandwidthin a ClusterArchitecture.In Proc. of the 24th Int. Symp.on ComputerArchitecture, pages85–97,Denver, CO,June1997.

Page 257: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Bibliography 235

[103] M. Martonosi,D. Ofelt, andM. Heinrich. IntegratingPerformanceMon-itoring andCommunicationin Parallel Computers. In Proc. of the 1996Conf. on Measurementand Modeling of ComputerSystems(SIGMET-RICS), pages138–147,Philadelphia,PA, May 1996.

[104] H. McGhanandM. O’Connor. PicoJava: A Direct ExecutionEngineforJavaBytecode.IEEE Computer, 31(10):22–30,October1998.

[105] E. Mohr, D.A. Kranz, andR.H. Halstead. Lazy TaskCreation: A Tech-niquefor IncreasingtheGranularityof ParallelPrograms.IEEE Trans.onParallel andDistributedSystems, 2(3):264–280,July1991.

[106] D. MosbergerandL.L. Peterson.CarefulProtocolsor How to UseHighlyReliableNetworks. In Proc.of theFourthWorkshoponWorkstationOper-atingSystems, pages80–84,Napa,CA, October1993.

[107] S.S.Mukherjee,B. Falsafi,M.D. Hill, andD.A. Wood. CoherentNetworkInterfacesfor Fine-GrainCommunication.In Proc. of the23rd Int. Symp.onComputerArchitecture, pages247–258,Philadelphia,PA, May 1996.

[108] S.S.MukherjeeandM.D. Hill. TheImpactof DataTransferandBufferingAlternativeson Network InterfaceDesign. In Proc. of the 4th Int. Symp.on High-PerformanceComputerArchitecture, pages207–218,LasVegas,NV, February1998.

[109] G.Muller, R. Marlet,E.-N.Volanschi,C. Consel,C. Pu,andA. Goel.Fast,OptimizedSunRPCUsing AutomaticProgramSpecialization. In Proc.of the18th Int. Conf. on DistributedComputingSystems, pages240–249,Amsterdam,TheNetherlands,May 1998.

[110] B. Nichols, B. Buttlar, and J. Proulx Farrell. PthreadsProgramming.O’Reilly & Associates,Inc.,Newton,MA, 1996.

[111] M. Oey, K. Langendoen,andH.E.Bal. ComparingKernel-SpaceandUser-SpaceCommunicationProtocolsonAmoeba.In Proc.of the15thInt. Conf.on DistributedComputingSystems, pages238–245,Vancouver, Canada,May 1995.

[112] S. Pakin, V. Karamcheti,andA.A. Chien. FastMessages(FM): Efficient,PortableCommunicationfor WorkstationClustersandMassively ParallelProcessors.IEEE Concurrency, 5(2):60–73,April–June1997.

[113] S. Pakin, M. Lauria, andA.A. Chien. High PerformanceMessagingonWorkstations:Illinois FastMessages(FM) for Myrinet. In Supercomputing’95, SanDiego,CA, December1995.

Page 258: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

236 Bibliography

[114] D. Perkovic andP.J.Keleher. Responsivenesswithout Interrupts.In Proc.of theInt. Conf. onSupercomputing, pages101–108,Rhodes,Greece,June1999.

[115] L.L. PetersonandB.S.Davie. ComputerNetworks:A SystemsApproach.MorganKaufmannPublishers,Inc.,SanFrancisco,CA, 1996.

[116] L.L. Peterson,N. Hutchinson,S. O’Malley, andH. Rao. Thex-kernel: APlatformfor AccessingInternetResources.IEEE Computer, 23(5):23–33,May 1990.

[117] M. PhilippsenandM. Zenger. JavaParty— TransparentRemoteObjectsinJava.Concurrency:PracticeandExperience, pages1225–1242,November1997.

[118] L. Prylli andB. Tourancheau.ProtocolDesignfor High PerformanceNet-working: A Myrinet Experience.TechnicalReport97-22,LIP-ENSLyon,July1997.

[119] S.K. Reinhardt,J.R.Larus,andD.A. Wood. TempestandTyphoon:User-Level SharedMemory. In Proc.of the21stInt. Symp.on ComputerArchi-tecture, pages325–337,Chicago,IL, April 1994.

[120] S.H.Rodrigues,T.E.Anderson,andD.E.Culler. High-PerformanceLocal-Area CommunicationWith Fast Sockets. In USENIX Technical Conf.,pages257–274,Anaheim,CA, January1997.

[121] J.W. Romein,H.E Bal, andD. Grune. An Application DomainSpecificLanguagefor DescribingBoardGames.In Parallel and DistributedPro-cessingTechniquesandApplications, volumeI, pages305–314,LasVegas,NV, July1997.

[122] J.W. Romein,A. Plaat,H.E. Bal, andJ. Schaeffer. TranspositionDrivenWork Schedulingin Distributed Search. In AAAI National Conference,pages725–731,Orlando,FL, July1999.

[123] T. Ruhl andH.E. Bal. A PortableCollective CommunicationLibrary us-ing CommunicationSchedules.In Proc. of the 5th Euromicro Workshopon Parallel and DistributedProcessing, pages297–306,London,UnitedKingdom,January1997.

[124] T. Ruhl, H.E. Bal, R. Bhoedjang,K.G. Langendoen,andG. Benson.Ex-periencewith a PortabilityLayer for ImplementingParallelProgrammingSystems. In The1996Int. Conf. on Parallel and DistributedProcessing

Page 259: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Bibliography 237

Techniquesand Applications, pages1477–1488,Sunnyvale, CA, August1996.

[125] D.J. Scales,K. Gharachorloo,andC.A. Thekkath. Shasta:A Low Over-head,Software-OnlyApproachfor SupportingFine-GrainSharedMemory.In Proc. of the 7th Int. Conf. on Architectural Supportfor ProgrammingLanguagesandOperating Systems, pages174–185,Cambridge,MA, Oc-tober1996.

[126] I. SchoinasandM.D. Hill. AddressTranslationMechanismsin NetworkInterfaces.In Proc. of the4th Int. Symp.on High-PerformanceComputerArchitecture, pages219–230,LasVegas,NV, February1998.

[127] J.T. Schwartz. Ultracomputers.ACM Trans.on ProgrammingLanguagesandSystems, 2(4):484–521,October1980.

[128] R. Sivaram,R. Kesavan,D.K. Panda,andC.B. Stunkel. Whereto ProvideSupportfor Efficient Multicasting in Irregular Networks: Network Inter-faceor Switch? In Proc. of the 1998Int. Conf. on Parallel Processing,pages452–459,Minneapolis,MN, August1998.

[129] M. Snir, P. Hochschild,D.D. Frye,andK.J. Gildea. The CommunicationSoftwareandParallelEnvironmentof theIBM SP2.IBM SystemsJournal,34(2):205–221,1995.

[130] E. Speight,H. Abdel-Shafi,andJ.K. Bennett. Realizingthe PerformancePotentialof theVirtual InterfaceArchitecture.In Proc.of theInt. Conf. onSupercomputing, pages184–192,Rhodes,Greece,June1999.

[131] E.SpeightandJ.K.Bennett.UsingMulticastandMultithreadingto ReduceCommunicationin SoftwareDSM Systems.In Proc.of the4th Int. Symp.on High-PerformanceComputerArchitecture, pages312–323,LasVegas,NV, February1998.

[132] E. Spertus,S.C.Goldstein,K.E. Schauser, T. von Eicken,D.E. Culler, andW.J.Dally. Evaluationof Mechanismsfor Fine-GrainedParallelProgramsin theJ-MachineandtheCM-5. In Proc.of the20thInt. Symp.onComputerArchitecture, pages302–313,SanDiego,CA, May 1993.

[133] P.A. Steenkiste.A SystematicApproachto HostInterfaceDesignfor High-SpeedNetworks. IEEE Computer, 27(3):47–57,March1994.

[134] D. Stodolsky, J.B.Chen,andB. Bershad.FastInterruptPriority Manage-mentin OperatingSystems.In Proc.of theUSENIXSymp.onMicrokernels

Page 260: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

238 Bibliography

and Other Kernel Architectures, pages105–110,SanDiego, September1993.

[135] C.B.Stunkel, J.Herring,B. Abali, andR.Sivaram.A New SwitchChipforIBM RS/6000SPSystems.In Supercomputing’99, Portland,OR,Novem-ber1999.

[136] C.B.Stunkel, R. Sivaram,andD.K. Panda.ImplementingMultidestinationWorms in Switch-basedParallel Systems:ArchitecturalAlternativesandtheir Impact. In Proc. of the 24th Int. Symp.on ComputerArchitecture,pages50–61,Denver, CO,June1997.

[137] V.S. Sunderam.PVM: A Framework for ParallelDistributedComputing.Concurrency:PracticeandExperience, 2(4):315–339,December1990.

[138] A.S. Tanenbaum.ComputerNetworks. PrenticeHall PTR,UpperSaddleRiver, NJ,3rd edition,1996.

[139] A.S. Tanenbaum,R. van Renesse,H. van Staveren,G.J.Sharp,S.J.Mul-lender, A.J. Jansen,andG. van Rossum. Experienceswith the AmoebaDistributedOperatingSystem.Communicationsof theACM, 33(2):46–63,December1990.

[140] H. Tang,K. Shen,andT. Yang. Compile/Run-Time Supportfor ThreadedMPI Executionon MultiprogrammedSharedMemory Machines. In 7thSymp.on Principles and Practice of Parallel Programming, pages107–118,Atlanta,GA, May 1999.

[141] H. Tezuka,A. Hori, Y. Ishikawa, andM. Sato. PM: An OperatingSys-tem CoordinatedHigh-PerformanceCommunicationLibrary. In High-PerformanceComputingand Networking(LNCS1225), pages708–717,Vienna,Austria,April 1997.

[142] H. Tezuka,F. O’Carroll, A. Hori, andY. Ishikawa. Pin-down Cache:AVirtual MemoryManagementTechniquefor Zero-copy Communication.In12th Int. Parallel ProcessingSymp., pages308–314,Orlando,FL, March1998.

[143] C.A. ThekkathandH.M. Levy. HardwareandSoftwareSupportfor Effi-cient ExceptionHandling. In Proc. of the 6th Int. Conf. on ArchitecturalSupportfor ProgrammingLanguagesandOperating Systems, pages110–119,SanJose,CA, October1994.

Page 261: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Bibliography 239

[144] R. van Renesse,K.P. Birman, and S. Maffeis. Horus, a Flexible GroupCommunicationSystem.Communicationsof theACM, 39(4):76–83,April1996.

[145] R. Veldema.Jcc,aNativeJavaCompiler. Master’s thesis,Dept.of Mathe-maticsandComputerScience,Vrije Universiteit,Amsterdam,TheNether-lands,August1998.

[146] K. Verstoep,K.G. Langendoen,andH.E. Bal. Efficient ReliableMulticastonMyrinet. In Proc.of the1996Int. Conf. onParallel Processing, volumeIII, pages156–165,Bloomingdale,IL, August1996.

[147] T. vonEicken,A. Basu,V. Buch,andW. Vogels.U-Net: A User-Level Net-work Interfacefor ParallelandDistributedComputing.In Proc.of the15thSymp.onOperatingSystemsPrinciples, pages303–316,CopperMountain,CO,December1995.

[148] T. von Eicken, D.E. Culler, S.C. Goldstein,and K.E. Schauser. ActiveMessages:a Mechanismfor IntegratedCommunicationandComputation.In Proc.of the19thInt. Symp.on ComputerArchitecture, pages256–266,GoldCoast,Australia,May 1992.

[149] T. von EickenandW. Vogels. Evolution of theVirtual InterfaceArchitec-ture. IEEE Computer, 31(11):61–68,November1998.

[150] D.A. Wallach,W.C.Hsieh,K.L. Johnson,M.F. Kaashoek,andW.E.Weihl.Optimistic Active Messages:A Mechanismfor SchedulingCommunica-tion with Computation.In 5thSymp.onPrinciplesandPracticeof ParallelProgramming, pages217–226,SantaBarbara,CA, July1995.

[151] H. Wasserman,O.M. Lubeck,Y. Luo,andF. Bassetti.PerformanceEvalua-tion of theSGIOrigin2000:A Memory-CentricCharacterizationof LANLASCI Applications.In Supercomputing’97, SanJose,CA, November1997.

[152] A. Wollrath, J.Waldo,andR. Riggs. Java-CentricDistributedComputing.IEEEMicro, 17(3):44–53,May/June1997.

[153] S.C.Woo, M. Ohara,E. Torrie, J.P. Singh,andA. Gupta. TheSPLASH-2Programs:CharacterizationandMethodologicalConsiderations.In Proc.of the 22nd Int. Symp.on ComputerArchitecture, pages24–36, SantaMargheritaLigure, Italy, June1995.

[154] K. Yocum,J.Chase,A. Gallatin,andA. Lebeck.Cut-ThroughDelivery inTrapeze:An Exercisein Low-Latency Messaging.In The6th Int. Symp.on

Page 262: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

240 Bibliography

High PerformanceDistributedComputing, pages243–252,Portland,OR,August1997.

[155] Y. Zhou,L. Iftode,andK. Li. PerformanceEvaluationof Two Home-BasedLazy Release-Consistency Protocolsfor SharedVirtual MemorySystems.In 2ndUSENIXSymp.on Operating SystemsDesignandImplementation,pages75–88,Seattle,WA, October1996.

Page 263: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Appendix A

Deadlockissues

As discussedin Section4.2, LFC’s basicmulticastprotocolcannotusearbitrarymulticasttrees. With chains,for example,we cancreatea buffer deadlock(seeFigure4.8). Below, in SectionA.1, wederive— informally — asufficientcondi-tion for deadlock-freemulticastingin LFC. Thebinarytreesusedby LFC satisfythiscondition.Theconditionappliesonly to multicastingwithin asinglemulticastgroup. Theexamplein SectionA.2 shows that theconditiondoesnot guaranteedeadlock-freemulticastingin overlappingmulticastgroups.

A.1 Deadlock-FreeMulticasting in LFC

Beforederiving a conditionfor deadlock-freemulticasting,we first considerthenatureof deadlocksin LFC. Recall that LFC partitionseachNI’s receive bufferspaceamongall possiblesenders(seeSection4.1).Thatis, eachNI hasaseparatereceivebuffer pool for eachsender.

A deadlockin LFC consistsof a cycleC of NIs suchthat for eachNI SwithsuccessorR (bothin C)

1. Shasanonemptyblocked-sendsqueueBSQS÷ R for R

2. R hasa full NI receivebuffer poolRBPS÷ R for S

We assumethat hostsarenot involved in the deadlockcycle. This is true onlyif all hostsregularly drain the network by supplyingfree hostreceive buffers totheir NI. This requirement,however, is partof thecontractbetweenLFC anditsclients(seeSection3.6). If all hostsdrainthenetwork, thefull receivepoolsin thedeadlockcycledonotcontainpacketsthatarewaiting(only) for afreehostreceivebuffer. Consequently, eachpacket p in a full receive pool RBPS÷ R is waiting tobeforwarded.Thatis, p is amulticastpacket thathastravelledfrom S to Randis

241

Page 264: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

242 Deadlockissues

enqueuedon at leastoneblocked-sendsqueueinvolved in a deadlockcycle (notnecessarilyC). Finally, observe that if RBPS÷ R is partof deadlockcycleC, thenat leastoneof the packets in RBPS÷ R hasto be enqueuedon the blocked-sendsqueueto thenext NI in C (i.e., to thesuccessorof R).

We now formulate a sufficient condition for deadlock-freemulticastinginLFC. We assumethesystemconsistsof P NIs, numbered0 ø Ý�Ý�Ý ø P æ 1. The ’dis-tance’from NI m to NI n is definedas ä n Ü P æ mç modP: the numberof hopsneededto reachn from m if all NIs areorganizedin a unidirectionalring. In amulticastgroupG the setof multicasttreesTG (oneper groupmember)cannotyield deadlockif thefollowing conditionholds:

For eachmemberg ù G, it holdsthat,for eachtreet ù TG, thedistancefrom g’s parentto g in t is smallerthaneachof thedistancesfrom gto its childrenin t.

To seewhy this is true, assumethat the condition is satisfiedand that we canconstructadeadlockcycleC of lengthk. Accordingto thenatureof deadlocksinLFC, eachNI ni in C holdsat leastonemulticastpacket pi from its predecessorin C thatneedsto beforwardedto its successorin C. Accordingto thecondition,eachNI in thecycle mustthereforehave a largerdistanceto its successorthantoits predecessor. Thatis, if wedenotethedeadlockcycle

n0d0æûú n1

d1æûúüÝ�Ý�Ý dk ý 1æþú n0

wheredi is thedistancefrom ni ÿ 1 to ni , thenwemusthave

d0 � d1 � Ý�Ý�Ý � dk ÿ 1 � d0

which is impossible.Using this condition,we canseewhy LFC’s binary treesaredeadlock-free.

In thebinary treefor node0, it is clearthateachnode’s distancesto its childrenarelargerthanthedistanceto thenode’sparent(seefor exampleFigure4.9). Themulticasttreefor anodep �Ô 0 is obtainedby renumberingeachtreenoden in themulticasttreeof node0 to ä n Ü pç modP, whereP is thenumberof processors.This renumberingpreservesthedistancesbetweennodes,sotheconditionappliesfor all trees.Consequently, LFC’s binarytreesaredeadlock-free.

Previous versionsof LFC usedbinomial insteadof binary trees[16].1 Fig-ureA.1 showsa16-nodebinomialtree.In binomialtreesthedistance(asdefined

1This paper([16]) containstwo errors.First, it incorrectlystatesthat forwardingin binomialtreesis equivalentto e-cuberouting,which is deadlock-free[43]. This is true for somesenders(e.g.,node0), but not for all. Second,thepaperassumesthatbinarytreesarenot deadlock-free.

Page 265: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

A.2 OverlappingMulticastGroups 243

0

1 2 4 8

3 5 6 9

7 11

15

1413

1210

Fig. A.1. A biomial spanningtree.

above)betweenany pairof nodesis alwaysapowerof two. Moreover, andin con-trastwith binarytrees,binomialtreeshave thepropertythatthedistancebetweena nodeand its parentis always greater than the distancesbetweena nodeandits children. The conditionfor deadlock-freemulticastingin LFC caneasilybeadaptedto applyto treesthathave thelatterproperty. Usingthis adaptedversion,wecanshow thatbinomialmulticasttreesarealsodeadlock-free.

A.2 Overlapping Multicast Groups

Without deadlockrecovery, overlappingmulticastgroupsareunsafe.Considera9-nodesystemwith threemulticastgroupsG1 Ô�� 0 ø 3 ø 4 ø 5 ø 6 � , G2 Ô�� 0 ø 3 ø 6 ø 7 ø 8 � ,andG3 Ô�� 0 ø 1 ø 2 ø 3 ø 6 � . Within eachgroup,we usebinarymulticasttrees.Thesetreesareconstructedasif the nodeswithin eachgroupwerenumbered0 ø Ý�Ý�Ý ø 4.FigureA.2 shows thetreesfor node0 in G1, for node3 in G2, andfor node6 inG3. The bold arrows in eachtreeindicatea forwardingpaththat overlapswitha forwardingpathin anothertree. By concatenatingthesepathswe cancreateadeadlockcycle.

Page 266: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

244 Deadlockissues

0

3 4

65

6

3

7

8 0 2 3

0 1

6

Fig. A.2. Multicasttreesfor overlappingmulticastgroups.

Page 267: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Appendix B

Abbreviations

ADC ApplicationDeviceChannelADI ApplicationDevice InterfaceAM ActiveMessagesAPI ApplicationProgrammingInterfaceASCI AdvancedSchoolfor ComputingandImagingATM AsynchronousTransferModeBB Broadcast/BroadcastBIP BasicInterfacefor ParallelismCRL C Region LibraryDAS DistributedASCI SupercomputerDMA Direct MemoryAccessDSM DistributedSharedMemoryF&A FetchandAddFM FastMessagesGSB GetSequencenumber, thenBroadcastJIT JustIn TimeJVM JavaVirtual MachineLFC Link-level Flow ControlMG MultigameMPI MessagePassingInterfaceMPICH MPI ChameleonMPL Message-PassingLibraryMPP Massively ParallelProcessorNI Network InterfaceNIC Network InterfaceCardOT OpenThreads

245

Page 268: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

246 Abbreviations

PB Point-to-point/BroadcastPIO ProgrammedI/OPPS ParallelProgrammingSystemRMI RemoteMethodInvocationRPC RemoteProcedureCallRSR RemoteServiceRequestRTS RunTimeSystemSAP ServiceAccessPointTCP TransmissionControlProtocolTLB TranslationLookasideBufferUDP UserDatagramProtocolUTLB User-managedTLBVI Virtual InterfaceVMMC Virtual Memory-MappedCommunicationWWW World Wide Web

Page 269: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Samenvatting

Parallelleprogramma’s lateneenaantalcomputerstegelijk aaneenrekenintensiefprobleemwerkenmetalsdoeldatprobleemsnelleropte lossendanmeteencom-putermogelijk is. Daarhetmoeilijk is om computersefficient encorrectsamente latenwerken, zijn er systemenontwikkeld om het schrijven en uitvoerenvanparallelleprogramma’s te vereenvoudigen.Dergelijke systemennoemenwe pa-rallelle programmeersystemen.

Verschillendeparallelleprogrammeersystemenbiedenverschillendeprogram-meerparadigmataaanenwordenin verschillendehardware-omgevingengebruikt.Dit proefschriftconcentreertzich op programmeersystemenvoor computerclus-ters. Zo’n clusterbestaatuit eendooreennetwerkverbondenverzamelingcom-puters.De meesteclustersbestaanuit tientallencomputers,maarer bestaanookclustersvanhonderdenenzelfsduizendencomputers.Kenmerkendvooreencom-puterclusteris dat zowel de computersals het netwerkdoor massaproductiere-latief goedkoopzijn endatdecomputersmetelkaarcommunicerendoorvia hetnetwerk,enniet via eengemeenschappelijkgeheugen,berichtenmetelkaaruit tewisselen.

Deuitwisselingvanberichtenwordtondersteunddoorcommunicatiesoftware.Dezesoftware bepaaltin belangrijke matede prestatiesvan de parallellepro-gramma’s die met een parallel programmeersysteemontwikkeld worden. Ditproefschrift,getiteld”Communicatie-architecturenvoor parallelleprogrammeer-systemen”,richt zich op dezecommunicatiesoftwareen behandeltde volgendevragen:

1. Welkemechanismendienteencommunicatiesysteemaantebieden?

2. Hoemoetenparallelleprogrammeersystemendeaangebodenmechanismengebruiken?

3. Hoemoetendezemechanismenin hetcommunicatiesysteemgeımplemen-teerdworden?

Met betrekkingtot dezevragenleverthetproefschriftdevolgendebijdragen:

247

Page 270: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

248 Samenvatting

1. Het toontaandateenkleineverzamelingeenvoudigecommunicatiemecha-nismeneffectief aangewendkan wordenom verschillendeparallellepro-grammeersystemenefficient te implementeren.Dezecommunicatiemecha-nismenzijn geımplementeerdin eennieuwcommunicatiesysteem,LFC,datlaterin dezesamenvattingpreciezerbeschrevenwordt. Eenaantalvandezemechanismenverondersteltondersteuningvandenetwerkadapter. De net-werkadapteris hetcomputeronderdeeldatdekoppelingtussennetwerkencomputerverzorgt. Modernenetwerkadapterszijn vaakprogrammeerbaarenkunnendaaromeenaantalcommunicatietakenuitvoerendie traditioneeldooreencomputeruitgevoerdworden.

2. Het beschrijfteenefficient en betrouwbaarmulticast-algoritme.Multicastis eenvandebovengenoemdecommunicatiemechanismenenspeeltin ver-schillendeparallelleprogrammeersystemeneenbelangrijke rol. Het is eenvorm van communicatiewaarbij een computereenberichtnaarverschei-deneanderecomputersverstuurt.

Opnetwerkendiemulticastniet in hardwareondersteunen,wordtmulticastgeımplementeerddoormiddelvandoorsturen. Eenmulticastberichtwordtdoordezendereerstsequentieelnaareen(meestal)klein aantalcomputersverstuurd;ieder van die computersontvangthet bericht, verwerkthet enstuurthet door naaranderecomputers,enz. Dezewijze vandoorsturenisredelijk efficient, omdathet berichtdoor verscheidenecomputerstegelijkverwerkten doorgestuurdwordt. Zoalsbeschreven is de methodeechternietoptimaal,omdatiederedoorsturendecomputerhetberichtterugkopieertnaarde netwerkadapter. Dit proefschriftbeschrijft eenalgoritmedat ge-bruik maaktvan de netwerkadapterom multicastberichtenop efficienterewijze door te sturen: zodrade netwerkadaptereenmulticastberichtont-vangt,stuurthet dit berichtzelfstandigdoor naaranderenetwerkadaptersenkopieerthet tevensnaardecomputer. De ontvangendecomputeris nietmeerbetrokkenbij hetdoorsturenvanhetbericht.

3. Het onderzoektop systematischewijze deinvloedvanverschillendeverde-lingenvanprotocoltakentussencomputerennetwerkadapteropdeprestatiesvan parallelleprogramma’s. De evaluatieconcentreertzich op het wel ofniet gebruikenvande netwerkadapterbij het implementerenvanbetrouw-barecommunicatieen bij het doorsturenvan multicastberichten.Een in-teressantresultaatis dat bepaaldewerkverdelingende prestatiesvan som-migeparallelleprogramma’sverbeteren,maardeprestatiesvananderepro-gramma’s doen verslechteren.De invloed van eenwerkverdelingblijktdeelsaf te hangenvandecommunicatiepatronendie eenparallelprogram-meersysteemgebruikt.

Page 271: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Samenvatting 249

Hoofdstuk1 vandit proefschriftbeschrijftbovengenoemdeonderzoeksvragenen-bijdragenin detail.

Hoofdstuk2 geefteenoverzichtvanbestaande,efficientecommunicatiesyste-men.Dit zijn bijnazonderuitzonderingsystemendierechtstreeksdoorgebruikers-processenaangesprokenkunnenworden,datwil zeggen,zondertussenkomstvanhet besturingssysteem.Hoewel de eliminatie van het besturingssysteemleidttot goedeprestatiesin eenvoudigetests,voldoet geenvan dezecommunicatie-systemenvolledig aande eisendie parallelleprogrammeersystemenstellen. Inhet bijzonderblijkt dat veel van dezecommunicatiesystemengeenasynchroneafhandelingvan berichtenof geenmulticastondersteunen;beidezijn belangrijkin parallelleprogrammeersystemen.Eenanderebelangrijke observatie is dat debestaandecommunicatiesystemensterkverschillenin demanierwaaropzewerkverdelentussencomputeren netwerkadapter. Sommigesystemenbeperken hetwerkopdenetwerkadaptertot eenminimum,omdatdeprocessorvandenetwerk-adapterin hetalgemeentraagis. Anderesystemendaarentegen,draaienvolledigecommunicatieprotocollenop denetwerkadapter. Hoofdstuk8 vandit proefschriftonderzoektopsystematischewijze deinvloedvanverschillendewerkverdelingen.

Hoofdstuk3 beschrijfthetontwerpendeimplementatievanLFC, eennieuwcommunicatiesysteemvoor parallelleprogrammeersystemen.LFC ondersteuntvijf parallelleprogrammeersystemen.Hoewel dezeprogrammeersystemensterkverschillendecommunicatie-eisenstellen,zijn zealleopefficientewijze bovenopLFC geımplementeerd.

LFC biedtdegebruikereeneenvoudigenlaag-niveaucommunicatie-interfaceaan.Dit interfacebestaatuit functiesom netwerkpakkettenbetrouwbaarnaareenof meeranderecomputerste versturen.Dezepakkettenkunnenzowel synchroonalsasynchroonontvangenworden.Het interfacebevatook eeneenvoudige,maarnuttigesynchronisatiefunctie(fetch-and-add).

De implementatievanhetinterfacemaaktagressiefgebruikvaneenprogram-meerbarenetwerkadapter. Hoofdstuk4 beschrijftin detaildeprotocollenentakendie LFC op denetwerkadapteruitvoert:

ö hetvoorkomenvanbufferoverlooptenbehoevevanbetrouwbarecommuni-catie(flow control)

ö hetdoorsturenvanmulticastberichten

ö hetvertraagdgenererenvaninterrupts

ö hetafhandelenvansynchronisatieberichten

LFC verondersteltdat de netwerkhardwarepakkettencorrumpeertnochver-liest,maardat is niet voldoendeom betrouwbarecommunicatietussenprocessen

Page 272: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

250 Samenvatting

te garanderen.Het is ook noodzakelijk dat ontvangersvan berichtenvoldoendebufferruimtehebbenom binnenkomendeberichtenop te slaan.Om dat te garan-deren,implementeertLFC eeneenvoudig flow-controlprotocolop de netwerk-adapter.

LFC laat de netwerkadaptermulticastberichtenzondertussenkomst van decomputerdoorsturen.Het blijkt mogelijk om dit te implementerenals eeneen-voudigeuitbreidingvanbovengenoemdflow-controlprotocol.

LFC staatgebruikerstoeomberichtensynchroonteontvangen,dooreencom-puteractief te latentestenof eenberichtgearriveerdis (polling), of asynchroon,doordenetwerkadaptereeninterruptte latengenereren.Eencomputerkansneltestenof eenberichtis binnengekomen,maarhetafhandelenvaneeninterruptver-looptmeestaltraag.Als eenberichtverwachtwordt,danis polling dusefficienterdanhetgebruikvan interrupts.Als niet bekendis wanneerhet volgendeberichtbinnenkomt, dan is polling in het algemeenduur, omdatde meestepolls geenberichtenzullenvinden.Omdatinterruptsduurzijn, laatLFC denetwerkadapterpasnaeenkortevertragingeeninterruptgenereren.Op dezewijze krijgt deont-vangendecomputerdekanseenberichtdoormiddelvanpolling weg te lezenendeinterruptte voorkomen.

LFC biedt eeneenvoudigesynchronisatiefunctieaan: fetch-and-add.Dezefunctieleestenverhoogtopondeelbarewijze dewaardevaneengedeeldeinteger-variabele.LFC slaatfetch-and-add-variabelenin het geheugenvan de netwerk-adapterop en laatde netwerkadapterzelfstandigbinnenkomendefetch-and-add-verzoekenafhandelen.

Hoofdstuk5evalueertdeprestatiesvanLFC.DetakendieLFC opdenetwerk-adapteruitvoertzijn relatiefeenvoudig,maaromdatdeprocessorvaneennetwerk-adapterin het algemeentraagis, is niet a priori duidelijk of het verstandigis aldezetaken op de netwerkadapteruit te voeren. De prestatiemetingenin hoofd-stuk5 latenechterziendatLFC uitstekendpresteert,ondankshetagressieve ge-bruik vanderelatieftragenetwerkadapter.

Hoofdstuk6 beschrijftPanda,eencommunicatiesysteemdateenhoger-niveauinterfaceaanbiedtdanLFC. Pandais bovenopLFC geımplementeerdenbiedtdegebruiker threads,messagepassing,RemoteProcedureCall (RPC) en groeps-communicatie.Dezeabstractiesvereenvoudigenvaakde implementatievaneenparallelprogrammeersysteem.

Pandagebruikt in het threadsysteemaanwezigekennis om automatischtebesluitenof binnenkomendeberichtendoor middel van polling dan wel inter-ruptsontvangenmoetenworden.Gewoonlijk gebruiktPandainterrupts,maarzo-dra alle threadsgeblokkeerdzijn staptPandaover op polling. Pandabeschouwtiederbinnenkomendberichtals eenbytestroomen staateenontvangendprocestoe om te beginnenmet het lezenvan dezebytestroomvoordathet bericht inzijn geheelontvangenis. TenslotteimplementeertPandatotaal-geordendegroeps-

Page 273: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Samenvatting 251

communicatiemetbehulpvaneeneenvoudige,maareffectievecombinatievandesynchronisatie-enbroadcastfunctiesvanLFC.

Hoofdstuk7 beschrijftvier parallelleprogrammeersystemendie met behulpvan LFC en Pandageımplementeerdzijn. Orca, CRL en Manta zijn object-gebaseerdesystemendie processenop verschillendecomputerstoestaanom ob-jectenmet elkaar te delenen om via die gedeeldeobjectente communiceren.MPI daarentegen, is gebaseerdop het expliciet versturenvan berichtentussenprocessenop verschillendecomputers(messagepassing).OrcaenMantazijn opprogrammeertalengebaseerdesystemen(respectievelijk Orcaen Java). CRL enMPI wordennietdooreencompilerondersteund.

HoewelOrca,CRLenMantavergelijkbareprogrammeerabstractiesaanbieden,implementerenzedieabstractiesopverschillendemanieren.Orcarepliceertsom-migeobjectenentransporteertvia RPCengroepscommunicatiedeparametersvanoperatiesop gedeeldeobjecten.Mantarepliceertobjectenniet engebruiktalleenRPCom de parametersvan operatieste transporteren.CRL repliceertobjecten,maarmaaktgebruikvaninvalidatieomdewaardevangerepliceerdeobjectencon-sistentte houden.BovendienverstuurtCRL geenoperatieparameters,maarob-jectdata.Orca,MantaenMPI zijn alle metbehulpvanPandageımplementeerd;CRL is directbovenopLFC geımplementeerd.

Het hoofdstuklaatziendatal dezesystemenefficientopLFC enPandageım-plementeerdkunnenworden. In het geval van Orca,eenrelatief complex sys-teem,wordendaartoetweeoptimalisatiesgeıntroduceerd.Ten eerstegenereertde Orca-compilergespecialiseerdecodevoor het in- en uitpakkenvanberichtendie operatieparametersbevatten.Dit voorkomt onnodigkopierenvandataenon-nodigeinterpretatievantype-beschrijvingen.TentweedemaaktOrcagebruikvanvoortzettingen(continuations)in plaatsvanthreadsomgeblokkeerdeoperatiesopobjectencompactte representeren.VoorgaandeOrca-implementatiesgebruiktenthreadsin plaatsvanvoortzettingen.Threadsnemenechtermeerruimtein beslagenkunnenin onvoorspelbarevolgordesgeactiveerdworden.

Hoofdstuk8 beschrijft en evalueertandereimplementatiesvan het commu-nicatie-interfacevan LFC. In totaalwordenvijf implementatiesbestudeerd,in-clusiefde in hoofdstukken3 t/m 5 beschrevenLFC-implementatie.Dezeimple-mentatiesverschillenin de manierwaaropze protocoltaken tussencomputerennetwerkadapterverdelen,metnametakendie verbandhoudenmetbetrouwbaar-heid en multicast. Het hoofdstukconcentreertzich op de invloed van dezever-schillendewerkverdelingenop de prestatiesvanparallelleprogramma’s. Omdatalle LFC-implementatieshetzelfdecommunicatie-interfaceimplementeren,kandezeinvloedbestudeerdwordenzonderwijziging vandeparallelleprogrammeer-systemen.

Hoofdstuk8 bestudeerthet gedragvan negenparallelleprogramma’s. Dezeprogramma’s makengebruikvanOrca,CRL, MPI, enMultigame.Orca,CRL en

Page 274: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

252 Samenvatting

MPI zijn bovenreedsbesproken. Multigameis eendeclaratiefparallelprogram-meersysteemdatautomatischenparallelspelbomendoorzoekt.Prestatiemetingenlatenziendatdein hoofdstukken3 t/m 5 beschrevenLFC-implementatiealtijd debesteprestatiesvan parallelleprogramma’s oplevert. DezeLFC-implementatieverondersteltdat de netwerkhardware betrouwbaaris en gebruikt de netwerk-adapterommulticastberichtendoortesturen.Deandereimplementatiesveronder-stellendatdenetwerk-hardwareonbetrouwbaaris enimplementerendaaromeenduurderbetrouwbaarheidsprotocol.Sommigeimplementatiesvoerendit protocolopdenetwerkadapteruit, andereopdecomputer.

Eeninteressanteuitkomstvandemetingenis dathetuitvoerenvanmeerwerkopdenetwerkadapterdeprestatiesvansommigeparallelleprogramma’sverbetert,maardeprestatiesvanandereprogramma’sdoetverslechteren.Met namehetge-bruik van de netwerkadapterom betrouwbaarheidte implementerenkan zowelpositief als negatief uitwerken. Het lijkt daarentegenaltijd nuttig om multicast-berichtendoordenetwerkadapter(ennietdoordecomputer)doorte latensturen.De invloedvaneenwerkverdelingblijkt deelsaf tehangenvandecommunicatie-patronendie eenparallelprogrammeersysteemgebruikt. Voor communicatiepa-tronendie veel relatief kleine RPC-transactiesbevattenis het verhogenvan dewerklastvan de netwerkadapterongunstig. Voor asynchronecommunicatiepa-tronendaarentegen, is het nuttig om de computerte ontlastenen werk naardenetwerkadapterte verschuiven.

Hoofdstuk9 besluitdit proefschrift.Uit hetgepresenteerdewerkblijkt dateencommunicatiesysteemmeteeneenvoudiginterfaceeenverscheidenheidaanparal-lelle programmeersystemenefficient kan ondersteunen.Zo’n communicatiesys-teemmoetinterrupts,polling en multicastaanbieden.De netwerkadapterspeelteenbelangrijkerol bij deimplementatievanhetcommunicatiesysteem.Met namemulticastkanmetbehulpvandenetwerkadapterefficientgeımplementeerdwor-den. De prestatiesvan parallelleapplicatieswordenechterniet alleendoor hetcommunicatiesysteembepaald,maarook door communicatiebibliotheken zoalsPandaen door parallelleprogrammeersystemen.Dit proefschriftbeschrijftver-schillendetechniekenomookdezelagengoedtelatenpresterenenlegt verbandentusseneigenschappenvan communicatiesystemen,eigenschappenvan parallelleprogrammeersystemenendeprestatiesvanparallelleprogramma’s.

Page 275: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Curriculum Vitae

Personaldata

Name: RaoulA.F. Bhoedjang

Born: 14December1970,TheHague,TheNetherlands

Nationality: Dutch

Address: Dept.of ComputerScienceUpsonHall, CornellUniversityIthaca,NY 14853–7501USA

Email: [email protected]

WWW: http://www.cs.cornell.edu/raoul/

Education

November1992: Master’sdegreein ComputerScience(cumlaude)Dept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

August1988: AthenaeumdegreeHanFortmannCollege,Heerhugowaard,TheNetherlands

253

Page 276: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

254 CurriculumVitae

ProfessionalExperience

Aug. 1999– present: ResearcherDept.of ComputerScience,CornellUniversity,Ithaca,NY, USA

July1998– July1999: ResearcherDept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

July1994– July1998: GraduatestudentDept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

Feb. 1994– June1994: ProgrammerDept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

Dec.1993– Jan.1994: Teachingassistant(CompilerConstruction)Dept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

Nov. 1993: Visiting studentDept.of ComputerScience,CornellUniversity,Ithaca,NY, USA

Sept.1993– Oct. 1993: Visiting studentLaboratoryfor ComputerScience,MIT,Cambridge,MA, USA

Feb. 1993– Sept.1993: Teachingassistant(OperatingSystems)Dept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

July1992– Sept.1992: SummerScholarshipstudentEdinburghParallelComputingCentre,Scotland,UK

Page 277: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

CurriculumVitae 255

Jan.1992– June1992: ExchangestudentComputingDept.,LancasterUniversity,Lancaster, UK

Sept.1991– Dec.1991: Teachingassistant(Introductionto Programming)Dept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

Page 278: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0
Page 279: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Publications

Journal publications

1. R.A.F. Bhoedjang,T. Ruhl, andH.E. Bal. User-Level Network InterfacePro-tocols. IEEE Computer, 31(11):53–60,November1998.

2. H.E.Bal,R.A.F. Bhoedjang,R.Hofman,C.Jacobs,K.G.Langendoen,T. Ruhl,andM.F. Kaashoek.PerformanceEvaluationof theOrcaSharedObjectSys-tem. ACM Trans.on ComputerSystems, 16(1):1–40,February1998.

3. H.E. Bal, K.G. Langendoen,R.A.F. Bhoedjang,andF. Breg. ExperiencewithParallelSymbolicApplicationsin Orca.J. of ProgrammingLanguages, 1998.

4. K.G. Langendoen,R.A.F. Bhoedjang,andH.E.Bal. Modelsfor AsynchronousMessageHandling. IEEEConcurrency, 5(2):28–37,April–June1997.

5. H.E. Bal, R.A.F. Bhoedjang,R. Hofman,C. Jacobs,K.G. Langendoen,andK. Verstoep.Performanceof aHigh-Level ParallelLanguageonaHigh-SpeedNetwork. J. of Parallel and DistributedComputing, 40(1):49–64,February1997.

Conferencepublications

1. T. Kielmann, R.F.H. Hofman, H.E. Bal, A. Plaat, and R.A.F. Bhoedjang.MAGPIE: MPI’s Collective CommunicationOperationsfor ClusteredWideAreaSystems.In Conf. on PrinciplesandPracticeof Parallel Programming(PPoPP’99), pages131–40,Atlanta,GA, May 1999.

2. T. Kielmann,R.F.H. Hofman,H.E.Bal,A. Plaat,andR.A.F. Bhoedjang.MPI’sReductionOperationsin ClusteredWide Area Systems.In Message PassingInterfaceDeveloper’s and User’s Conference(MPIDC’99) pages43–52,At-lanta,GA, March1999.

257

Page 280: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

258 Publications

3. R.A.F. Bhoedjang,J.W. Romein,andH.E. Bal. Optimizing DistributedDataStructuresUsing Application-SpecificNetwork Interface Software. In Int.Conf. onParallel Processing, pages485–492,Minneapolis,MN, August1998.

4. R.A.F. Bhoedjang,T. Ruhl, andH.E.Bal. EfficientMulticastonMyrinet UsingLink-Level Flow Control.In Int. Conf. onParallel Processing, pages381–390,Minneapolis,MN, August1998.(Bestpaperaward).

5. K.G. Langendoen,J. Romein,R.A.F. Bhoedjang,andH.E. Bal. IntegratingPolling,Interrupts,andThreadManagement.In The6thSymp.ontheFrontiersof MassivelyParallel Computation(Frontiers ’96), pages13–22,Annapolis,MD, October1996.

6. T. Ruhl, H.E. Bal, R. Bhoedjang,K.G. Langendoen,andG. Benson.Experi-encewith aPortabilityLayerfor ImplementingParallelProgrammingSystems.In The1996Int. Conf. on Parallel andDistributedProcessingTechniquesandApplications, pages1477–1488,Sunnyvale,CA, August1996.

7. R.A.F. BhoedjangandK.G. Langendoen.FriendlyandEfficientMessageHan-dling. In Proc.of the29thAnnualHawaii Int. Conf. onSystemSciences, pages121–130,Maui, HI, January1996.

8. K.G. Langendoen,R.A.F. Bhoedjang,andH.E.Bal. AutomaticDistributionofSharedDataObjects. In B. SzymanskiandB. Sinharoy, editors,Languages,Compilers and Run-Time Systemsfor ScalableComputers, pages287–290,Troy, NY, May 1995.Kluwer AcademicPublishers.

9. H.E. Bal, K.G. Langendoen,andR.A.F. Bhoedjang.Experiencewith ParallelSymbolicApplicationsin Orca. In T. Ito, R.H. Halstead,Jr, andC. Queinnec,editors,Parallel SymbolicLanguagesandSystems(PSLS’95), number1068inLectureNotesin ComputerScience,pages266–285,Beaune,France,October1995.

10. R.A.F. Bhoedjang,T. Ruhl, R. Hofman, K.G. Langendoen,H.E. Bal, andM.F. Kaashoek.Panda:A PortablePlatformto SupportParallelProgrammingLanguages.In Proc. of the USENIXSymp.on Experienceswith Distributedand MultiprocessorSystems(SEDMSIV), pages213–226,SanDiego, CA,September1993.

Page 281: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0

Publications 259

Technicalreports

1. R. Veldema,R.A.F. Bhoedjang,and H.E. Bal. DistributedSharedMemoryManagementfor Java. In 6th AnnualConf. of theAdvancedSchool for Com-putingandImaging, Lommel,Belgium,June2000.

2. R. Veldema,R.A.F. Bhoedjang,andH.E. Bal. Jackal:A Compiler-BasedIm-plementationof Java for Clustersof Workstations.In 5th AnnualConf. of theAdvancedSchool for Computingand Imaging, pages348–355,Heijen, TheNetherlands,June1999.

3. R.A.F. Bhoedjang,T. Ruhl, andH.E. Bal. LFC: A CommunicationSubstratefor Myrinet. In 4th AnnualConf. of theAdvancedSchool for ComputingandImaging, pages31–37,Lommel,Belgium,June1998.

4. H.E.Bal,R.A.F. Bhoedjang,R.Hofman,C.Jacobs,K.G.Langendoen,T. Ruhl,andM.F. Kaashoek.Orca:aPortableUser-Level SharedObjectSystem.Tech-nical ReportIR-408,Dept.of MathematicsandComputerScience,Vrije Uni-versiteit,Amsterdam,TheNetherlands,July1996.

5. H.E. Bal, R.A.F. Bhoedjang,R. Hofman,C. Jacobs,K.G. Langendoen,andK. Verstoep.Performanceof aHigh-Level ParallelLanguageonaHigh-SpeedNetwork. TechnicalReportIR-400,Dept.of MathematicsandComputerSci-ence,Vrije Universiteit,Amsterdam,TheNetherlands,1996.

6. K.G. Langendoen,J. Romein,R.A.F. Bhoedjang,andH.E. Bal. IntegratingPolling, Interrupts,andThreadManagement.In 2ndAnnualConf. of theAd-vancedSchool for Computingand Imaging, pages168–173,Lommel, Bel-gium,June1996.

7. R.A.F. Bhoedjangand K.G. Langendoen. User-SpaceSolutionsto ThreadSwitchingOverhead. In 1st AnnualConf. of the AdvancedSchool for Com-putingandImaging, pages397–406,Heijen,TheNetherlands,May 1995.

Page 282: Communication Architectures for Parallel-ProgrammingSystemsPanda 4.0, described in Chapter 6, was designed by Rutger Hofman, Tim Ruhl,¨ and me. Rutger Hofman implemented Panda 4.0