communication architectures for parallel-programmingsystems

308
Communication Architectures for Parallel-Programming Systems Raoul A.F. Bhoedjang

Upload: others

Post on 02-Mar-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Communication Architectures for Parallel-ProgrammingSystems

CommunicationArchitecturesforParallel-ProgrammingSystems

RaoulA.F. Bhoedjang

Page 2: Communication Architectures for Parallel-ProgrammingSystems
Page 3: Communication Architectures for Parallel-ProgrammingSystems

Advanced School for Computing and Imaging

Thiswork wascarriedout in graduateschoolASCI.ASCI dissertationseriesnumber52.

Copyright c�

2000RaoulA.F. Bhoedjang.

Page 4: Communication Architectures for Parallel-ProgrammingSystems
Page 5: Communication Architectures for Parallel-ProgrammingSystems

VRIJE UNIVERSITEIT

CommunicationArchitecturesfor

Parallel-ProgrammingSystems

ACADEMISCH PROEFSCHRIFT

ter verkrijgingvandegraadvandoctoraandeVrije Universiteitte Amsterdam,op gezagvanderectormagnificus

prof.dr. T. Sminia,in hetopenbaarteverdedigen

tenoverstaanvandepromotiecommissievandefaculteitderexactewetenschappen/

wiskundeeninformaticaopdinsdag6 juni 2000om15.45uur

in hethoofdgebouwvandeuniversiteit,DeBoelelaan1105

door

RaoulAngeloFelicienBhoedjang

geborente ’s-Gravenhage

Page 6: Communication Architectures for Parallel-ProgrammingSystems

Promotoren: prof.dr.ir. H.E.Balprof.dr. A.S.Tanenbaum

Page 7: Communication Architectures for Parallel-ProgrammingSystems

Preface

PersonalAcknowledgements

First andforemost,I am grateful to my thesissupervisors,Henri Bal andAndyTanenbaum,for creatinganoutstandingresearchenvironment.I thankthembothfor their confidencein meandmy work.

I am very grateful to all members,pastandpresent,of Henri’s Orca team:SaniyaBenHassen,RutgerHofman,CerielJacobs,Thilo Kielmann,KoenLangen-doen,JasonMaassen,RobvanNieuwpoort,AskePlaat,JohnRomein,Tim Ruhl,RonaldVeldema,KeesVerstoep,andGreg Wilson.

Henridoesagreatjob runningthisteamandworkingwith himhasbeenagreatpleasure.His ability to get a job done,his punctuality, andhis infinite readingcapacityaregreatlyappreciated— by me,but by many othersaswell. Aboveall,Henri is averyniceperson.

Tim Ruhl hasbeena wonderfulfriend andcolleaguethroughoutmy yearsatthe VU andI owe him greatly. We have spentcountlesshourshackingaway atvariousprojects,meanwhilediscussingresearchandpersonalmatters.Only afewof our petprojectsmadeit into papers,but thepapersthatwe workedon togetherarethebestthatI have (co-)authored.

I would liketo thankJohnRomeinfor hiswork onMultigame(seeChapter8),for teachingmeaboutparallelgame-treesearching,for hisexpertadviceoncom-puterhardware,for keepingthe DAS clusterin goodshape,andfor generouslydonatingpeanutswhenI neededthem.

KoenLangendoenis theliving proof of theratherdisturbingobservationthatthereis no significantcorrelationbetweena programmer’s codingstyle andthecorrectnessof hiscode.Koenis agreathackerandtaughtmeathingor two aboutpersistence.I wantto thankhim for hiswork onOrca,Panda,afew otherprojects,andfor hiswork asamemberof my thesiscommittee.

TheOrcagroupis blessedwith thesupportof threehighly competentresearch

i

Page 8: Communication Architectures for Parallel-ProgrammingSystems

ii Preface

programmers:RutgerHofman,CerielJacobs,andKeesVerstoep.All threehavecontributeddirectly to thework describedin this thesis.

I wish to thankRutgerfor his work on Panda4.0, for reviewing papers,formany useful discussionsaboutcommunicationprotocols,and for lots of goodespresso.I alsoapologizefor all the boring microbenchmarkshe performedsowillingly .

Therearecomputerprogrammersandthereareartists. Ceriel,no doubt,be-longsto theartists.I want to thankhim for his fine work on OrcaandPandaandfor reviewing severalof my papers.

Kees’s work hasbeeninstrumentalin completingthecomparisonof LFC im-plementationsdescribedin Chapter8. During my typing diet, he performedatremendousamountof work on theseimplementations.Not only did he writethreeof the five LFC implementationsdescribedin this thesis,but he also putmany hoursinto measuringand analyzingthe performanceof variousapplica-tions.Meanwhile,hedebuggedMyrinet hardwareandkepttheDAS clusteralive.Finally, Keesreviewedsomeof my papersandprovidedmany usefulcommentson this thesis.It hasbeenagreatpleasureto work with him.

With a 128-nodeclusterof computersin your backyard, it is easyto forgetaboutyour desktopmachine����� until you cannotreachyour homedirectory. For-tunately, suchmomentshavebeenscarce.Many thanksto thedepartment’ssystemadministratorsfor keepingeverythingup andrunning.I amespeciallygratefultoRuudWiggers,for quotaof variouskindsandgreatservice.

I wishto thankMatty Huntjensfor startingmy graduatecareerby recommend-ing meto Henri Bal. I greatlyappreciatethewayheworkswith thedepartment’sundergraduatestudents;thedepartmentis lucky to havesuchpeople.

I atemany dinnersat the University’s ’restaurant.’ I will refrain from com-mentingonthefood,but I did enjoy thecompany of RutgerHofman,Philip Hom-burg, CerielJacobs,andothers.

My thesiscommittee— prof. dr. Zwaenepoel,prof. dr. Kaashoek,dr.Langendoen,dr. Epema,anddr. van Steen— went throughsomehard times;my move to Cornell ratherreducedmy pages-per-dayrate. I would like to thankthemall for readingmy thesisandfor their suggestionsfor improvement. I amparticularlygratefulto FransKaashoekwhosecommentsandsuggestionsconsid-erablyimprovedthequality of this thesis.

I would alsolike to thankFransfor his hospitalityduring someof my visitsto theUS. I especiallyenjoyedmy two-monthvisit to his researchgroupat MIT.On that sameoccasion,Robbertvan Renesseinvited me to Cornell, my currentemployer. He too,hasbeenmosthospitableon many occasions,andI would like

Page 9: Communication Architectures for Parallel-ProgrammingSystems

Preface iii

to thankhim for it.TheNetherlandsOrganizationfor ScientificResearch(N.W.O.)supportedpart

of my work financiallythroughHenri Bal’s Pioniergrant.Many of the important thingsI know, my parentstaughtme. I am suremy

fatherwould have lovedto attendmy thesisdefense.My motherhasbeena greatexampleof strengthin difficult times.

Finally, I wish to thankDaphnefor sharingherlife with me. Most statementsconcerningtwo thesesin onehouseholdaretrue. I amgladwesurvived.

Per-Chapter Contrib utions

Thefollowing paragraphssummarizethecontributionsmadeby othersto thepa-persandsoftwarethateachchapterof this thesisis basedon. I would liketo thankall for thetimeandeffort they have invested.

Chapter2’s survey is basedon anarticleco-authoredby Tim Ruhl andHenriBal [18].

Tim Ruhl andI designedand implementedthe LFC communicationsystemdescribedin Chapters3, 4, and5. JohnRomeinportedLFC to Linux. LFC isdescribedin two papersco-authoredby Tim Ruhl andHenri Bal; it is alsooneofthesystemsclassifiedin thesurvey mentionedabove.

Panda4.0,describedin Chapter6,wasdesignedbyRutgerHofman,Tim Ruhl,andme.RutgerHofmanimplementedPanda4.0onLFC from scratch,exceptforthe threadspackage,OpenThreads,which wasdevelopedby KoenLangendoen.Earlierversionsof Pandaaredescribedin two conferencepublications[19, 124].Thecomparisonof upcallmodelsis basedonanarticleby KoenLangendoen,me,andHenri Bal [88]. Panda’s dynamicswitchingbetweenpolling andinterruptsis describedin a conferencepaperby KoenLangendoen,JohnRomein,me,andHenri Bal [89].

Chapter7 discussesthe implementationof four parallel-programmingsys-tems: Orca, Manta, CRL, and MPICH. Orca and Manta were both developedin thecomputersystemsgroupat theVrije Universiteit;CRL andMPICH weredevelopedelsewhere.

KoenLangendoenandI implementedthe continuation-basedversionof theOrcaruntimesystemontopof Panda2.0.Theuseof continuationsin theruntimesystemis describedin a paperco-authoredby KoenLangendoen[14]. CerielJa-cobswrotetheOrcacompilerandportedtheruntimesystemto Panda3.0.RutgerHofmanportedthe continuation-basedruntimesystemto Panda4.0. An exten-

Page 10: Communication Architectures for Parallel-ProgrammingSystems

iv Preface

sivedescriptionandperformanceevaluationof theOrcasystemon Panda3.0aregivenin a journalpaperby Henri Bal, me,RutgerHofman,Ceriel Jacobs,KoenLangendoen,Tim Ruhl, andFransKaashoek[9].

Mantawasdevelopedby JasonMaassen,Rob van Nieuwpoort,andRonaldVeldema. Their work is describedin a conferencepaper[99]. Rob helpedmewith theMantaperformancemeasurementspresentedin Section7.2.

CRL wasdevelopedatMIT by Kirk Johnson[73, 74]; I portedCRL to LFC. Itcouldnothavebeenmucheasier, becauseCRL is a reallynicepieceof software.

MPICH [61] is beingdevelopedjointly by ArgonneNationalLaboratoriesandMississippiStateUniversity. RutgerHofmanportedMPICH to Panda4.0.

Thecomparisonof differentLFC implementationsdescribedin Chapter8 wasdonein cooperationwith KeesVerstoep,Tim Ruhl, and RutgerHofman. Timand I implementedthe NrxImc andHrxHmc versionsof LFC. KeesimplementedIrxImc, IrxHmc, andNrxHmc andperformedmany of themeasurementspresentedinChapter8. Rutgerhelpedoutwith variousPanda-relatedissues.

Theparallelapplicationsanalyzedin Chapter8 weredevelopedby many per-sons. The Puzzleapplicationand the parallel-programmingsystemthat it runson, Multigame [122], were written by JohnRomein. Awari wasdevelopedbyVictor Allis, Henri Bal, HansStaalman,andElwin deWaal. Thelinearequationsolver (LEQ) wasdevelopedby KoenLangendoen.The MPI applicationswerewritten by RutgerHofman(QR andASP)andThilo Kielmann(SOR).TheCRLapplicationsBarnes-Hut(Barnes),FastFourier Transform(FFT), andradix sort(radix)areall basedonshared-memoryprogramsfromtheSPLASH-2suite[153].Barnes-Hutwasadaptedfor CRL by Kirk Johnson,FFT by AskePlaat,andradixby me.

Page 11: Communication Architectures for Parallel-ProgrammingSystems

Contents

1 Intr oduction 11.1 Parallel-ProgrammingSystems 21.2 Communicationarchitecturesfor parallel-programmingsystems 61.3 Problems 7

1.3.1 Data-TransferMismatches 71.3.2 Control-TransferMismatches 91.3.3 DesignAlternativesfor User-Level Communication

Architectures 101.4 Contributions 101.5 Implementation 11

1.5.1 LFC Implementations 111.5.2 Panda 131.5.3 Parallel-ProgrammingSystems 13

1.6 ExperimentalEnvironment 151.6.1 ClusterPosition 151.6.2 HardwareDetails 16

1.7 ThesisOutline 19

2 Network Interface Protocols 212.1 A BasicNetwork InterfaceProtocol 222.2 DataTransfers 25

2.2.1 FromHostto Network Interface 252.2.2 FromNetwork Interfaceto Network Interface 282.2.3 FromNetwork Interfaceto Host 28

2.3 AddressTranslation 292.4 Protection 312.5 ControlTransfers 322.6 Reliability 34

v

Page 12: Communication Architectures for Parallel-ProgrammingSystems

vi Contents

2.7 Multicast 362.8 Classificationof User-Level CommunicationSystems 372.9 Summary 38

3 The LFC User-Level Communication System 413.1 ProgrammingInterface 41

3.1.1 Addressing 423.1.2 Packets 423.1.3 SendingPackets 443.1.4 Receiving Packets 453.1.5 Synchronization 473.1.6 Statistics 47

3.2 Key ImplementationAssumptions 473.3 ImplementationOverview 48

3.3.1 Myrinet 493.3.2 OperatingSystemExtensions 493.3.3 Library 503.3.4 NI ControlProgram 50

3.4 Packet Anatomy 523.5 DataTransfer 553.6 HostReceiveBuffer Management 583.7 MessageDetectionandHandlerInvocation 603.8 Fetch-and-Add 613.9 Limitations 62

3.9.1 FunctionalLimitations 623.9.2 SecurityViolations 64

3.10 RelatedWork 653.11 Summary 66

4 Core Protocolsfor Intelligent Network Interfaces 694.1 UCAST: ReliablePoint-to-PointCommunication 70

4.1.1 ProtocolDataStructures 704.1.2 Sender-SideProtocol 724.1.3 Receiver-SideProtocol 72

4.2 MCAST: ReliableMulticastCommunication 754.2.1 MulticastForwarding 754.2.2 MulticastTreeTopology 764.2.3 ProtocolDataStructures 77

Page 13: Communication Architectures for Parallel-ProgrammingSystems

Contents vii

4.2.4 Sender-SideProtocol 784.2.5 Receiver-SideProtocol 784.2.6 Deadlock 80

4.3 RECOV: DeadlockDetectionandRecovery 824.3.1 DeadlockDetection 824.3.2 DeadlockRecovery 834.3.3 ProtocolDataStructures 844.3.4 Sender-SideProtocol 864.3.5 Receiver-SideProtocol 87

4.4 INTR: ControlTransfer 924.5 RelatedWork 93

4.5.1 Reliability 934.5.2 InterruptManagement 944.5.3 Multicast 954.5.4 FM/MC 96

4.6 Summary 98

5 The Performanceof LFC 995.1 UnicastPerformance 99

5.1.1 Latency andThroughput 995.1.2 LogGPParameters 1015.1.3 InterruptOverhead 103

5.2 Fetch-and-AddPerformance 1045.3 MulticastPerformance 105

5.3.1 Performanceof theBasicMulticastProtocol 1055.3.2 Impactof DeadlockRecoveryandMulticastTreeShape 108

5.4 Summary 112

6 Panda 1156.1 Overview 116

6.1.1 Functionality 1166.1.2 Structure 1186.1.3 Panda’sMain Abstractions 122

6.2 IntegratingMultithreadingandCommunication 1266.2.1 Multithread-SafeAccessto LFC 1276.2.2 TransparentlySwitchingbetweenInterruptsandPolling 1296.2.3 An EfficientUpcall Implementation 134

6.3 StreamMessages 144

Page 14: Communication Architectures for Parallel-ProgrammingSystems

viii Contents

6.4 Totally-OrderedBroadcasting 1486.5 Performance 149

6.5.1 Performanceof theMessage-PassingModule 1496.5.2 Performanceof theBroadcastModule 1506.5.3 OtherPerformanceIssues 153

6.6 RelatedWork 1536.6.1 PortableMessage-PassingLibraries 1546.6.2 Polling andInterrupts 1556.6.3 Upcall Models 1566.6.4 StreamMessages 1576.6.5 Totally-OrderedBroadcast 157

6.7 Summary 159

7 Parallel-Programming Systems 1617.1 Orca 161

7.1.1 ProgrammingModel 1627.1.2 ImplementationOverview 1647.1.3 EfficientOperationTransfer 1677.1.4 Efficient Implementationof GuardedOperations 1717.1.5 Performance 175

7.2 Manta 1797.2.1 ProgrammingModel 1797.2.2 Implementation 1807.2.3 Performance 182

7.3 TheC Region Library 1847.3.1 ProgrammingModel 1847.3.2 Implementation 1857.3.3 Performance 188

7.4 MPI — TheMessagePassingInterface 1887.4.1 Implementation 1897.4.2 Performance 190

7.5 RelatedWork 1947.5.1 OperationTransfer 1947.5.2 OperationExecution 195

7.6 Summary 196

8 Multile vel PerformanceEvaluation 1998.1 ImplementationOverview 200

Page 15: Communication Architectures for Parallel-ProgrammingSystems

Contents ix

8.1.1 TheImplementations 2008.1.2 Commonalities 202

8.2 ReliablePoint-to-PointCommunication 2048.2.1 TheNo-RetransmissionProtocol(Nrx) 2048.2.2 TheRetransmittingProtocols(Hrx andIrx) 2058.2.3 Latency Measurements 2068.2.4 Window SizeandReceiveBuffer Space 2078.2.5 SendBuffer Space 2108.2.6 ProtocolComparison 211

8.3 ReliableMulticast 2128.3.1 Host-Level versusInterface-Level Packet Forwarding 2138.3.2 AcknowledgementSchemes 2148.3.3 DeadlockIssues 2158.3.4 Latency andThroughputMeasurements 215

8.4 Parallel-ProgrammingSystems 2178.4.1 CommunicationStyle 2178.4.2 PerformanceIssues 220

8.5 ApplicationPerformance 2218.5.1 PerformanceResults 2218.5.2 All-pairs ShortestPaths(ASP) 2228.5.3 Awari 2258.5.4 Barnes-Hut 2268.5.5 FastFourierTransform(FFT) 2278.5.6 TheLinearEquationSolver (LEQ) 2278.5.7 Puzzle 2288.5.8 QRFactorization 2288.5.9 RadixSort 2298.5.10 SuccessiveOverrelaxation(SOR) 230

8.6 Classification 2308.7 RelatedWork 232

8.7.1 Reliability 2328.7.2 Multicast 2338.7.3 NI ProtocolStudies 234

8.8 Summary 235

9 Summary and Conclusions 2379.1 LFC andNI Protocols 2379.2 Panda 239

Page 16: Communication Architectures for Parallel-ProgrammingSystems

x Contents

9.3 Parallel-ProgrammingSystems 2409.4 PerformanceImpactof DesignDecisions 2429.5 Conclusions 2429.6 LimitationsandOpenIssues 243

A Deadlockissues 263A.1 Deadlock-FreeMulticastingin LFC 263A.2 OverlappingMulticastGroups 265

B Abbreviations 267

Samenvatting 271

Curriculum Vitae 279

Page 17: Communication Architectures for Parallel-ProgrammingSystems

List of Figures

1.1 Communicationlayersstudiedin this thesis. 61.2 Structureof this thesis. 121.3 Myrinet switchtopology. 161.4 Switchesandclusternodes. 161.5 Clusternodearchitecture. 17

2.1 Operationof thebasicprotocol. 222.2 Host-to-NIthroughputusingdifferentdatatransfermechanisms. 262.3 Designdecisionsfor reliability. 342.4 Repeatedsendversusa tree-basedmulticast. 37

3.1 LFC’s packet format. 533.2 LFC’s datatransferpath. 563.3 LFC’s sendpathanddatastructures. 573.4 Two waysto transferthepayloadandstatusflagof anetwork packet. 583.5 LFC’s receivedatastructures. 59

4.1 Protocoldatastructuresfor UCAST. 714.2 Sendersideof theUCAST protocol. 734.3 Receiversideof theUCASTprotocol. 744.4 Host-level andinterface-level multicastforwarding. 764.5 Protocoldatastructuresfor MCAST. 774.6 Sender-sideprotocolfor MCAST. 794.7 Receiveprocedurefor themulticastprotocol. 794.8 MCAST deadlockscenario. 814.9 A binarytree. 814.10 Subtreecoveredby thedeadlockrecoveryalgorithm. 844.11 Deadlockrecoverydatastructures. 854.12 Multicastforwardingin thedeadlockrecoveryprotocol. 87

xi

Page 18: Communication Architectures for Parallel-ProgrammingSystems

xii List of Figures

4.13 Packet transmissionin thedeadlockrecoveryprotocol. 884.14 Receiveprocedurefor thedeadlockrecoveryprotocol. 894.15 Packet releaseprocedurefor thedeadlockrecoveryprotocol. 914.16 Timeouthandlingin INTR. 934.17 LFC’s andFM/MC’sbufferingstrategies. 96

5.1 LFC’s unicastlatency. 1005.2 LFC’s unicastthroughput. 1025.3 Fetch-and-addlatenciesundercontention. 1045.4 LFC’s broadcastlatency. 1055.5 LFC’s broadcastthroughput. 1065.6 Treeusedto determinemulticastgap. 1075.7 LFC’s broadcastgap. 1085.8 Impactof deadlockrecoveryonbroadcastthroughput. 1095.9 Impactof treetopologyon broadcastthroughput. 1105.10 Impactof deadlockrecoveryonall-to-all throughput. 111

6.1 Differentbroadcastdeliveryorders. 1186.2 Pandamodules:functionalityanddependencies. 1196.3 Two Pandaheaderstacks. 1246.4 Recursive invocationof anLFC routine. 1276.5 RPCexample. 1326.6 Simpleremotereadwith activemessages. 1356.7 Messagehandlerwith locking. 1386.8 Threedifferentupcallmodels. 1406.9 Messagehandlerwith blocking. 1416.10 Fastthreadswitchto anupcallthread. 1456.11 Application-to-applicationstreamingwith Panda’sstreammessages.1466.12 Panda’s message-headerformats. 1466.13 Receiver-sideorganizationof Panda’sstreammessages. 1476.14 Panda’s message-passinglatency. 1496.15 Panda’s message-passingthroughput. 1516.16 Panda’s broadcastlatency. 1526.17 Panda’s broadcastthroughput. 152

7.1 Definitionof anOrcaobjecttype. 1637.2 Orcaprocesscreation. 1637.3 Thestructureof theOrcasharedobjectsystem. 164

Page 19: Communication Architectures for Parallel-ProgrammingSystems

List of Figures xiii

7.4 Decisionprocessfor executinganOrcaoperation. 1667.5 Orcadefinitionof anarrayof records. 1687.6 Specializedmarshalingcodegeneratedby theOrcacompiler. 1697.7 Datatransferpathsin Orca. 1707.8 Orcawith popupthreads. 1727.9 Orcawith single-threadedupcalls. 1737.10 Thecontinuationsinterface. 1747.11 Roundtriplatenciesfor Orca,Panda,andLFC. 1767.12 Roundtripthroughputfor Orca,Panda,andLFC. 1777.13 Broadcastlatency for Orca,Panda,andLFC. 1787.14 Broadcastthroughputfor Orca,Panda,andLFC. 1787.15 Roundtriplatenciesfor Manta,Orca,Panda,andLFC. 1827.16 Roundtripthroughputfor Manta,Orca,Panda,andLFC. 1837.17 CRL readandwrite misstransactions. 1867.18 Write misslatency. 1887.19 Write missthroughput. 1897.20 MPI unicastlatency. 1917.21 MPI unicastthroughput. 1927.22 MPI broadcastlatency. 1937.23 MPI broadcastthroughput. 193

8.1 LFC unicastlatency. 2068.2 Peakthroughputfor differentwindow sizes. 2088.3 LFC unicastthroughput. 2098.4 Host-level andinterface-level multicastforwarding. 2138.5 LFC multicastlatency. 2168.6 LFC multicastthroughput. 2168.7 Application-level impactof efficientbroadcastsupport. 2198.8 Dataandpacket ratesof NrxImc on 64 nodes. 2238.9 Normalizedapplicationexecutiontimes. 2248.10 Classificationof communicationsystemsfor Myrinet. 232

A.1 A biomial spanningtree. 265A.2 Multicasttreesfor overlappingmulticastgroups. 266

Page 20: Communication Architectures for Parallel-ProgrammingSystems
Page 21: Communication Architectures for Parallel-ProgrammingSystems

List of Tables

1.1 Classificationof parallel-programmingsystems. 41.2 Communicationcharacteristicsof parallel-programmingsystems. 14

2.1 ThroughputsonaPentiumPro/Myrinetcluster. 292.2 Summaryof designchoicesin existingcommunicationsystems. 39

3.1 LFC’s userinterfac. 433.2 Resourcesusedby theNI controlprogram. 513.3 LFC’s packet tags. 54

5.1 Valuesof theLogGPparametersfor LFC. 1035.2 Deadlockrecoverystatistics. 112

6.1 Sendandreceive interfacesof Panda’s systemmodule. 1256.2 Classificationof upcallmodels. 134

7.1 A comparisonof OrcaandManta. 1807.2 PPSperformancesummary 197

8.1 Fiveversionsof LFC’s reliability andmulticastprotocols. 2018.2 Divisionof work in theLFC implementations. 2028.3 Parametersusedby theprotocols. 2048.4 LogPparametervalues(in microseconds)for a16-bytemessage. 2078.5 Fetch-and-addlatencies. 2078.6 Applicationcharacteristicsandtimings. 2218.7 Classificationof applications. 231

xv

Page 22: Communication Architectures for Parallel-ProgrammingSystems
Page 23: Communication Architectures for Parallel-ProgrammingSystems

Chapter 1

Intr oduction

This thesisis aboutcommunicationsupportfor parallel-programmingsystems.A parallel-programmingsystem(PPS)is a systemthat aids programmerswhowish to exploit multiple processorsto solve a single,computationallyintensiveproblem. PPSscomein many flavors,dependingon theprogrammingparadigmthey provide andthe type of hardwarethey run on. This thesisconcentratesonPPSsfor clusters of computers.By a clusterwe meana collection of off-the-shelf computersinterconnectedby an off-the-shelfnetwork. Clustersare easyto build, easyto extend,scaleto large numbersof processors,andarerelativelyinexpensive.

Theperformanceof aPPSdependscritically onefficientcommunicationmech-anisms.At present,user-level communicationsystemsoffer thebestcommunica-tion performance.Suchsystemsbypasstheoperatingsystemon all critical com-municationpathssothatuserprogramscanbenefitfrom thehigh performanceofmodernnetwork technology. This thesisconcentrateson theinteractionsbetweenPPSsanduser-level communicationsystems.We focuson threequestions:

1. Which mechanismsshouldauser-level communicationsystemprovide?

2. How shouldparallel-programmingsystemsusethemechanismsprovided?

3. How shouldthesemechanismsbeimplemented?

The relevanceof thesequestionsfollows from the following observations.First, thefunctionalityprovidedby user-level communicationsystemsvariescon-siderably(seeChapter2). Systemsdiffer in the typesof datatransferprimitivesthey offer (point-to-pointmessage,multicastmessage,or remote-memoryaccess),

1

Page 24: Communication Architectures for Parallel-ProgrammingSystems

2 Introduction

theway incomingdatais detected(polling or interrupts),andtheway incomingdatais handled(explicit receive,upcall,or popupthread).

Second,mostuser-level communicationsystemsprovide low-level interfacesthatoftendo not matchtheneedsof thedevelopersof a PPS.This is not a prob-lem aslong ashigher-level abstractionscanbelayeredefficiently on top of thoseinterfaces.Unfortunately, many systemsignorethis issue(seeSection1.3).

Third, wheresystemsdo provide similar functionality, they implementit indifferentways(seeChapter2). Many systems,for example,usea programmablenetwork interface(NI) to executepartof thecommunicationprotocol. However,systemsdiffer greatly in the type andamountof work that they off-load to theNI. Somesystemsminimize the amountof work performedon the NI —on theaccountthat the NI processoris much slower than the host processor— whileothersrun acompletereliability protocolon theNI.

This chapterproceedsasfollows. Section1.1 discussesPPSsin moredetailandexplainswhich typesof PPSsweaddressexactly. Section1.2discussesuser-level architecturesandintroducesthedifferentlayersof communicationsoftwarewe study. Section1.3 introducesthespecificproblemsthat this thesisaddresses.Section1.4 statesour contributionstowardssolving theseproblems.Section1.5discussesthesoftwarecomponentsin whichwehaveimplementedour ideas.Sec-tion 1.6 describestheenvironmentin which we developedandevaluatedour so-lutions.Finally, Section1.7describesthestructureof thethesis.

1.1 Parallel-Programming Systems

The useof parallelismin andbetweencomputersis by now the rule ratherthantheexception. Within a modernprocessor, we find a pipelinedCPUwith multi-ple functionalunits,nonblockingcaches,andwrite buffers. Surprisingly, this in-traprocessorparallelismis largely hiddenfrom theprogrammerby instructionsetarchitecturesthatprovide a moreor lesssequentialprogrammingmodel. Whereparallelismdoesbecomevisible (e.g.,in VLIW architectures),compilershide itoncemorefrom mostapplicationprogrammers.This way, applicationprogram-mershave enjoyed advancesin computerarchitecturewithout having to changetheway they write theirprograms.

Programmerswho want to exploit parallelismbetweenprocessorshave notbeenso fortunate. In specificapplicationdomains,or for specifictypesof pro-grams,compilerscanautomaticallydetectandexploit parallelismof asufficientlylargegrainto employ multiplecomputersefficiently. In mostcases,however, pro-

Page 25: Communication Architectures for Parallel-ProgrammingSystems

1.1Parallel-ProgrammingSystems 3

grammersmustguidecompilersby meansof annotationsor write an explicitlyparallelprogram.Unfortunately, writing efficientparallelprogramsis notoriouslydifficult. At thealgorithmiclevel, programmersmustfind a compromisebetweenlocality and load balance. To achieve both, nontrivial techniquessuchas datareplication,datamigration,work stealing,load balancing,etc., may have to beemployed. At a lower level, programmersmustdealwith raceconditions,mes-sageinterrupts,messageordering,andmemoryconsistency.

To simplify theprogrammer’s task,many parallel-programmingsystemshavebeendeveloped,eachof which implementsits own programmingmodel. Themostimportantprogrammingmodelsare:

� Sharedmemory

� Messagepassing

� Data-parallelprogramming

� Sharedobjects

In the shared-memorymodel, multiple threadssharean addressspaceandcommunicateby writing andreadingshared-memorylocations.Threadssynchro-nizebymeansof atomicmemoryoperationsandhigher-levelprimitivesbuilt upontheseatomicoperations(e.g.,locksandconditionvariables).Multiprocessorsdi-rectly supportthis programmingmodelin hardware.

In the message-passingmodel,processescommunicateby exchangingmes-sages.Implementationsof this model (e.g.,PVM [137] andMPI [53]) usuallyprovide severalflavorsof sendandreceive methodswhich differ, for example,inwhenthesendermay continueandhow the receiver selectsthe next messagetoreceive.

In the shared-memoryandmessage-passingmodels,the programmerhastodealwith multiple threadsor processes.A data-parallelprogramhasonly onethreadof control and in many ways resemblesa sequentialprogram. For effi-ciency, however, theprogrammerprovidesannotationsthat indicatewhich loopscanbeexecutedin parallelandhow datamustbedistributed.

In the shared-objectmodel [7], processescansharedata,provided that thisdatais encapsulatedin anobject.Thedatais accessedby invokingmethodsof theobject’s class.While this modelis not aswidely usedin practiceasthepreviousmodels,it hasreceived a fair amountof attentionin the parallel-programmingresearchcommunity.

Page 26: Communication Architectures for Parallel-ProgrammingSystems

4 Introduction

Programming model Multipr ocessor Cluster implementationsimplementations

Sharedmemory Pthreads[110] Shasta[125], Tempest[119],TreadMarks[78]

Messagepassing MPI [53, 140] MPI [53, 61], PVM [137]Data-parallel OpenMP[42] HPF[81]Sharedobjects Java [60] Orca[7, 9], CRL [74],

Java/RMI [99]

Table 1.1.Classificationof parallel-programmingsystems.

Parallel-programmingmodelsusedto beidentifiedwith aspecifictypeof par-allel hardware. Today, mostvendorsof parallelhardwaresupportmultiple pro-grammingmodels. The shared-memorymodel, for example,is a more naturalmatchto a multiprocessorthan the message-passingmodel, yet multiprocessorvendorsalsoprovideimplementationsof message-passingstandardssuchasMPI.

At present,two typesof hardwaredominateboththecommercialmarketplaceandparallel-programmingresearchin academia:multiprocessorsandworkstationclusters.In a multiprocessor, thehardwareimplementsa single,sharedmemorythat canbe accessedby all processorsthroughmemoryaccessinstructions. Ina workstationcluster, eachprocessorhas its own memory. Accessto anotherprocessor’smemoryis not transparentandcanonly beachievedby someform ofexplicit messagepassing.

All four programmingmodelsmentionedabove have beenimplementedonboth typesof hardware (seeTable 1.1). Implementationson clusterhardwaretendto be morecomplex thanmultiprocessorimplementationsbecauseefficientstatesharingis moredifficult on a clusterthanon a multiprocessor. This thesisaddressesonly communicationsupportfor clusters. On clusters,modelssuchasthe shared-objectmodelcanbe implementedonly by softwarethat hidesthehardware’s distributednature.

All PPSsdiscussedin this thesis(exceptMultigame[122], which we discussonly briefly) require the programmerto write a parallel program. With somesystems,however, this is easierthan with others. The Orca shared-objectsys-tem[7, 9], for example,is language-based.Programmersthereforeneednotwriteany codeto marshalobjectsor parametersof operationsbecausethiscodeis eithergeneratedby thecompileror hiddenin thelanguageruntimesystem.In a library-basedsystemsuchasMPI, in contrast,programmersmusteitherwrite their own

Page 27: Communication Architectures for Parallel-ProgrammingSystems

1.1Parallel-ProgrammingSystems 5

marshalingcodeor createtypedescriptors.At a higher level, considerlocality and load balancing. Distributed shared

memorysystemslike CRL [74] andOrcareplicatedatatransparentlyto optimizereadaccesses,thusimproving locality. SystemssuchasCilk [55] andMultigameusea work-stealingruntimesystemthat automaticallybalancesthe load amongall processors.An MPI programmerwho wishesto replicatea dataitem mustdo so explicitly, without any help from the systemto keepthe replicasconsis-tent. Similarly, MPI programmershave to implementtheir own load balancingschemes.

Theconveniencesof high-levelprogrammingsystemsdonotcomefor free.Ina recentstudy[97], for example,Lu et al. show thatparallelapplicationswrittenfor the TreadMarks[78] distributed shared-memorysystem(DSM) sendmoredataandmoremessagesthanequivalentmessage-passingprograms.ThereasonsarethatTreadMarkscannotcombinedatatransferandsynchronizationin asinglemessagetransfer, cannottransferdatafrom multiple memorypagesin a singlemessage,andsuffersfrom falsesharinganddiff accumulation.(Diff accumulationis specificto the consistency protocol usedby TreadMarks. The result of diffaccumulationis thatprocessorscommunicateto obtainold versionsof shareddatathathavealreadybeenoverwrittenby newerversions.)

Orca provides a secondexample. Although Orca’s shared-objectprogram-mingmodeliscloserto messagepassingthanTreadMarks’sshared-memorymodel,Orcaprogramsalsosendmoremessagesthanis strictly needed.Thecurrentim-plementationof Orca[9], for example,broadcastsoperationson a replicatedob-ject to all processors,even if the object is replicatedon only a few processors.Also,operationsonnonreplicatedobjectsarealwaysexecutedby meansof remoteprocedurecalls.Eachsuchoperationresultsin arequestandareplymessage,evenif no reply is needed.

Theseexamplesillustratethat,with high-level PPSs,applicationprogrammersno longer control the implementationof high-level functionsand must rely ongenericimplementations.This resultsin increasedmessageratesandincreasedmessagesizes. In addition,high-level PPSsaddmessage-processingoverheadsin theform of marshalingandmessage-handlerdispatch.Consequently, efficientcommunicationsupportis of primeimportanceto thesesystems.

Page 28: Communication Architectures for Parallel-ProgrammingSystems

6 Introduction

Communication libraries

Network interface protocols

PPS-specific(compiler and/or runtime system)Parallel applications

Communication software

Network hardware

Fig. 1.1.Communicationlayersstudiedin this thesis.

1.2 Communication architecturesforparallel-pr ogramming systems

To achieveefficientcommunication,weneedbothefficienthardwareandefficientsoftware. While high-bandwidth,low-latency, andscalablenetwork hardwareisstill expensive,suchhardwareis availableandcanbeusedreadilyoutsidesuper-computercabinets.Examplesof suchhardwareincludeGigaNet[130], MemoryChannel[57], andMyrinet [27]. On thesesystems,network bandwidthis mea-suredin Gigabitspersecondandnetwork latency in microseconds.Most of thesenetworkshave low bit-errorrates.Finally, network interfaceshave becomemoreflexible andaresometimesprogrammable.

In many ways,software is themainobstacleon theroadto efficient commu-nication. Softwareoverheadcomesin many forms: systemcalls,copying, inter-rupts,threadswitches,etc. Part of thesoftwareproblemhasbeensolvedby theintroductionof user-levelcommunicationarchitectureswhichgiveuserprocessesdirect accessto the network. In mostuser-level communicationsystems,usersmustinitialize communicationsegmentsandchannelsby meansof systemcalls.After this initialization phase,however, kernelmediationis no longerneeded,orneededonly in exceptionalcases.

Communicationsupportfor aPPScanbedividedinto threelayersof software(seeFigure1.1). At the bottom,above the network hardware,we find networkinterfaceprotocols. Theseprotocolsperformtwo functions:they controlthenet-work deviceandimplementalow-level communicationabstractionthatis usedbyall higherlayers.

SinceNI protocolsareoftenlow-level, most(but not all) PPSsusea commu-nicationlibrary thatimplementsmessageabstractionsandhigher-level communi-cationprimitives(e.g.,remoteprocedurecall).

Thetop layer implementsa PPS’s programmingmodel. In general,this layer

Page 29: Communication Architectures for Parallel-ProgrammingSystems

1.3Problems 7

consistsof a compileranda runtimesystem. Not all PPSsarebasedon a pro-gramminglanguage,however, sonot all PPSsusea compiler. In principle,PPSsbasedon a programminglanguagedo not needa runtimesystem:a compilercangenerateall codethatneedsto beexecuted.In practice,though,PPSsalwaysusesomeruntimesupport.

Sincecommunicationeventsfrequentlycut throughall theselayers,applica-tion-level performanceis determinedby thewaytheselayerscooperate.In partic-ular, highperformanceatonelayeris of nouseif this layeroffersaninconvenientinterfaceto higherlayers.Our goal in this thesisis to find mechanismsandinter-facesthatwork well at all layers.

1.3 Problems

This thesisaddressesthreeproblems:

1. The datatransfermethodsprovided by user-level communicationsystemsoftendonotmatchtheneedsof PPSs.

2. The control transfermethodsprovided by user-level communicationsys-temsdo notmatchtheneedsof PPSs.

3. Designalternatives for reliable point-to-pointand multicast implementa-tionsfor modernnetworksarepoorlyunderstood.

1.3.1 Data-Transfer Mismatches

Achieving efficient datatransferat all layersin a PPSis hard. A datatransferproblemthathasreceivedmuchattentionis thehigh costof memorycopiesthattake placeas datatravels from one layer to another. Sinceredundantmemorycopiesdecreaseapplication-to-applicationthroughput(seeChapter2),muchuser-level communicationwork hasfocusedon avoiding unnecessarycopies[13, 24,32, 49]. Most solutionsinvolve the useof DMA transfersbetweenthe user’saddressspaceandthe NI. Unfortunately, thesesolutionscannotalwaysbe usedeffectively by PPSs.

At the sendingside,somesystems(e.g.,U-Net [147] andHamlyn [32]) cantransferdataefficiently (usingDMA) whenit is storedin a specialsendsegmentin thesender’s hostmemory. For a specificapplicationit maybefeasibleto storethe datathat needsto be communicated(in a certainperiodof time) in a send

Page 30: Communication Architectures for Parallel-ProgrammingSystems

8 Introduction

segment.A PPS,however, wouldhave to do this for all applications.If this is notpossible,sendersmustfirst copy their datainto thesendsegment.With this extracopy, theadvantageof a low-level high-speeddatatransfermechanismis lost.

To improve performanceat the receiving side,severalsystems(e.g.,HamlynandSHRIMP[24]) offer aninterfacethatallowsa senderto specifya destinationaddressin a receiver’s addressspace.This allows the communicationsystemtomove incomingdatadirectly from theNI to its destinationaddress.A message-passingsystem,on the other hand,would first copy the datato somenetworkbuffer andwould later, whenthereceiverspecifiesadestinationaddress,copy thedatato its final destination.This may appearinefficient, but in many casesthesendersimply doesnot know thedestinationaddressof thedatathatneedsto besent. In suchcases,higher layers(e.g.,PPSs)areforcedto negotiatea destina-tion addressbeforethe actualdatatransfertakesplace. Sucha negotiationmayintroduceupto two extramessagesperdatatransferandis thereforeexpensiveforsmall datatransfers.(Alternatively, the sendercantransferthe datato a defaultbuffer with a known addressandhave thereceiver copy thedatafrom thatbuffer.Sucha schemecanbeextendedsothat the receiver canposta destinationbufferthatreplacesthedefault buffer [49]. If thedestinationbuffer is postedin time,noextracopy is needed.)

The needto avoid copying shouldbe balancedagainstthe cost of manag-ing asynchronousdatatransfermechanismsandnegotiatingdestinationaddresses.Also, not all communicationoverheadresultsfrom a lack of bandwidth.In theirstudyof split-Capplications[102], Martin etal. foundthatapplicationsaresensi-tive to sendandreceiveoverhead,but toleratelowerbandwidthsfairly well.

A datatransferproblemthathasreceivedlessattention,but that is importantin many PPSs,is the efficient transferof datafrom a singlesenderto multiplereceivers. Most existing user-level communicationsystemsfocuson high-speeddatatransfersfrom asinglesenderto asinglereceiveranddonotsupportmulticastor supportit poorly. Thisis unfortunate,becausemulticastplaysanimportantrolein many PPSs.

MPICH [61], a widely usedimplementationMPI, for example,usesmulti-castin its implementationof collectivecommunicationoperations(e.g., reduce,gather, scatter).Collective communicationoperationsareusedin many parallelalgorithms.Efficient multicastprimitiveshave alsoprovedtheir valuein the im-plementationof DSMssuchasOrca[9, 75] andBrazos[131], whichusemulticastto updatereplicateddata.

Thelackof efficientmulticastprimitivesin user-level communicationsystemsforcesPPSs(or underlyingcommunicationlibraries)to implementtheirown mul-

Page 31: Communication Architectures for Parallel-ProgrammingSystems

1.3Problems 9

ticaston topof point-to-pointprimitives.This is frequentlyinefficient. Naive im-plementationslet thesendersenda singlemessageto eachreceiver, which turnsthe senderinto a serialbottleneck.Smarterimplementationsarebasedon span-ning trees. In theseimplementations,the sendertransmitsa messageto a fewnodes(its children)which thenforward themessageto their children,andsoonuntil all nodeshave beenreached.This strategy allows themessageto travel be-tweendifferentsender-receiverpairsin parallel.

Evena tree-basedmulticastcanbeinefficient if it is layeredon point-to-pointprimitives.At theforwardingnodes,datatravelsup from theNI to thehost.If thehostdoesnot poll, forwardingwill eitherbedelayedor thedatawill bedeliveredby meansof anexpensive interrupt. To forward thedata,thehosthasto reinjectthedatainto thenetwork. This datatransferis unnecessary, becausethedatawasalreadypresenton theNI.

1.3.2 Control-Transfer Mismatches

Controltransferhastwo components:detectingincomingmessagesandexecutinghandlersfor incomingmessages.Most user-level communicationsystemsallowtheir clientsto poll for incomingmessages,becausereceiving a messagethroughpolling is muchmoreefficient thanreceiving it throughaninterrupt.Theproblemfor aPPSis to decidewhento poll, especiallyif theprocessingof incomingmes-sagesis not alwaystied to anexplicit receive call by theapplication.Again, fora specificapplicationthis neednot bea hardproblem,but gettingit right for allapplicationsis difficult.

In DSMs,for example,processesdo not interactwith eachotherdirectly, butonly throughshareddataitems. Whena processaccessesa shareddataitem thatis storedremotely, it typically sendsan accessrequestto the remoteprocessor.The remoteprocessormustrespondto the request,even if noneof its processesareaccessingthe dataitem. If the remoteprocessoris not polling the networkwhentherequestarrives,the reply will bedelayed.This canbesolvedby usinginterrupts,but theseareexpensive.

Oncea messagehasbeendetected,it needsto be handled. An interestingproblemis in which executioncontext messagesshouldbehandled.Allocating athreadto eachmessageis conceptuallyclean,but potentiallyexpensive, both intime andspace.Moreover, dispatchingmessagesto multiple threadscancausemessagesto beprocessedin adifferentorderthanthey werereceived.Sometimesthis is necessary;at othertimes,it shouldbeavoided.Executingapplicationcodeandmessagehandlersin thesamethreadgivesgoodperformance,but canleadto

Page 32: Communication Architectures for Parallel-ProgrammingSystems

10 Introduction

deadlockwhenmessagehandlersblock.

1.3.3 DesignAlter nativesfor User-Level CommunicationAr chitectures

Thelow-level communicationprotocolsof user-level architecturesdiffer substan-tially from traditional protocolssuchas TCP/IP. For example,many protocolsnow usetheNI to implementreliability [37, 49], multicast[16, 56, 146], networkmapping[100], performancemonitoring[93, 103], andaddresstranslationsand(remote)memoryaccess[13,24,32,51,83, 126]. Currentuser-level communica-tion systemsmakedifferentdesigndecisionsin theseareas.Specifically, differentsystemsdivide similar protocol tasksbetweenthe host and the NI in differentways.In somesystems,for example,theNI is responsiblefor implementingreli-ablecommunication.In othersystems,this taskis left to thehostprocessor.

The impactof differentdesignchoiceson the performanceof PPSsandap-plicationshashardlybeeninvestigated.Suchaninvestigationis difficult, becauseexistingarchitecturesimplementdifferentfunctionalityandprovidedifferentpro-gramminginterfaces. Moreover, many studiesuseonly low-level microbench-marksthatignoreperformanceat higherlayers[5].

1.4 Contrib utions

The main contributionsof this thesistowardssolving the problemspresentedintheprevioussectionareasfollows:

1. We show thata smallsetof simple,low-level communicationmechanismscanbeemployedeffectively to obtainefficientPPSimplementationsthatdonot suffer from the dataandcontrol-transfermismatchesdescribedin theprevious section. Someof thesecommunicationmechanismsrely on NIsupport(seebelow).

2. We have designedandimplementeda new, efficient, andreliablemulticastalgorithmthatusestheNI insteadof thehostto forwardmulticasttraffic.

3. To gain insight in the tradeoffs involved in choosingbetweendifferentNIprotocol implementations,we have implementeda singleuser-level com-municationinterface in multiple ways. Theseimplementationsdiffer in

Page 33: Communication Architectures for Parallel-ProgrammingSystems

1.5 Implementation 11

whetherthe hostor the NI is responsiblefor reliability andmulticastfor-warding.Usingtheseimplementations,wehavesystematicallyinvestigatedthe impactof differentdesignchoiceson the performanceof higher levelsystems(onecommunicationlibrary andmultiplePPSs)andapplications.

The communicationmechanismsmentionedhave beenimplementedin a new,user-level communicationsystemcalledLFC. LFC andits mechanismsaresum-marizedin Section1.5anddescribedin detail in Chapters3 to 5. On top of LFCwehave implementedacommunicationlibrary, Panda,that

� providesanefficient (stream)messageabstraction

� transparentlyswitchesbetweenusingpolling andinterruptsto detectincom-ing messages

� implementsanefficient totally-orderedbroadcastprimitive

Using LFC andPandawe have implementedandportedvariousPPSs(seeSec-tion 1.5).Wedescribehow wemodifiedsomeof thesePPSsto benefitfrom LFC’sandPanda’scommunicationsupport.

1.5 Implementation

We have implementedour solutionsin variouscommunicationsystems(seeFig-ure1.2). Thesesystemscover all communicationlayersshown in Figure1.1: NIprotocol,communicationlibrary, andPPS.The following sectionintroducestheindividualsystems,layerby layer, bottom-up.

1.5.1 LFC Implementations

Wehavedevelopedanew NI protocolcalledLFC [17, 18]. LFC providesreliablepoint-to-pointandmulticastcommunicationandrunson Myrinet [27], a modern,switched,high-speednetwork.

LFC hasa low-level, packet-basedinterface. By exposingnetwork packets,LFC allows its clientsto avoid mostredundantmemorycopies. Packetscanbereceivedthroughpolling or interruptsandclientscandynamicallyswitchbetweenthe two. LFC deliversincomingpacketsby meansof upcalls,asin Active Mes-sages[148].

Page 34: Communication Architectures for Parallel-ProgrammingSystems

12 Introduction

Orca(Ch. 7)

MPI(Ch. 7)

Multigame(Ch. 8)

CRL(Ch. 7)

Parallel applications (Ch. 8)

Panda (Ch. 6)

LFC (Chs. 3-5)

Myrinet (Ch. 1)

Parallel Java(Ch. 7)

Fig. 1.2. Structureof this thesis.

We have implementedLFC’s point-to-pointandmulticastprimitives in fivedifferentways. The implementationsdiffer in their reliability assumptionsandin how they divide protocolwork betweenthe NI andthe host. Chapters3 to 5describethemostaggressive of theseimplementationsin detail. This implemen-tationexploitsMyrinet’sprogrammableNI to implementthefollowing:

1. Reliablepoint-to-pointcommunication.

2. An efficient, reliablemulticastprimitive.

3. A mechanismto reduceinterruptoverhead(apolling watchdog).

4. A fetch-and-addprimitivefor globalsynchronization.

This implementationachieves reliable point-to-pointcommunicationby meansof an NI-level flow control protocol that assumesthat the network hardware isreliable.A simpleextensionof this flow controlprotocolallows usto implementanefficient,reliable,NI-level multicast.In thismulticastimplementation,packetsneednot travel to the hostandbackbeforethey areforwarded. Instead,the NIrecognizesmulticastpacketsandforwardsthemto childrenin the multicasttreewithouthostintervention.

The implementationdescribedabove performsmuch protocol work on theprogrammableNI andassumesthat the network hardwaredoesnot drop or cor-rupt network packets. Our otherimplementations,describedin Chapter8, differmainly in whereandhow they implementreliability and multicastforwarding.Ourmostconservativeimplementation,for example,assumesunreliablehardwareandimplementsbothreliability (includingretransmission)andmulticastforward-ing on thehost.

Page 35: Communication Architectures for Parallel-ProgrammingSystems

1.5 Implementation 13

LFC hasbeenusedto implementor portseveralsystems,includingCRL [74],FastSockets[120], Parallel Java [99], MPICH [61], Multigame[122], Orca[9],Panda[19, 124], andTreadMarks[78]. Severalof thesesystemsaredescribedinthis thesis;they areintroducedin thesectionsbelow.

1.5.2 Panda

Pandais a portablecommunicationlibrary thatprovidesthreads,messages,reli-ablemessagepassing,RemoteProcedureCall (RPC),andtotally-orderedgroupcommunication.UsingPanda,wehave implementedvariousPPSs(seebelow).

To implementits abstractionsefficiently, Pandaexploits LFC’s packet-basedinterfaceanditsefficientmulticast,fetch-and-add,andinterrupt-managementprim-itives. PandausesLFC’s packet interfaceto implementanefficient messageab-stractionwhich allows end-to-endpipelining of messagedataandwhich allowsapplicationsto delaymessageprocessingwithout copying messagedata. All in-comingmessagesarehandledby a single,dedicatedthread,which solvespartofthe blocking-upcallproblem. To automaticallyswitch betweenpolling and in-terrupts,Pandaintegratesthreadmanagementandcommunicationsuchthat thenetwork is automaticallypolled when all threadsare idle. Finally, PandausesLFC’sNI-level multicastandsynchronizationprimitivesto implementanefficienttotally-orderedmulticastprimitive.

1.5.3 Parallel-Programming Systems

In this thesis,we usefive PPSsto testour ideas:Orca,CRL, MPI, Manta,andMultigame.

Orca is a DSM systembasedon the notion of user-definedsharedobjects.Jointly, theOrcacompilerandruntimesystemautomaticallyreplicateandmigratesharedobjectsto improve locality. To performoperationsonsharedobjects,OrcausesPanda’s threads,RPC,andtotally-orderedgroupcommunication.

Like Orca,CRL is a DSM system.CRL processessharechunksof memorywhich they can map into their addressspace. Applicationsmust bracket theiraccessesto theseregionsby library callssothattheCRL runtimesystemcankeepthe sharedregions in a consistentstate. Unlike Orca,which updatesreplicatedsharedobjects,CRL usesinvalidationto maintainregion-level coherence.

MPI is a message-passingstandard.Parallelapplicationsareoftendevelopeddirectly on top of MPI, but MPI is alsousedasa compiler target. Our imple-mentationof MPI is basedon MPICH [61] andPanda.MPICH is a portableand

Page 36: Communication Architectures for Parallel-ProgrammingSystems

14 Introduction

PPS Communication patterns Messagedetection # contextsOrca Roundtrip+ broadcast Polling + interrupts 1CRL Roundtrip Polling + interrupts 0MPI One-way+ broadcast Polling 0Manta Roundtrip Polling + interrupts � 1Multigame One-way Polling 0

Table 1.2.Communicationcharacteristicsof parallel-programmingsystems.

widely usedpublic-domainimplementationof MPI thatis beingdevelopedjointlyby ArgonneNationalLaboratoriesandMississippiStateUniversity.

Manta is a PPSthat allows Java [60] threadsto communicateby invokingmethodson sharedobjects,asin Orca.Superficially, Mantahassimplercommu-nicationrequirementsthanOrca,becauseMantadoesnot replicatesharedobjectsandthereforerequiresonly RPC-stylecommunication.In reality, however, sev-eralfeaturesof Javathatarenotpresentin Orca—specifically, garbagecollectionandconditionsynchronizationatarbitraryprogrampoints—leadto morecomplexinteractionswith thecommunicationsystem.

Multigameis a declarative parallelgame-playingsystem.Given the rulesofaboardgameanda boardevaluationfunction,Multigameautomaticallysearchesfor goodmoves,usingoneof severalsearchstrategies(e.g.,IDA* or alpha-beta).During a search,processorspushsearchjobs to eachother(usingone-way mes-sages).A job consistsof (recursively) evaluatinga boardposition. To avoid re-searchingpositions,positionsarecachedin adistributedhashtable.

Not only do thesePPSscovera rangeof programmingmodels,they alsohavedifferentcommunicationrequirements.Table1.2summarizesthemaincommuni-cationcharacteristicsof all fivePPSs.(A moredetaileddiscussionof thesecharac-teristicsappearsin Chapters7 and8.) Thesecondcolumnlists themostcommontype of communicationpatternusedin eachPPS.The third columnshows howeachPPSdetectsincomingmessages.All DSMs(Orca,CRL, andManta)useacombinationof polling andinterrupts.Thefourth columnshowshow many inde-pendentmessage-handlercontexts canbeactive in eachPPS.In CRL, MPI, andMultigame,all incomingmessagesareprocessedby handlersthatrun in thesamecontext asthemaincomputation(i.e.,asprocedurecalls),sotherearenoindepen-denthandlercontexts. In Orca,all handlersarerunby adedicatedthread.Finally,Mantacreatesanew threadfor eachincomingmessage.

Page 37: Communication Architectures for Parallel-ProgrammingSystems

1.6ExperimentalEnvironment 15

1.6 Experimental Envir onment

For mostof theexperimentsdescribedin this thesis,weuseaclusterthatconsistsof 128 computers,which areinterconnectedby a Myrinet [27] network. Beforedescribingthehardwarein detail,wefirst positionour clusterarchitecture.

1.6.1 Cluster Position

The clusterusedin this thesisis a compromisebetweena traditionalsupercom-puterandLAN technology. Unlikeatraditionalsupercomputer, theNI connectstothehost’s I/O busandis thereforefarawayfrom thehostprocessor. This is a typi-calorganizationfor commodityhardware,but aggressivesupercomputerarchitec-turesintegratetheNI moretightly with thehostsystemby placingit on themem-ory busor integratingit with the cachecontroller. Thesearchitecturesallow forvery low network accesslatenciesandsimplerprotectionschemes[24, 85, 108].In this thesis,however, we focuson architecturesthatuseoff-the-shelfhardwareandrely on advancedsoftwaretechniquesto achieve efficient user-level networkaccess.This typeof architecture(i.e.,with theNI residingon thehost’s I/O bus)is now alsofoundin severalcommercialparallelmachines(e.g.,theIBM SP/2).

The cluster’s network, Myrinet, is not aswidely usedasthe ubiquitousEth-ernetLAN. As a parallel-processingplatform,however, Myrinet is usedin manyplaces,bothin academiaandindustry.

Myrinet’sNI containsaprogrammable,customRISCprocessorandfastSRAMmemory. The network consistsof high-bandwidthlinks andswitches;networkpacketsarecut-through-routedthroughtheselinks andswitches.ThesefeaturesmakeMyrinet (andsimilarproducts)muchmoreexpensivethanEthernet,thetra-ditional bus-basedLAN technology. While Ethernetcanbe switchedto obtainhighbandwidth,mostEthernetNIs arenotprogrammable.

Our main reasonfor usingMyrinet is that its programmableNI enablesex-perimentationwith differentprotocols,mechanisms,andinterfaces.This typeofexperimentationis notspecificto Myrinet andhasalsobeenperformedwith otherprogrammableNIs [34, 147]. Themainproblemis oftenthevendors’reluctanceto releaseinformationthat allows otherpartiesto programtheir NIs. Myricom,thecompany thatdevelopsandsellsMyrinet, doesprovidethedocumentationandtoolsthatenablecustomersto programtheNI.

Page 38: Communication Architectures for Parallel-ProgrammingSystems

16 Introduction

Fig. 1.3.Myrinet switchtopology.

Cluster node

switch

Network interface

Fig. 1.4. Switchesandclusternodes. Eachnodehasits own network interfacewhich connectsto a crossbarswitch. Switchesconnectto network interfacesandotherswitches.

1.6.2 Hardware Details

EachclusternodecontainsasingleSiemens-NixdorfD983motherboard,anIntel440FXPCI chipset,andanIntel PentiumPro[70] processor. Thenodesarecon-nectedvia 328-portMyrinet switches,whichareorganizedin athree-dimensionalgrid topology. Figure1.3shows theswitchtopology. Figure1.4showsoneverti-cal planeof theswitch topology. Eachswitchconnectsto 4 clusternodesandtoneighboringswitches.(Figure1.4 shows only theconnectionsto switchesin thesameverticalplane.)TheclusternodescanalsocommunicatethroughaFastEth-ernetnetwork (notshown in Figures1.3and1.4).FastEthernetis usedfor processstartup,file transfers,andterminaltraffic.

The architectureof a single clusternodeis illustratedin Figure 1.5. Eachnodecontainsa single PentiumPro processorand 128 Mbyte of DRAM. The

Page 39: Communication Architectures for Parallel-ProgrammingSystems

1.6ExperimentalEnvironment 17

Host CPU

Host bus

PCI bus

Network

Host memory

Cache

Bridge

DMA enginesNetwork interface

DMA

DMAmem.NI

DMAhostTo/from Send

Recv

CPU

Fig. 1.5.Clusternodearchitecture.

PentiumPro is a 200MHz, three-way superscalarprocessor. It hastwo on-chipfirst-level caches:an 8 Kbyte, 2-way set-associative datacacheandan 8 Kbyte,4-way set-associative instructioncache.In addition,a unified,256KByte, 4-wayset-associativesecond-level cacheis packagedalongwith theprocessorcore.Thecacheline sizeis 32bytes.Theprocessor-memorybusis 64bitswideandclockedat 66 MHz. TheI/O busis a33 MHz, 32-bit PCI bus.TheMyrinet NI is attachedto theI/O bus.

To obtainaccuratetimingsin ourperformancemeasurements,weusethePen-tium Pro’s 64-bit timestampcounter[71]. Thetimestampcounteris a clock withclock cycle (5 ns) resolution. A simple pseudodevice driver makes this clockavailable to unprivilegedusers. The countercanbe read(from userspace)us-ing a singleinstruction,sotheuseof this fine-grainclock imposeslittle overhead(approximately0.17µs).

Myrinet is a high-speed,switchedLAN technology[27]. Switchedtechnolo-giesscalewell to a largenumberof hosts,becausebandwidthincreasesasmorehostsandswitchesareaddedto thenetwork. Unlike Ethernet,Myrinet providesno hardwaresupportfor multicast. General-purposemulticastingfor wormhole-routednetworksis acomplex problemandanareaof active research[136, 135].

Unlike traditionalNIs, Myrinet’s NI is programmable;it containsa custom

Page 40: Communication Architectures for Parallel-ProgrammingSystems

18 Introduction

processorwhich canbeprogrammedin C or assembly. Theprocessor, a 33 MHzLANai4.1, is controlledby the LANai control program. The processoris effec-tivelyanorderof magnitudeslowerthanthe200MHz, superscalarhostprocessor.

The NI is equippedwith 1 Mbyte of SRAM memorywhich holdsboth thecodeandthe datafor the control program. The currentlyavailableMyrinet NIsrequirethatall networkpacketsbestagedthroughthismemory, bothatthesendingandthe receiving side. SRAM is fast,but expensive, so the amountof memoryis relatively small. Othernetworksallow datato be transferreddirectly betweenhostmemoryandthenetwork (e.g.,usingmemory-mappedFIFOsor DMA).

The NI containsthreeDMA engines. The host DMA enginetransfersdatabetweenhostmemoryandNI memory. Host memorybuffers that areaccessedby this DMA engineshouldbepinned(i.e., markedunswappable)to prevent theoperatingsystemfrom pagingthesepagesto disk during a DMA transfer. ThesendDMA enginetransferspackets from NI memoryto the outgoingnetworklink; thereceiveDMA enginetransfersincomingpacketsfrom thenetwork link toNI memory. TheDMA enginescanrun in parallel,but theNI’s memoryallowsat mosttwo memoryaccessespercycle,oneto or from thePCI busandoneto orfrom theNI processor, thesendDMA engine,or thereceiveDMA engine.

Myrinet NIs connectto eachother via 8-port crossbarswitchesand high-speedlinks (1.28 Gbit/s in eachdirection). The Myrinet hardwareusesroutingandswitchingtechniquessimilar to thoseusedin massively parallelprocessors(MPPs)suchasthe Thinking MachinesCM-5, the Cray T3D andT3E, andtheIntel Paragon.Network packetsarecut-through-routedfrom a sourceto a desti-nationnetwork interface.Eachpacketstartswith asequenceof routingbytes,onebyte per switch on the path from sourceto destination.Eachswitch consumesone routing byte and usesthe byte’s value to decideon which output port thepacket mustbeforwarded.Myrinet implementsa hardwareflow controlprotocolbetweenpairsof communicatingNIs which makespacket lossvery unlikely. Ifall senderson thenetwork agreeon a deadlock-freeroutingschemeandtheNIs’controlprogramsremove incomingpacketsin a timely manner, thenthenetworkcanbeconsideredreliable(seealsoSection3.9andChapter8).

While in transit, a packet may becomeblocked, either becausepart of thepathit follows is occupiedby anotherpacket, or becausethedestinationNI failsto drain the network. In the first case,Myrinet will kill the packet if it remainsblockedfor morethan50 milliseconds.In thesecondcase,theMyrinet hardwaresendsa resetsignalto thedestinationNI aftera timeoutinterval haselapsed.Thelengthof this interval canbesetin software. Both measuresareneededto breakdeadlocksin the network. If the network did not do this, a maliciousor faulty

Page 41: Communication Architectures for Parallel-ProgrammingSystems

1.7ThesisOutline 19

programcouldblockanotherprogram’spackets.

1.7 ThesisOutline

This chapterintroducedour areaof research,communicationsupportfor PPSs.We sketchedtheproblemsin this areaandour approachto solving them. Chap-ter 2 surveys the main designissuesfor user-level communicationarchitecturesand shows that existing systemsresolve theseissuesin widely different ways,which illustratesthat the tradeoffs arestill unclear. Chapter3 describesthe de-signandimplementationof our mostoptimisticLFC implementation.Chapter4givesa detaileddescriptionof theNI-level protocolsemployedby this LFC im-plementation.Chapter5 evaluatesthe implementation’s performance.Chapter6describesPandaandits interactionswith LFC. Chapter7 describestheimplemen-tationof four PPSs:Orca,CRL, MPI, andManta.We show how LFC andPandaenableefficient implementationsof thesesystemsanddescribeadditionalopti-mizationsusedwithin thesesystems.Chapter8 comparestheperformanceof fiveLFC implementationsat multiple levels: theNI protocollevel, thePPSlevel, andtheapplicationlevel. Finally, in Chapter9, wedraw conclusions.

Page 42: Communication Architectures for Parallel-ProgrammingSystems
Page 43: Communication Architectures for Parallel-ProgrammingSystems

Chapter 2

Network Interface Protocols

High-speednetworks suchasMyrinet offer greatpotentialfor communication-intensive applications.Unfortunately, traditionalcommunicationprotocolssuchasTCP/IPareunableto realizethis potential. In thecommonimplementationoftheseprotocols,all network accessis throughthe operatingsystem,which addssignificantoverheadsto both the transmissionpath(typically a systemcall anda datacopy) and the receive path (typically an interrupt and a datacopy). Inresponseto this performanceproblem,several user-level communicationarchi-tectureshave beendevelopedthat remove theoperatingsystemfrom thecriticalcommunicationpath[48,147]. Thischapterprovidesinsightinto thedesignissuesfor communicationprotocolsfor thesearchitectures.We concentrateon issuesthatdeterminetheperformanceandsemanticsof a communicationsystem:datatransfer, addresstranslation,protection,controltransfer, reliability, andmulticast.

In this chapter, we usethe following systemsto illustratethe designissues:Active MessagesII (AM-II) [37], BIP [118], Illinois FastMessages(FM) [112,113], FM/MC [10, 146], Hamlyn[32], PM [141, 142], U-Net [13, 147], VMMC[24], VMMC-2 [49,50],andTrapeze[154]. All systemsaimfor highperformanceandall exceptTrapezeoffer a user-level communicationservice. Interestingly,however, they differ significantlyin how they resolve thedifferentdesignissues.It is this varietythatmotivatesour study.

Thischapteris structuredasfollows. Section2.1explainsthebasicprinciplesof NI protocolsby describinga simple,unreliable,user-level protocol. Next, inSections2.2to 2.7,wediscussthesix protocoldesignissuesthatdeterminesystemperformanceandsemantics.Weillustratetheseissuesby giving performancedataobtainedon ourMyrinet cluster.

21

Page 44: Communication Architectures for Parallel-ProgrammingSystems

22 Network InterfaceProtocols

CPU

CPU CPU

Receive ringSend ring

DMA area

Host NI NI Host

CPU 8

7

654

3

1

2

Data packet

User address space

Fig. 2.1. Operationof thebasicprotocol. Thedashedarrows arebuffer pointers.The numberedarrows representthe following steps. (1) Host copiesuserdatainto DMA area.(2) Hostwritespacket descriptorto sendring. (3) NI processorreadspacketdescriptor. (4) NI DMAs packet to NI memory. (5) Network transfer.(6) NI readsreceive ring to find emptybuffer in DMA area.(7) NI DMAs packetto DMA area.(8) Optionalmemorycopy to userbuffer by thehostprocessor.

2.1 A BasicNetwork Interface Protocol

Thegoalof this sectionis to explain thebasicsof NI protocolsandto introducethe most importantdesignissues.To structureour discussion,we first describethedesignof a simple,user-level NI protocolfor Myrinet. Theprotocolignoresseveralimportantproblems,whichweaddressin subsequentsections.

To avoid the costof kernelcalls for eachnetwork access,the basicprotocolmapsall NI memoryinto userspace. Userprocesseswrite their sendrequestsdirectly to NI memory, without operatingsystem(OS) involvement. The basicprotocolprovidesno protection,so the network device cannotbe sharedamongmultipleprocesses.

User processesinvoke a simple sendprimitive to senda datapacket. Thebasicprotocolsendspacketswith a maximumpayloadof 256bytesandrequiresthatusersfragmenttheirdatasothateachfragmentfits in apacket. Thesignatureof thesendprimitive is asfollows:

void send(int destination, void *data, unsigned size);

Send() performstwo actions(seeFigure2.1).First, it copiestheuserdatato apacket buffer in a specialstagingareain hostmemory(step1). TheNI will laterfetchthepacket from thisDMA areaby meansof aDMA transfer. Unlikenormaluserpages,pagesin theDMA areaareneverswappedto diskby theOS.By only

Page 45: Communication Architectures for Parallel-ProgrammingSystems

2.1A BasicNetwork InterfaceProtocol 23

DMAing to andfrom suchpinnedpages,the protocolavoids corruptionof usermemoryandusermessagesdueto pagingactivity of theOS.

Second,thehostwritesasendrequestinto adescriptorin NI memory(step2).Thesedescriptorsarestoredin acircularbuffer calledthesendring. Send() storestheidentifierof thedestinationmachine,thesizeof thepacket’s payload,andthepacket’s offset in the DMA areainto the next available descriptorin this sendring. To inform the NI of this event,send() alsosetsa DescriptorReadyflag inthedescriptor. This flag preventsracesbetweenthehostandtheNI: theNI willnot readany otherdescriptorfieldsuntil this flag hasbeenset. Thedescriptoriswritten usingprogrammedI/O; sincethedescriptoris small,DMA would have ahighoverhead.

The NI repeatedlypolls the DescriptorReadyflag of the first descriptorinthe sendring. As soonasthis flag is setby the host, the NI readsthe offset inthe descriptorandaddsit to the physicaladdressof the startof the DMA area,resultingin the physicaladdressof the packet (step3). Next, the NI initiatesaDMA transferover the I/O bus to copy the packet’s payloadfrom hostmemoryto NI memory(step4). Subsequently, it readsthe destinationmachinein thedescriptorandlooksuptheroutefor thepacket in a routingtable.Therouteandapacket headerthatcontainsthepacket’s sizeareprependedto thepacket. Finally,theNI startsasecondDMA to transmitthepacket (step5).

Whenthe sendingNI detectsthat the network DMA for a given packet hascompleted,it setsaDescriptorFreeflagto releasethedescriptor, andpollsthenextfreedescriptorin thering. If thehostwantsto senda packet while no descriptoris available,it busy-waitsby polling theDescriptorFreeflag of thedescriptoratthetail of thering.

NIs usereceive DMAs to storeincomingpackets in their memory. EachNIcontainsa receivering with descriptorsthat point to free buffers in the host’sDMA area. The NI usesthe receive ring’s descriptorsto determinewhere(inhostmemory)to storeincomingpackets. After receiving a packet, the NI triesto acquirea descriptorfrom thereceive ring (step6). Eachdescriptorcontainsaflag bit that is usedin a similar way asfor the sendring. If no free hostbufferis available, the packet is simply dropped. Otherwise,the NI startsa DMA totransferthepacket to hostmemory(step7). Eachhostbuffer alsocontainsa flagthat is setby theNI asthelastpartof its NI-to-hostDMA transfer. Thehostcancheckif thereis apacketavailableby polling theflagof thenext unprocessedhostreceive buffer. Oncethe flag hasbeenset,the receiving processcansafelyreadthebuffer andoptionallycopy its contentsto auserbuffer (step8).

Thedatatransfersin steps1, 4, 5, 7, and8 canall beperformedconcurrently.

Page 46: Communication Architectures for Parallel-ProgrammingSystems

24 Network InterfaceProtocols

For example,if thehostsendsa long,multipacketmessage,onepacket’snetworkDMA canbeoverlappedwith thenext packet’s host-to-NIDMA. Exploiting thisconcurrency is essentialfor achieving high throughput.

Sincethe delivery of network interruptsto user-level processesis expensiveon currentOSs,the basicprotocoldoesnot useinterrupts,but requiresuserstopoll for incomingmessages.A successfulcall to poll() resultsin theinvocationofauserfunction,handlepacket(), thathandlesaninboundpacket:

void poll(void);void handle packet(void *data, unsigned size);

Our (unoptimized)implementationof the basicprotocolachievesa one-wayhost-to-hostlatency of 11 µs and a throughputof 33 Mbyte/s (using 256-bytepackets). For comparison,on the samehardwarethe highly optimizedBIP sys-tem[118], achievesa minimumlatency of 4 µs andcansaturatetheI/O bus(127Mbyte/s). For large datatransfers,BIP usesDMA, both at the sendingandthereceiving side. In contrastwith the basicprotocol,however, BIP doesnot stagedatathroughDMA areas.Section2.3discussesdifferenttechniquesto eliminatetheuseof DMA areas.

Thebasicprotocolavoidsall OSoverhead,keepstheNI codesimple,anduseslittle NI memory. It is clear, however, thattheprotocolhasseveralshortcomings:� All inboundand outboundnetwork transfersare stagedthrougha DMA

area.For applicationsthatneedto sendandreceivefrom arbitrarylocations,this introducesextra memorycopies.� Theprotocolprovidesno protection.If thebasicprotocolallowedmultipleusersto accesstheNI, theseuserscouldreadandmodify eachother’s datain NI memory. Userscanevenmodify theNI’s controlprogramanduseitto accessany hostmemorylocation.� The receiver-side control transfermechanism,polling, is simple, but notalwayseffective. For many applicationsit is difficult to determinea goodpolling rate.If thehostpolls too frequently, it will havea high overhead;ifit polls too late,it will not reply quickly enoughto incomingpackets.� Theprotocolis unreliable,eventhoughtheMyrinet hardwareis highly re-liable. If thesenderssendpacketsfasterthanthereceiver canhandlethem,thereceiving hostwill runoutof buffer spaceandtheNI will dropincomingpackets.

Page 47: Communication Architectures for Parallel-ProgrammingSystems

2.2DataTransfers 25

� The protocol supportsonly point-to-pointmessages.Although multicastcanbe implementedon top of point-to-pointmessages,doing so may beinefficient. Multicast is an importantserviceby itself anda fundamentalcomponentof collectivecommunicationoperationssuchasthosesupportedby themessage-passingstandardMPI.

Below, we will discusstheseproblemsin moredetail and look at betterdesignalternatives.

2.2 Data Transfers

OnMyrinet,at leastthreestepsareneededto communicateapacketfrom oneuserprocessto another:thepacket mustbemovedfrom thesender’smemoryto its NI(host-NI transfer),from this NI to thereceiver’s NI (NI-NI transfer),andthentothe receiving process’s addressspace(NI-host transfer). Network technologiesthatdonot requiredatato bestagedthroughNI memoryuseonly two steps:host-to-network andnetwork-to-host. Below, we discussthe Myrinet case,but mostissues(programmedI/O versusDMA, pinning,alignment,andmaximumpacketsize)alsoapplyto thetwo-stepcase.

The datatransfershave a significant impact on the latency and throughputobtainedby a protocol, so optimizing them is essentialfor obtaininghigh per-formance.As shown in Figure2.1, thebasicprotocolusesfive datatransferstocommunicatea packet, becauseit stagesall packetsthroughDMA areas.Below,we discussalternative designsfor implementingthehost-NI,NI-NI, andNI-hosttransfers.

2.2.1 From Host to Network Interface

On Myrinet, the host-to-NI transfercan use either DMA or ProgrammedI/O(PIO). Steenkistegivesa detaileddescriptionof both mechanisms[133]. WithPIO, the hostprocessorreadsthe datafrom hostmemoryandwrites it into NImemory, typically oneor two wordsat a time,which resultsin many bustransac-tions. DMA usesspecialhardware(a DMA engine)to transfertheentirepacketin large burstsandasynchronously, so that the datatransfercanproceedin par-allel with hostcomputations.Onethusmight expectDMA to alwaysoutperformPIO. Theoptimalchoice,however, dependson the typeof hostCPUandon thepacket size. The PentiumPro, for example,supportswrite combiningbuffers,

Page 48: Communication Architectures for Parallel-ProgrammingSystems

26 Network InterfaceProtocols

4�

16 64�

256� 1K� 4K� 16K�Buffer size (bytes)

0

20

40

60

80

100T

hrou

ghpu

t (M

byte

/sec

ond)

DMADMA + on-demand pinningPIO + write combiningDMA + memcpyPIO

Fig. 2.2.Host-to-NIthroughputusingdifferentdatatransfermechanisms.

which allow memorywrites to the samecacheline to be merged into a single32-bytebus transaction.We canthusboostthe performanceof host-to-NIPIOtransfersby applyingwrite combiningto NI memory. (This useof the PentiumPro’swrite combiningfacility wassuggestedto usby theFastMessagesgroupoftheUniversityof Illinois at Urbana-Champaign[31].)

Figure 2.2 shows the throughputobtainedby PIO (with and without writecombining)andcache-coherentDMA, for copying datafrom a PentiumPro toa Myrinet NI card. PIO with write combiningquickly outperformsPIO withoutwrite combining. For buffer sizesup to 1024bytes,PIO with write combiningevenoutperformsDMA (which suffersfrom astartupcost).

For small messages,PIO with write combiningis slower than PIO withoutwrite combining. This is dueto an expensive extra instruction(a serializingin-struction)thatour benchmarkexecutesafter eachhost-to-NImemorycopy withwrite combining.Weusethis instructionto modelthecommonsituationin whicha datacopy to NI memoryis followedby anotherwrite thatsignalsthepresenceof thedata.Clearly, theNI shouldnotobservethelatterwrite beforethedatacopyhascompleted.With write combining,however, writesmaybereorderedandit isnecessaryto separatethe datacopy andthe write that follows it by a serializinginstruction.

In user-level communicationsystems,DMA transferscanbestartedeitherbya userprocessor by thenetwork interfacewithout any operatingsysteminvolve-ment.SinceDMA transfersareperformedasynchronously, theoperatingsystem

Page 49: Communication Architectures for Parallel-ProgrammingSystems

2.2DataTransfers 27

maydecideto swapout thepagethathappensto bethesourceor destinationof arunningDMA transfer. If this happens,partof thedestinationof thetransferwillbe corrupted. To avoid this, operatingsystemsallow applicationsto pin a lim-ited numberpagesin their addressspace.Pinnedpagesareneverswappedoutbytheoperatingsystem.Unfortunately, pinningapagerequiresasystemcall andtheamountof memorythatcanbepinnedis limited by theavailablephysicalmemoryandby OSpolicies.In thebestcase,all pagesthatanapplicationtransfersdatatoor from needto bepinnedonly once.In thiscasethecostof pinningthepagescanbeamortizedovermany datatransfers.In theworstcase,anapplicationtransfersdatato or from morepagesthancanbepinnedsimultaneously. In this case,twosystemcallsareneededfor every datatransfer:oneto unpina previously pinnedpageandone to pin the pageinvolved in the next datatransfer. SHRIMP pro-videsspecialhardwarethatallows userprocessesto startDMA transferswithoutpinning [25]. This user-level DMA mechanism,however, works only for host-initiatedtransfers,not for NI-initiated transfers,sopinningis still requiredat thereceiving side.

NI protocolsthatuseDMA oftenchooseto copy thedatainto a reserved(andpinned)DMA area,whichcostsanextramemorycopy andthusmaydecreasethethroughput.Figure2.2,shows thata DMA transferprecededby a memorycopyis consistentlyslower thanPIO with write combining.On processorsthatdo notsupportcache-coherentDMAs, theDMA areaneedsto beallocatedin uncachedmemory, which alsodecreasesperformance.(Most modernCPUs,including thePentiumPro,supportcache-coherentDMA, however.) With PIO, pinning is notnecessary. Evenif theOSswappedoutthepageduringthetransfer, thehost’snextmemoryreferencewould generatea pagefault, causingtheOSto swapthepagebackin. In practice,many protocolsuseDMA; otherprotocols(AM-II, Hamlyn,BIP) usePIOfor smallmessagesandDMA for largemessages.FM usesPIOforall messages.

With bothDMA andPIO,datatransfersbetweenunaligneddatabufferscanbemuchslower thanbetweenalignedbuffers. This problemis aggravatedwhentheNI’sDMA enginesrequire thatsourceanddestinationbuffersbeproperlyaligned.In thatcase,extracopying is neededto alignunalignedbuffers.

Anotherimportantdesignchoiceis themaximumpacket size. Largepacketsyield betterthroughput,becauseper-packet overheadsare incurredfewer timesthanwith small packets. The throughputof our basicprotocol,for example,in-creasesfrom 33 Mbyte/sto 48 Mbyte/sby using1024-byteinsteadof 256-bytepackets. The choiceof the maximumpacket size is influencedby the system’spagesize,memoryspaceconsiderations,andhardwarerestrictions.

Page 50: Communication Architectures for Parallel-ProgrammingSystems

28 Network InterfaceProtocols

2.2.2 From Network Interface to Network Interface

TheNI-to-NI transfertraversestheMyrinet links andswitches.All Myrinet pro-tocolsusetheNI’s DMA engineto sendandreceivenetwork data.In theory, PIOcould be used,but DMA transfersarealways faster. To sendor receive a dataword by meansof PIO, theprocessormustalwaysmove thatdataitem throughaprocessorregister, which costsat leasttwo cycles. A DMA enginecantransferoneword percycle. Moreover, usingDMA freestheprocessorto do otherworkduringdatatransfers.

To preventnetwork congestion,thereceiving NI shouldextractincomingdatafrom the network fastenough. On Myrinet, the hardwareusesbackpressure tostall the sendingNI if the receiver doesnot extract datafast enough. To pre-ventdeadlock,however, thereis a time limit on thebackpressuremechanism.Ifthereceiver doesnot drain thenetwork within a certaintime period,thenetworkhardwarewill resetthe NI or truncatea blocked packet. Many Myrinet controlprogramsdealwith this real-timeconstraintby copying datafastenoughto pre-vent resets.Otherprotocolsavoid the problemby usinga softwareflow controlscheme,aswewill discusslater.

2.2.3 From Network Interface to Host

The transferfrom NI to hostat the receiving sidecanagainuseeitherDMA orPIO.On Myrinet, however, only thehost(not theNI) canusePIO,makingDMAthemethodof choicefor mostprotocols.Somesystems(e.g.,AM-II) usePIO onthehostto receive smallmessages.For largemessages,all protocolsuseDMA,becausereadsover the I/O bus are typically muchslower thanDMA transfers.Whetherwe useDMA or PIO, in both casesthe bottleneckfor the NI-to-hosttransferis theI/O bus.

Usingmicrobenchmarks,wemeasuredtheattainablethroughputontheimpor-tantdatapathsof ourPentiumPro/Myrinetcluster, usingDMA andPIO.Table2.1summarizesthe resultsandalsogivesthe hardwarebandwidthof the bottleneckcomponenton eachdatapath. For transfersbetweenthe host and the NI, thebottleneckis the 33.33MHz PCI bus; both main memoryandthe memorybusallow higherthroughputs.With DMA transfers,themicrobenchmarkscanalmostsaturatethe I/O bus. Our throughputis slightly lessthanthe I/O busbandwidthbecausein ourbenchmarkstheNI acknowledgeseveryDMA transferwith aone-word NI-to-host DMA transfer. For network transfersfrom one NI’s memoryto anotherNI’s memory, the bottleneckis the NIs’ memory. The sendand re-

Page 51: Communication Architectures for Parallel-ProgrammingSystems

2.3AddressTranslation 29

Source Destination Method Hardware Measuredbandwidth throughput

Hostmemory Hostmemory PIO 170 52Hostmemory NI memory PIO 127 25

PIO+ WC 127 84DMA 127 117

NI memory Hostmemory PIO 127 7DMA 127 117

NI memory NI memory DMA 127 127

Table 2.1. Bandwidthsand measuredthroughputs(in Mbyte/s) on a Pen-tium Pro/Myrinetcluster. WC meanswrite combining.

ceive DMA enginescanaccessat mostoneword per I/O bus clock cycle (i.e.,127 Mbyte/s). The bandwidthof the network links is higher, 153 Mbyte/s. Fi-nally, for localhostmemorycopies,thebottleneckcomponentis mainmemory.

An interestingobservation is that a local memorycopy on a singlePentiumPro obtainsa lower throughputthan a remotememorycopy over Myrinet (52Mbyte/sversus127 Mbyte/s). This problemis largely dueto the poor memorywrite performanceof thePentiumPro[29]; on a 450MHz PentiumIII platform,we measureda memory-to-memorycopy throughputof 157Mbyte/s. Neverthe-less,memorycopieshave animportantimpacton performance.For comparison,recall that thebasicprotocolachievesa throughputof only 33 Mbyte/s.Therea-sonis that thebasicprotocolusesfairly small packets(256bytes)andperformsmemorycopiesto andfrom DMA areas,which interfereswith DMA transfers.Several systems(e.g.,BIP andVMMC-2) cansaturatethe I/O bus: the key is-sueis to avoid thecopying to andfrom DMA areas. Thenext sectiondescribestechniquesto achieve this.

2.3 Addr essTranslation

Theuseof DMA transfersbetweenhostandNI memoryintroducestwo problems.First,mostsystemsrequirethateveryhostmemorypageinvolvedin aDMA trans-fer bepinnedto prevent theoperatingsystemfrom replacingthatpage.Pinning,however, requiresanexpensive systemcall which shouldbekeptoff thecriticalpath. The secondproblemis that, on mostarchitectures,the NI’s DMA engine

Page 52: Communication Architectures for Parallel-ProgrammingSystems

30 Network InterfaceProtocols

needsto know thephysicaladdressesof eachpagethatit transfersdatato or from.Operatingsystems,however, do not export virtual-to-physicalmappingsto user-level programs,sousersnormallycannotpassphysicaladdressesto theNI. Evenif they could,theNI wouldhave to checkthosephysicaladdresses,to ensurethatuserspassonly addressesof pagesthatthey haveaccessto.

We considerthreeapproachesto solve theseproblems.Thefirst approachisto avoid all DMA transfersby usingprogrammedI/O. Dueto thehighcostof I/Obusreads,however, this is only a realisticsolutionat thesendingside.

The secondapproach,usedby the basicprotocol, requiresthat userscopytheir datainto andout of specialDMA areas(seeFigure2.1). This way, only theDMA areasneedto be pinned. This is doneonce,when the applicationopensthe device, andnot during sendandreceive operations.The addresstranslationproblemis thensolvedasfollows. Theoperatingsystemallocatesfor eachDMAareaa contiguouschunkof physicalmemoryandpassesthearea’s physicalbaseaddressto theNI. Usersspecifysendandreceivebuffersby meansof anoffsetintheir DMA area.TheNI only needsto addthis offset to thearea’s baseaddressto obtain the buffer’s physicaladdress.Several systems(e.g.,AM-II, Hamlyn)usethis approach,acceptingthe extra copying costs. As shown in Figure2.2,however, theextracopy reducesthroughputsignificantly.

In the third approach,the copying to andfrom DMA areasis eliminatedbydynamicallypinningandunpinninguserpagessothatDMA transferscanbeper-formeddirectly to thosepages.Systemsthatusethis approach(e.g.,VMMC-2,PM, andBIP) cantrack the ’DMA’ curve in Figure2.2. Themain implementa-tion problemis thattheNI needsto know thecurrentvirtual-to-physicalmappingsof individual pages.SinceNIs areusuallyequippedwith only a smallamountofmemoryandsincetheirprocessorsareslow, they do notstoreinformationfor ev-erysinglevirtual page.Somesystems(e.g.,BIP) provideasimplekernelmodulethat translatesvirtual addressesto physicaladdresses.Usersareresponsibleforpinningtheir pagesandobtainingthephysicaladdressesof thesepagesfrom thekernelmodule. Thedisadvantageof this approachis that theNI cannotcheckifthephysicaladdressesit receivesarevalid andif they referto pinnedpages.

An alternative approachis to let thekernelandNI cooperatesuchthat theNIcankeeptrack of valid addresstranslations(either in hardwareor in software).Systemslike VMMC-2 andU-Net/MM [13] (an extensionof U-Net) let the NIcachea limited numberof valid addresstranslationswhich refer to pinnedpages(this invariantmustbemaintainedcooperatively by theNI andtheoperatingsys-tem). This cachingworkswell for applicationsthatexhibit locality in thepagesthey usefor sendingandreceiving data.Whenthetranslationof a user-specified

Page 53: Communication Architectures for Parallel-ProgrammingSystems

2.4Protection 31

addressis foundin thecache,theNI canaccessthataddressusingaDMA transfer.In thecaseof amiss,specialactionmustbetaken.In U-Net/MM, for example,theNI generatesaninterruptwhenit cannottranslateanaddress.Thekernelreceivestheinterrupt,looksup theaddressin its pagetable,pinsthepage,andpassesthetranslationto theNI.

In VMMC-2, addresstranslationsfor userbuffers aremanagedby a library.Thisuser-level library mapsusers’virtual addressesto referencesto addresstrans-lations which userscan passto the NI. The library createsthesereferencesbyinvoking a kernelmodule.This moduletranslatesvirtual addresses,pinsthecor-respondingpages,andstoresthetranslationsin a User-managedTLB (UTLB) inkernelmemory.

To avoid invokingtheoperatingsystemeverytimeareferenceis needed,theli-brarymaintainsauser-level lookupdatastructurethatkeepstrackof theaddressesfor which a valid UTLB entryexists. Thelibrary invokestheUTLB kernelmod-ule only whenit cannotfind the addressin its lookup datastructure.WhentheNI receivesa reference,it canfind the translationusinga DMA transferto thekernelmodule’sdatastructure.To avoid suchDMA transferson thecritical path,theNI maintainsits own cacheof references.The’on-demandpinning’ curve inFigure2.2 shows thethroughputobtainedby a benchmarkthat imitatesthemissbehavior of a UTLB. To simulatea missin its lookupdatastructure,thehostin-vokes,for eachpagethatis transferred,asystemcall to pin thatpage.To simulatea missin theNI cache,theNI fetchesa singleword from hostmemorybeforeittransfersthedata.

2.4 Protection

Sinceuser-level architecturesgive usersdirect accessto the NI, the OS cannolongercheckevery network transfer. Multiplexing anddemultiplexing mustnowbeperformedby theNI andspecialmeasuresareneededto preserve theintegrityof theOSandto isolatethedatastreamsof differentprocessesfrom oneanother.

In a protectedsystem,no userprocessshouldbegivenunrestrictedaccesstoNI memory. With unrestrictedaccessto NI memory, a userprocesscanmodifythe NI control programanduseit to reador write any locationin hostmemory.NI memoryaccessmustalsobe restrictedto implementprotectedmultiplexinganddemultiplexing. In the basicprotocol, for example,userprocessesdirectlywrite to NI memoryto initialize senddescriptors.If multiple processessharedthe NI, oneprocesscould corruptanotherprocess’s senddescriptors.Similarly,

Page 54: Communication Architectures for Parallel-ProgrammingSystems

32 Network InterfaceProtocols

no processshouldbeableto readincomingnetwork packetsthataredestinedforanotherprocess.The basicprotocolpreventstheseproblemsby providing user-level network accessto at mostoneuserat a time, but this limitation is clearlyundesirablein amulti-userandmultiprogrammingenvironment.

A straightforwardsolutionto theseproblemsis to usethevirtual-memorysys-tem to give eachuseraccessto a differentpart of NI memory[48]. When theuseropensthe network device, the operatingsystemmapssucha part into theuser’s addressspace.Oncethe mappinghasbeenestablished,all useraccessesoutsidethemappedareawill be trappedby thevirtual-memoryhardware. Userswrite theircommandsandnetwork datato theirown pagesin NI memory. It is theresponsibilityof theNI to checkeachuser’spagefor new requestsandto processonly legal requests.

SinceNI memoryis typically small,only a limited numberof processescanbegivendirect accessto theNI this way. To solve this problem,AM-II virtual-izesnetwork endpointsin thefollowing way. Part of NI memoryactsasa cachefor active communicationendpoints;inactive endpointsarestoredin hostmem-ory. WhenanNI receivesa messagefor an inactive endpointor whena processtriesto senda messagevia aninactiveendpoint,theNI andtheoperatingsystemcooperateto activatetheendpoint.Activationconsistsof moving theendpoint’sstate(sendandreceivebuffers,protocolstatus)to NI memory, possiblyreplacinganotherendpointwhich is thenswappedout to hostmemory.

Thesharingproblemalsoexistson thehost,for theDMA areas.To maintainprotection,eachuserprocessneedsits own DMA area.Sincetheuseof a DMAareaintroducesanextra copy, somesystemseliminateit andstoreaddresstrans-lationson theNI. VMMC-2 andU-Net/MM do this in a protectedway, eitherbyletting the kernelwrite the translationsto the NI or by letting the NI fetch thetranslationsfrom kernelmemory. BIP, on the otherhand,eliminatesthe DMAarea,but doesnotmaintainprotection.

2.5 Control Transfers

The control transfermechanismdetermineshow a receiving host is notified ofmessagearrivals. Theoptionsareto useinterrupts,polling, or a combinationofthese.

Interruptsarenotoriouslyexpensive. With mostoperatingsystems,the timeto deliver an interrupt (asa signal) to a userprocesseven exceedsthe networklatency. On a 200 MHz PentiumPro runningLinux, dispatchingan interruptto

Page 55: Communication Architectures for Parallel-ProgrammingSystems

2.5ControlTransfers 33

a kernel interrupthandlercostsapproximately8 µs. Dispatchingto a user-levelsignalhandlercostsevenmore,approximately17 µs. This exceedsthelatency ofour basicprotocol(11 µs).

Given the high costsof interrupts,all user-level architecturessupportsomeform of polling. Thegoalof polling is to give thehosta fastmechanismto checkif a messagehasarrived. This checkmust be inexpensive, becauseit may beexecutedoften. A simpleapproachis to let theNI seta flag in its memoryandtolet thehostcheckthisflag. Thisapproach,however, is inefficient,sinceeverypollnow resultsin anI/O bustransfer. In addition,this polling traffic will slow downotherI/O traffic, includingnetwork packettransfersbetweenNI andhostmemory.

A veryefficient solutionis to useaspecialdevicestatusregisterthatis sharedbetweentheNI andthehost[107]. Currenthardware,however, doesnot providethesesharedregisters.

On architectureswith cache-coherentDMA, a practicalsolutionis to let theNI write aflagin cachedhostmemory(usingDMA) whenamessageis available.This approachis usedby our basicprotocol. The hostpolls by readingits localmemory;sincepolls areexecutedfrequently, the flag will usually residein thedatacache,so failed polls arecheapanddo not generatememoryor I/O traffic.WhentheNI writes theflag, thehostwill incur a cachemissandreadtheflag’snew valuefrom memory. On a 200MHz PentiumPro,theschemejust describedcosts5 nanosecondsfor a failed poll (i.e., a cachehit) and74 nanosecondsfora successfulpoll (i.e., a cachemiss). For comparison,eachpoll in the simplescheme(i.e.,anI/O bustransfer)costs467nanoseconds.

Even if the polling mechanismis efficient, polling is a mixed blessing. In-sertingpolls manuallyis tediousanderror-prone.Severalsystemsthereforeuseacompileror abinaryrewriting tool to insertpollsin loopsandfunctions[114, 125].Theproblemof finding theright polling frequency remains,though[89]. In mul-tiprocessorarchitecturesthisproblemcanbesolvedby dedicatingoneof thepro-cessorsto polling andmessagehandling.

Severalsystems(AM-II, FM/MC, Hamlyn,Trapeze,U-Net,VMMC, VMMC-2) supportboth interruptsandpolling. Interruptsusuallycanbe enabledor dis-abledby thereceiver; sometimesthesendercanalsoseta flag in eachpacket thatdetermineswhetheraninterruptis to begeneratedwhenthepacket arrives.

Page 56: Communication Architectures for Parallel-ProgrammingSystems

34 Network InterfaceProtocols

AM-IIVMMC-2

HamlynFMFM/MCVMMC

U-NetTrapeze

PM

Recovery

BIP

Yes No

No (unreliable interface)

Host

Yes

Application

Prevent buffer overflow

Assume Myrinet is reliable?

Reliability strategy Reliability protocol?

Locus of flow control

Fig. 2.3.Designdecisionsfor reliability.

2.6 Reliability

ExistingMyrinet protocolsdiffer widely in theway they addressreliability. Fig-ure2.3 shows thechoicesmadeby varioussystems.Themostimportantchoiceis whetheror not to assumethat thenetwork is reliable. Myrinet hasa very lowbit-error rate(lessthan10 15 on shieldedcablesup to 25 m long [27]). Conse-quently, therisk of a packet gettinglost or corruptedis smallenoughto considerit a ’f atal event’ (muchlike a memoryparity erroror anOScrash).Sucheventscanbehandledby higher-level software(e.g.,usingcheckpointing),or elsecausetheapplicationto crash.

Many Myrinet protocolsindeedassumethatthehardwareis reliable,solet uslook at theseprotocolsfirst. Theadvantageof thisapproachis efficiency, becausenoretransmissionprotocolor time-outmechanismis needed.Evenif thenetworkis fully reliable,however, thesoftwareprotocolmaystill droppacketsdueto lackof buffer space. In fact, this is the most commoncauseof packet loss. Eachprotocolneedscommunicationbufferson both thehostandtheNI, andbotharea scarceresource.Thebasicprotocoldescribedin Section2.1, for example,willdrop packets when it runs out of receive buffers. This problemcan be solvedin oneof two ways: either recover from buffer overflow or preventoverflow tohappen.

Thefirst idea(recovery) is usedin PM. The receiver simply discardsincom-ing packetsif it hasno roomfor them. It returnsanacknowledgement(ACK or

Page 57: Communication Architectures for Parallel-ProgrammingSystems

2.6Reliability 35

NACK) to thesenderto indicatewhetheror not it acceptedthepacket. A NACKindicatesapacketwasdiscarded;thesenderwill laterretransmitthatpacket. Thisprocesscontinuesuntil anACK is received,in which casethesendercanreleasethebuffer spacefor themessage.This protocolis fairly simple; themain disad-vantagesaretheextraacknowledgementmessagesandtheincreasednetwork loadwhenretransmissionsoccur.

A key propertyof theprotocolis thatit neverdropsacknowledgements.By as-sumption,Myrinet is reliable,sothenetwork hardwarealwaysdeliversacknowl-edgementsto their destination.Whenan acknowledgementarrives,the receiverprocessesit completelybeforeit acceptsa new packet from the network, so atmostoneacknowledgementneedsto bebuffered. (Myrinet’s hardwareflow con-trol ensuresthatpendingnetwork packetsarenotdropped.)

Unlike anacknowledgement,a datapacket cannotalwaysbeprocessedcom-pletelyonceit hasbeenreceived. A datapacket needsto betransferredfrom theNI to thehost,which requiresseveral resources:at leasta freehostbuffer andaDMA engine.If thewait time for oneof theseresources(e.g.,a freehostbuffer)is unbounded,thentheNI cannotsafelywait for that resourcewithout receivingpendingpackets,becausethenetwork hardwaremaythenkill blockedpackets.

The secondapproachis to prevent buffer overflow by using a flow controlschemethatblocksthesenderif the receiver is runningout of buffer space.Forlarge messages,BIP requirestheapplicationto dealwith flow controlby meansof a rendezvousmechanism:thereceiver mustposta receive requestandprovidea buffer beforea messagemaybesent.That is, a large-messagesendnevercom-pletesbeforea receive hasbeenposted.FM andFM/MC implementflow controlusing a host-level credit scheme. Before a host can senda packet, it needstohaveacreditfor thereceiver; thecreditrepresentsapacketbuffer in thereceiver’smemory. Creditscanbehandedout in advanceby pre-allocatingbuffersfor spe-cific senders,but if a senderrunsout of creditsit mustblock until the receiverreturnsnew credits.

A host-level credit schemepreventsoverflow of hostbuffers, but not of NIbuffers,which areusuallyevenscarcer(becauseNI memoryis smallerthanhostmemory). With someprotocols,the NI temporarilystopsreceiving messagesifthe NI buffers overflow. Suchprotocolsrely on Myrinet’s hardware, link-level,flow controlmechanism(backpressure)to stall thesenderin suchacase.

The protocolsdescribedso far implementa reliable interfaceby dependingon the reliability of the hardware. Several other protocolsdo not assumethenetwork to be reliable, and either presentan unreliableprogramminginterfaceor implementa retransmissionprotocol. U-Net andTrapezepresentan unreli-

Page 58: Communication Architectures for Parallel-ProgrammingSystems

36 Network InterfaceProtocols

able interfaceandexpecthighersoftware layers(e.g.,MPI, TCP) to retransmitlost messages.Othersystemsdo provide a reliableinterface,by implementingatimeout-retransmissionmechanism,eitheron thehostor theNI. Thecostof set-ting timersandprocessingacknowledgementsis modest,typically nomorethanafew microseconds.

2.7 Multicast

Multicasting occursin many communicationpatterns,rangingfrom a straight-forward broadcastto the more complicatedall-to-all exchange[84]. Message-passingsystemslike MPI [45] directly supportsuchpatternsby meansof collec-tive communicationservicesandthusrely on anefficient multicastimplementa-tion. In addition,varioushigher-level systemsusemulticastingto updaterepli-catedshareddata[9].

Today’s wormhole-routednetworks, including Myrinet, do not supportreli-ablemulticastingat thehardwarelevel; multicastin wormhole-switchednetworksis a hardproblemandthe subjectof ongoingresearch[56, 128, 136, 135]. Thesimplestway to implementa multicast in software is to let the sendersendapoint-to-pointmessageto eachmulticastdestination.This solutionis inefficient,becausethepoint-to-pointstartupcostis incurredfor everymulticastdestination;this cost includesthe datacopy to NI memory(possiblyprecededby a copy toa DMA area). With someNI support,the repeatedcopying canbe avoidedbypassingall multicastdestinationsto the NI, which thenrepeatedlytransmitsthesamepacket to eachdestination.Sucha ’multisend’primitive is providedby PM.

Although more efficient than a repeatedsendon the host, a multisendstillleavesthenetwork interfaceasa serialbottleneck.A moreefficient approachisto organizethesenderandreceiversinto a multicasttree. Thesenderis the rootof the treeandtransmitseachmulticastpacket to all its children(a subsetof thereceivers). Thesechildren,in turn, forwardeachpacket to their children,andsoon. Figure2.4contraststherepeated-sendandthetree-basedmulticaststrategies.Tree-basedprotocolsallow packetsto travel in parallelalongthebranchesof thetreeandthereforeusuallyhave logarithmicratherthanlinearcomplexity.

Tree-basedprotocolscanbe implementedefficiently by performingthe for-wardingof multicastpacketson the NI insteadof on the host[56, 68, 80, 146].Verstoepet al. implementedsuchanNI-level spanningtreemulticastasanexten-sion of the Illinois FastMessages[113] substrate;the resultingsystemis calledFM/MC (FastMessages/MultiCast)[146].

Page 59: Communication Architectures for Parallel-ProgrammingSystems

2.8Classificationof User-Level CommunicationSystems 37

Fig. 2.4.Repeatedsendversusa tree-basedmulticast.

An importantissuein thedesignof a multicastprotocolis flow control. Mul-ticastflow control is morecomplex thanpoint-to-pointflow control,becausethesenderof a multicastpacket needsto obtainbuffer spaceon all receiving NIs andhosts. FM/MC usesa centralbuffer manager(implementedon oneof the NIs)to keeptrackof theavailablebuffer spaceat all receivers. This managerhandlesall requestsfor multicastbuffersandallows sendersto prefetchbuffer spaceforfuture multicasts.An advantageof this schemeis that it avoids deadlockby ac-quiring buffer spacein advance;a disadvantageis thatit employsa centralbuffermanager. An alternative schemeis to acquirebufferson thefly: thesenderof amulticastpacket is responsibleonly for acquiringbuffer spaceatits childrenin themulticasttree.A moredetaileddiscussionof FM/MC is givenin Section4.5.4.

2.8 Classificationof User-LevelCommunicationSys-tems

Table2.2 classifiesten user-level communicationsystemsandshows how eachsystemdealswith the designissuesdiscussedin this chapter. Thesesystemsall aim for high performanceandall provide a lean,low-level, andmoreor lessgenericcommunicationfacility. All systemsexceptVMMC andU-Net werede-velopedfirst ona Myrinet platform.

Most systemsimplementa straightforwardmessage-passingmodel.Thesys-temsdiffer mainly in their reliability andprotectionguaranteesandtheir supportfor multicast.Severalsystemsuseactivemessages[148]. With active messages,the senderof a messagespecifiesnot only the message’s destination,but alsoahandlerfunction, which is invokedat thedestinationwhenthemessagearrives.

VMMC, VMMC-2, andHamlynprovidevirtual memory-mappedandsender-basedcommunicationinsteadof messagepassing. In both modelsthe sender

Page 60: Communication Architectures for Parallel-ProgrammingSystems

38 Network InterfaceProtocols

specifieswherein the receiver’s addressspacethe communicateddatamustbedeposited.This modelallows datato bemoveddirectly to its destinationwithoutany unnecessarycopying.

In additionto the researchprojectslisted in Table2.2, industryhasrecentlycreatedadraftstandardfor user-level communicationin clusterenvironments[51,149]. Implementationsof this Virtual Interface(VI) architecturehave beencon-structedby UC Berkeley, GigaNet,Intel, andTandem.Giganetsellsa hardwareVI implementation.The othersimplementVI in device driver software, in NIfirmware,or both.Speightetal. discusstheperformanceof onehardwareandonesoftwareimplementation[130].

2.9 Summary

Thischapterhasdiscussedseveraldesignissuesandtradeoffs for NI protocols,inparticular:

� how to transferdatabetweenhostsandNIs (DMA or programmedI/O);

� how to avoid copying to andfrom DMA areasby meansof addresstransla-tion techniques;

� how to achieveprotected,user-level network accessin amulti-userenviron-ment;

� how to transfercontrol(interrupts,polling, or both);

� how andwhereto implementreliability (application,hosts,NIs);

� whetherto implementadditionalfunctionalityontheNIs, suchasmulticast.

An interestingobservationis thatmany novel techniquesexploit theprogramma-ble NI processor(e.g., for addresstranslation). ProgrammableNIs areflexible,whichcompensatesfor thelackof hardwaresupportpresentin themoreadvancedinterfacesusedby MPPs. Eventually, hardware implementationsmay be moreefficient,but theavailability of a programmableNI hasenabledfastandeasyex-perimentationwith differentprotocolsfor commoditynetwork interfaces. As aresult, the performanceof theseprotocolshassubstantiallyincreased.In com-binationwith theeconomicadvantagesof commoditynetworks,this makessuchnetworksakey technologyfor parallelclustercomputing.

Page 61: Communication Architectures for Parallel-ProgrammingSystems

2.9Summary 39

Sys

tem

Dat

atr

ansf

er(h

ost-

NI)

Add

ress

tran

sla-

tion

Pro

tect

ion

Con

trol

tran

sfer

Rel

iabi

lity

Mul

ticas

tsu

ppor

tA

M-I

IP

IO+

DM

AD

MA

area

sY

esP

ollin

g+

inte

rrup

tsR

elia

ble.

NI:

alte

rnat

ingb

it.H

ost:

slid

ing

win

dow

.

No

FM

PIO

DM

Aar

ea(r

e-ce

ive)

No

Pol

ling

Rel

iabl

e.H

ost-

level

cred

its.

No

FM

/MC

PIO

DM

Aar

ea(r

e-ce

ive)

No

Pol

ling

+in

terr

upts

Rel

iabl

e.U

cast

:hos

t-lev

elcr

edits

.M

cast

:NI-

leve

lcre

dits

.

Yes

(on

NI)

PM

DM

AS

oftw

are

TLB

onN

IY

es(g

ang

sche

dul-

ing)

Pol

ling

Rel

iabl

e.A

CK

/NA

CK

prot

ocol

onN

I.M

ultip

lese

nds

VM

MC

DM

AS

oftw

are

TLB

onN

IY

esP

ollin

g+

inte

rrup

tsR

elia

ble.

Ex-

ploi

tsha

rdwa

reba

ckpr

essu

re.

No

VM

MC

-2D

MA

UT

LBin

kern

el,

cach

edon

NI

Yes

Pol

ling

+in

terr

upts

Rel

iabl

e.N

o

Ham

lyn

PIO

+D

MA

DM

Aar

eas

Yes

Pol

ling

+in

terr

upts

Rel

iabl

e.E

x-pl

oits

hard

ware

back

pres

sure

.

No

Tra

peze

DM

AD

MA

topa

gefr

ames

No

Pol

ling

+in

terr

upts

Unr

elia

ble.

No

BIP

PIO

+D

MA

Use

rtra

nsla

tes

No

Pol

ling

Rel

iabl

e.R

ende

zvous

and

back

pres

sure

.N

o

U-N

etD

MA

TLB

onN

I(U

-Net

/MM

)Y

esP

ollin

g+

inte

rrup

tsU

nrel

iabl

e.N

o

Tabl

e2.

2.S

umm

aryo

fdes

ignc

hoic

esin

exis

ting

com

mun

icat

ions

yste

ms.

Page 62: Communication Architectures for Parallel-ProgrammingSystems

40 Network InterfaceProtocols

An efficientnetwork interfaceprotocol,however, doesnotsuffice;whatcountsis applicationperformance.As discussedin Section1.1, mostapplicationpro-grammersuseaparallel-programmingsystemto developtheirapplications.Thereis a largegapbetweenthehigh-level abstractionsprovidedby theseprogrammingsystemsandthelow-level interfacesof thecommunicationarchitecturesdiscussedin this chapter. It is not a priori clearthatall typesof programmingsystemscanbeimplementedefficiently on all of theselow-level systems.

Page 63: Communication Architectures for Parallel-ProgrammingSystems

Chapter 3

The LFC User-LevelCommunication System

This chapterdescribesLFC, a new user-level communicationsystem.LFC dis-tinguishesitself from othersystemsby theway it dividesprotocoltasksbetweenthehostprocessorandthenetwork interface(NI). In contrastwith communicationsystemsthatminimizetheamountof protocolcodeexecutedon theNI, LFC ex-ploits theNI to implementflow control,to forwardmulticasttraffic, to reducetheoverheadof network interrupts,andto performsynchronizationoperations.ThischapterdescribesLFC’suserinterfaceandgivesanoverview of LFC’simplemen-tationon thecomputerclusterdescribedin Section1.6. LFC’s NI-level protocolsaredescribedseparatelyin Chapter4.

This chapteris structuredasfollows. Section3.1 describesLFC’s program-ming interface. Section3.2 statesthe key assumptionsthat we madein LFC’simplementation.Section3.3 givesan overview of the main componentsof theimplementation.SubsequentsectionsdescribeLFC’spacket format(Section3.4),datatransfermechanisms(Section3.5), host buffer management(Section3.6),the implementationof messagedetectionanddispatch(Section3.7),andthe im-plementationof a fetch-and-addprimitive (Section3.8). Section3.9 discusseslimitationsof theimplementation.Finally, Section3.10discussesrelatedwork.

3.1 Programming Interface

LFC aimsto supportthedevelopmentof parallelruntimesystemsratherthanthedevelopmentof parallelapplications.Therefore,LFC providesanefficient, low-

41

Page 64: Communication Architectures for Parallel-ProgrammingSystems

42 TheLFC User-Level CommunicationSystem

level interfaceratherthana high-level, user-friendly interface.In particular, LFCdoesnot fragmentlargemessagesanddoesnot provide a demultiplexing mecha-nism.Asaresult,theuserof LFChasto writecodefor fragmenting,reassembling,anddemultiplexing messages.This interfacerequiresextra programmingeffort,but at the sametime offers high performanceto all clients. The mostimportantfunctionsof theinterfacearelistedin Table3.1.

3.1.1 Addressing

LFC’s addressingschemeis simple. Eachparticipatingprocessis assignedauniqueprocessidentifier, anumberin therange� 0 ����� P � 1 , whereP is thenumberof participatingprocesses.This numberis determinedat applicationstartuptimeandremainsfixedduringtheapplication’s lifetime. LFC mapsprocessidentifiersto hostaddressesandnetwork routes.

3.1.2 Packets

LFC clientsusepacketsto sendandreceive data. Packetshave a maximumsize(1 Kbyte by default); messageslarger thanthis sizemustbe fragmentedby theclient. LFC deliberatelydoesnot provide a higher-level abstractionthat allowsclientsto constructmessagesof arbitrarysizes. Suchabstractionsimposeover-headandoftenintroducea copy at thereceiving sidewhendataarrivesin packetbuffers thatarenot contiguousin memory. Clientsthatneedonly sequentialac-cessto messagefragmentscanavoid thiscopy by usingstreammessages[90]. Anefficient implementationof streammessageson top of LFC is describedin Sec-tion6.3.Evenwith streammessages,however, someoverheadremainsin theformof procedurecalls, auxiliary datastructures,andheaderfields that someclientssimply do not need. Our implementationof CRL, for example, is constructeddirectlyon LFC’s packet interface,without intermediatemessageabstractions.

LFC distinguishesbetweensendpacketsandreceivepackets. Sendpacketsresidein NI memory. Clientscanallocateasendpacketandstoredatainto it usingnormalmemorywrites (i.e., usingprogrammedI/O). At the receiving side,LFCusestheDMA areaapproachdescribedin Section2.1, so receive packetsresidein pinnedhostmemory.

Bothsendandreceivepacketsform ascarceresource,for whichonly alimitedamountof memoryis available.Sendpacketsarestoredin NI memory, which istypically small: thememorysizesof currentNIs rangefrom a few hundredkilo-bytesto a few megabytes. Receive buffers are storedin pinnedhost memory.

Page 65: Communication Architectures for Parallel-ProgrammingSystems

3.1ProgrammingInterface 43

void

*lfc

send

allo

c(in

tupc

alls

allo

wed

)A

lloca

tes

ase

ndpa

cket

buffe

rin

NI

mem

ory

and

retu

rns

apo

inte

rto

the

pack

et’s

data

part

.Par

amet

eru

pca

llsa

llow

ed

has

the

sam

emea

ning

inal

lfun

ctio

ns.I

tin

dica

tesw

heth

erup

calls

canb

em

adew

henn

etwo

rkdr

aini

ngoc

curs

durin

gex

ecut

iono

fth

efu

nctio

n.in

tlfc

grou

pcr

eate

(

unsi

gned

*mem

bers

,uns

igne

dnm

embe

r)

Cre

ates

am

ultic

astg

roup

and

retu

rnsa

grou

pid

entifi

er.

The

proc

essid

entifi

erso

fall

nm

em

be

rsm

embe

rsare

stor

edin

me

m-

be

rs.

void

lfcuc

ast

laun

ch(u

nsig

ned

dest

,

void

*pkt

,uns

igne

dsi

ze,i

ntup

calls

allo

wed

)

Tra

nsm

itsth

efir

stsi

zeby

teso

fsen

dpac

ketp

ktto

proc

essd

est

and

tran

sfer

sow

ners

hipo

fthe

pack

etto

LFC

.vo

idlfc

bcas

tla

unch

(

void

*pkt

,uns

igne

dsi

ze,i

ntup

calls

allo

wed

)

Bro

adca

ststh

efir

stsi

zeby

teso

fsen

dpac

ketp

ktto

allp

roce

sses

exce

ptth

ese

nder

andt

rans

fers

owne

rshi

poft

hepa

cket

toLF

C.

void

lfcm

cast

laun

ch(in

tgid

,

void

*pkt

,uns

igne

dsi

ze,i

ntup

calls

allo

wed

)

Mul

ticas

tsth

efir

stsi

zeby

teso

fsen

dpac

ketp

ktto

allm

embe

rsof

grou

pg

idan

dtr

ansf

erso

wne

rshi

poft

hepa

cket

toLF

C.

void

lfcpo

ll(vo

id)

Dra

inst

hene

twor

kan

din

voke

sthe

clie

nt-s

uppl

iedp

acke

than

-dl

er(lf

cu

pca

ll())

for

each

pack

etre

mov

edfr

omth

ene

twor

k.vo

idlfc

intr

disa

ble(

void

)vo

idlfc

intr

enab

le(v

oid)

Dis

able

and

enab

lene

twor

kin

terr

upts

.The

sefu

nctio

nsm

ust

beca

lled

inpa

irs,w

ithlfc

intr

dis

ab

le()

goin

gfir

st.

intl

fcup

call(

unsi

gned

src,

void

*pkt

,uns

igne

dsi

ze,l

fcup

call

flags

tflag

s)

Lfc

up

call(

)is

the

clie

nt-s

uppl

iedf

unct

ion

that

LFC

invo

kes

once

peri

ncom

ing

pack

et.

Pkt

poin

tsto

the

data

just

rece

ived

(siz

ebyt

es)a

ndfla

gs

indi

cate

swhe

ther

orno

tpkt

isa

mul

ticas

tpa

cket

.If

lfcu

pca

ll()

retu

rnsn

onze

ro,th

ecl

ient

beco

mes

the

owne

rof

pkt

and

mus

tlate

rrel

ease

pkt

usin

glfc

pa

cke

tfre

e().

Oth

erw

ise,

LFC

recy

cles

pkt

imm

edia

tely.

void

lfcpa

cket

free

(voi

d*p

kt)

Ret

urns

owne

rshi

pofr

eceiv

epa

cket

pkt

toLF

C.

unsi

gned

lfcfe

tch

and

add(

unsi

gned

var,

intu

pcal

lsal

low

ed)

Ato

mic

ally

incr

emen

tsthe

F&

Ava

riabl

eide

ntifi

edby

var

and

retu

rnst

heva

lue

ofva

rbef

oret

hein

crem

ent.

Tabl

e3.

1.LF

C’s

user

inte

rface

.Onl

yth

ees

sent

ialfu

nctio

nsar

esh

own.

Page 66: Communication Architectures for Parallel-ProgrammingSystems

44 TheLFC User-Level CommunicationSystem

In a multiprogrammingenvironment,whereuserprocessescompetefor CPUcy-clesandmemory, mostoperatingsystemsdo not allow a singleuserto pin largeamountsof memory.

3.1.3 SendingPackets

LFCprovidesthreesendroutines:lfc ucastlaunch() sendsapoint-to-pointpacket,lfc bcastlaunch() broadcastsa packet to all processes,and lfc mcastlaunch()sendsapacketto all processesin amulticastgroup. Eachmulticastgroupcontainsafixedsubsetof all processes.Multicastgroupsarecreatedonceandfor all at ini-tializationtime, soprocessescannotjoin or leave a multicastgroupdynamically.For reasonsdescribedin AppendixA, multicastgroupsmaynotoverlap.

To senddata,aclientmustperformthreesteps:

1. Allocate a sendpacket with lfc sendalloc(). Lfc sendalloc() returnsapointerto a freesendpacket in NI memoryandtransfersownershipof thepacket from LFC to theclient.

2. Fill thesendpacket,usuallywith aclient-specificheaderanddata.

3. Launchthepacketwith lfc ucastlaunch(), lfc mcastlaunch(), or lfc bcast-launch(). Thesefunctionstransmitthe packet to one,multiple, or all des-tinations,respectively. LFC transmitsasmany bytesas indicatedby theclient. Ownershipof the packet is transferredbackto LFC. This impliesthattheallocationstep(step1) mustbeperformedfor eachpacket thatis tobetransmitted.No packet canbetransmittedmultiple times.

Steps1 and3 mayblock dueto thefinite numberof sendpacket buffersandthefinite lengthof the NI’s commandqueue.To avoid deadlock,LFC continuestoreceivepacketswhile it waitsfor resources(seeSection3.1.4).

Thesendprocedureavoidsintermediatecopying: clientscangatherdatafromvariousplaces(registers,differentmemoryregions)anduseprogrammedI/O tostorethat datadirectly into sendpackets. An alternative is to useDMA trans-fers,which consumefewer processorcyclesandrequirefewer bustransfers.Formessagesizesup to 1 Kbyte, however, programmedI/O movesdatafrom hostmemoryto NI memoryfasterthanthe NI’s DMA engine(seeFigure2.2). Fur-thermore,programmedI/O caneasilymove datafrom any virtual addressin theclient’saddressspaceto theNI, while DMA canbeusedonly for pinnedpages.

Page 67: Communication Architectures for Parallel-ProgrammingSystems

3.1ProgrammingInterface 45

Thesendprocedureseparatespacket allocationandpacket transmission.Thisallowsclientsto hidetheoverheadof packetallocation.TheCRL implementationon LFC, for example,allocatesa new sendpacket immediatelyafter it hassentapacket. SincetheCRL runtimefrequentlywaitsfor a replymessageaftersendinga requestmessage,it canoftenhidethecostof allocatingasendpacket.

3.1.4 Receiving Packets

LFC copiesincoming network packets from the NI to receive packets in hostmemory. LFC deliversreceive packetsto the receiving processby meansof anupcall [38]. An upcall is a function that is definedin one software layer andthat is invoked from lower-level software layers. Upcalls are usedto performapplication-specificprocessingat timesthat cannotbe predictedby the applica-tion. SinceLFC doesnotknow whataclientaimsto dowith its incomingpackets,theclientmustsupplyanupcallfunction.This function,namedlfc upcall(), is in-vokedby LFCeachtimeapacketarrives;it transfersownershipof areceivepacketfrom LFC to theclient. Usually, this functioneithercopiesthepacketor performsa computationusing the packet’s contents. (The computationcanbe assimpleasincrementinga counter.) Theupcall’s returnvaluedetermineswhetheror notownershipof the packet is returnedto LFC. If the client keepsthe packet, thenthepacket mustlaterbereleasedexplicitly by calling lfc packet free(), otherwiseLFC recycles the packet immediately. The main advantageof this interfaceisthat it doesnot forcetheclient to copy packetsthatcannotbeprocessedimmedi-ately. Pandaexploit this propertyin its implementationof streammessages(seeSection6.3).

Many systemsbasedon active messages[148], invoke messagehandlerswhile draining the network. (Draining the network consistsof moving packetsfrom thenetwork to hostmemoryandis necessaryto avoid congestionanddead-lock.) Theclientsof suchsystemsmustbepreparedto handlepacketswheneverdraining can occur. Unfortunately, draining sometimesoccursat inconvenienttimes. Consider, for example,thecasein which a messagehandlersendsa mes-sage.If thesystemdrainsthenetwork while sendingthis message,it mayrecur-sively activatethesamemessagehandler(for anotherincomingmessage).Sincethe systemdoesnot guaranteeatomichandlerexecution,the programmermustprotectglobal dataagainstinterleaved accessesby multiple handlers. Using alock to turn theentirehandlerinto a critical region leadsto deadlock.Therecur-siveinvocationwill blockwhenit triesto enterthecritical sectionoccupiedby thefirst invocation.Thefirst invocation,however, cannotcontinue,becauseit is wait-

Page 68: Communication Architectures for Parallel-ProgrammingSystems

46 TheLFC User-Level CommunicationSystem

ing for therecursive invocationto finish. Chapter6 studiesthis upcallprobleminmoredetailandcomparesseveralsolutions.

LFC separatesnetwork drainingandhandlerinvocation[28]. DrainingoccurswheneveranLFC routinewaitsfor resources(e.g.,a freeentryin thesendqueue),when the userpolls, or when the NI generatesan interrupt. During draining,however, LFC will invoke lfc upcall() only if oneof the following conditionsissatisfied:

1. Theclient hasnot disablednetwork interrupts.

2. Theclient invokeslfc poll().

3. Theclient invokesanLFC primitivewith its upcalls allowedparametersetto true(nonzero).

All LFC calls thatpotentiallyneedto drain thenetwork take anextra parameternamedupcalls allowed. If the client invokesa routinewith upcalls allowedsetto false(zero),thenLFC will drain thenetwork if it needsto, but will not makeany upcalls. If theparameteris true, thenLFC will drain thenetwork and makeupcalls.

Separatingdrainingandhandlerinvocationis no panacea.If a client enablesdraining but disablesupcalls,LFC must buffer incoming network packets. TopreventLFC from runningout of receive packetstheclient mustimplementflowcontrol. This is crucial, becausethereis nothing LFC, or any communicationsystem,cando againstclientsthatsendanunboundedamountof dataanddo notconsumethatdata.Thesystemwill eitherrunoutof memoryor deadlockbecauseit stopsdrainingthenetwork. SinceLFC’s currentclientsdo not implementflowcontrol—atleast,notfor all theircommunicationprimitives—they cannotdisableupcallsandmustthereforealwaysbepreparedto handleincomingpackets.

With the current interface,clients cannotspecify that the network must bedrainedasynchronously, usinginterrupts,without LFC makingupcalls. If a pro-cessentersa phasein which it cannotprocessupcallsandin which it doesnotinvokeLFC primitives,thenthenetwork will notbedrained.TheMPI implemen-tation describedin Section7.4, for example,doesnot useinterrupts. Drainingoccursonly during the invocationof LFC primitives. Sincetheseprimitivesareinvokedonly whentheapplicationinvokesanMPI primitive, thereis no guaran-tee that a processwill frequentlydrain the network. If a processsendsa largemessageto aprocessengagedin a long-runningcomputation—i.e.,aprocessthatdoesnot drain— thenLFC’s internalflow-control mechanism(seeSection4.1)

Page 69: Communication Architectures for Parallel-ProgrammingSystems

3.2Key ImplementationAssumptions 47

will eventuallystall the sender, even if the receiver hasspaceto storeincomingpackets.

3.1.5 Synchronization

LFC providesanefficient fetch-and-add(F&A) primitive [127]. A fetch-and-addoperationfetchesthecurrentvalueof a logically sharedintegervariableandthenincrementsthis variable.Sincetheseactionsareperformedatomically, processescanusethe F&A primitive to synchronizetheir actions(e.g., to obtainthe nextfreeslot in asharedqueue).

LFC storesF&A variablesin NI memory. The implementationprovidesoneF&A variableperNI andinitializesall F&A variablesto zero.Thereis nofunctionto setor resetF&A variablesto a specificvalue,but sucha functioncouldeasilybeadded.

Pandausesa singleF&A variableto implementtotally-orderedbroadcasting(seeChapter6). Totally-orderedbroadcasting,in turn, is usedby the Orcarun-time systemto updatereplicatedobjectsin aconsistentmanner(seeSection7.1).Karamchetiet al. describeanotherinterestinguseof fetch-and-addin their pa-per on pull-basedmessaging[76]. In pull-basedmessagingsendersdo not pushmessagedatato receivers. Instead,a sendertransmitsonly a messagepointertothe receiver. The receiver pulls in the messagedatawhenit is readyto receive.Pull-basedmessagingreducescontention.

3.1.6 Statistics

Besidesthefunctionslistedin Table3.1,LFC providesvariousstatisticsroutines.LFC can be compiledsuchthat it countseventssuchas packet transmissions,DMA transfers,etc. The statisticsroutinesareusedto reset,collect, andprintstatistics.

3.2 Key Implementation Assumptions

Wehave implementedLFC onthecomputerclusterdescribedin Section1.6.Theimplementationmakesthefollowing key assumptions:

1. Thenetwork hardwareis reliable.

2. Multicastandbroadcastarenot supportedin hardware.

Page 70: Communication Architectures for Parallel-ProgrammingSystems

48 TheLFC User-Level CommunicationSystem

3. Eachhostis equippedwith anintelligentNI.

4. Interruptprocessingis expensive.

Assumptions1–3pertainto network architecture.Amongthesethree,thefirstassumption,reliablenetwork hardware,is themostcontroversial.Weassumethatthenetwork hardwareneitherdropsnorcorruptsnetwork packets.With theexcep-tion of customsupercomputernetworks(e.g.,theCM-5 network [91]), however,network hardwareis usuallyconsideredunreliable.This may be dueto the net-work material(e.g., insufficient shieldingfrom electricalinterference)or to thenetwork architecture(e.g.,lackof flow control in thenetwork switches).

Thesecondassumption,no hardwaremulticast,is satisfiedby mostswitchednetworks(e.g.,ATM), but notby tokenringsandbus-basednetworks(e.g.,Ether-net). We assumethat multicastandbroadcastservicesmustbe implementedinsoftware. Efficient software multicastschemesusespanningtree protocolsinwhich network nodesforward the messagesthey receive to their childrenin thespanningtree(seeSection2.7).

Thethird assumption,thepresenceof anintelligentNI, enablestheexecutionof nontrivial protocolcodeon theNI. IntelligentNIs areusedboth in supercom-puters(e.g.,FLASH [85] and the IBM SP/2[129]) and in clusterarchitectures(e.g.,Myrinet andATM clusters).NI-level protocolsform an importantpart ofLFC’s implementation(seeChapter4).

The fourth assumption,expensive interrupts,is relatedto host-processorar-chitectureandoperatingsystemimplementations.Modernprocessorshave deeppipelines,whichmustbeflushedwhenaninterruptis raised.In addition,commer-cial operatingsystemsdonotdispatchinterruptsefficiently to userprocesses[143].We thereforeassumethat interruptprocessingis slow andthat interruptsshouldbeusedonly asa lastresort.

3.3 Implementation Overview

LFC consistsof a library that implementsLFC’s interfaceanda control programthatrunsontheNI andtakescareof packet transmission,receipt,forwarding,andflow control.ThecontrolprogramimplementstheNI-level protocolsdescribedinChapter4. In addition,severaldevice driversareusedto obtainefficient accessto thenetwork device. Thelibrary, theNI controlprogram,andthedevicedriversareall written in C.

Page 71: Communication Architectures for Parallel-ProgrammingSystems

3.3 ImplementationOverview 49

3.3.1 Myrinet

Myrinet implementshardwareflow control on the network links betweencom-municatingNIs andhasa very low bit-error rate [27]. If all NIs usea single,deadlock-freeroutingalgorithmandarealwaysreadyto processincomingpack-ets,thenMyrinet will not dropany packets. In a singleadministrativedomainofmodestsize,suchasour Myrinet cluster, theseconditionscanbesatisfied.Con-sequently, LFC assumesthat Myrinet neitherdropsnor corruptspackets. Thelimitationsof this approacharediscussedin moredetail in Section3.9.

3.3.2 Operating SystemExtensions

All clusternodesrun theLinux operatingsystem(RedHatdistribution5.2,kernelversion2.0.36). To supportuser-level communication,we addedseveral devicedriversto theoperatingsystem.

Myricom’s Myrinet device driver was modified so that at most one userata time can openthe Myrinet network device. When a userprocessopensthedevice, the driver mapsthe NI’s memorycontiguouslyinto the user’s addressspace,so theuserprocesscanreadandwrite NI memoryusingnormalloadandstoreinstructions(i.e., programmedI/O). Thedriver alsoprocessesall interruptsgeneratedby theNI. Eachtime thedriver receivesan interrupt,it sendsa SIGIO

softwaresignalto theprocessthatopenedthedevice.At initialization time,LFC’s library allocatesa fixednumberof receive pack-

etsfrom theclient’sheap.Thepagesonwhichthesepacketsarestoredarepinnedby meansof themlock() systemcall. Theclientcanspecifythenumberof receivepacketsthat the library is to allocate.To transferdatafrom its memoryinto thereceive packetsin hostmemory, theNI’s controlprogramneedsthephysicalad-dressesof thereceivepackets.To obtaintheseaddresses,weimplementedasmallpseudodevice driver that translatesa page’s virtual addressto thecorrespondingphysicaladdress.(A pseudodevicedriver is a kernelmodulewith adevicedriverinterfacebut without associatedhardware.) LFC’s library invokesthis driver atinitialization time to computeeachpacket’sphysicaladdress.

Writes to device memory(e.g.,NI memory)arenormally not cachedby thewriting processor, becausethe datawritten to device memoryis unlikely to bereadbackagainby the processor. Moreover, in the caseof write-backcaching,the device will not observe the writes until they happento be flushedfrom thecache.By disablingcachingfor devicememory, however, performanceis reducedbecauseabustransactionmustbesetupfor eachwordthatis writtento thedevice.

Page 72: Communication Architectures for Parallel-ProgrammingSystems

50 TheLFC User-Level CommunicationSystem

Write-backcaches,in contrast,write completecachelines to memory, which isbeneficialwhena largecontiguouschunkof datais written.

To speedupwritesto NI memory, LFC usesthePentiumPro’sability to setthecachingpropertiesof specificvirtual memoryranges[71]. A smallpseudodevicedriver enableswrite combiningfor the virtual memoryrangeto which the NI’smemoryis mapped.The PentiumPro combinesall writes to a write-combiningmemoryregion in on-processorwrite buffers. Writesaredelayeduntil thewritebuffer is full or until a serializinginstructionis issued.By aggregatingwrites toNI memory, theprocessorcantransferthesewritesin largerburststo theI/O bus,which reducesthenumberof I/O busarbitrations.

As discussedin Section2.2, theuseof write combiningconsiderablyspeedsup programmedI/O datatransfersfrom hostmemoryto NI memory. The maindisadvantageof applyingwrite combiningto a memoryregion is that writes tothatregionmaybereordered.Two sucessivewritesareexecutedin programorderonly if they areseparatedby aserializinginstruction.Whennecessary, weusethePentiumPro’s atomicincrementinstructionto orderwrites.

3.3.3 Library

LFC’s library implementsall routineslisted in Table3.1. The library managessendandreceivebuffers,communicateswith theNI controlprogram,anddeliverspacketsto clients. Host-NI communicationtakesplacethroughsharedvariablesin NI memory, throughDMA transfersbetweenhostandNI memory, andthroughsignalsgeneratedby thekernelin responseto NI interrupts.

3.3.4 NI Control Program

Themaintaskof theNI controlprogramis to sendandreceivepackets.Outgoingandincomingpacketsusevarioushardwareandsoftwareresourcesasthey travelthroughtheNI. Thecontrolprogramacquiresandactivatestheseresourcesin therightorderfor eachpacket. If thecontrolprogramcannotacquirethenext resourcethat a packet needs,then it queuesthe packet on a resource-specific(software)resourcequeue. All resourceslistedin Table3.2,exceptthehostreceive buffers,haveanassociatedresourcequeue.

Threehardwareresourcesareavailablefor packet transfers:the sendDMAengine,the receive DMA engine,andthe host/NI DMA engine. Theseenginesoperateasynchronously:typically, the NI processorstartsa DMA transferandcontinueswith otherwork, periodicallycheckingif theDMA enginehasfinished

Page 73: Communication Architectures for Parallel-ProgrammingSystems

3.3 ImplementationOverview 51

Type Resource DescriptionHardware SendDMA engine TransferspacketsfromNI memoryto the

network.ReceiveDMA engine Transferspacketsfrom thenetwork to NI

memory.Host/NI DMA engine Mainly usedto transferpacketsfrom NI

memoryto host memory. Also usedtoretrieveandupdatestatusinformation.

Software Sendcredits A sendcredit representsa data packetbuffer in someNI’s memory. For eachdestinationNI, there is a separatere-sourcequeue.

Hostreceivebuffer A receivebuffer in hostmemory.

Table3.2. Resourcesusedby theNI controlprogram.

its transfer. (Alternatively, a DMA enginecangeneratean interruptwhenit hascompleteda transfer. LFC’s controlprogram,however, doesnotuseinterrupts.)

Softwareresourcesareusedto implementflow control,bothbetweencommu-nicatingNIs andbetweenanNI andits host. Sendcreditsrepresentbuffer spacein someNI’s memoryandareusedto implementNI-to-NI flow control(seeSec-tion 4.1).BeforeanNI cansendadatapacketto anotherNI, it mustacquireasendcreditfor thatNI. Similarly, beforeanNI cancopy adatapacket to hostmemory,it mustobtainahostbuffer. Hostbuffer managementis describedin Section3.6.

To utilize all resourcesefficiently, LFC’scontrolprogramis structuredaroundresourcesratherthanpackets.Theprogram’smainlooppollsseveralwork queuesandchecksif resourcesareidle. Whenthecontrolprogramfindsa taskon oneofits work queues,it executesthetaskupto thepointwherethetaskneedsaresourcethatis notavailable.Thetaskis thenappendedto theappropriateresourcequeue.Whenthe control programfinds that a resourceis idle, it checksthe resource’squeuefor new work. If thequeueis nonempty, ataskis dequeuedandtheresourceis put backto useagain.

Theresource-centricstructureoptimizesresourceutilization. In particular, allDMA enginescanbeactivesimultaneously. By allowing thereceiveDMA of onepacket to proceedconcurrentlywith thehostDMA of anotherpacket,wepipelinepacket transfersandobtainbetterthroughput.A packet-centricprogramstructurewouldguideasinglepacketcompletelythroughtheNI beforestartingto work onanotherpacket. With this approach,thecontrolprogramusesat mostoneDMA

Page 74: Communication Architectures for Parallel-ProgrammingSystems

52 TheLFC User-Level CommunicationSystem

engineat a time, leaving theotherDMA enginesidle.Theresource-centricapproachincreaseslatency, becausepacketsareenqueued

anddequeuedeachtime they usesomeresource.To attackthis problem,LFC’scontrolprogramcontainsa fastpathat thesendingside,which skipsall queueingoperationsif all resourcesthatapacketneedsareavailable.Wealsoexperimentedwith a receiver-sidefastpath,but foundthat thelatency gainswith this extra fastpathweresmall.To avoid codeduplication,we thereforeremovedthis fastpath.

Thecontrolprogramactsin responseto threeeventtypes:

1. sendrequests

2. DMA transfercompletions

3. timerexpiration

Thecontrolprogram’smainlooppolls for theseeventsandprocessesthem.Sendrequestsarecreatedby thehostlibrary in responseto aclient invocation

of one of the packet launchroutines. Eachsendrequestis stored(using pro-grammedI/O) in a queuein NI memory. Thecontrolprogram’s main loop pollsthis queuefor new requests.

DMA transfersare usedto sendand receive packets and to move receivedpacketsto hostmemory. DMA transfercompletionis signaledby meansof bitsin theNI’s interruptstatusregister. Whenapacket transfercompletes,thecontrolprogrammovesthe packet to its next processingstageandchecksif it canputtheDMA engineto work again.ThecontrolprogramalwaysrestartsthereceiveDMA enginesothatit is alwayspreparedfor incomingpackets.

Myrinet’sNI containsatimerwith agranularityof 0.5µs. Thecontrolprogramusesthis timer to delayinterrupts(seeSection4.4). Timer expiration is signaledby meansof certainbits in theNI’s interruptstatusregister.

3.4 Packet Anatomy

LFC usesseveralpacket typesto carryuserandcontroldataacrossthenetwork.As shown in Figure3.1,eachpacketconsistsof a12-byteheaderanda1024-byteuserdatabuffer. Thebuffer storesthedatathata client wishesto transfer. LFCdoesnot interpretthecontentsof this buffer. WhenLFC allocatesa sendpacket(in lfc sendalloc()) or passesareceivepacket to lfc upcall(), theclient receivesapointerto thedatabuffer. Clientsmustneitherreadnor write thepacket header,but this is notenforced.

Page 75: Communication Architectures for Parallel-ProgrammingSystems

3.4Packet Anatomy 53

2 bytes

Myrinet tag

LFC tag

Source

Size

Credits

Tree

Deadlock channel

User data(1024 bytes)

Fig. 3.1.LFC’s packet format.

Page 76: Communication Architectures for Parallel-ProgrammingSystems

54 TheLFC User-Level CommunicationSystem

Packet class LFC tag FunctionControl CREDIT Explicit creditupdate

CHANNEL CLEAR DeadlockrecoveryFA REPLY Fetch-and-addreply

Data UCAST UnicastpacketMCAST MulticastpacketFA REQUEST Fetch-and-addrequest

Table3.3. LFC’s packet tags.

Theheader’s tagfield consistsof a 2-byteMyrinet tag. Every Myrinet com-municationsystemcanobtainits own tagrangefrom Myricom. This tagrangeisdifferentfrom thetagrangesassignedto otherregisteredMyrinet communicationsystems.NIs canusethe Myrinet tag to recognizeanddiscardmisroutedpack-ets from differentcommunicationsystems.All LFC packetscarry Myrinet tag0x0450.

LFC usesan additionaltag field to identify different typesof LFC packets.Unicastandmulticastpackets,for example,usedifferenttags,becausethecontrolprogramtreatsunicastand multicastpackets differently; multicastpackets areforwarded,whereasunicastpacketsarenot. Table3.3 lists the most importantpacket tags.

We distinguishbetweencontrol packets and datapackets. Control packetsrequireonly a smallamountof NI-level processing.Unlike datapackets,controlpackets are never queued;when the NI receives a control packet, it processesit immediatelyand then releasesthe buffer in which the packet arrived. Twopacketbufferssufficeto receiveandprocessall controlpackets.A secondbuffer isneededbecausethecontrolprogramalwaysrestartsthereceiveDMA engine(witha freepacket buffer asthedestinationaddress)beforeit processesthepacket justreceived.

Datapacketsgothroughmoreprocessingstagesandusemoreresources.EachUNICAST packet, for example,needsto becopiedto hostmemory, but thiscanbedoneonly whenafreehostbuffer andtheDMA engineareavailable.Wheneitherresourceis unavailable,thecontrolprogramqueuesthepacketandstartsworkingon anotherpacket. Sincesomeresources(e.g.,hostbuffers)maybeunavailablefor anunpredictableamountof time,thecontrolprogrammayhaveto buffer manydatapackets.Thetotalnumberof packetsthatneedsto bebufferedis boundedbyLFC’s flow controlprotocol,which stallssenderswhenreceiving NIs run out of

Page 77: Communication Architectures for Parallel-ProgrammingSystems

3.5DataTransfer 55

freepacket buffers.Thesourcefield in thepacketheaderidentifiesthesenderof apacket. At LFC

initialization time, the client passesto eachLFC processthe numberof partici-patingprocessesandassignsa uniqueprocessidentifier to eachprocess.SinceLFC allows only oneprocessperprocessor, this processidentifieralsoidentifiestheprocessorandtheNI. EachtimeanNI transmitsapacket,LFC storestheNI’sidentifierin thesourcefield.

The userdata part of sendpacketshasa maximumsizeof 1024bytes,butclientsneednot fill theentiredatapart. Thesizefield specifieshow many byteshave beenwritten into the userdatapart of a packet. Thesebytesarecalledthevalid userbytesandthey mustbestoredcontiguouslyat thebeginningof apacket.WhenLFC transfersa packet, it transfersthe packet headerand the valid userbytes,but not theunusedbytesat theendof thepacket. This yields low latencyfor smallmessagesandhigh throughputfor largemessages.

Theremainingheaderfieldsaredescribedin detail in Chapter4. Thecreditsfield is usedto piggybackflow control information to the packet’s destination.Thetreefield identifiesmulticasttrees.Thedeadlock channelfield is usedduringdeadlockrecovery.

LFC’s outgoingpacketsaretagged(in hardware)with a CRCthat is checked(in software) by LFC’s control programat the receiving side. CRC errorsaretreatedascatastrophicandcauseLFC to abort.Lostpacketsarenotdetected.

3.5 Data Transfer

LFC transfersdatain packetswhich areallocatedfrom threepacket buffer pools.EachNI maintainsa sendbuffer pool anda receive buffer pool. Hostsmaintainonly a receive buffer pool; all buffers in this pool arestoredin pinnedmemory.The datatransferpath in Figure3.2 for unicastpackets shows how thesethreepoolsareused.

At thesendingside,theclientallocatessendpacketsfrom thesendbuffer poolin NI memory(transfer1 in Figure3.2). ProgrammedI/O is usedto copy clientdatadirectly, withoutoperatingsystemintervention,into asendpacket. Theclientcalls oneof LFC’s launchroutines,which writes a senddescriptorinto the NI’ssendqueue(seeFigure3.3). The senddescriptorcontainsthe packet’s destina-tion anda referenceto thepacket. TheNI, which polls thesendqueue,inspectsthe senddescriptor. Whencreditsareavailablefor the specifieddestination,thedescriptoris moved to the transmitqueue; the packet will be transmittedwhen

Page 78: Communication Architectures for Parallel-ProgrammingSystems

56 TheLFC User-Level CommunicationSystem

2. Network send and receive DMA transfers1. Programmed I/O to network interface

3. DMA transfer to host memory

NI receive packetNI send packet

2

1

3

User data

Sender’s address spaceReceiver’s address space

Pinned host receive packetNI NI

Host Host

Fig. 3.2.LFC’s datatransferpath.

thedescriptorreachestheheadof this queue(transfer2 in Figure3.2). After thepacket hasbeentransmitted,it is returnedto theNI’s sendpacket pool.

Whennosendcreditsareavailable,thedescriptoris addedto ablocked-sendsqueueassociatedwith thepacket’sdestination.Whencreditsflow backfrom somedestination,descriptorsaremovedfrom thatdestination’sblocked-sendsqueuetothetransmitqueue.Thedetailsof thecreditalgorithmaredescribedin Chapter4.

ThedestinationNI receivestheincomingpacket into afreepacketbuffer allo-catedfrom its receive buffer pool. This pool is partitionedevenly amongall NIs:eachNI ownsa fixednumberof thebuffers in thereceive pool andcanfill thesebuffers—andnomore—by sendingpackets.Eachbuffer correspondsto onesendcreditandfilling abuffer correspondsto consumingasendcredit.

Eachhostis responsiblefor passingtheaddressesof freehostreceive buffersto its NI. Theseaddressesarestoredin a sharedqueuein NI memory(seeSec-tion 3.6). As soonas a free host receive buffer is available, the NI copiesthepacket to hostmemory(transfer3 in Figure3.2). After this copy, theNI receivebuffer is returnedto its pool. Whenno hostreceive buffersareavailable,theNIsimply doesnot copy packetsto the hostanddoesnot releaseany of its receivebuffers. Eventually, senderswill bestalledandremainstalleduntil thereceivinghostpostsnew buffers.

Eachhostreceivebuffer containsastatusfield thatindicateswhetherthecon-trol programhascopieddatainto the buffer. The hostusesthis field to detectincomingmessages.Copying a packet from an NI receive buffer to a host re-ceive buffer involves two actions: copying the packet’s payloadto host mem-

Page 79: Communication Architectures for Parallel-ProgrammingSystems

3.5DataTransfer 57

���������������������������������������������������������������������������������������������������������������������������������������� ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Send descriptor

Send queue

Transmit queue

Network interface

Client data

Send buffer pool

Blocked-sendsqueues

Fig. 3.3. LFC’s sendpathanddatastructures.

ory andsettingthe statusfield in the hostbuffer to FULL. For efficiency, LFCfolds both actionsinto a singleDMA transfer. This is illustratedin Figure3.4,which alsoshows a naive implementationthat usestwo DMA transfersto copythepacket’s payloadandstatusfield. To avoid transferringtheunusedpartof thereceive packet’s datafield, thestatusfield mustbestoredright after thepacket’spayload.(If it werestoredin front of thedata,thehostcouldbelieve it receivedapacketbeforeall of thatpacket’sdatahadarrived.)Thepositionof theFULL wordthereforedependsonthesizeof thepacket’spayload.Thehost,however, needstoknow in advancewhereto poll for thestatusfield. Thecontrolprogramthereforetransfersthe packet’s payloadandthe FULL word to the endof the hostreceivebuffer. In addition,thecontrolprogramtransfersa word thatholdsthepayload’sdisplacementrelative to thebeginningof thehost’s receivebuffer.

Oncefilled, thehostbuffer is passedto theclient-suppliedupcallroutine.Theclientdecideswhenit releasesthis buffer. Oncereleased,thebuffer is returnedtothehostreceivebuffer pool.

Multicastpacketsfollow thesamepathasunicastpackets,but mayaddition-ally beforwardedby receiving NIs to otherdestinations.An NI receivebuffer thatcontainsamulticastpacket is notreturnedto its pooluntil themulticastpackethas

Page 80: Communication Architectures for Parallel-ProgrammingSystems

58 TheLFC User-Level CommunicationSystem

Combined data andstatus DMA transfer

Packet buffer in host memory

E

Data DMA transfer

F

E = EMPTYPacket buffer in network interface memory

Header Payload F

Status DMA transfer

Status field (polled by host)

Displacement field

F = FULL

Fig. 3.4. Two waysto transferthe payloadandstatusflag of a network packet.The dashedarrows illustrate a simple, but expensive manner. The solid arrowindicatestheoptimizedtransfer.

beenforwardedto all forwardingdestinations.

3.6 Host ReceiveBuffer Management

Eachhost hasa pool of free host receive buffers. The client specifiesthe ini-tial numberof buffers in this pool. At initialization time, the library obtainsthephysical1 addressof eachbuffer in thepool andstoresthis addressin thebuffer’sdescriptor.

Besidesthepool, the library managesa receivequeueof hostreceive buffers(seeFigure3.5). The library maintainsthreequeuepointers.Headpointsto thefirst FULL, but unprocessedpacket buffer. This is the next buffer that will bepassedto lfc upcall(). Emptypointsto thefirst EMPTY receive buffer. This is thenext buffer that the library expectsto be filled by the NI. Tail points to the lastbuffer in thequeue.Whenthelibrary allocatesnew receivebuffersfrom thepool,

1To be precise,we obtain the bus address.This is a device’s addressfor a particularhostmemorylocation. On somearchitectures,this addresscanbedifferentfrom thephysicaladdressusedby the host’s CPU,but on Intel’s x86 architecturethebusaddressandthe physicaladdressareidentical.

Page 81: Communication Architectures for Parallel-ProgrammingSystems

3.6HostReceiveBuffer Management 59

��������������������������������������������

��������������������������������������� � � � � � � � � � � � � � � � � � � !�!�!�!�!!�!�!�!�!"�"�"�"�""�"�"�"�"#�#�##�#�##�#�#$�$�$$�$�$$�$�$

%�%�%�%%�%�%�%&�&�&�&&�&�&�&Head

Host memory

Host receive queue

Empty Tail

Network interface memory

DMA transfer

Shared queue

Fig. 3.5. LFC’s receivedatastructures.

it appendsthemto thereceivequeueandadvancestail.Eachtime the library appendsa buffer to its receive queue,it alsoappends

the physicaladdressof the buffer to a sharedqueuein NI memory. When theNI control programreceivesa network packet, it tries to fetch an addressfromthesharedqueueandcopy thepacket to this address.Whenthesharedqueueisempty, the control programdefersthe transferuntil the queueis refilled by thehostlibrary.

It isnotnecessaryto separatethereceivequeueandthepool,butdoingsooftenreducesthememoryfootprint of thehostreceive buffers. In mostcases,we keeponly a smallnumberof bufferson thereceivequeue.As long astheclient is ableto useandreusethesebuffers,only a smallportionof theprocessor’s datacacheis pollutedby host receive buffers. In somecases,however, clientsneedmorebuffers,whicharethenallocatedfrom thepoolandaddedto thereceivequeue.Inthosecases,a largerportionof thecachewill bepollutedby hostreceivebuffers.

Whenthelibrary needsto drain(seeSection3.1.4),it performstwo actions:

1. If thepacketpointedto by emptyhasbeenfilled, thelibrary advancesempty.If emptycannotbeadvanced(becauseall packetsarefull) and upcallsarenot allowed, then the library appendsa new buffer from the pool to bothqueues.

2. If headdoesnot equalemptyandupcallsareallowed,thelibrary dequeuesheadandcalls lfc upcall(), passingheadasa parameter. Head is thenad-

Page 82: Communication Architectures for Parallel-ProgrammingSystems

60 TheLFC User-Level CommunicationSystem

vanced.If thereareno buffersleft in thereceivequeue,thenthelibrary ap-pendsanew buffer from thepool to bothqueuesbeforecalling lfc upcall().

Packetsfreedby clientsareappendedto bothqueues,unlessthequeuealreadycontainsa sufficient numberof emptypackets. In the latter case,the packet isreturnedto thebuffer pool.

3.7 MessageDetectionand Handler Invocation

LFC deliverspacketsto a receiving processby invoking lfc upcall(). This routineis either invoked sychronously, in the context of a client call to a drainingLFCroutine,or asynchronously, in the context of a signalhandler. Here,we discusstheimplementationof bothmechanisms.

Eachinvocationof lfc poll() checksif thecontrol programhascopieda newpacket to hostmemory. If lfc poll() finds a new packet, it invokes lfc upcall(),passingthepacketasaparameter. To allow fair schedulingof theclientcomputa-tion andthepacket handler, lfc poll() deliversat mostonenetwork packet.

Thecheckfor thenext packet is implementedby readingthestatusfield of thenext undeliveredpacket buffer (emptyin Figure3.5). Whenthis buffer’s addresswas passedto the control program,its statusfield is set to EMPTY. The pollsucceedswhenthehostdetectsthat this field hasbeensetto FULL. This pollingstrategy is efficient, becausethe statusfield residesin hostmemoryandwill becached(seeSection2.5).

Asynchronouspacketdeliveryis implementedcooperatively by LFC’scontrolprogram,theMyrinet device driver, andLFC’s library. In principle,LFC’s con-trol programgeneratesan interruptafter it hascopieda packet to hostmemory.However, sincethereceiving processmaypoll, thecontrolprogramdelaysits in-terrupts. Myrinet NIs containan on-boardtimer with 0.5 µs granularity. Whena packet arrives, the control programsetsthis timer to a user-specifiedtimeoutvalue(unlessthe timer is alreadyrunning). An interruptis generatedonly if thehostfails to processa packet beforethetimer expires.Thedetailsof this pollingwatchdog mechanismaredescribedin Chapter4.

Interruptsgeneratedby thecontrolprogramarereceivedby theoperatingsys-temkernel,whichpassesthemto theMyrinet devicedriver. Thedrivertransformstheinterruptinto a user-level softwaresignalandsendsthis networksignal to theclientprocess.Thesesignalsarenormallycaughtby theLFC library. Thelibrary’ssignalhandlerchecksif theclient is runningwith interruptsenabled.If this is the

Page 83: Communication Architectures for Parallel-ProgrammingSystems

3.8Fetch-and-Add 61

case,thesignalhandlerinvokeslfc poll() to processpendingpackets;otherwiseitreturnsimmediately.

Recall that clients may dynamicallydisableand enablenetwork interrupts.For efficiency, the interrupt statusmanipulationfunctionsare implementedbytoggling an interruptstatusflag in hostmemory, without any communicationorsynchronizationwith thecontrolprogram.Beforesendingan interrupt,thecon-trol programfetches(usinga DMA transfer)the interruptstatusflag from hostmemory. No interruptis generatedwhentheflag indicatesthattheclientdisabledinterrupts.

Sincethe library doesnot synchronizewith thecontrolprogramwhenit ma-nipulatestheinterruptstatusflag, two racescanoccur:

1. The control programmay raisean interrupt just after the client disabledinterrupts.Unlessprecautionsaretaken,thesignalhandlerwill invoke thepackethandlereventhoughtheclientexplicitly disabledinterrupts.Tosolvethis problem,LFC’s signalhandleralwayschecksthe interruptstatusflagbeforeinvoking lfc poll() and ignoresthe signal if the raceoccurred. Inpractice,thisracedoesnotoccurveryoften,sotheadvantageof inexpensiveinterruptmanipulationoutweighsthecostof spuriousinterrupts.

2. Thecontrolprogrammaydecidenot to raiseaninterruptjustaftertheclient(re-)enabledinterrupts.This raceis fatal for clientsthat rely on interruptsfor correctness.To avoid lost interrupts,the control programcontinuestogenerateperiodic interruptsfor a packet until the hosthasprocessedthatpacket (seeChapter4).

3.8 Fetch-and-Add

TheF&A implementationis straightforward. Every NI storesoneF&A variable,which is identified by the var argumentin lfc fetch and add() (seeTable 3.1).Lfc fetch and add() performsa remoteprocedurecall to the NI that storestheF&A variable. The processthat invoked lfc fetch and add() allocatesa requestpacket, tagsit with tagFA REQUEST, andappendsit to theNI’shostsendqueue.

WhenthedestinationNI receivestherequest,it incrementsits F&A variableandstoresthepreviousvaluein a replypacket. Thereplypacket,acontrolpackettaggedFA REPLY, is thenappendedto theNI’s transmitqueue.

Theimplementationassumesthateachparticipatingprocesscanissueatmostone F&A requestat a time. To allow multiple outstandingrequests,we need

Page 84: Communication Architectures for Parallel-ProgrammingSystems

62 TheLFC User-Level CommunicationSystem

a form of flow control that preventsthe NI that processesthoserequestsfromrunningoutof memoryasit createsandqueuesFA REPLY packets.FA REQUEST

packetsconsumecredits,but thosecreditsarereturnedassoonastherequesthasbeenreceivedandpossiblybeforea reply hasbeengenerated.

Whenthe reply arrivesat the requestingNI, the control programcopiestheF&A resultfrom thereply to a locationin NI memorythatis polledby lfc fetch -and add(). Whenlfc fetch and add() findstheresult,it returnsit to theclient.

This implementationis simpleandefficient. TheFA REQUEST packet is han-dled entirely by the receiving NI, without any involvementof the hostto whichtheNI is attached.This is particularlyimportantwhenprocessesissuemany F&Arequests.Many of theserequestswouldeitherbedelayedor generateinterruptsifthey hadto behandledby thehost.

Theimplementationusesadatapacket for therequestandacontrolpacket forthereply. Therequestis adatapacketonly becausethecurrentimplementationofLFC doesnot allow hoststo sendcontrolpackets.Althoughthis is easyto add,itwill increasethesendoverheadfor all packets,becausetheNI will have to checkif thehostsendsa dataor acontrolpacket.

3.9 Limitations

Theimplementationof LFC on Myrinet makesa numberof simplifying assump-tions,someof which limit its applicability in a productionenvironment.We dis-tinguishbetweenfunctionallimitationsandsecurityviolations.

3.9.1 Functional Limitations

LFC’s functional limitations are all relatedto the assumptionthat LFC will beusedin ahomogeneous,dedicatedcomputingenvironment.

HomogeneousHosts

LFC assumesthatall hostprocessorsusethesamebyteorder. Addingbyteswap-ping for packet headersis straightforwardandwill addlittle overhead.(It is theclient’s responsibilityto convert thedatacontainedin network packets;LFC doesnotknow whattypeof dataclientsstorein their packets.)Sinceour cluster’shostprocessorsarelittle-endianandthe NI is big-endian,somebyte swappingis al-readyusedfor datathat needsto be interpretedboth by the hostprocessorand

Page 85: Communication Architectures for Parallel-ProgrammingSystems

3.9Limitations 63

the NI. This concernsmainly shareddescriptorsand packet headers;userdatacontainedin packetsis not interpretedby LFC andis neverbyte-swapped.

DeviceSharing

LFC canserviceatmostoneuserprocessperhost,mainlybecauseLFC’s controlprogramdoesnot separatethepacketsfrom differentusers.In our currentenvi-ronment(DAS), theMyrinet devicedriverreturnsanerrorif asecondusertriestoobtainaccessto theMyrinet device.

Multiple userscanbe supportedby giving differentusersaccessto differentpartsof the NI’s memory. This canbe achieved by settingup appropriatepagemappingsbetweentheuser’s virtual addressspaceandthe NI’s memory. If thisis done,theNI’s controlprogrammustpoll multiple commandqueueslocatedondifferentmemorypages,which is likely to havesomeimpacton performance.

NewerMyrinet network interfacesprovideadoorbellmechanism.Thismech-anismdetectshostprocessorwrites to certainregionsandappendsthe targetad-dressandvaluewritten to a FIFO queue.With this doorbellmechanism,theNIneednot poll all thepagesthatdifferentprocessescanwrite to; it needonly polltheFIFO queue.

Recall that LFC partitionsthe available receive buffer spacein NI memoryamongall participatingnodes.If theNI’s memoryis alsopartitionedto supportmultiprogramming,thentheavailablebuffer spaceperprocesswill decreaseno-ticeably. Eachprocesswill be given fewer sendcreditsper destinationprocess,which reducesperformance. A simple, but expensive solution is to buy morememory. However, thismayonly bepossibleto a limited extent.A lessexpensiveandmorescalablesolutionis to employ swappingtechniquessuchasdescribedinSection2.4.

Fixed ProcessConfiguration

LFCassumesthatall of anapplication’sprocessesareknownatapplicationstartuptime; userscannotaddor remove processesto the initial setof processes.Themainobstacleto addingprocessesdynamicallyis thatLFC evenly dividesall NIreceive buffers (or rather, thesendcreditsthat representthem)amongthe initialsetof processes.Whena processis added,sendcreditsmustbe reallocatedin asafeway.

Page 86: Communication Architectures for Parallel-ProgrammingSystems

64 TheLFC User-Level CommunicationSystem

ReliableNetwork

LFC assumesthat thenetwork hardwarenever dropsor corruptspackets. WhileLFC testsfor CRC errors in incoming packets, it doesnot recover from sucherrors.Lostpacketsarenotdetectedat all.

A retransmissionmechanismwouldallow anLFC applicationto toleratetran-sientnetwork failures. In addition,retransmissionallows communicationproto-cols to drop packetswhenthey run out of resources.With retransmission,LFCwould not needto partitionNI receive buffersamongall senders.Chapter8 con-sidersalternative implementationswhich do retransmitdroppedand corruptedpackets.

3.9.2 Security Violations

LFC doesnot limit a userprocess’s accessto theNI andis thereforenot secure.Any userprocesswith NI accesscancrashany otherprocessor eventhekernel.With a little kernelsupport,though,LFC canbe madesecure;the techniquestoachievesecureuser-level communicationarewell-known. Moreover, LFC’s sendand receive methodsrequirerelatively simple and inexpensive securitymecha-nisms.Below, wedescribethesecurityproblemsin moredetail.Wherenecessary,wealsoprovidesolutions.Thesesolutionshavenotbeenimplemented.

Network Interface Access

LFC(or rather, theMyrinet devicedriverusedby LFC)allowsanarbitraryprocessto mapall NI memoryinto its addressspaceanddoesnot restrictaccessto thismemory. As a result, userscan freely modify the NI control program. Oncemodified,thecontrolprogramcanbothreadandwrite all hostmemorylocations.

Accessto theNI canberestrictedby mappingonly partof theNI’s memoryinto a process’s addressspace.The processis not given any accessto the con-trol programor to anotherprocess’s pages.LFC caneasilybemodifiedto workwith a device driver that implementsthis typeof protectionandtheexpectedtheperformanceimpactis small.Thepagemappingscanbesetup at applicationini-tializationtimesothatthecostof settingup thesemappingsis not incurredonthecritical communicationpath.

Page 87: Communication Architectures for Parallel-ProgrammingSystems

3.10RelatedWork 65

AddressTranslation

TheLFC library passesthephysicalmemoryaddressesof freehostreceivebuffersto the control program. Sincethe control programdoesnot checkif thesead-dressesarevalid, amaliciousprogramcanpassanillegaladdress,which thecon-trol programwill useasthedestinationof aDMA transfer.

A simple solution is to have all host receive buffers allocatedby a trustedkernelmodule.This moduleassignsto eachbuffer anindex (a small integer)andmapsthis index to thebuffer’sphysicaladdress.Themappingis passedto theNI,but not to theuserprocess.Theuserprocess,i.e., theLFC library, is only giveneachbuffer’s virtual addressandits index. Insteadof passinga buffer’s physicaladdressto theNI, thelibrary will passthebuffer’s index. TheNI controlprogramcaneasilycheckif theindex it receivesis valid. Thisway, DMA transfersto hostmemorycanonly targetthebuffersallocatedby thetrustedkernelmodule.

Another solution is to usea secureaddresstranslationmechanismsuchasVMMC-2’s user-managedTLB [49] (seeSection2.3). This schemeis morein-volved, but supportsdynamicpinning andunpinningof any pagein the user’saddressspace,whereastheschemeabovewould pin all hostreceive buffersonceandfor all.

ForeignPackets

LFC’s control programmakesno seriousattemptto rejectnetwork packets thatoriginatefrom anotherapplication. LFC rejectsonly packets that do not carryLFC’s Myrinet tag or that carry an out-of-rangeLFC tag. Consequently, mali-ciousor faulty programscansenda packet to any otherapplicationaslong asthepacket carriestagsknown by LFC. TheHamlynsystem[32] provideseachparal-lel application’sprocesswith asharedrandombit patternthatis appendedto eachmessageandverifiedby the receiver. The bit patternis sufficiently long that anexhaustivesearchfor its valueis computationallyinfeasible.

3.10 RelatedWork

Thedesignof LFC wasmotivatedin partby theshortcomingsof existingsystems.Sincemany of thesesystemswerediscussedin Chapter2, we will not treatthemall againhere.

LFC usesoptimistic interrupt protection[134] to achieve atomicity with re-spectto network signals.This techniquewasusedin theMachoperatingsystem

Page 88: Communication Architectures for Parallel-ProgrammingSystems

66 TheLFC User-Level CommunicationSystem

kernelto reducetheoverheadof hardwareinterrupt-maskmanipulation.With thistechnique,interrupt-maskingroutinesdo not truly disableinterrupts,but manipu-latea softwareflag that representsthecurrentinterruptmask. It is the responsi-bility of theinterrupthandlerto checkthisflag. If theflagindicatesthatinterruptsarenotallowed,theinterrupthandlercreatesaninterruptcontinuation, somestatethatallows the interrupt-handlingcodeto be executedat a later time. LFC usesthesametechnique,but doesnot needto createinterruptcontinuations,becausetheNI will automaticallyregeneratetheinterrupt.

Somesystemsemulateasynchronousnotificationusinga backgroundthreadthat periodicallypolls the network [54]. The problemis to find a goodpollingfrequency. Also, addinga threadto anotherwisesingle-threadedsoftwaresystemis oftendifficult.

A similar idea is usedin the Message-PassingLibrary (MPL) for the IBMSP/2.Thislibrary usesperiodictimerinterruptsto allow thenetwork to bedrainedeven if sendersand receiversare engagedin long-runningcomputations[129].The main differenceis that MPL’s timeout handlerdoesnot invoke usercode,becauseMPL doesnotprovideanasynchronousnotificationinterface.

Anotheremulationapproachis to usea compileror binary rewriter to insertpollsautomaticallyinto applications.Shasta[125], afine-grainDSM, usesbinaryinstrumentation.In a simulationstudy[114], Perkovic andKelehercompareau-tomaticinsertionof pollsby abinaryrewriter andthepolling-watchdogapproach.Their simulationresults,performedin the context of the CVM DSM, indicatethat both approacheswork well andperformbetterthanan approachthat reliesexclusively onpolling duringmessagesends.

3.11 Summary

In this chapter, we have presentedthe interfaceand implementationof LFC, anew user-level communicationsystemdesignedto supportthe developmentofparallelprogrammingsystems.LFC providesreliablepoint-to-pointandmulticastcommunication,interruptmanagement,andsynchronizationthroughfetch-and-add. LFC’s interfaceis low-level. In particular, thereis no messageabstraction:clientsoperateonsendandreceivepackets.

This chapterhasdescribedthekey componentsof LFC’s implementationonMyrinet (library, device drivers, and NI control program). We have describedLFC’s datatransferpath,buffer management,control transfer, andsynchroniza-tion. TheNI-level communicationprotocols,which form animportantpartof the

Page 89: Communication Architectures for Parallel-ProgrammingSystems

3.11Summary 67

implementation,aredescribedin thenext chapter.

Page 90: Communication Architectures for Parallel-ProgrammingSystems
Page 91: Communication Architectures for Parallel-ProgrammingSystems

Chapter 4

Core Protocolsfor IntelligentNetwork Interfaces

This chaptergivesa detaileddescriptionof LFC’s NI-level protocols: UCAST,MCAST, RECOV, and INTR. The UCAST protocol implementsreliablepoint-to-point communicationbetweenNIs. UCAST assumesreliablenetwork hard-wareandpreservesthis reliability usingsoftwareflow controlimplementedat thedatalink level, betweenNIs. LFC is namedafter this property:Link-level FlowControl.

Link-level flow controlalsoformsthebasisof MCAST, anNI-level spanningtreemulticastprotocol.MCAST forwardsmulticastpacketson theNI ratherthanon thehost;this preventsunnecessaryreinjectionof packetsinto thenetwork andremoveshostprocessing—in particularinterruptprocessingandcopying— fromthecritical multicastpath.

MCAST works for a limited, but widely usedclassof multicasttrees(e.g.,binary andbinomial trees). For treesnot in this class,MCAST may deadlock.The RECOV protocolextendsMCAST with a deadlockrecovery protocol thatallows theuseof arbitrarymulticasttrees.

UCAST, MCAST, andRECOV areconcernedwith datatransferbetweenNIs.The last protocoldescribedin this chapter, INTR, dealswith NI-to-hostcontroltransfer. INTR reducesthe overheadof network interruptsby delayingnetworkinterrupts.

This chapteris structuredas follows. Sections4.1 through4.4 discuss,re-spectively, UCAST, MCAST, RECOV, andINTR. Section4.5 discussesrelatedwork.

69

Page 92: Communication Architectures for Parallel-ProgrammingSystems

70 CoreProtocolsfor IntelligentNetwork Interfaces

4.1 UCAST: ReliablePoint-to-Point Communication

The UCAST protocol implementsreliableandFIFO point-to-pointtransfersofpacketsbetweenNIs. Sinceweassumethatthehardwaredoesnotdropor corruptpackets(seeSection3.2), theonly causeof packet lossis lack of buffer spaceonreceiving nodes(bothNIs andhosts).Thischapterdealsonly with NI-level bufferoverflow; host-level bufferingwasdiscussedin Section3.6.

UCAST usesa variantof sliding window flow control betweeneachpair ofNIs to preventNI-level buffer overflows. Theprotocolguaranteesthatno NI willsendapacket to anotherNI beforeit knows thereceiving NI canstorethepacket.To achieve this,eachNI assignsanequalnumberof its receivebuffersto eachNIin thesystem.An NI S thatneedsto transmita packet to anotherNI R maydo soonly whenit hasat leastonesendcredit for R. Eachsendcredit correspondstooneof R’s freereceivebuffersfor S. EachtimeSsendsapacket to R, Sconsumesoneof its sendcreditsfor R. OnceS hasconsumedall its sendcreditsfor R, Smust wait until R freessomeof its receive buffers for S and returnsnew sendcreditsto S. Sendcreditsarereturnedby meansof explicit acknowledgementsorby piggybackingthemonapplication-level returntraffic.

4.1.1 Protocol Data Structures

Figure4.1 shows the typesandvariablesusedby UCAST. All variablesarebestoredin NI memory. UCAST usestwo typesof network packets: UNICAST

packetscarryuserdata;CREDIT packetsareusedonly to returncreditsto asenderanddo not carryany data. Eachpacket carriesa headerof typePacketType thatidentifiesthepacket’s type(tag) andsender(src). In addition,theprotocolusesacreditsfield in theheaderto returnsendcreditsto anNI.

Transmissionrequestsare storedin senddescriptorsof type SendDescthatidentify thepacket to betransmitted(packet) andthedestinationNI (dest). Multi-plesenddescriptorscanreferto thesamepacket; thisis importantfor themulticastprotocol,whichcanforwarda singlepacket to multipledestinations.

EachNI maintainsprotocolstatefor eachotherNI. This stateis storedin aprotocolcontrolblock of typeProtocolCtrlBlock. Theprotocolstatefor a givenNI consistsof thenumberof remainingsendcreditsfor thatNI (sendcredits), aqueueof blockedtransmissionrequests(blocked sends), andthenumberof creditsthatcanbereturnedto thatNI (freecredits).

Finally, eachNI maintainsseveralglobalvariables.EachNI storesits uniqueid in LOCAL ID, thenumberof NIs in thesystemin NR NODES, andacreditre-

Page 93: Communication Architectures for Parallel-ProgrammingSystems

4.1UCAST: ReliablePoint-to-PointCommunication 71

typedef enum ' UNICAST, CREDIT ( PacketType;

typedef struct 'PacketType tag; / ) packet type ) /unsigned src; / ) sender of packet ) /unsigned credits; / ) piggybacked credits ) /( PacketHeader;

typedef struct 'PacketHeader hdr;Byte data[PACKET SIZE];( Packet;

typedef struct 'unsigned dest;Packet ) packet;( SendDesc;

typedef struct 'unsigned send credits;SendDescQueue blocked sends;unsigned free credits;( ProtocolCtrlBlock;

unsigned LOCAL ID, NR NODES, CREDIT REFRESH; / ) runtime constants ) /ProtocolCtrlBlock pcbtab[MAX NODES]; / ) per-node protocol state ) /unsigned nr pkts received;

Fig. 4.1.Protocoldatastructuresfor UCAST.

Page 94: Communication Architectures for Parallel-ProgrammingSystems

72 CoreProtocolsfor IntelligentNetwork Interfaces

freshthresholdin CREDITREFRESH. Therefreshthresholdis usedby receiversto determinewhencreditsmustbereturnedto thesenderthatconsumedthem.Theprotocolcontrol blocksarestoredin arraypcbtab. Variablenr receivedcountshow many packetsanNI hasreceived;this variableis usedby theINTR protocol(seeSection4.4).

4.1.2 Sender-SideProtocol

Figure4.2showsthereliability protocolexecutedbyeachsendingNI. All protocolcodeis event-driven; this part of the protocol is activatedin responseto unicasteventswhich are generatedwhen the NI’s host launchesa packet. How LFCgeneratesthis eventwasdiscussedin Chapter3.

Eachunicasteventis dispatchedto eventunicast(). Thisroutinecreatesasenddescriptorandpassesthisdescriptorto credit send(), whichchecksif asendcreditis availablefor thedestinationspecifiedin thesenddescriptor. If this is thecase,thepacket is transmitted;otherwisethesenddescriptoris queuedon theblocked-sendsqueuefor thedestinationNI. It will remainqueueduntil thedestinationNIhasreturnedasufficientnumberof sendcredits.

Packets are transmittedby sendpacket(). This routine is also usedto re-turn sendcreditsto the destinationNI. In the protocolcontrol block of eachNIwe counthow many creditscanbe returnedto thatNI; sendpacket() copiesthecounterinto thecreditsfield of thepacketheaderandthenresetsthecounter. Thisway, any packet travelling to an NI will automaticallyreturnany availablesendcreditsto thatNI.

Sendpacketscanbereturnedto theNI sendbufferpoolassoonastransmissionhascompleted. We do not show sendpacket deallocation,becauseit doesnottriggerany significantprotocolactions.

4.1.3 Receiver-SideProtocol

Thereceiversideof thereliability protocolis shown in Figure4.3.Thispartof theprotocolhandlestwo events: packet arrival andpacket release.Incomingpack-etsaredispatchedto eventpacket received(). This routineacceptsany new sendcreditsthat the sendermay have returnedandthenprocessesthe packet. If thepacket is a UNICAST datapacket, it is deliveredto thehost. In this protocolde-scription,we arenot concernedwith thedetailsof NI-to-hostdelivery. CREDIT

packetsserve only to sendcreditsto an NI; they do not requireany further pro-cessingandarediscardedimmediately.

Page 95: Communication Architectures for Parallel-ProgrammingSystems

4.1UCAST: ReliablePoint-to-PointCommunication 73

void event unicast(unsigned dest, Packet * pkt) +SendDesc * desc = alloc send desc();

pkt , hdr.tag = UNICAST;desc , dest = dest;desc , packet = pkt;credit send(desc);-

void credit send(SendDesc * desc) +ProtocolCtrlBlock * pcb = &pcbtab[desc , dest];

if (pcb , send credits . 0) +pcb , send credits /0/ ; / * consume a credit * /send packet(desc);return;-

enqueue(&pcb , blocked sends, desc); / * wait for a credit * /-void send packet(SendDesc * desc) +

ProtocolCtrlBlock * pcb = &pcbtab[desc , dest];Packet * pkt = desc , packet;

pkt , hdr.src = LOCAL ID;pkt , hdr.credits = pcb , free credits; / * piggyback credits * /pcb , free credits = 0;

transmit(desc , dest, pkt);-Fig. 4.2. Sendersideof theUCAST protocol.

Page 96: Communication Architectures for Parallel-ProgrammingSystems

74 CoreProtocolsfor IntelligentNetwork Interfaces

void event packet received(Packet * pkt) +accept new credits(pkt);

switch(pkt , hdr.tag) +case UNICAST:

deliver to host(pkt);nr pkts received++;break;

case CREDIT:break;--

void accept new credits(Packet * pkt) +ProtocolCtrlBlock * pcb = &pcbtab[pkt , hdr.src];SendDesc * desc;

/ * Retrieve (piggybacked) credits and unblock blocked senders. * /pcb , send credits += pkt , hdr.credits;while (! queue empty(&pcb , blocked sends) && pcb , send credits . 0) +

desc = dequeue(&pcb , blocked sends);credit send(desc);--

void event packet released(Packet * pkt) +return credit to(pkt , hdr.src);-

void return credit to(unsigned sender) +ProtocolCtrlBlock * pcb = &pcbtab[sender];

pcb , free credits++;if (pcb , free credits 1 CREDIT REFRESH) +

Packet credit packet;SendDesc * desc = alloc send desc();

credit packet.hdr.tag = CREDIT; / * send explicit credit packet * /desc , dest = sender;desc , packet = &credit packet;send packet(desc);--

Fig. 4.3.Receiversideof theUCAST protocol.

Page 97: Communication Architectures for Parallel-ProgrammingSystems

4.2MCAST: ReliableMulticastCommunication 75

Routineacceptnew credits() addsthe credits returnedin packet pkt to thesendcreditsfor NI src. Next, this routinewalkstheblocked-sendsqueueto trans-mit to src asmany blockedpacketsaspossible.Sincetheblocked-sendsqueueischeckedimmediatelywhencreditsarereturned,packetsarealwaystransmittedinFIFO order.

Although UCAST is not concernedwith the exactway in which packetsaredeliveredto the host, it needsto know whena packet buffer canbe reusedfornew incoming packets. The sendcredit consumedby the packet’s sendercan-not be returneduntil the packet is availablefor reuse.We thereforerequirethata packet releaseevent be generatedwhen a receive packet is released. Thisevent is dispatchedto eventpacket released(), which returnsa sendcredit tothe packet’s senderandthendeallocatesthe packet. Creditsarereturnedby re-turn credit to(), which incrementsa counter(freecredits) in theprotocolcontrolblock of thepacket’s sender. If this counterreachesthecredit refreshthreshold,return credit to() sendsan explicit CREDIT packet to the sender. This happensonly if datapacketsflow mainly in onedirection.If thereis sufficient returntraf-fic free creditswill never reachtherefreshthreshold,becausesendpacket() willalreadyhavepiggybackedall freecreditsonanoutgoingUNICAST packet.

UCAST is simpleandefficient. Sincetheprotocolpreservesthereliability ofthe underlyinghardwareby avoiding receiver buffer overruns,sendersneednotbuffer packets for retransmission,nor needthey manageany timers. The maindisadvantageof UCAST is that it staticallypartitionstheavailablereceive bufferspaceamongall senders,soit needsmoreNI memoryasthenumberof processorsis increased.Chapter8 studiesthis issuein moredetail.

4.2 MCAST: ReliableMulticast Communication

Sincewe assumethatmulticastis not supportedin hardware,we provide a soft-warespanning-treemulticastprotocol.Thisprotocol,MCAST, implementsareli-ableandFIFO-orderedmulticast.BeforedescribingMCAST, wediscussdifferentmulticastforwardingstrategiesandtheimportanceof multicasttreetopology.

4.2.1 Multicast Forwarding

Mostspanning-treeprotocolsusehost-level forwarding.With host-level forward-ing, thesenderis theroot of a multicasttreeandsendsa multicastpacket to eachof its children. Eachchild’s NI passesthepacket to its host,which reinjectsthe

Page 98: Communication Architectures for Parallel-ProgrammingSystems

76 CoreProtocolsfor IntelligentNetwork Interfaces

Host Host

Network interface Network interface

Fig. 4.4.Host-level (left) andinterface-level (right) multicastforwarding.

packet to forwardit to thenext level in themulticasttree(seeFigure4.4).Host-level forwardinghasthreedrawbacks.First,sinceeachreinjectedpacket

wasalreadyavailablein NI memory, host-level forwardingresultsin anunneces-saryhost-to-NIdatatransferateachinternaltreenodeof themulticasttree,whichwastesbusbandwidthandprocessortime. Second,no forwardingtakesplaceun-lessthehostprocessesincomingpackets.If onehostdoesnotpoll thenetwork inatimely manner, all its childrenwill beaffected.Insteadof relyingonpolling, theNI canraisean interrupt,but interruptsareexpensive. Third, thecritical sender-receiverpathincludesthehost-NIinteractionsof all thenodesbetweenthesenderandthereceiver. For eachmulticastpacket, theseinteractionsconsistof copyingthepacket to thehost,hostprocessing,andreinjectingthepacket.

MCAST addressestheseproblemsby meansof NI-level forwarding:multicastpacketsareforwardedby theNI without interventionby hostprocessors.

4.2.2 Multicast TreeTopology

Multicast treetopologymattersin two ways. First, differenttreeshave differentperformancecharacteristics.Many differenttreetopologieshave beendescribedin theliterature[30, 77, 80]. Deeptrees(i.e., treeswith a low fanout),for exam-ple, give goodthroughput,becauseinternalnodesneedto forwardpacketsfewertimes,so they uselesstime per multicast. Latency increasesbecausemorefor-wardingactionsareneededbeforethe lastdestinationis reached.Dependingonthetypeof multicasttraffic generatedby anapplication,onetopologygivesbetterperformancethananother. Our goal is not to inventor analyzea particulartopol-ogythatis optimalin somesense.No topologyis optimalunderall circumstances.It is important,however, thatamulticastprotocoldoesnotprohibit topologiesthatareappropriatefor theapplicationat hand.

Second,with MCAST’s packet forwarding scheme(describedlater), some

Page 99: Communication Architectures for Parallel-ProgrammingSystems

4.2MCAST: ReliableMulticastCommunication 77

typedef enum + UNICAST, MULTICAST, CREDIT-

PacketType;

typedef struct +23242 / * as before * /unsigned tree; / * identifies multicast tree * /-

PacketHeader;

typedef struct +unsigned parent;unsigned nr children;unsigned children[MAX FORWARD];/ * MAX FORWARD depends on tree size and shape * /-

Tree;

Tree treetab[MAX TREES];

Fig. 4.5.Protocoldatastructuresfor MCAST.

treescan inducebuffer deadlocks.A multicastpacket requiresbuffer spaceatall its destinations.This buffer spacecanbeobtainedbefore thepacket is sentorit canbe obtainedmoredynamically, asthe packet travels from onenodein thespanningtreeto another. Thefirst approachis conceptuallysimple,but requiresaglobalprotocolto reserve buffersat all destinationnodes.Thesecondapproach,takenby MCAST, introducesthedangerof buffer deadlocks.A sendermayknowthatits multicastpacket canreachthenext nodein thespanningtree,but doesnotknow if the packet cantravel further onceit hasreachedthat node. Section4.3givesanexampleof suchadeadlock.

MCAST avoidsbuffer deadlocksby restrictingthetopologyof multicasttrees.An alternative approachis to detectdeadlocksandrecover from them. This ap-proachis takenby RECOV, anextensionof MCAST thatallows theuseof arbi-trarymulticasttrees.

4.2.3 ProtocolData Structures

In MCAST, NIs multicastinsidemulticastgroupsthatarecreatedat initializationtime; MCAST doesnot includea groupjoin or leave protocol. Eachmemberofa multicastgrouphasits own spanningtreewhich it usesto forward multicastpacketsto theothergroupmembers.

Figure4.5 shows MCAST’s datastructures.MCAST builds on UCAST, so

Page 100: Communication Architectures for Parallel-ProgrammingSystems

78 CoreProtocolsfor IntelligentNetwork Interfaces

we reuseandextendUCAST’s datastructuresandroutines.MCAST introducesa new packet type, MULTICAST, andextendthe packet headerwith a treefield.Eachprocessis givena uniqueindex for eachmulticastgroupthatit is a memberof. This index identifiesthe process’s multicasttreefor thatparticularmulticastgroup.Whentheprocesstransmitsapacketto theothermembersof thatmulticastgroup,it storesthetreeidentifier in thepacket’s treefield (seeFigure3.1). EachNI storesa multicastforwardingtable(treetab) that is indexedby this treefield.Thetableentryfor agiventreespecifieshow many childrentheNI hasin thattreeandlists thosechildren.

4.2.4 Sender-SideProtocol

An NI initiates a multicast in responseto multicast eventswhich are handledby eventmulticast(). This routinemarksthe packet to be multicastasa MUL-TICAST packet, storesthemulticasttreeidentifier in thepacket header, andtriesto transmit the packet to all first-level children in the multicast tree by callingforward packet().

Routineforward() looksupthetableentryfor themulticasttreethatthepacketwassenton andcreatesa senddescriptorfor eachforwardingdestinationlistedin thetableentry. Fromthenon, exactly thesameprocedureis followedasfor aunicastpacket. Thepacket will betransmittedto a forwardingdestinationonly ifcreditsareavailablefor thatdestination;otherwise,thesenddescriptoris movedto thedestination’sblocked-sendsqueue.

4.2.5 Receiver-SideProtocol

Figure4.7 shows the receiving sideof MCAST. Whena multicastpacket is re-ceived,it is deliveredto thehost,just likeaunicastpacket. In addition,thepacketis forwardedto all of thereceiving NI’schildrenin themulticasttree.This is doneusingthesameforwardingroutinedescribedabove(forward packet()).

An NI receive buffer that holds a multicastpacket is releasedwhen it hasbeendeliveredto thehostand whenit hasbeenforwardedto all childrenin themulticasttree. At that point, eventpacket released() is called. As before,thisroutinereturnsa sendcredit to the(immediate)senderof thepacket. In thecaseof a multicastpacket, the immediatesenderis the receiving NI’s parentin themulticasttree. We do not usethepacket header’s sourcefield (src) to determinethepacket’ssender, becausethisfield is overwrittenduringpacket forwarding(by

Page 101: Communication Architectures for Parallel-ProgrammingSystems

4.2MCAST: ReliableMulticastCommunication 79

void event multicast(unsigned tree, Packet * pkt) +pkt , hdr.tag = MULTICAST;pkt , hdr.tree = tree;forward packet(pkt);-

void forward packet(Packet * pkt) +Tree * tree = &treetab[pkt , hdr.tree];SendDesc * desc;unsigned i;

for (i = 0; i 5 tree , nr children; i++) +desc = alloc send desc();desc , dest = tree , children[i];desc , packet = pkt;

credit send(desc);--Fig. 4.6. Sender-sideprotocolfor MCAST.

void event packet received(Packet * pkt) +accept new credits(pkt);

switch(pkt , hdr.tag) +case UNICAST: 23232 ; break / * as before * /case CREDIT: 24232 ; break / * as before * /case MULTICAST:

deliver to host(pkt);nr pkts received++;forward packet(pkt);break;--

void event packet released(Packet * pkt) +return credit to(treetab[pkt , hdr.tree].parent);-

Fig. 4.7. Receiveprocedurefor themulticastprotocol.

Page 102: Communication Architectures for Parallel-ProgrammingSystems

80 CoreProtocolsfor IntelligentNetwork Interfaces

sendpacket()). Instead,we usethe treefield to find this NI’s parent;this field isnevermodifiedduringpacket forwarding.

To makethismodifiedpacketreleaseroutinework for unicastpackets,wealsodefineunicasttrees. Thesetreesconsistof two nodes,asenderandareceiver; thesenderis the parentof the receiver. Oneunicasttreeis definedfor eachsender-receiver pair. UNICAST is modifiedsothat it writestheunicasttreeidentifier ineachoutgoingunicastpacket. Unicasttreesarenot definedonly to make packetreleasework smoothly, but arealsoimportantin theRECOV protocol.

4.2.6 Deadlock

MCAST’s simplicity resultsfrom building on reliablecommunicationchannelsbetweenNIs. Unfortunately, MCAST is susceptibleto deadlockif an arbitrarytopologyis used.Figure4.8illustratesadeadlockscenario.ThefigureshowsfourNIs,eachwith itsownsendandreceivepool. RecallthattheNI receivebufferpoolis partitioned.In thisscenario,eachprocessoris therootof adegeneratemulticasttree,a linear chain. Initially, eachNI ownsonesendcredit for eachdestinationNI. All processors(not shown in thefigure)have injecteda multicastpacket andeachmulticastpacket hasbeenforwardedto thenext NI in thesender’s multicastchain. That is, eachNI hasspentits sendcredit for the next NI in themulticastchain. Now every packet needsto be forwardedagainto reachthenext NI in itschain,but no packet canmake progress,becausethe target receive buffer on thenext NI is occupiedby anotherblockedpacket. EachNI hasfreereceive buffers,but UCAST’s flow controlschemedoesnot allow onesendingNI to useanothersendingNI’s receivebuffers.

Thisdeadlockis theresultof thespecificmulticasttreesusedto forwardpack-ets(linearchains).By default,LFC usesbinarytreesto forwardmulticastpackets.Figure4.9 shows the binary multicasttreefor processor0 in a 16-nodesystem.The multicasttreefor processorp is obtainedby renumberingeachtreenodenin the multicasttreeof processor0 to 6 n 7 p8 modP, whereP is the numberofprocessors.With thesebinarytrees,it is impossibleto constructa deadlockcycle(seeAppendixA).

To avoid buffer deadlocks,MCAST imposestwo restrictions:

1. Multicastgroupsmaynot overlap.

2. Not all multicasttreetopologiesarevalid.

Thefirst restrictionimpliesthatwecannotusebroadcastandmulticastat thesame

Page 103: Communication Architectures for Parallel-ProgrammingSystems

4.2MCAST: ReliableMulticastCommunication 81

9�9�9�9�9:�:�:�:�:;�;�;�;;�;�;�;<�<�<�<<�<�<�<

=�=�=�=>�>�>�>

?�?�?�??�?�?�?@�@�@�@@�@�@�@

Network interface 0 Network interface 1

Network interface 3

I0 I1

I2I3 3I

I0 I1

I2

I1

I2I3

0I

I2

1II0

I3

Send pool

Network interface 2

Receivepool

Fig. 4.8. MCAST deadlockscenario.

7 8

3

1

0

4 6

2

5

15

9 10 1211 1413

Fig. 4.9.A binarytree.

Page 104: Communication Architectures for Parallel-ProgrammingSystems

82 CoreProtocolsfor IntelligentNetwork Interfaces

time, becausethebroadcastgroupoverlapswith every multicastgroup. Therea-sonsfor theserestrictionsanda precisecharacterizationof valid multicasttreesaregivenin AppendixA.

4.3 RECOV: DeadlockDetectionand Recovery

MCAST’srestrictionsonmulticasttreetopologyarepotentiallycumbersome,be-causesometopologiesthat are not deadlock-freeare more efficient than othercombinations[80]. For large messages,for example,linear chainsgive the bestthroughput,but asillustratedin theprevioussection,chainsarenotdeadlock-free.Several sophisticatedmulticast protocolsbuild treesthat containshort chains.Thesetreesarenotdeadlock-freeeither, yetwewould liketo usethemif deadlockis unlikely.

To solve this problem,we have extendedMCAST with a deadlockrecoverymechanism.Theextendedprotocol,RECOV, switchesto a slower, but deadlock-freeprotocolwhena deadlockis suspected.In this section,we first give a high-level overview of RECOV’s deadlockdetectionandrecovery protocol. Next, wegiveadetaileddescriptionof theprotocol.

4.3.1 DeadlockDetection

RECOV usesthe feedbackfrom UCAST’s flow control schemeto detectpoten-tial deadlocks.In the caseof a buffer deadlock,eachNI in the deadlockcyclehasconsumedall sendcreditsfor the next NI in the cycle. In RECOV, an NIthereforesignalsa potentialdeadlockwhen it hasrun out of sendcreditsfor adestinationthat it needsto senda packet to. That is, nonemptyblocked-sendsqueuessignaldeadlock.With this triggermechanismanNI cansignala potentialdeadlockunnecessarily, becauseanNI may run out of sendcreditseven if thereis no deadlock.Theadvantageof this detectionmechanism,however, is thatNIsneedonly local information(thestateof theirblocked-sendsqueues)to detectpo-tentialdeadlocks.No communicationis neededto detectadeadlock,whichkeepsthedetectionalgorithmsimpleandefficient.

Sinceadeadlockneednotinvolveall NIs,eachNI mustmonitorall itsblocked-sendsqueue;progresson some,but not all queuesdoesnot guaranteefreedomofdeadlock. Checkingfor a nonemptyblocked-sendsqueueis efficient, but maysignaldeadlocksoonerthana timeout-basedmechanism.

Whenan NI hassignaleda potentialdeadlock,it initiatesa recovery action.

Page 105: Communication Architectures for Parallel-ProgrammingSystems

4.3RECOV: DeadlockDetectionandRecovery 83

No new recoveriesmaybestartedby thatNI while a previousrecovery is still inprogress.Whenarecovery terminates,theNI thatinitiatedit will searchfor othernonemptyblocked-sendsqueuesandstartanew recovery if it findsone.To avoidlivelock,theblocked-sendsqueuesareservedin a round-robinfashion.

Multiple NIs candetectandrecoverfrom adeadlocksimultaneously. Simulta-neousrecoveriesdo not interfere,but canbeinefficient. In adeadlockcycle,onlyoneNI needsto usethe slower escapeprotocol to guaranteeforward progress.With simultaneousrecoveries,all NIs that detectdeadlockwill start using theslowerescapeprotocol[96].

4.3.2 DeadlockRecovery

To ensureforwardprogresswhenadeadlockis suspected,RECOV requiresP A 1extra bufferson eachNI, whereP is thenumberof NIs usedby theapplication.EachNI N owns onebuffer on every otherNI. Thesebuffers form a dedicatedrecoverychannelfor N. Communicationon this recoverychannelcanresultonlyfrom arecovery initiatedby N.

WhenanNI suspectsdeadlock,it will useits recoverychannelto makeprogresson onemulticastor unicasttree. (Recall that we have definedboth unicastandmulticasttree,so thatevery datapacket travelsalongsometree.) OtherNIs willusethischannelonly whenthey receiveapacket thatwastransmittedonthechan-nel. Specifically, whenanNI receives,on recovery channelC, a packet p trans-mittedalongtreeT, thenthatNI mayforwardonepacket to eachof its childreninT. Theforwardedpacketsmustalsobelongto T. This guaranteesthatwe cannotconstructacyclewithin arecoverychannel,becauseall communicationlinks usedin therecoverychannelform asubtreeof T.

OnceanNI hasinitiateda recovery on its recovery channel,it maynot initi-ateanotherrecovery until therecovery channelis completelyclear. Therecoverychannelis clearedbottom-up.Whena leaf in therecovery treereleasesits escapebuffer (becauseit hasdeliveredthepacket to its host),it sendsa specialacknowl-edgementto its parentin therecoverytree.Whentheparenthasreleaseditsescapebuffer andwhenit hasreceivedacknowledgementsfrom all its childrenin there-covery tree,thenit will propagatean acknowledgementto its parent. Whentheroot of the recovery treereceivesan acknowledgement,the recovery channelisclear.

EachNI caninitiate at mostonerecoveryat a time,but differentNIs canstarta recovery simultaneously, even in a single multicast tree. Suchsimultaneousrecoveriesdo not interferewith eachother, becausethey usedifferentrecovery

Page 106: Communication Architectures for Parallel-ProgrammingSystems

84 CoreProtocolsfor IntelligentNetwork Interfaces

D

E F

Recovery treeR

Fig. 4.10.Subtreecoveredby thedeadlockrecoveryalgorithm.

channels:eachrecovery usestherecovery channelownedby theNI that initiatestherecovery.

The protocol’s operationis illustratedin Figure4.10. In this figure,verticesrepresentNIs and the edgesshow how theseNIs are organizedin a particularspanningtree.Dashededgesbetweenaparentandachild indicatethattheparenthasno sendcreditsleft for thechild.

NI R triggersarecoverybecauseit hasnosendcreditsfor its child D. Rselectsablockedpacket p at theheadof oneof its blocked-sendsqueuesanddeterminesthe treeT that this packet is beingforwardedon. R now becomesthe root of arecoverytreeandwill useits recoverychannelto forward p. Rmarksp andsendsp to D. D thenbecomespartof therecovery tree.If D hasany childrenfor whichit hasno sendcredits,thenD may further extendthe recovery treeby usingR’srecovery channelto forward p. If p canbeforwardedto anNI usingthenormalcredit mechanism(e.g.,to F), thenD will do so, andthe recovery treewill notincludethatNI. (If F cannotforward p to its child, thenF will initiate a recoveryon its own recoverychannel.)

4.3.3 Protocol Data Structures

We now presentthe RECOV protocolasan extensionof MCAST. Figure4.11showsthenew andmodifieddatastructuresandvariablesusedby RECOV. A newtype of packet, CHANNEL CLEAR, is usedto cleara recovery channel. CHAN-NEL CLEAR packetstravel from theleavesto therootof a recovery tree.

The packet headercontainstwo new fields. (In reality LFC combinesthese

Page 107: Communication Architectures for Parallel-ProgrammingSystems

4.3RECOV: DeadlockDetectionandRecovery 85

typedef enum + UNICAST, MULTICAST, CREDIT, CHANNEL CLEAR-

PacketType;

typedef struct +23242 / * as before * /int deadlocked; / * TRUE iff transmitted over recovery channel * /unsigned channel; / * identifies owner of the recovery channel * /-

PacketHeader;

typedef struct +23242 / * as before * /int deadlocked; / * saves deadlocked field of incoming packets * /unsigned channel; / * saves channel field of incoming packets * /-

SendDesc;

typedef struct +23242 / * as before * /unsigned nr acks; / * need to receive this many recovery acks * /-

ProtocolCtrlBlock;

typedef struct +23242 / * as before * /UnsignedQueue active channels; / * active recovery channels in this tree * /-

Tree;

int recovery active; / * TRUE iff the local node initiated a recovery * /Fig. 4.11.Deadlockrecoverydatastructures.

Page 108: Communication Architectures for Parallel-ProgrammingSystems

86 CoreProtocolsfor IntelligentNetwork Interfaces

two fieldsin a singledeadlock channelfield (seeSection3.4)). Field deadlockedis setto TRUE whena packet needsto usea recovery channel.Thechannelfieldidentifiestherecovery channel.This field containsa valid valuefor all UNICAST

andMULTICAST packetsthat travel acrossa recovery channel(i.e., packetsthathave deadlocked setto TRUE) andfor all CHANNEL CLEAR packets. The samefieldsareaddedto senddescriptors.

The protocolcontrol block is extendedwith a counter(nr acks) that countshow many CHANNEL CLEAR packetsremainto bereceivedbeforeachannel-clearacknowledgementcanbepropagatedup therecovery tree.

TheTreedatastructureis extendedwith a queueof active recovery channels(activechannels). Eachtime an NI participatesin a recovery on somerecoverychannelC for sometreeT, it addsC to thequeueof treeT.

4.3.4 Sender-SideProtocol

Thesendersideof RECOV consistsof severalmodificationsto theMCAST andUCAST routines. The modified multicastforwarding routine is shown in Fig-ure4.12.For eachchild, RECOV mustknow whetherthepacket to beforwardedarrivedon a deadlockrecovery channel.At thetime thepacket is forwardedto aspecificchild, wecannolongertrustall informationin thepacketheader, becausethis informationmayhavebeenoverwrittenwhenthepacketwasforwardedto an-otherchild. Therefore,wehavemodifiedforward packet() to savethedeadlockedandchannelfieldsof theincomingmulticastpacket in thesenddescriptorsfor theforwardingdestinations.

Thesaveddeadlockedfield is usedby credit send() (seeFigure4.13).As be-fore, this routinefirst triesto consumea sendcredit. If all sendcreditshave beenconsumed,however, sendcredit() no longerimmediatelyenqueuesthepacket —or rather, the descriptorthat referencesthe packet— on a blocked-sendsqueue.Instead,wefirst checkif thepacketarrivedonarecoverychannel(usingthesaveddeadlocked field). If this is the case,we forward a packet that belongsto thesametreeon the samerecovery channel.We cannotalwaysforward the packetreferencedby desc, becauseotherpacketsmay be waiting to travel to the samedestination.If we bypassthesepacketsandoneof themarrivedon thesametreeasthebypassingpacket, thenweviolateFIFOness.To preventthis,next in tree()(not shown) first appendsdescto theblocked-sendsqueuefor thepacket’s desti-nation. Next, it removesfrom this queuethe first packet that is travelling alongthesametreeasthepacket referencedby descandit returnsthesenddescriptorfor this packet. This packet is transmittedon therecoverychannel.

Page 109: Communication Architectures for Parallel-ProgrammingSystems

4.3RECOV: DeadlockDetectionandRecovery 87

void forward packet(Packet * pkt) +Tree * tree = &treetab[pkt , hdr.tree];int deadlocked = pkt , hdr.deadlocked; / * save! * /unsigned channel = pkt , hdr.channel; / * save! * /SendDesc * desc;unsigned i;

for (i = 0; i 5 tree , nr children; i++) +desc = alloc send desc();desc , dest = tree , children[i];desc , packet = pkt;desc , deadlocked = deadlocked;desc , channel = channel;

credit send(desc);--Fig. 4.12.Multicastforwardingin thedeadlockrecoveryprotocol.

If the packet that credit send() tries to transmitdid not arrive on a deadlockrecovery channel,thencredit send() will starta new recovery on this NI’s recov-ery channel.However, it cando soonly if it is not alreadyworking on anearlierrecovery. If this NI hasalreadyinitiateda recovery andthis recovery is still ac-tive,thenthepacket is simplyenqueuedon its destination’sblocked-sendsqueue.OtherwisetheNI startsa new recoveryon its own recoverychannel.

For completeness,wealsoshow themodifiedsendpacket() routine.This rou-tine now alsocopiesthe deadlocked field and the channelfield into the packetheader.

4.3.5 Receiver-SideProtocol

Thereceiver-sidecodeof RECOV is shown in Figure4.14andFigure4.15.Onlya few modificationsto eventpacket received() areneeded.

First,specialactionmustbetakenwhenapacketarrivesonarecoverychannel.In thatcase,thereceiving NI createssomestatefor therecovery thatthispacket ispartof. In theprotocolcontrolblock of therecoverychannel’s owner, theNI setsthe channel-clearcounterto 1. This counteris usedwhenthe recovery channelis clearedandindicateshow many partiesneedto agreethat thechannelis clearbeforeachannel-clearacknowledgementcanbesentuptherecoverytree.As long

Page 110: Communication Architectures for Parallel-ProgrammingSystems

88 CoreProtocolsfor IntelligentNetwork Interfaces

void credit send(SendDesc * desc) +ProtocolCtrlBlock * pcb = &pcbtab[desc , dest];

if (pcb , send credits . 0) +desc , deadlocked = FALSE;pcb , send credits /0/ ; / * consume a credit * /send packet(dest);return;-

if (desc , deadlocked) +/ * desc , packet arrived on a deadlock channel. We forward desc , packet,* or a predecessor in the same tree, on the same channel.* /pcbtab[desc , channel].nr acks++;desc = next in tree(pcb, desc); / * preserve FIFOness in tree * /send packet(desc); / * no credit consumed by this send * /-

else if (recovery active) +/ * Cannot do more than one recovery at a time. * /enqueue(&pcb , blocked sends, desc);-

else +/ * Initiate recovery on my deadlock channel * /recovery active = TRUE;pcbtab[LOCAL ID].nr acks = 1;desc , deadlocked = TRUE;desc , channel = LOCAL ID; / * my recovery channel * /send packet(desc); / * no credit consumed by this send * /--

void send packet(SendDesc * desc) +ProtocolCtrlBlock * pcb = &pcbtab[desc , dest];Packet * pkt = desc , packet;

pkt , hdr.src = LOCAL ID;pkt , hdr.credits = pcb , free credits; / * piggyback credits * /pkt , hdr.deadlocked = desc , deadlocked;pkt , hdr.channel = desc , channel;pcb , free credits = 0; / * we returned all credits to dest * /transmit(desc , dest, pkt);-

Fig. 4.13.Packet transmissionin thedeadlockrecoveryprotocol.

Page 111: Communication Architectures for Parallel-ProgrammingSystems

4.3RECOV: DeadlockDetectionandRecovery 89

void event packet received(Packet * pkt) +accept new credits(pkt);

if (pkt , hdr.deadlocked) +ProtocolCtrlBlock * channel owner = &pcbtab[pkt , hdr.channel];Tree * tree = &treetab[pkt , hdr.tree];

/ * Register my state for this recovery in the pcb of the node that* initiated the recovery. Add the owner’s id to a queue of ids* that is maintained for the tree that the packet belongs to.* /channel owner , nr acks = 1; / * one ’ack’ for local delivery * /enqueue(&tree , active channels, pkt , hdr.channel);-

switch(pkt , tag) +case UNICAST: 23232 ; break; / * as before * /case MULTICAST: 23242 ; break; / * as before * /case CREDIT: 24232 ; break; / * as before * /case CHANNEL CLEAR:

release recovery channel(pkt , channel);break;--

Fig. 4.14.Receiveprocedurefor thedeadlockrecoveryprotocol.

Page 112: Communication Architectures for Parallel-ProgrammingSystems

90 CoreProtocolsfor IntelligentNetwork Interfaces

asthereceiving NI doesnot extendtherecovery tree,only it needsto saythatthechannelis clear, hencethe counteris set to 1. If, however, the recovery tree isextendedby forwardingthepacketon its recoverychannel,thencredit send() willincreasethecounterto indicatethatanadditionalchannel-clearacknowledgementis neededfrom thechild thatthepacketwasforwardedto.

Routineeventpacket received() alsostorestheidentity of therecovery chan-nel’sowner(theNI thatinitiatedtherecovery)onaqueueassociatedwith thetreethattheincomingpacket travelledon. This informationis usedwhenpacketsarereleased.

Second,we mustprocessincomingchannel-clearacknowledgements.This isdoneby releaserecovery channel() (seeFigure 4.15). This routine propagateschannel-clearacknowledgementsup therecovery treeanddetectsrecovery com-pletion. If thechannel-clearcounterdoesnot dropto zero,thenthis routinedoesnothing. If thecounterdropsto zero,therearetwo possibilities. If thechannel-clearacknowledgementhasreachedtherootof therecoverytree,thentherecoveryis complete. The owner of the recovery channelknows that the entirerecoverychannelis free and will try to initiate a new recovery on its recovery channel.This is doneby try new recovery() (not shown), which scansall blocked-sendsqueuesfor a blockedpacket. If sucha queueis found,a new recovery is started.If the channel-clearacknowledgementshave not yet reachedthe recovery chan-nel’s owner (the root of the recovery tree), thenwe createa new channel-clearacknowledgementandsendit up therecovery tree.

We have now describedhow a recovery is initiated, how a recovery tree isbuilt, andhow channel-clearacknowledgementstravel backup therecovery tree,but wehavenotexplainedhow theclearingof arecoverychannelis initiated.Thisis doneby theleavesof therecoverytreewhenthey releaseapacket. Themodifiedeventpacket released() first checksif this NI is participatingin any recovery inthetreethatthepacket arrivedon. If it is, thenit will have storedtheidentitiesoftherecoverychannelownersin theactivechannelsqueueassociatedwith thetree.If this queueis not empty, eventpacket released() dequeuesonechannelownerandreturnsthepacket to thatowner’srecoverychannel.If, on theotherhand,thisNI is not participatingin a recovery on thetreethat thepacket arrivedon, thenitwill simply returnasendcreditto thepacket’s lastsender(asbefore).

Page 113: Communication Architectures for Parallel-ProgrammingSystems

4.3RECOV: DeadlockDetectionandRecovery 91

void release recovery channel(unsigned channel) +ProtocolCtrlBlock * channel owner = &pcbtab[channel];

channel owner , nr acks /0/ ;if (channel owner , nr acks . 0) return; / * channel not clear yet * // * Below and at this tree node, the recovery channel is clear. * /if (channel == LOCAL ID) +

/ * Recovery completed; the entire recovery channel is clear. * /recovery active = FALSE;try new recovery(); / * search for new nonempty blocked-sends queue * /-

else +Packet deadlock ack;SendDesc * desc = alloc send desc();

/ * Propagate channel-clear ack to parent * /deadlock ack.hdr.tag = CHANNEL CLEAR;desc , dest = pcb , parent;desc , packet = &deadlock ack;desc , deadlocked = FALSE;desc , channel = channel;send packet(desc);--

void event packet released(Packet * pkt) +Tree * tree = &treetab[pkt , hdr.tree];

if (queue empty(&tree , active channels)) +return credit to(tree , parent);-

else +/ * Do not return a credit, but try to clear a deadlock channel * /unsigned channel = dequeue(&tree , active channels);release recovery channel(channel);--

Fig. 4.15.Packet releaseprocedurefor thedeadlockrecoveryprotocol.

Page 114: Communication Architectures for Parallel-ProgrammingSystems

92 CoreProtocolsfor IntelligentNetwork Interfaces

4.4 INTR: Control Transfer

Sinceinterruptsareexpensive, LFC tries to avoid generatinginterruptsunneces-sarily by meansof a polling watchdog [101]. This is a mechanismthat delaysinterruptsin thehopethatthetargetprocesswill soonpoll thenetwork. Theideais to starta watchdogtimer whena packet arrivesat theNI. TheNI generatesaninterruptonly if thehostdoesnotpoll beforethetimerexpires,

Wherethe original polling watchdogproposalby Maquelinet al. is a hard-waredesign,LFC usestheprogrammableNI processorto implementa softwarepolling watchdog. In addition,LFC refinesthe original designin the followingway. Whenthe watchdogtimer expires,LFC doesnot immediatelygenerateaninterruptif thehosthasnot yet processedall packetsdeliveredto it. Instead,theNI performstwo additionalchecksto determinewhetherit shouldgenerateaninterrupt:aninterruptstatuscheckandaclient progresscheck.

Thepurposeof theinterruptstatuscheckis to avoid generatinginterruptswhenthe receiving processrunswith network interruptsdisabled. Recall that clientsmaydynamicallydisableandenablenetwork interrupts.For efficiency, theinter-ruptstatusmanipulationfunctionsareimplementedby togglinganinterruptstatusflagin hostmemory, withoutany communicationor synchronizationwith thecon-trol program(seeSection3.7). Beforesendingan interrupt,thecontrolprogramfetches(usinga DMA transfer)the interruptstatusflag from hostmemory. Nointerruptis generatedwhentheflag indicatesthattheclientdisabledinterrupts.

Thegoalof theclientprogresscheckis to avoid generatinginterruptswhenthereceiving processrecentlyprocessedsome(but not necessarilyall) of thepacketsdeliveredto it. The hostmaintainsa countof the numberof packets that it hasprocessed.When the watchdogtimer expires, the control programfetchesthiscounter. No interruptis generatedif thehostprocessedat leastonepacket sincethepreviouswatchdogtimeout.

INTR startsthe watchdogtimer eachtime a packet arrivesand the timer isnot alreadyrunningon behalfof anearlierpacket. Figure4.16shows how INTRhandlespolling watchdogtimeouts.The NI fetchesthe host’s interruptflag andpacket counterusinga singleDMA transfer;both arestoredinto host. The NIthenperformsthefirst partof theclientprogresscheck:if theclienthasprocessedall (nr pkts received) packetsdeliveredby theNI, thentheNI cancelsthepollingwatchdogtimer. In this case,the hostprocessedall packetsthat weredeliveredto it and thereis no needto keepthe timer running. If this checkfails, the NIperformsboththeinterruptstatuscheckandthesecondpartof theclientprogresscheck. If the host recentlydisabledinterruptsor if the hostconsumedsomeof

Page 115: Communication Architectures for Parallel-ProgrammingSystems

4.5RelatedWork 93

typedef struct +unsigned nr pkts processed;int intr disabled;-

HostStatus;

unsigned last processed;

void event watchdog timeout(void) +HostStatus host;

fetch from host(&host);if (host.nr pkts processed == nr pkts received) +

timer cancel(); / * client processed all packets * /-else if (host.intr disabled BCB host.nr pkts processed . last processed) +

timer restart(); / * client disabled interrupts or polled recently * /-else +

send interrupt to host();timer restart();-

last processed = host.nr pkts processed;-Fig. 4.16.Timeouthandlingin INTR.

thepacketsdeliveredto it, thentheNI doesnot generateaninterrupt,but restartsthe polling watchdogtimer. If all checksfail, the NI generatesan interruptandrestartsthetimer. Finally, theNI savesthehost’spacketcounterin last processed,sothaton thenext timeoutit cancheckif theclientmadeprogress.

4.5 RelatedWork

LFC’sNI-level protocolsimplementreliability, interruptmanagement,andmulti-castforwarding.We discussrelatedwork in all threeareas.

4.5.1 Reliability

LFC assumesthat the network hardware is reliable and preserves this reliabil-ity by meansof its link-level flow controlprotocol(UCAST). Severalothersys-tems(BIP [118], Hamlyn[32], Illinois FastMessages[112, 113], FM/MC [146],VMMC [24], andPM [141]) make thesamehardwarereliability assumption(see

Page 116: Communication Architectures for Parallel-ProgrammingSystems

94 CoreProtocolsfor IntelligentNetwork Interfaces

Figure2.3). Thesesystems,however, implementflow control in a differentway.With the exceptionof PM (seeSection2.6) noneof thesesystemsimplementslink-level flow control. Instead,they assumethattheNI cancopy datato thehostsufficiently fastto preventbuffer overflows(seeSection1.6).

With theexceptionof FM/MC, noneof theabovesystemssupportsacompleteNI-levelmulticast.With NI-level forwarding,multicastpacketsoccupy NI receivebuffersfor a longertime. As a result,thepressureon thesebuffersincreases,andspecialmeasuresareneededto preventblocked-packetresets.FM/MC treatshighNI receivebuffer pressureasarareeventandswapsbuffersto hostmemorywhentheproblemoccurs.For applicationsthatmulticastvery heavily, however, swap-pingoccursfrequentlyanddegradesperformance[15]. LFC is moreconservativeandusesNI-level flow controlto solve theproblem.

4.5.2 Interrupt Management

LFC supportsbothpolling andinterruptsascontrol-transfermechanisms.Somecommunicationsystemsprovideonly apolling primitive.Themainproblemwiththisapproachis thatmany parallel-programmingsystems,notablyDSM systems,cannotrely on polling by theapplication.Interruptsprovideasynchronousnotifi-cation,but areexpensiveon commercialarchitecturesandoperatingsystems:thetime to field aninterruptoftenexceedsthe latency of a smallmessage.ResearchsystemssuchastheAlewife [2] andtheJ-machine[132] supportfastinterrupts,but the mechanismsusedin thesemachineshave not found their way into com-mercialprocessorarchitectures.Even without hardwaresupport,operatingsys-temscan,in principle,dispatchinterruptsefficiently [94, 143]. Thekey ideais tosave minimal processorstate,andleave the rest(e.g.,floating-pointstate)up totheapplicationprogram.In practice,however, interruptprocessingoncommodityoperatingsystemshasremainedexpensive.

LFC doesnot attack the overheadof individual interrupts,but aims to re-ducethe interrupt frequency by optimistically delayinginterrupts. Maquelinetal. proposedthepolling watchdog,a hardwareimplementationof this idea[101].We augmentedthepolling watchdogwith theinterrupt-statusandclient-progresschecksdescribedin Section4.4. Thesetwo checksfurtherreducethenumberofspuriousinterrupts.

Page 117: Communication Architectures for Parallel-ProgrammingSystems

4.5RelatedWork 95

4.5.3 Multicast

Several researchershave proposedto use the NI insteadof the host processorto forward multicastpackets [56, 68, 80, 146]. Our NI-level protocol is origi-nal, however, in that it integratesunicastandmulticastflow control andusesadeadlock-recoveryschemeto avoid routingrestrictions.

Verstoepet al. describea system,FM/MC, that implementsanNI-level mul-ticastfor Myrinet [146]. FM/MC runson thesamehardwareasLFC, but imple-mentsbuffer managementandreliability in a very differentway. Section4.5.4comparesLFC andFM/MC in moredetail.

Huangand McKinley proposeto exploit ATM NIs to implementcollectivecommunicationoperations,includingmulticast[68]. In theirsymmetricbroadcastprotocol,NIs useanATM multicastchannelto forwardmessagesto theirchildren;multicastacknowledgementsare also collectedvia the NIs. The sendinghostmaintainsaslidingwindow; it appearsthatasinglewindow is usedperbroadcastgroup. LFC, in contrast,usessliding windows betweeneachpair of NIs. Thisallows LFC to integrateunicastandmulticastflow control;HuangandMcKinleydo notdiscussunicastflow controlandpresentsimulationresultsonly.

Gerlaet al. [56] alsodiscussusingNIs for multicastpacket forwarding.Theyproposea deadlockavoidanceschemethatdividesreceive buffersin two classes.Thefirst classis usedwhenmessagestravel to higher-numberedNIs, thesecondwhengoing to lower-numberedNIs. This schemerequiresbuffer resourcespermulticasttree,whichis problematicin asystemwith many smallmulticastgroups.

KesavanandPandastudiedoptimalmulticasttreeshapesfor systemswith pro-grammableNIs [80]. They describetwo packet-forwardingschemes,first-child-first-servedandfirst-packet-first-served(FPFS),andshow thatthelatterperformsbest. The paperdoesnot discussflow control issues.MCAST integratesFPFSwith UCAST’s flow control scheme. MCAST usually forwardspackets to allchildrenasthesepacketsarrive, just like in FPFS.WhenanNI hasto forwardapacket to a destinationfor which no sendcreditsareavailable,however, the NIwill queuethatpacket andstartworkingon anotherpacket.

All spanning-treemulticastprotocolsmustdealwith the possibility of dead-lock. Deadlockis oftenavoidedby meansof a deadlock-freeroutingalgorithm;the literatureon suchalgorithmsis abundant. Most researchin this area,how-ever, appliesto routingat thehardwarelevel. Also, aspointedout by Anjan andPinkston[4], mostprotocolsavoiddeadlockby imposingroutingrestrictions[43].This is theapproachtakenby MCAST. RECOV wasinspiredby theDISHA pro-tocol for wormhole-routednetworks [4]. DISHA, however, is a deadlockrecov-

Page 118: Communication Architectures for Parallel-ProgrammingSystems

96 CoreProtocolsfor IntelligentNetwork Interfaces

DDDD

LFC

MulticastP2P1P0

P0 P1 P2NI

Host Shared

Shared

FM/MC

E�E�E�EF�F�FG�G�GH�H�HI�I�IJ�J�JK�K�KK�K�KL�L�LL�L�L M�M�M�MM�M�M�MN�N�N�NN�N�N�N O�O�OO�O�OP�P�PP�P�P Q�Q�Q�QQ�Q�Q�QR�R�RR�R�R S�S�S�SS�S�S�ST�T�TT�T�T U�U�UU�U�UV�V�VV�V�V W�W�WW�W�WX�X�XX�X�X

Y�Y�Y�YZ�Z�Z[�[�[\�\�\

]�]�]^�^�^_�_�_`�`�`a�a�ab�b�bc�c�cd�d�d e�e�e�ef�f�f�f g�g�g�gh�h�h�h i�i�i�ij�j�j k�k�k�kl�l�l�l m�m�m�mn�n�n o�o�o�op�p�p�p

q�q�qq�q�qr�r�rr�r�rs�s�st�t�tu�u�u�uv�v�vw�w�wx�x�x yyyyzzzz

{{{{||||}�}�}~�~�~������������ ���������� ���������� ���������� ����������

������������Fig. 4.17.LFC’s andFM/MC’sbufferingstrategies.

ery schemefor unicastworms,whereasRECOV dealswith buffer deadlocksin astore-and-forwardmulticastprotocol.

4.5.4 FM/MC

FM/MC implementsa reliableNI-level multicastfor Myrinet, but doesso in averydifferentway thanLFC. Themostimportantdifferencesarerelatedto buffermanagementandthereliability protocol.

Figure4.17 shows how LFC andFM/MC allocatereceive buffers. FM/MCusesa singlequeueof NI buffers for all inboundnetwork packets. LFC, on theotherhand,staticallypartitionsits NI buffersamongall senders:onesendercan-notuseanothersender’sreceivebuffers.At thehostlevel, thesituationhasalmostbeenreversed. FM/MC allocatesa fixed numberof unicastbuffers per sender.Theseunicastbuffers are managedby a standardsliding-window protocol thatruns on the host. In addition, FM/MC hasanother, separateclassof multicastbufferswhich arenot partitionedstaticallyamongsenders.LFC doesnot distin-guishbetweenunicastandmulticastbuffersandusesasinglepoolof hostreceivebuffers.

FM/MC’s multicastbuffers areallocatedto sendersby a centralcredit man-agerthat runson oneof theNIs. A multicastcredit representsa buffer on everyreceiving host. Beforea processcanmulticasta message,it mustobtaincreditsfrom thecreditmanager. To avoid theoverheadof a credit request-replypair foreverymulticastpacket,hostscanrequestmultiplecreditsandcachethem.Creditsthathave beenconsumedarereturnedto thecreditmanagerby meansof a tokenthatrotatesamongall NIs.

At theNI level, nosoftwareflow controlis present.EachNI containsa singlereceive queuefor all inboundpackets. Underconditionsof high load,this queuefills up. FM/MC thenmovespart of the receive queueto hostmemoryto makespacefor inboundpackets. Eventually, senderswill run out of creditsand the

Page 119: Communication Architectures for Parallel-ProgrammingSystems

4.5RelatedWork 97

memorypressureon receiving NIs will drop; packets can then be copiedbackto NI memoryandprocessedfurther. The idea is that this type of swappingtohostmemorywill occuronly underexceptionalconditions,andit is assumedthatswappingwill not take too muchtime. Eventually, however, FM/MC reliesonhardwareback-pressureto stall sendingNIs.

By default, FM/MC usesthe samebinary treesasLFC to forward multicastpackets. The protocol,however, allows the useof arbitrarymulticasttrees,be-causethecreditandswappingmechanismavoid buffer deadlocksat theNI level.WhenNI buffersfill up, they arecopiedto hostmemory. Thecentralizedcreditmechanismguaranteesthatbuffer spaceis availableonthehost.Sincepacketsarenot forwardedfrom hostmemory, thereis no risk of buffer deadlockat the hostlevel.

Both LFC andFM/MC have their strengthsandweaknesses.An importantadvantageof LFC over FM/MC is its simplicity. A singleflow controlschemeisusedfor unicastandmulticasttraffic. Thereis noneedto communicatewith asep-aratecreditmanager. No requestmessagesareneededto obtaincredits:receiversknow whensendersarelow on creditsandsendcredit updatemessageswithoutreceiving any requests.Finally, sinceLFC implementsNI-level flow control, itdoesnot needto swapbuffersbackandforth betweenhostandNI memory. Thepricewepayfor thissimplicity is therestrictionon thetreetopologiesthatcanbeusedby MCAST. FM/MC canusearbitrarymulticasttrees.

LFC andFM/MC alsodiffer in thewaycreditsareobtainedfor multicastpack-ets.LFC is optimisticin thata senderwaitsonly for creditsfor its childrenin themulticasttree. In FM/MC, a senderwaits until every receiver hasspacebeforesendingthe next multicastpacket(s). On the otherhand,hostbuffers arenot asscarcea resourceasNI buffers,sowaiting in FM/MC mayoccurlessfrequently.

Both protocolshavepotentialscalabilityproblems.LFC partitionsNI receivebuffers amongall senders.FM/MC, in contrast,allows NI buffers to be shared,which is attractive whenthe amountof NI memoryis small andthe numberofnodesin thesystemlarge. Givena reasonableamountof memoryon theNI anda modestnumberof nodes,however, LFC’s partitioningof NI buffers posesnoproblems,andobviatestheneedfor FM/MC’sbuffer swappingmechanism.

FM/MC hasotherscalabilityproblems.First,all creditrequestsareprocessedby asingleNI, which introducesapotentialbottleneck.Second,to recovermulti-castcredits,FM/MC usesa rotatingtoken. Thecreditmanagermayhave to waitfor acompletetokenrotationuntil it cansatisfya requestfor credits.

Page 120: Communication Architectures for Parallel-ProgrammingSystems

98 CoreProtocolsfor IntelligentNetwork Interfaces

4.6 Summary

Thischapterhasgivenadetaileddescriptionof LFC’s NI-level protocolsfor reli-ablepoint-to-pointcommunication,reliablemulticastcommunication,andfor in-terruptmanagement.TheUCAST protocolprovidesreliablepoint-to-pointcom-municationbetweenNIs. MCAST extendsUCAST with multicastsupport,butis not deadlock-freefor all multicasttrees.RECOV extendsMCAST with dead-lock recovery andthe resultingprotocol is deadlock-freefor all multicasttrees.Finally, INTR is a refinedsoftwarepolling watchdogprotocol. INTR delaysthegenerationof network interruptsasmuchaspossibleby monitoringthebehaviorof thehost.

Page 121: Communication Architectures for Parallel-ProgrammingSystems

Chapter 5

The Performanceof LFC

Thischapterevaluatestheperformanceof LFC’sunicast,fetch-and-add,andbroad-castprimitivesusingmicrobenchmarks.The performanceof client systemsandapplications,which is whatmattersin theend,will bestudiedin subsequentchap-ters.

5.1 UnicastPerformance

First we discussthe latency and throughputof LFC’s point-to-pointmessage-passingprimitive,lfc ucastlaunch(). Next, wedescribeLFC’spoint-to-pointper-formancein termsof theLogGP[3] parameters.

5.1.1 Latency and Thr oughput

Figure5.1shows LFC’s one-way latency for differentreceive methods.All mes-sagesfit in asinglepacketandarereceivedthroughpolling. Weusedthefollowingreceivemethods:

1. No-touch. The receiving processis notified of a packet’s arrival (in hostmemory),but neverreadsthepacket’scontents.This typeof receivebehav-ior is frequentlyusedin latency benchmarksandgivesa goodindicationoftheoverheadsimposedby LFC.

2. Read-only. Thereceiving processusesa for loop to readthepacket’s con-tentsinto registers(one32-bit word per loop iteration),but doesnot writethe packet’s contentsto memory. This type of behavior canbe expected

99

Page 122: Communication Architectures for Parallel-ProgrammingSystems

100 ThePerformanceof LFC

0� 200� 400� 600� 800� 1000Message size (bytes)�

0

10

20

30

40

50

Late

ncy

(mic

rose

cond

s)� Read only

CopyNo touch

Fig. 5.1.LFC’s unicastlatency.

for small control messages,which can immediatelybe interpretedby thereceiving processor.

3. Copy. Thereceiving processcopiesthepacket’s contentsto memory. (Thisis doneusingan efficient string-move instruction.) This is the typical be-havior for largermessagesthatneedto becopiedto a datastructurein thereceiving process’saddressspace.

The one-way latency of an emptypacket is 11.6 µs (for all receive strategies).This doesnot includethecostof packet allocationat thesendingsideandpacketdeallocationat the receiving side,becausebothcanbeperformedoff thecriticalpath.Sendpacketallocationcosts0.4µs,receivepacketdeallocationcosts0.7µs.

As expected,the read-onlyandcopy latenciesarehigher than the no-touchlatencies,becausethe receiving processincurs cachemisseswhen it readsthepacket’s contents.Surprisingly, however, thecopy variantis fasterthantheread-only variant.This hastwo causes.First, thecopy variantalwayscopiesincomingpacketsto thesame,write-backcacheddestinationbuffer, sothewritesthatresultfrom thecopy do notgenerateany memorytraffic (aslongasthebuffer fits in thecache).Second,the read-onlyvariantaccessesthe databy meansof a for loop,while thecopy variantusesa faststring-copy instruction.

Figure5.2showstheone-wayunicastthroughputusingthesamethreereceivemethods.(Notethatthemessagesizeaxishasa logarithmicscale.)Thesawtoothshapeof the curves is causedby fragmentation.Messageslarger than1 Kbyte

Page 123: Communication Architectures for Parallel-ProgrammingSystems

5.1UnicastPerformance 101

aresplit into multiple packets. If the last of thesepacketsis not full, theoverallefficiency of themessagetransferdecreases.Thiseffect is strongestwhenthelastfragmentis nearlyempty.

With theno-touchreceivestrategy, themaximumthroughputis 72.0Mbyte/s.Readingthe datareducesthe throughputsignificantly. For messagesizesup to128Kbyte,copying thedatais fasterthanjust readingit, for thereasonsdescribedabove (write-backcachinganda fastcopy instruction).For largemessages,how-ever, thethroughputof thecopy variantdropsnoticeably(e.g.,from 63.2Mbyte/sfor 1 Kbytemessagesto 33.2Mbyte/sfor 1 Mbytemessages).For messagesupto256Kbyte, thereceive buffer usedby thecopy variantoughtto fit in theproces-sor’ssecond-level datacache.SincetheL2 cacheis physicallyindexed,however,someof the buffer’s memorypagesmay map to the samecachelines asotherpages.Theexactpagemappingsvary from run to run,but for largerbuffers,con-flicts aremorelikely to occur. Conflictingpagemappingsresultin conflictmissesduring thecopy operationandthesemissesgeneratememorytraffic. This trafficcompetesfor busandmemorybandwidthwith incomingnetwork packetsandre-ducestheoverall throughput.Messageslargerthan256Kbyte no longerfit in theL2 cache.Cachemissesareguaranteedto occurwhensuchamessageis copied.

AlthoughLFCachievesgoodthroughput,it is notableto saturatethehardwarebottleneck,theI/O bus,which hasa bandwidthof 127Mbyte/s.Themainreasonis thatLFC usesprogrammedI/O (with write combining)at thesendingside.Asshown in Figure2.2, this datatransfermechanismcannotsaturatethe I/O bus.This problemcanbe alleviatedby switching to DMA, but DMA introducesitsown problems(seeSection2.2).

5.1.2 LogGP Parameters

While the throughputand latency graphsare useful, additional insight can begainedby measuringLFC’sLogGPparameters.LogGP[3, 102] is anextensionoftheLogPmodel[40,41],whichcharacterizesacommunicationsystemin termsoffour parameters:latency (L), overhead(o), gap(g), andthenumberof processors(P).

Latencyis thetimeneededbyasmallmessageto propagatefromonecomputerto another, measuredfrom themomentthesendingprocessoris nolongerinvolvedin thetransmissionto themomentthereceiving processorcanstartprocessingthemessage.Latency includesthe time consumedby thesendingNI to transmitthemessageandthetimeconsumedby thereceiving NI to deliver themessageto thereceiving host. This definitionof latency is differentfrom thesender-to-receiver

Page 124: Communication Architectures for Parallel-ProgrammingSystems

102 ThePerformanceof LFC

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)�0

10

20

30

40

50

60

70

Thr

ough

put (

Mby

te/s

econ

d)

�No touchCopyRead only

Fig. 5.2.LFC’s unicastthroughput.

latency discussedabove. Thesender-to-receiver latency includessendandreceiveoverhead(describedbelow), whicharenot includedin L.

Overheadis the time a processorspendson transmittingor receiving a mes-sage.We will make theusualdistinctionbetweensendoverhead(os) andreceiveoverhead(or ). Overheaddoesnot include the time spenton the NI andon thewire. Unlike latency, overheadcannotbehidden.

Thegap is the reciprocalof thesystem’s small-messagebandwidth.Succes-sive packetssentbetweena given sender-receiver pair arealwaysat leastg mi-crosecondsapartfrom eachother.

The LogP modelassumesthat messagescarry at mosta few words. Largemessages,however, arecommonin many parallelprogramsandmany commu-nicationsystemscantransfera singlelarge messagemoreefficiently thanmanysmallmessages.LogGPthereforeextendsLogPwith aparameter, G, thatcapturesthebandwidthfor largemessages.For LFC, we defineG asthepeakthroughputundertheno-touchreceivestrategy.

To measurethevaluesof theLogGPparametersfor LFC,weusedbenchmarkssimilar to thosedescribedby Iannelloet al [69]. TheresultingvaluesaregiveninTable5.1. All values,exceptthevalueof G, weremeasuredusingmessageswithapayloadof 16 bytes.

Thesender-to-receiverlatenciesreportedin Figure5.1arerelatedto theLogGPparametersasfollows:

Page 125: Communication Architectures for Parallel-ProgrammingSystems

5.1UnicastPerformance 103

LogGP parameter ValueSendoverhead(os) 1.6µsReceiveoverhead(or) 2.2µsLatency (L) 8.2µsGap(g) 5.6µsBig gap(G) 72.0Mbyte/s

Table5.1. Valuesof theLogGPparametersfor LFC.

sender-to-receiver latency � os 7 L 7 or .

Thesender-to-receiverlatency of a16-bytemessageis thus1 � 6 7 8 � 2 7 2 � 2 �12� 0 µs, slightly more than the latency of a zero-bytemessage(11.6 µs). Thelargestpart (L) of the sender-to-receiver latency is spenton the sendingandre-ceiving NIs; this partcanin principlebeoverlappedwith usefulwork on thehostprocessor.

5.1.3 Interrupt Overhead

All previousmeasurementswereperformedwith network interruptsdisabled.Wenow briefly considerinterrupt-driven messagedelivery. As explained in Sec-tion 3.7, LFC delaysinterruptsin thehopethat the destinationprocesswill pollbeforethe interruptneedsto be raised. The default delay is 70 µs. This value,approximatelytwice theinterruptoverhead(seebelow), wasdeterminedafterex-perimentingwith someof theapplicationsdescribedin Chapter8.

To measuretheoverheadof interrupt-drivenmessagedelivery, we setthede-lay to 0 µs andmeasurethe null unicastlatency in two differentways: first, us-ing a programthatdoesnot poll for incomingmessagesandthatusesinterruptsto receive messages;second,usinga programthat doespoll and that doesnotuseinterrupts.Thedifferencein latency measuredby thesetwo programsstemsfrom interruptoverhead.Using this method,we measurean interruptoverheadof 31 µs, which is morethantwice the unicastnull latency. This time includescontext-switchingfrom usermodeto kernelmode,interruptprocessinginsidethekerneland the Myrinet device driver, dispatchingthe user-level signalhandler,andreturningfrom thesignalhandler(usingthesigreturn() systemcall). Noticethat the last overheadcomponent,the signalhandlerreturn,neednot be on thecritical communicationpath:a messagecanbeprocessedanda reply canbesentbeforereturningfrom thesignalhandler.

Page 126: Communication Architectures for Parallel-ProgrammingSystems

104 ThePerformanceof LFC

0� 20 40 60Number of processors

0

100

200

300

400

500

Late

ncy

(mic

rose

cond

s)

0� 1� 2� 3� 4�

5�01020304050

Fig. 5.3. Fetch-and-addlatenciesundercontention.

This largeinterruptoverheadis notuncommonfor commercialoperatingsys-tems. In user-level communicationsystemswith low-latency communication,however, the high cost of using interruptsshouldclearly be avoided wheneverpossible.It is exactly thishighcostthatmotivatedtheadditionof apolling watch-dogmechanismto LFC.

5.2 Fetch-and-AddPerformance

A contention-freefetch-and-addoperationonavariablein localNI memorycosts20.7 µs. Sincewe have not optimizedthis local case,an FA REQUEST and anFA REPLY messagewill besentacrossthenetwork. Eachmessagetravels to theswitch to which the sendingNI is attached;this switch thensendsthe messagebackto thesameNI.

A contention-freeF&A operationon a remoteF&A variablecosts19.8 µs,which is slightly lessexpensive thanthelocal case.To detecttheF&A reply, thehost that issuedthe operationpolls a location in its NI’s memory. In the localcase,this polling stealsmemorycyclesfrom the local NI processorwhich needsto processtheF&A requestandsendtheF&A reply(to itself). In theremotecase,receiving therequest,processingit, andsendingthereply areall performedby aremoteNI, without interferencefrom apolling host.

Figure5.3showshow F&A latenciesincreaseundercontention.In thisbench-mark,eachprocessor, excepttheprocessorwheretheF&A variableis stored,ex-

Page 127: Communication Architectures for Parallel-ProgrammingSystems

5.3MulticastPerformance 105

0� 200� 400� 600� 800� 1000Message size (bytes)�

0

50

100

150La

tenc

y (m

icro

seco

nds)

�64 processors32 processors16 processors8 processors4 processors�

Fig. 5.4.LFC’s broadcastlatency.

ecutesa tight loop in which an F&A operationis issued. With as few as threeprocessors(seeinset),the NI that servicesthe F&A variablesbecomesa bottle-neckandthelatenciesincreaserapidly.

5.3 Multicast Performance

The multicastperformanceevaluationis organizedasfollows. First, we presentlatency, throughput,and forwarding overheadresultsfor LFC’s basicprotocol.Next, we evaluatethe deadlockrecovery protocolundervariousconditions. InChapter8 wewill alsocomparetheperformanceof LFC’smulticastprotocolwithprotocolsthatusehost-level forwarding.

5.3.1 Performanceof the BasicMulticast Protocol

We first examinethe latency, throughput,andforwardingoverheadof LFC’s ba-sic multicastprotocol. The basicprotocolusesbinary treesand thereforedoesnot needthedeadlockrecovery mechanism.In the following measurements,themaximumpacket sizeis 1 Kbyte andeachsenderis allocated � 512

P � sendcredits,whereP is the numberof processors.All packetsarereceived throughpolling.Weusethecopy receivestrategy describedin Section5.1.

Figure 5.4 shows the broadcastlatenciesof the basicprotocol. We definebroadcastlatency asthe latency betweenthe senderandthe last receiver of the

Page 128: Communication Architectures for Parallel-ProgrammingSystems

106 ThePerformanceof LFC

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)�0

10

20

30

40

Thr

ough

put (

Mby

te/s

econ

d)

4 processors8 processors16 processors32 processors64 processors

Fig. 5.5. LFC’s broadcastthroughput.

broadcastpacket. As expected,the latency increaseslinearly with increasingmessagesizeand logarithmicallywith increasingnumbersof processors.Witha 4-nodeconfiguration,we obtaina null latency of 22 µs; with 64 nodes,thenulllatency is 72 µs.

Figure5.5 shows the throughputof the basicprotocol. Throughputis mea-suredusinga straightforwardblasttestin which thesendertransmitsmany mes-sagesof a given sizeandthenwaits for an (empty)acknowledgementmessagefrom all receivers.Wereportthethroughputobservedby thesender.

All curvesdeclinefor largemessages.This is dueto thesamecacheeffect asdescribedin Section5.1. For messagesthatfit in thecache,themaximummulti-castthroughput(41 Mbyte/sfor 4 processors)is lessthanthemaximumunicastthroughput(63 Mbyte/s) and decreasesas the numberof processorsincreases.Theseeffectsaredueto thefollowing throughput-limitingfactors:

1. NI memory supportsat most two memory accessesper NI clock cycle(33.33MHz). Memory referencesare issuedby the NI’s threeDMA en-ginesandby theNI processor. Noneof thesesourcescanissuemorethanonememoryreferenceperclockcycle. As aresult,themaximumpacketre-ceiveandsendrateis 127Mbyte/s(33.33MHz � 32 bits). This maximumcanbe obtainedonly with large packets;with LFC’s 1 Kbyte packets,themaximumDMA speedis approximately120Mbyte/s.

Page 129: Communication Architectures for Parallel-ProgrammingSystems

5.3MulticastPerformance 107

1

P-2 P-132

0

Fig. 5.6. Treeusedto determinemulticastgap.

2. Thefanoutof themulticasttreedeterminesthenumberof forwardingtrans-fers that an NI’s sendDMA enginemust make. Sincetheseforwardingtransfersmust be performedserially, the attainablethroughputis propor-tionalto thereciprocalof thefanout.Binarytreeshaveamaximumfanoutoftwo. Themaximumthroughputthatcanbeobtainedis therefore120� 2 � 60Mbyte/s.

3. High throughputcanonly beobtainedif theNI processormanagesto keepits DMA enginesbusy.

4. Multicast packets travelling acrossdifferent logical links in the multicasttreemaycontendfor thesamephysicallinks. Theamountof contentionde-pendson themappingof processesto processors.Sincedifferentmappingscangiveverydifferentperformanceresults,we useda simulated-annealingalgorithmto obtain reasonablemappingsfor all single-senderthroughputbenchmarks.The resultingmappingsgive muchbetterperformancethanrandommappings,but arenot necessarilyoptimal, becausesimulatedan-nealingis not guaranteedto find a true optimumandbecausewe did nottakeacknowledgementtraffic into account.

Thefirst threefactorsexplainwhy wedonotattaintheunicastthroughput,but notwhy throughputdecreaseswhenthenumberof processorsis increased.Sincetheamountof forwardingwork performedby the bottlenecknodesof the multicasttree—i.e., internalnodeswith two children—is independentof the sizeof themulticasttree,we would expect that throughputis also independentof the sizeof thetree. In reality, addingmoreprocessorsincreasesthedemandfor physicalnetwork links. Theresultingcontentionleadsto decreasedthroughput.

Page 130: Communication Architectures for Parallel-ProgrammingSystems

108 ThePerformanceof LFC

1 2 3� 4 5� 6� 7 8� 9� 10Fanout

0

10

20

30

40

50

Mul

ticas

t gap

(m

icro

seco

nds)

Measured multicast gapLeast-squares fit

Fig. 5.7. LFC’s broadcastgap.

To estimatethe amountof work performedby the NI to forward a multicastpacket, we can measurethe multicastgap g 6 P8 . This is the reciprocalof themulticastthroughputon P processorsfor small (16-byte)messages.We assumethatthemulticastgapis proportionalto thefanoutF 6 P8 of thebottlenecknode(s)in themulticasttree. That is, we assumeg 6 P8�� α � F 6 P8�7 β, for someα andβ.By varyingF 6 P8 andmeasuringtheresultingmulticastgapsg 6 P8 , we canfind αandβ. LFC’s binarytrees,however, haveafixedfanoutof two. To varyF 6 P8 , wethereforeusemulticasttreesthathavetheshapeshown in Figure5.6. In suchtrees,thereis exactlyoneforwardingnodeandthefanoutof thetreeis F 6 P8 � P A 2.

We measuredthemulticastgapsfor variousnumbersof processorsanduseda least-squaresfit to determineα andβ from thesemeasurements.Themeasuredgapsandtheresultingfit areshown in Figure5.7;wefoundα � 4 � 6 µsperforwardandβ � 5 � 2 µs. Theforwardingtime perdestination(α) is fairly large; this timeis usedto look up the forwarding destination,to allocatea senddescriptor, toenqueuethisdescriptor, andto initiate thesendDMA.

5.3.2 Impact of DeadlockRecovery and Multicast TreeShape

We assumethatmostprogramswill rarelymulticastin sucha way thatdeadlockrecovery is necessary. It is thereforeimportant to know the overheadthat theprotocolimposeswithout recovery.

Enablingdeadlockrecovery doesnot increasethe broadcastlatency. In the

Page 131: Communication Architectures for Parallel-ProgrammingSystems

5.3MulticastPerformance 109

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)�0

10

20

30

Thr

ough

put (

Mby

te/s

econ

d)

�Basic, 8 procs.Recovery, 8 procs.Basic, 16 procs.Recovery, 16 procs.

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)�0

10

20

30

Basic, 32 procs.Recovery, 32 procs.Basic, 64 procs.Recovery, 64 procs.

Fig. 5.8. Impactof deadlockrecoveryon broadcastthroughput.

latency benchmark,theconditionthat triggersdeadlockrecovery (anNI hasrunoutof sendcredits)is neversatisfied,sotherecoverycodeis neverexecuted.Also,the testfor theconditiondoesnot addany overhead,becauseit takesplacebothin thebasicandin theenhancedprotocol.

Throughputisaffectedbyenablingdeadlockrecovery. Figure5.8showssingle-senderthroughputresultsfor 8, 16,32,and64processors,with andwithoutdead-lock recovery. For therecoveryprotocolmeasurements,weusedalmostthesameconfigurationasfor the basicprotocolmeasurementsshown in Figure5.5. Theonly differenceis thatwe enableddeadlockrecovery andaddedP A 1 NI receivebufferson eachNI (seeSection4.3). We usethesamebinary treesasbefore,sono truedeadlockscanoccur. Sinceour deadlockrecovery schememakesa local,conservativedecision,however, it still signalsdeadlock.

Our measurementsindicatethat,with a few exceptions,all recoveriesareini-tiatedby theroot of themulticasttree. Thedeadlockedsubtreesthat resultfromtheserecoveriesaresmall. With 64 processorsand64 Kbyte messages,for ex-ample,eachdeadlockrecovery results(on average)in 1.1 packetsbeingsentviaa deadlockrecoverychannel.In almostall cases,thedeadlockedsubtreeconsistsof theroot andoneof its children.Theroot performslesswork thanits children,becauseit doesnothaveto receiveanddeliverincomingmulticastpackets.Conse-quently, therootcansendoutmulticastpacketsfasterthanits childrencanprocessthemandthereforerunsoutof credits,which triggersadeadlockrecovery.

Page 132: Communication Architectures for Parallel-ProgrammingSystems

110 ThePerformanceof LFC

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)�0

10

20

30

Thr

ough

put (

Mby

te/s

econ

d)

ChainBinary

64 processors

Fig. 5.9. Impactof treetopologyonbroadcastthroughput.

For 8 processors,theperformanceimpactof theserecoveriesis small,but forlargernumbersof processors,thenumberof recoveriesincreasesandthethrough-put decreasessignificantly. With 64 Kbyte messages,for example,thenumberofrecoveriespermulticastincreasesfrom 0.02for 8 processorsto 0.37for 64 pro-cessors(onaverage).Thisincreasein thenumberof recoveriesandthedecreaseinthroughputappearto becausedby increasedcontentionon thephysicalnetworklinks (dueto the larger numberof processors).This contentiondelaysacknowl-edgements,whichcausessendersto runoutof creditsmorefrequently. Theprob-lem is aggravatedslightly by a fastpathin themulticastcodethattriesto forwardincoming multicastpackets immediately. If this succeeds,the sendchannelisblockedby anoutgoingdatapacket andcannotbeusedby anacknowledgement.Removing thisoptimizationimprovesperformancewhendeadlockrecoveryis en-abled,but reducesperformancewhendeadlockrecovery is disabled.

Figure5.9shows theimpactof treeshapeon broadcastthroughput.We com-parebinary treesandlinear chainson 64 processors.We usea large numberofprocessors,becausethat is when we expect the performancecharacteristicsofdifferenttreeshapesto bemostvisible. In bothcases,deadlockrecovery wasen-abled,becausewith chains,deadlockscanoccurwhenmultiple sendersmulticastconcurrently. With a singlesender, however, deadlockscannotoccur, andwe ex-pectlow-fanouttreesto performbest. Figure5.9 confirmsthis expectation.Thelinearchain(fanout1) performsbetterthanthebinarytree(fanout2).

Page 133: Communication Architectures for Parallel-ProgrammingSystems

5.3MulticastPerformance 111

64�

256� 1K¡ 4K¡ 16K� 64K�

Message size (bytes)�0.0

0.1

0.2T

hrou

ghpu

t (M

byte

/sec

ond)

� BinaryChain

64 processors

Fig. 5.10. Impactof deadlockrecoveryon all-to-all throughput.

To testthebehavior of thedeadlockrecoveryschemein thecasethatdeadlockscan occur, we performedan all-to-all benchmarkin which all sendersbroadcastsimultaneously. In this case,truedeadlocksmayoccurfor thechain.Figure5.10shows the per-senderthroughputon 64 processorsfor the chainand the binarytree. Due to contentionfor network resources(links, buffers, andNI processorcycles),theper-senderthroughputis muchlower thanin thesingle-sendercase.

As expected,thechainperformsworsethanthebinarytree.Table5.2 reportsseveralstatisticsthat illustratethebehavior of therecovery protocolfor bothtreetypes. Thesestatisticsare for an all-to-all exchangeof 64 Kbyte messageson64 processors.In this table,Lused is thenumberof logical links betweenpairsofNIs thatareusedin theall-to-all test;Lav is thetotalnumberof logical links avail-able(64 � 63). Lused� Lav indicateshow well a particulartreespreadsits packetsoverdifferentlinks. RC is thenumberof deadlockrecoveries;M is thenumberofmulticasts.RC � M indicateshow oftenarecoveryoccurs.D is thenumberof pack-etsthatareforcedto travel througha slow deadlockrecoverychannel.D � RC 7 1is theaveragesizeof deadlockedsubtrees.Thesestatisticsshow that,with chains,every multicasttriggersa deadlockrecovery action. What is worse,theserecov-eriesinvolve all 64 NIs, soeachmulticastpacket usesa slow deadlockchannel.With chains,this benchmarkcanuseonly a small fraction(0.02)of theavailableNI receive buffer space.In combinationwith thehigh communicationload, thiscausessendersto run outof creditsandtriggersdeadlockrecoveries.With binarytrees,whichmakebetteruseof theavailableNI buffer space,deadlockrecoveries

Page 134: Communication Architectures for Parallel-ProgrammingSystems

112 ThePerformanceof LFC

Treeshape Lused� Lav RC � M D � RC 7 1Binary 0 � 51 0 � 38 3 � 3Chain 0 � 02 1 � 00 63� 9

Table5.2.Deadlockrecoverystatistics.Lused— #NI-to-NI links usedin this test;Lav — #NI-to-NI links available;RC — #deadlockrecoveries;M — #multicasts;D — #packetsthatuseadeadlockrecoverychannel.

occurlessfrequentlyandthesizeof thedeadlockedsubtreesis muchsmaller.Summarizing,we find thatdeadlockrecovery is triggeredfrequently, but that

its effect on performanceis modestwhen the communicationload is moderateandtrue deadlocksdo not occur. To reducethe numberof falsealarms,a morerefined,timeout-basedmechanismcouldbeused[96]. This mechanismdoesnotsignaldeadlockimmediatelywhenablocked-sendsqueuebecomesnonempty, butstartsa timer andwaitsfor a timeoutto occur. Thetimer is canceledwhena senddescriptoris removedfrom theblocked-sendsqueue.

5.4 Summary

In this chapter, we have analyzedthe performanceof LFC using microbench-marks. LFC’s point-to-pointperformanceis comparableto thatof existing user-level communicationsystemsanddoesnot suffer much from the NI-level flowcontrolprotocol.Specifically, on thesamehardwareLFC achievessimilar point-to-point latenciesasIllinois FastMessages(version2.0),which usesa host-levelvariantof the sameflow control protocol. Runningthe protocol betweenNIs,however, allows for asimpleandefficientmulticastimplementation.

Due to our useof programmedI/O at the sendingside, LFC cannotattainthemaximumavailablethroughput,but we believe thatLFC’s throughputis stillsufficiently high thatthis is notaproblemin practice.At leastonestudysuggeststhatparallelapplicationsaremoresensitive to sendandreceive overheadthantothroughput[102].

LFC’s NI-level fetch-and-addimplementationis not muchfasterthana host-level implementationwould be,but hastheadvantagethat it neednever generateaninterrupt.

LFC’s NI-level multicastalso avoids unnecessaryinterrupts(during packetforwarding)andeliminatesan unnecessarydatatransfer. The binary treesused

Page 135: Communication Architectures for Parallel-ProgrammingSystems

5.4Summary 113

by thebasicmulticastprotocolgivegoodlatency andthroughput.Client systemsthatwish to optimizefor eitherlatency or throughputcanuseothertreeshapesifthey useLFC’s extendedmulticastprotocolwhich includesa deadlockrecoverymechanism.We showed that the extendedprotocolallows clients to obtain theadvantagesof aspecifictreeshapesuchasthechain.

Page 136: Communication Architectures for Parallel-ProgrammingSystems
Page 137: Communication Architectures for Parallel-ProgrammingSystems

Chapter 6

Panda

This chapterdescribesthedesignandimplementationof thePandacommunica-tion systemon top of LFC. Pandais a multithreadedcommunicationlibrary thatprovidesflow-controlledpoint-to-pointmessagepassing,remoteprocedurecall,andtotally-orderedbroadcast.Since1993,versionsof Pandahave beenusedbyimplementationsof theOrcasharedobjectsystem[9, 19,124], theMPI message-passingsystem,a parallel Java system[99], an implementationof Linda tuplespaceson MPI [33], a subsetof thePVM message-passingsystem[124], theSRprogramminglanguage[124], andaparallelsearchsystem[121].

Portabilityandefficiency have beenthemaingoalsof all Pandaimplementa-tions. Pandahasbeenportedto a variety of communicationarchitectures.Thefirst Pandaimplementation[19] wasconstructedon a clusterof workstations,us-ing theUDPdatagramprotocol.Sincethen,versionsof PandahavebeenportedtotheAmoebadistributedoperatingsystem[111], to atransputer-basedsystem[65],to MPI for the IBM SP/2,to active messagesfor the Thinking MachinesCM-5,to Illinois FastMessagesfor Myrinet clusters[10], andto LFC, also for Myri-net clusters. This chapter, however, focuseson a singlePandaimplementation:Panda4.0onLFC. Panda4.0differssubstantiallyfrom theoriginal systemwhichhadnoseparatemessage-passinglayerandhadadifferentmessageinterface[19].

Thesecondgoal,efficiency, conflictswith thefirst, portability. Whereasporta-bility favors a modular, layeredstructure,efficiency dictatesan integratedstruc-ture.In thischapterwedemonstratethatPandacanbeimplementedefficiently (onLFC) without sacrificingportability. This efficiency resultsfrom Panda’s flexibleinternalstructure,carefullydesignedinterfaces(bothin PandaandLFC), exploit-ing LFC’s communicationmechanisms,andintegratingmultithreadingandcom-munication.

115

Page 138: Communication Architectures for Parallel-ProgrammingSystems

116 Panda

In this chapter, we show how PandausesLFC’s packet-basedinterface,NI-level multicastandfetch-and-add,andpolling watchdogto implementthefollow-ing abstractions:

1. Asynchronousupcalls.Sincemany PPSsrequireasynchronousmessagede-livery, Pandadeliversall incomingmessagesby meansof asynchronousup-calls.Toavoid theuseof interruptsfor eachincomingmessage,Pandatrans-parentlyanddynamicallyswitchesbetweenusingpolling andinterrupts.Tochoosethemostappropriatemechanism,Pandausesthread-schedulingin-formation.

2. Streammessages.LFC’s packet-basedcommunicationinterfaceallows aflexible implementationof streammessages[112]. Streammessagesallowend-to-endpipeliningof datatransfers.Our implementationavoidsunnec-essarycopying anddecouplesmessagenotificationandmessageconsump-tion.

3. Totally-orderedbroadcast.Pandaprovidesasimpleandefficient implemen-tationof a totally-orderedbroadcastprimitive. The implementationbuildsonLFC’s multicastandfetch-and-addprimitives.

The first sectionof this chaptergives an overview of Panda4.0. It describesPanda’s functionalityandits internalstructure.Section6.2 describeshow Pandaintegratescommunicationandmultithreading.Section6.3studiestheimplemen-tation of Panda’s messageabstraction. Section6.4 describesPanda’s totally-orderedbroadcastprotocol. Section6.5 reportson theperformanceof PandaonLFC andSection6.6discussesrelatedwork.

6.1 Overview

This sectiongives an overview of Panda’s functionality, describesthe internalstructureof Panda,anddiscussesthe key performanceissuesin Panda’s imple-mentation.

6.1.1 Functionality

PandaextendsLFC’s low-level message-passingfunctionality in several ways.First, processesthat communicateusingPandaexchangemessagesof arbitrarysizeratherthanpacketswith amaximumsize.

Page 139: Communication Architectures for Parallel-ProgrammingSystems

6.1Overview 117

Second,Pandaimplementsdemultiplexing. EachPandamodule(describedlater) allows its clientsto constructcommunicationendpointsandto associateahandlerfunctionwith eachendpoint.Sendersdirect their messagesto suchend-points.Whenamessagearrivesatanendpoint,Pandainvokestheassociatedmes-sagehandlerandpassesthemessageto this handler. LFC, in contrast,dispatchesall network packetsthatarriveat aprocessorto a singlepacket handler.

Third, Pandasupportsmultithreading. Pandaclientscancreatethreadsandusethemto structureasystemor to overlapcommunicationandcomputation.

Finally, Pandasupportsseveral high-level communicationprimitives,all ofwhich operateon messagesof arbitrary length. With one-waymessage pass-ing, messagescanbe sentfrom oneprocessto an endpointin anotherprocess.Suchmessagescanbe received implicitly, by meansof upcalls,or explicitly, bymeansof blockingdowncalls.Remoteprocedurecall, a well known communica-tion mechanismin distributedsystems[23], allows a processto invoke a proce-dure(an upcall) in a remoteprocessandwait for the resultof the invocation. Atotally-ordered broadcastis a broadcastwith strongorderingguarantees.It hasmany applicationsin systemsthatneedto maintainconsistentreplicasof shareddata[75]. Specifically, a totally-orderedbroadcastprimitiveguaranteesthatwhenn processesreceive thesamesetof broadcastmessages,then

1. All messagessentby thesameprocesswill bedeliveredto theirdestinationsin thesameorderthey weresent.

2. All messages(sentby any process)will bedeliveredin thesameorderto alln processes.

The first requirement(FIFOness)is alsosatisfiedby LFC’s broadcastprimitive,but thesecondis not.

Figure6.1 illustratesthe differencebetweena FIFO broadcastanda totally-orderedbroadcast.Two processesA andB concurrentlybroadcastamessage.Thefigureshowsall four possibledeliveryorders.With aFIFObroadcast,all deliveryordersare valid. With a totally-orderedbroadcast,however, only the first twoscenariosarepossible.In thethird andfourthscenario,processesA andB receivethetwo messagesin differentorders.

Both messagepassingand RPC are subjectto flow control. If a receivingprocessdoesnot consumeincomingmessages,Pandawill stall the sender(s)ofthosemessages.At present,no flow control is implementedfor Panda’s totally-orderedbroadcast.Scalablemulticastflow control is difficult due to the largenumberof receiversthata senderneedsto get feedbackfrom. LFC’s currentset

Page 140: Communication Architectures for Parallel-ProgrammingSystems

118 Panda

Totally ordered? Yes Yes No

A B A B A B A B

No

Tim

e

Fig. 6.1.Differentbroadcastdeliveryorders.

of clientsandapplications,however, worksfinewithoutmulticastflow control,fortwo reasons.First, dueto application-level synchronizationor dueto the useofcollective-communicationoperations,many applicationshaveat mostonebroad-castingprocessat any time, which reducesthe pressureon receiving processes.Second,a receiver that disablesnetwork interruptsand that doesnot poll, willeventuallystall all nodesthat try to sendto it. If thereis only a singlebroad-castingprocess,this back-pressureis sufficient to stall thatprocesswhenneeded.This is anall-or-nothingmechanism,though:eitherthereceiveracceptsincomingpacketsfrom all sendersor it doesnot receiveany packetsatall.

SincePandaaddsconsiderablefunctionalityto LFC’ssimpleunicastandmul-ticast primitives, the questionariseswhy this functionality is not part of LFCitself. Theansweris thatnot all client systemsneedthis extra functionality. TheCRL DSM system(discussedin Section7.3), for example,needslittle morethanefficientpoint-to-pointmessagepassingandinterruptmanagement.Addingextrafunctionalitywould introduceunnecessaryoverhead.

6.1.2 Structure

Pandaconsistsof several modules,which canbe configuredat compile time tobuild a completePandasystem.Figure6.2 illustratesthemodulestructure.Eachboxrepresentsamoduleandlists its mainfunctions.Thearrowsrepresentdepen-denciesbetweenmodules.A detaileddescriptionof themoduleinterfacescanbefoundin thePandadocumentation[62].

Page 141: Communication Architectures for Parallel-ProgrammingSystems

6.1Overview 119

Message-passing module

Broadcast module

Messages of arbitrary lengthTotally ordered broadcastReliable broadcast

Unicast and multicastDemultiplexing (endpoints)Protocol header stacksThreads

System module

Messages of arbitrary lengthReliable unicast and multicastDemultiplexing (ports)

RPC module

Reliable remote procedure call

Fig. 6.2. Pandamodules:functionalityanddependencies.

The SystemModule

The mostimportantmoduleis thesystemmodule. For portability, Pandaallowsonly thesystemmoduleto accessplatform-specificfunctionalitysuchasthenativemessage-passingprimitives. All othermodulesmustbe implementedusingonlythefunctionalityprovidedby thesystemmodule.

The systemmoduleimplementsthreads,endpoints,messages,and Panda’slow-level unicastandmulticastprimitives. The threadandmessageabstractionsareusedin all Pandamodulesandby Pandaclients. The unicastandmulticastfunctions,however, aremainly usedby Panda’s message-passingandbroadcastmoduleto implementhigher-level communicationprimitives. Pandaclientsusethesehigher-level primitives.

Oneway to achieve bothportability andefficiency would beto defineonly ahigh-level interfacethateveryPandasystemmustimplement.ThefixedinterfaceensuresthatPandaclientswill runonany platformthattheinterfacehasbeenim-plementedon. Also, if only thetop-level interfaceis defined,theimplementationof Pandacanexploit platform-specificproperties.For example,if the message-passingprimitive of the target platform is reliable, then Pandaneednot buffermessagesfor retransmission.

Page 142: Communication Architectures for Parallel-ProgrammingSystems

120 Panda

The main problemwith this single-interfaceapproachis that Pandamustbeimplementedfrom scratchfor every target platform. In practiceplatformsvaryin a relatively small numberof ways,so implementingPandafrom scratchforevery platformwould leadto codeduplication.To allow codereuse,Pandahidesplatform-specificcodein thesystemmoduleandrequiresthatall othermodulesbeimplementedin termsof thesystemmodule.Thesemodulescanthusbereusedonotherplatforms.Thesystem-modulefunctionshave a fixedsignature(i.e., name,argumenttypes,andreturntype),but theexactsemanticsof thesefunctionsmayvary in a small numberof predefinedways from onePandaimplementationtoanother. Thesystemmoduleexportsthe particularsemanticsof its functionsbymeansof compile-timeconstantsthatindicatewhichfeaturesor restrictionsapplyto thetargetplatform. This way, thesystemmodulecanconvey themainproper-tiesof theunderlyingplatformto higher-level modules.Thepropertiesexportedby thesystemmoduledo not have to matchthoseof theunderlyingsystem.Forsomeplatforms,includingLFC, thesystemmodulesometimesexportsastrongerinterfacethanprovidedby theunderlyingsystem.

Variationsof semanticsis possiblealongthefollowing dimensions:

1. Maximumpacket size. Thesystemmodulecanexport a maximumpacketsizeandleave fragmentationandreassemblyto higher-level modules,or itcanacceptanddelivermessagesof arbitrarysize.

2. Reliability. Thesystemmodulecanexport anunreliableor a reliablecom-municationinterface.

3. Messageordering.ThesystemmodulecanexportFIFOor unorderedcom-municationprimitives.Thesystemmodulecanalsoindicatewhetheror notits broadcastprimitive is totally-ordered.

Panda’s systemmodulefor LFC implementsfragmentationandreassembly, bothfor unicastand broadcastcommunication. (In this respect,the systemmodulethus exports a strongerinterfacethan LFC.) The systemmodulepreserves thereliability andFIFOnessof LFC’s unicastandmulticastprimitives,andexportsa reliable and FIFO interfaceto higher-level Pandamodules. In addition, thesystemmoduleimplementsatotally-orderedbroadcastusingLFC’sfetch-and-addandbroadcastprimitivesandexportsthis totally-orderedbroadcastto higher-levelmodules.With thisconfiguration,theimplementationof thehigher-level modulesis relatively simple. Fragmentation,reliability, andorderingareall implementedin lower layers.

Page 143: Communication Architectures for Parallel-ProgrammingSystems

6.1Overview 121

High-Level Modules

We discussthreehigher-level modules: the message-passingmodule,the RPCmodule,andthe broadcastmodule. Thesemodulescanexploit the informationconveyed by the systemmodule’s (compile-time)parameters.This is the onlyinformation aboutthe underlyingsystemavailable to the higher-level modules.Thesemodulescan thereforebe reusedin Pandaimplementationsfor anotherplatform,providedthatthesystemmodulefor thatplatformimplementsthesame(or stronger)semantics.It is not necessaryto build implementationsof higher-level modulesfor everypossiblecombinationof system-modulesemantics.First,many combinationsareunlikely. Forexample,nosystemmoduleprovidesreliablebroadcastandunreliableunicast.Second,weonly needanimplementationfor thesystemmodulethatexportstheweakestsemantics(i.e.,unreliableandunorderedcommunication).Suchanimplementationis suboptimalfor systemmodulesthatoffer strongerprimitives,but it will functioncorrectlyandcanserve asa startingpoint for optimization.

The message-passingmodule implementsreliable point-to-pointcommuni-cationandan endpoint(demultiplexing) abstraction.SinceLFC is reliableandPanda’s systemmoduleperformsfragmentationand reassembly, the message-passingmoduleneedonly implementdemultiplexing, whichamountsto addingasmallheaderto everymessage.

TheRPCmoduleimplementsremoteprocedurecallson top of themessage-passingmodule.An RPCis implementedasfollows. Theclient createsa requestmessageandtransmitsit to the server processusing the message-passingmod-ule’s sendprimitive. The RPCmodulethenblocksthe client (usinga conditionvariable). When the server receives the request,it invokesthe requesthandler.This handlercreatesa reply messageandtransmitsit to theclient process.At theclientside,thereply is dispatchedto areplyhandler(internalto theRPCmodule)which awakenstheblockedclient threadandwhich passesthe reply messagetotheawakenedclient. Sincetheimplementationis built uponthemessage-passingmodule’s reliableprimitives,it is smallandsimple.

Thebroadcastmoduleimplementsa totally-orderedbroadcastprimitive. Allbroadcastmessagesaredeliveredin thesametotal orderto all processes,includ-ing the sender. Sincethe systemmodulealreadyimplementsa totally-orderedbroadcast—sothatit canexploit LFC’smulticastandfetch-and-add—theimple-mentationof thebroadcastmoduleis alsoverysimple.

Page 144: Communication Architectures for Parallel-ProgrammingSystems

122 Panda

6.1.3 Panda’s Main Abstractions

Below we describetheabstractionsandmechanismsthatareusedthroughoutthePandasystemandby Pandaclients. On LFC, mostof theseabstractionsareim-plementedin thesystemmodule.

Threads

Panda’s multithreadinginterfaceincludesthe datatypesand functionsthat arefound in mostthreadpackages:threadcreate,threadjoin, schedulingpriorities,locks,andconditionvariables.Sincemostthreadpackagesprovidesimilarmech-anisms,Panda’s threadinterfacecanusuallybe implementedusingwrapperrou-tinesthat invoke theroutinessuppliedby thethreadpackageavailableon thetar-get platform (e.g.,POSIX threads[110]). The LFC implementationof Panda’sthreadinterfaceusesa threadpackagecalledOpenThreads[63, 64]. Section6.2describeshow PandaorchestratestheinteractionsbetweenLFC andOpenThreadsin anefficientway.

Addressing

ThesystemmoduleimplementsanaddressingabstractioncalledServiceAccessPoints(SAPs). Messagestransmittedby the systemmoduleareaddressedtoa SAP in a particulardestinationprocess. SincePandadoesnot implementanameservice,all SAPsmustbecreatedby all processesat initializationtime. Thecreatorof a SAP(a Pandaclient or oneof Panda’s modules)associatesa receiveupcall with eachSAP. Whena messagearrivesat a SAP, this upcall is invokedwith themessageasanargument.PandaexecutesatmostoneupcallperSAPatatime. Upcallsof differentSAPsmayrun concurrently.

Panda’shigh-levelmodulesusesimilaraddressingandmessagedeliverymech-anismsasthesystemmodule.Themessage-passingmodule,for example,imple-mentsan endpointabstractioncalledports. On LFC, portsare implementedassystemmoduleendpoints(SAPs),but on otherplatformsthereneednotbeaone-to-onecorrespondencebetweenportsandSAPs.In thePanda4.0implementationon top of UDP, for instance,SAPsare implementedasUDP sockets,but portsarenot. The reasonfor this differenceis that communicationto UDP socketsisunreliable,while communicationto a port is reliable.

As to messagedelivery, almostall modulesdeliver messagesby meansofupcallsassociatedwith communicationsendpoints(SAPs,ports,etc.). TheRPCmoduleis anexception,becauseRPCrepliesaredeliveredsynchronouslyto the

Page 145: Communication Architectures for Parallel-ProgrammingSystems

6.1Overview 123

threadthat sentthe request.This threadis blocked until the reply arrives. RPCrequests,on theotherhand,aredeliveredby meansof upcalls.

MessageAbstractions

Pandaprovides two messageabstractions:I/O vectorsat the sendingside andstreammessagesat the receiving side. Both abstractionsareusedby all Pandamodules.

A sendingprocesscreatesa list of pointersto buffersthatit wishesto transmitandpassesthislist to asendroutine.Thebuffer list is calledanI/O vector. Panda’slow-level sendroutines(describedbelow) gather the datareferencedby the I/Ovectorby copying thedatainto network packets.Themainadvantageof a gatherinterfaceover interfacesthatacceptonly a singlebuffer argument,is thatthey donot forcetheclient to copy datainto acontiguousbuffer beforeinvoking thesendroutine(which will have to copy themessageat leastoncemore,to theNI).

At the receiving side,Pandausesstreammessages. Streammessageswereintroducedin Illinois Fast Messages(version2.0) [112]. A streammessageisa messageof arbitrarylengththat canbe accessedonly sequentially(i.e., like astream).A streammessagebehaveslike a TCPconnectionthatcontainsexactlyonemessage.Thispropertyallowsreceiversof amessageto begin consumingthatmessagebeforeit hasbeenfully received. Incomingdatacanbecopiedto its finaldestinationin a pipelinedfashion,which is moreefficient thanfirst reassemblingthecompletemessagebeforepassingit to theclient.

Pandaconsistsof multiple protocol layersand Pandaclients may add theirown protocol layers. Eachlayer typically addsa headerto every messagethatit transmits. To supportefficient headermanipulation,we borrow an ideafromthe x-kernel [116]. All protocolheadersthat needto be prependedto an outgo-ing messagearestoredin a single,contiguousbuffer. By usinga singlebufferto storeprotocol headers,Pandaavoids copying its clients’ I/O vectorsjust toprependpointersto its own headers(which aresmall). Suchbuffers arecalledheaderstacksandareusedboth at the sendingandthe receiving side. Senderswrite (push)their headersto theheaderstack. Receiversread(pop) thosehead-ersin reverseorder. Both pushingandpoppingheadersaresimpleandefficientoperations.

Figure 6.3 shows two Pandaprotocol stacksand the correspondingheaderstacks.Eachprotocol layer, whetherinternalor external to Panda,exportshowmuchspacein theheaderstackit andits descendantsin theprotocolgraphneed.Otherprotocolscanretrieve this informationanduseit to determinewherein the

Page 146: Communication Architectures for Parallel-ProgrammingSystems

124 Panda

Headers

Client protocol B

Client protocol A

Message-passing module

System module

Group module

Client protocol C

Unused

RPC module

Header stack Header stack

Fig. 6.3.Two Pandaheaderstacks.

headerstackbuffer they muststoretheirheader.

Sendand Receive Interfaces

Table6.1 givesthe signaturesof the systemmodule’s communicationroutines.Pan unicast() gathersandtransmitsa headerstack(proto) andanI/O vector(iov)to a SAP (sap) in the destinationprocess(dest). When the destinationprocessreceives the first packet of the sender’s message,it invokesthe handlerroutineassociatedwith the serviceaccesspoint parameter(sap). The handlerroutinetakesaheaderstackandastreammessageasits arguments.Thereceiving processcanreadthestreammessage’s databy passingthestreammessageto pan msg-consume(). Theheaderstackcanbereaddirectly.

Pan msgconsume() consumesthenext n bytesfrom thestreammessageandcopiesthemto a client-specifiedbuffer. If lessthann byteshave arrived at thetimethatpan msgconsume() is called,pan msgconsume() blocksandpollsuntilat leastn byteshavearrived.

In many cases,themessagehandlerreadsall of astreammessage’sdata.If thisis inconvenient,however, thenthereceiving processcanstorethestreammessage,returnfrom thehandler, andconsumethedatalater. Storingthestreammessageconsistsof copying a pointer to the messagedatastructure;the contentsof themessageneednotbecopied.

OtherPandamoduleshave similar sendandreceive signatures.In all cases,

Page 147: Communication Architectures for Parallel-ProgrammingSystems

6.1Overview 125

void pan unicast(unsigned dest, pan sap p sap,pan iovec p iov, int veclen, void *proto, int proto size)

Constructsamessageoutof all veclenbuffersin I/O vectoriov. Bufferproto, of lengthproto size, is prependedto this messageandis usedto storeheaders.The messageis transmittedto processdest. Whendestreceivesthe message,it invokestheupcall associatedwith sap,passingthemessage,its headers,andits senderasarguments.

void pan multicast(pan pset p dests, pan sap p sap,pan iovec p iov, int veclen, void *proto, int proto size)

Behaveslike pan unicast(), but multicaststhe I/O vector to all pro-cessesin dests, not just to asingleprocess.

int pan msg consume(pan msg p msg, void *buf, unsigned size)

Triesto copy thenext sizebytesfrom messagemsgto buf andreturnsthenumberof bytesthathave beencopied.This numberequalssizeunlesstheclienttriedto consumemorebytesthanthestreamcontains.Whenthelastbytehasbeenconsumed,themessageis destroyed.

Table6.1.Sendandreceive interfacesof Panda’ssystemmodule.

Page 148: Communication Architectures for Parallel-ProgrammingSystems

126 Panda

the sendroutine acceptsan I/O vectorwith buffers to transmitand the receiveupcalldeliversa streammessageandaprotocolheaderstack.

Upcalls

Thesystemmoduledeliversevery incomingmessageby invoking theupcall rou-tine associatedwith themessage’s destinationSAP. Theupcall routineis usuallya routineinsidethehigher-level modulethatcreatedtheSAP. In mostcases,thisroutinepassesthemessageto a higher-level layer (eitherthePandaclient or an-otherPandamodule). This is doneby invoking anotherupcall or by storingthemessage(e.g.,an RPCreply) in a known location. In othercases,the message(e.g.,anacknowledgement)servesonly controlpurposesinternalto thereceivingmoduleandwill notbepropagatedto higher-level layers.

Whenthesystemmodulereceivesthefirst packet of a message,it schedulesan upcall. EachSAP maintainsa queueof pendingupcalls,which areexecutedin-orderandoneat a time. Clientsandhigher-level modulesdo not have precisecontrol over the schedulingof upcalls. They must assumethat every upcall isexecutedasynchronouslyandprotecttheir globaldatastructuresaccordingly.

An interestingquestionis whetheranupcallshouldbeviewedasanindepen-dentthreador asa subroutineof the running’computation’ thread. The formerview is morenatural,becausethe next incomingmessagemay have nothing todo with the currentactivity of the runningthread. Unfortunately, this view hassomeproblemsassociatedwith it. Theseproblemsandtheir resolutionin Pandaarediscussedin Section6.2.

6.2 Integrating Multithr eadingandCommunication

This sectionstudiestheinteractionsbetweenLFC’s communicationmechanismsandmultithreadingin Panda. Section6.2.1explainshow Pandaandothermul-tithreadedclientscanbe implementedsafelyon LFC. Next, in Section6.2.2weshow how theknowledgethatis availablein athreadschedulercanbeexploitedtoreduceinterruptoverhead.Finally, in Section6.2.3wediscussdesignoptionsandimplementationtechniquesfor upcall models. The designof an upcall modelisrelatedto multithreadingandefficient implementationsof someupcallmodelsre-quirecooperationof themultithreadingsystem.After discussingdifferentdesignoptions,wedescribePanda’supcallmodel.

Page 149: Communication Architectures for Parallel-ProgrammingSystems

6.2 IntegratingMultithreadingandCommunication 127

Message handling

Computation

Time

Procedure callPacket in flight

Client (computation)LFC send (lfc_ucast_launch)Client (lfc_upcall)

¢�¢�¢�¢¢�¢�¢�¢£�£�£�££�£�£�£¤�¤�¤�¤�¤¤�¤�¤�¤�¤¤�¤�¤�¤�¤¥�¥�¥�¥�¥¥�¥�¥�¥�¥¥�¥�¥�¥�¥ ¦�¦�¦�¦�¦¦�¦�¦�¦�¦¦�¦�¦�¦�¦§�§�§�§§�§�§�§§�§�§�§

¨�¨¨�¨©�©©�©

Fig. 6.4.Recursive invocationof anLFC routine.

6.2.1 Multithr ead-SafeAccessto LFC

LFC is not multithread-safe:concurrentinvocationsof LFC routinescancorruptLFC’s internaldatastructures.LFC doesnot uselocks to protectits globaldata,becauseseveralLFC clients(e.g.,CRL, TreadMarks,andMPI) do not usemulti-threadingandbecausewearereluctantto makeLFC dependentonspecificthreadpackages.Nevertheless,PandaandothermultithreadedsystemscansafelyuseLFC withoutmodificationsto LFC’s interfaceor implementation.

With single-threadedclientsLFC hasto bepreparedfor two typesof concur-rent invocationof LFC routines:recursive invocationsandinvocationsexecutedby LFC’ssignalhandler. Recursiveinvocationsoccur(only) whenanLFC routinedrainsthenetwork (seeSection3.6). The routinethatsendsa packet, for exam-ple,will poll whennofreesenddescriptorsareavailable.Duringapoll, LFC mayinvoke lfc upcall() to handleincomingnetwork packets.Thisupcallis definedbythe client andmay invoke LFC routines,including the routinethat polled. Thisscenariois illustratedin Figure6.4. To preventcorruptionof global stateduetorecursion,LFC alwaysleavesall globaldatain aconsistentstatebeforepolling.

When we allow network interrupts,LFC routinescan be interruptedby anupcallatany point,whichcaneasilyresultin dataraces.To avoid suchdataraces,LFC logically1 disablesnetwork interruptswhenever an LFC routine is enteredthataccessesglobalstate.

Thetwo measuresdescribedaboveareinsufficient to dealwith multithreaded

1As explainedin Section3.7,LFC doesnot truly disablenetwork interrupts.

Page 150: Communication Architectures for Parallel-ProgrammingSystems

128 Panda

clients,becausethey donotpreventtwo client threadsfrom concurrentlyenteringLFC. To solve this problem,Panda’s systemmoduleemploys a single lock tocontrolaccessto LFC routines.Pandaacquiresthis lock beforeinvokingany LFCroutineandreleasesthelock immediatelyaftertheroutinereturns.

The lock introducesa new problem:recursive invocationsof anLFC routinewill deadlock,becausePandadoesnot allow a threadto acquirea lock multipletimes (i.e., Pandadoesnot implementrecursive locks). To prevent this, Pandareleasesthe lock uponenteringlfc upcall() andacquiresit againjust beforere-turning from lfc upcall(). This works, becauseall recursive invocationshavelfc upcall() in their call chain.Whenlfc upcall() releasesthelock, anotherthreadcanenterLFC. This is safe,becausethe polling threadthat invoked lfc upcall()alwaysleavesLFC’s globalvariablesin aconsistentstate.

With thesemeasures,we can handleconcurrentinvocationsfrom multiplePandathreadsandrecursive invocationsby a single thread. Thereremainsoneproblem: network interrupts. Network interruptsareprocessedby LFC’s signalhandler(seeSection3.7).Wheninvoked,thissignalhandlerchecksthatinterruptsareenabled(otherwiseit returns)andtheninvokeslfc poll() to checkfor pendingpackets. Lfc poll(), in turn, may invoke lfc upcall() andwill do sobeforePandahashada chanceto acquireits lock. Whenthis occurs,lfc upcall() will releasethelock eventhoughthelock hadneverbeenacquired.

The problemis that the signalhandleris anotherentry point into LFC thatneedsto beprotectedwith a lock/unlockpair. To do that,PandaoverridesLFC’ssignalhandlerwith its own signalhandler. Panda’s signalhandleracquiresthelock, invokesLFC’s signalhandler, and releasesthe lock. This closesthe lastatomicityhole.

Summarizing,Pandaachievesmultithread-safeaccessto LFC throughthreemeasures:

1. LFC leavesits globalvariablesin aconsistentstatebeforepolling.

2. Theclient bracketscallsto LFC routineswith a lock/unlockpair to preventmultipleclient threadsfrom executingconcurrentlywithin LFC.

3. The client brackets invocationsof LFC’s network signal handlerwith alock/unlockpair.

Two propertiesof LFC make thesemeasureswork. First,LFC doesnotprovideablockingreceivedowncall,but dispatchesall incomingpacketsto auser-supplied

Page 151: Communication Architectures for Parallel-ProgrammingSystems

6.2 IntegratingMultithreadingandCommunication 129

upcall function. Systemslike MPI, in contrast,provide a blockingreceive down-call. If we bracket calls to sucha blocking receive with a lock/unlockpair, weblock entranceto the communicationlayer until a messageis received. Thiscausesdeadlockif a threadblocked in a receive downcall waits for a messagethatcanbereceivedonly if anotherthreadis ableto enterthecommunicationsys-tem(e.g.,to sendamessage).This typeof blockingis differentfrom thetransientblockingthatoccurswhenLFC hasrunoutof someresource(e.g.,sendpackets).The latter typeof blocking is alwaysresolvedif all participatingprocessesdrainthecommunicationnetwork andreleasehostreceive-packets.

Second,LFC dispatchesall packetsto a singleupcall function,which allowslock managementto becentralized.If thecommunicationsystemwould directlyinvoke userhandlers,then it would be the responsibilityof the userto get thelocking right, or, alternatively, LFC would have to know aboutthelocksusedbytheclient system.

6.2.2 Transparently Switching betweenInterrupts and Polling

In Section2.5 we discussedthe main advantagesand disadvantagesof pollingandinterrupts.This sectionpresentsanapproach,implementedin PandaanditsunderlyingthreadpackageOpenThreads,thatcombinestheadvantagesof pollingand interrupts. In this approachpolling is usedwhen the applicationis knownto be idle. Interruptsareusedto deliver messagesto a processthat is busy exe-cuting applicationcode. Using the threadscheduler’s knowledgeof the stateofall threads,Pandadynamically(and transparently)switchesbetweenthesetwostrategies.As aresult,Pandaneverpollsunnecessarilyandweuseinterruptsonlywhentheapplicationis truly busy.

The techniquesdescribedin this sectionwereinitially developedon anotheruser-level communicationsystem(FM/MC [10]). Thesametechniques,however,areusedon LFC. In addition,LFC supportsthemixing of polling andinterruptsby meansof its polling watchdog.

Polling versusInterrupts

Choosingbetweenpolling andinterruptscanbedifficult. Therearedifficultiesintwo areas:easeof programmingandperformancecost. We considertwo issuesrelatedto easeof programming:matchingthepolling rateto themessage-arrivalrateandconcurrency control.

With polling, it is necessaryto roughlymatchthepolling rateto themessage-

Page 152: Communication Architectures for Parallel-ProgrammingSystems

130 Panda

arrival rate. Unfortunately, many parallelprogrammingsystemscannotpredictwhenmessagesarrive andwhenthey needto beprocessed.Thesesystemsmusteitheruseinterruptsor insertpollsautomatically. Sinceinterruptsareexpensive,apolling-basedapproachis potentiallyattractive. Automaticpolling, however, hasits own costsandproblems.

Polls canbe insertedstatically, by a compiler, or dynamically, by a runtimesystem.A compilermustbeconservative andmay thereforeinsertfar too manypolling statements(e.g.,at thebeginningof every loop iteration).A runtimesys-tem canpoll the network eachtime it is invoked, but this approachworks onlyif the applicationfrequentlyinvokesthe runtimesystem.Alternatively, the run-timesystemcancreateabackgroundthreadthatregularlypolls thenetwork. This,however, requiresthatthethreadschedulerimplementa form of time-slicing,andintroducestheoverheadof switchingto andfrom thebackgroundthread.Pollingis also troublesomefrom a software-engineeringperspective, becauseall code,includinglargestandardlibraries,mayhave to beprocessedto includepolls.

Thesecondissueis concurrency control. Unlike interrupts,usingpolls givesprecisecontrolover message-handlerexecution.Consequently, a single-threadedapplicationthat polls the network only when it is not in a critical sectionneednot uselocks or interrupt-statusmanipulationto protectits shareddata. How-ever, to exploit thesynchronousnatureof polling, onemustknow exactly wherea poll mayoccur. Calling a library routinefrom within a critical sectioncanbedangerous,unlessit is guaranteedthatthis routinedoesnot poll.

Quantifyingthedifferencein costbetweenusinginterruptsandpolling is dif-ficult due to the large numberof parametersinvolved: hardware (cachesizes,network adapters),operatingsystem(interrupthandling),runtimesupport(threadpackages,communicationinterfaces),andapplication(polling policy, message-arrival rate,communicationpatterns).The discussionbelow considersthe basecostsof polling andinterrupts,andtherelationshipwith themessage-arrival rate.

First, executinga singlepoll is typically muchcheaperthantaking an inter-rupt,becauseapoll executesentirelyin userspacewithout any context switching(seeSection2.5). Dispatchingan interruptto userspaceon a commodityoper-ating system,on the otherhand,is expensive. The main reasonis that softwareinterruptsaretypically usedto signalexceptionslike segmentationfaults,eventsfor whichoperatingsystemsdonotoptimize[143]. In LFC,asuccessfulpoll costs1.0µs;dispatchinganinterruptandasignalhandlercosts31 µs.

Second,comparingthe costof a singlepoll to the costof a single interruptdoesnot provide a sufficient basisfor statementsaboutapplicationperformance.A singleinterruptcanprocessmany messages,so thecostof an interruptcanbe

Page 153: Communication Architectures for Parallel-ProgrammingSystems

6.2 IntegratingMultithreadingandCommunication 131

amortizedover multiple messages.Also, eachtime a poll fails, theuserprogramwastesa few cycles. Sincematchingthe polling rateto the message-arrival ratecanbe hard,an applicationmay eitherpoll too often (andthuswastecycles)orpoll too infrequently(anddelaythedeliveryof incomingmessages).

For Panda,the following observationsapply. SincePandais multithreaded,many Pandaclientswill alreadyuselocksto protectaccessesto globaldata.Suchclientswill have no troublewhenupcallsarescheduledpreemptively anddo notgetany benefitfrom executingupcallsatknown pointsin time. Moreover, severalof Panda’sclientscannotpredictwhenmessageswill arrive,yetneedto respondtothosemessagesin a timely manner. Thesetwo (ease-of-use)observationssuggestaninterrupt-drivenapproach.However, sincewe expectthat theexclusiveuseofinterruptswill leadto badperformancefor clientsthat communicatefrequently,polling shouldalsobe taken into consideration.Below, we describehow Pandatriesto getthebestof bothworlds.

Exploiting the ThreadScheduler

Bothpolling andinterruptshavetheiradvantagesandit is oftenbeneficialto com-bine the two of them. LFC providesthe mechanismsto switch dynamicallybe-tweenpolling and interrupts: primitivesto disableandenableinterrupts,andapolling primitive. Usingtheseprimitives,asophisticatedprogrammercangettheadvantagesof both polling andinterrupts. The CRL runtimesystem,for exam-ple, takes this approach[73]. Unfortunately, this approachis error-prone: theprogrammermayeasilyforgetto poll or, worse,to disableinterrupts.

To solve this problem,we have deviseda simplestrategy for switchingbe-tweenthetwo messageextractionmechanisms.Thekey ideais to integratethreadmanagementandcommunication.This integrationis motivatedby two observa-tions. First, polling is alwaysbeneficialwhentheapplicationis idle (i.e.,whenitwaits for an incomingmessage).Second,whentheapplicationis active, pollingmay still be effective from a performancepoint of view, but insertingpolls intocomputationalcodeandfinding the right polling ratecanbe tedious. Thus,ourstrategy is simple: we poll whenever the applicationis idle and useinterruptswhenever runnablethreadsexist. To beableto poll whentheapplicationis idle,however, we mustbeableto detectthis situation.This suggeststhat thedecisionto switchbetweenpolling andinterruptsbetakenin thethreadscheduler.

We implementedthis strategy in the following way. By default, Pandausesinterruptsto receive messages.However, when OpenThreads’s threadsched-uler detectsthatall threadsareblocked, it disablesnetwork interruptsandstarts

Page 154: Communication Architectures for Parallel-ProgrammingSystems

132 Panda

Procedure callFast thread switchMessage in flightOpenThreads idle/polling loop

Panda/LFC communication codeApplication code

Processor B

Processor A

Application

Idle/polling

LFC/Panda upcall

Time

LFC/Panda upcall

Idle/polling

Application

Request Reply

ª�ªª�ª«�««�«

¬�¬�¬�¬�¬¬�¬�¬�¬�¬­�­�­�­�­­�­�­�­�­

®�®®�®¯�¯¯�¯ °�°�°°�°�°±�±�±±�±�±

²�²�²�²�²�²²�²�²�²�²�²³�³�³�³�³�³³�³�³�³�³�³ ´�´�´�´�´´�´�´�´�´µ�µ�µ�µµ�µ�µ�µ

Fig. 6.5.RPCexample.

Page 155: Communication Architectures for Parallel-ProgrammingSystems

6.2 IntegratingMultithreadingandCommunication 133

polling the network in a tight loop. A successfulpoll resultsin the invocationof lfc upcall(). If theexecutionof the handlerwakesup a local thread,thentheschedulerwill re-enablenetwork interruptsandscheduletheawakenedthread.Ifanapplicationthreadis runningwhenamessagearrives,LFC will generateanet-work signal.SinceLFC delaysthegenerationof network interrupts,it is unlikelythatinterruptsaregeneratedunnecessarily(seeSection3.7).

OpenThreadsdoesnot detectall typesof blocking: a threadis consideredblockedonly if it blocksby invokingoneof Panda’s threadsynchronizationprim-itives(pan mutex lock(), pan cond wait(), or pan thread join()). If a threadspin-waitsonamemorylocationthenthisis notdetectedbyOpenThreads.Also,Open-Threadsis notawareof blockingsystemcalls. If asystemcall blocks,it will blocktheentireprocessandnotjust thethreadthatinvokedthesystemcall. Somethreadpackagessolvethisproblemby providingwrappersfor blockingsystemcalls.Thewrapperinvokestheasynchronousversionof thesystemcall andtheninvokesthescheduler, which canthenblock thecalling thread.Whena poll or a signalindi-catesthat thesystemcall hascompleted,theschedulercanreschedulethethreadthatmadethesystemcall.

Figure 6.5 illustratesa typical RPC scenario. A threadon processorA is-suesanRPCto remoteprocessorB, which alsorunsa computationthread.Aftersendingits request,thesendingthreadblockson a conditionvariableandPandainvokesthethreadscheduler. Theschedulerfindsno otherrunnablethreads,dis-ablesinterrupts,andstartspolling thenetwork.

Whentherequestarrivesat processorB, it interruptstherunningapplicationthread. (SinceprocessorB hasan active thread,interruptsare enabled.) Therequestis thendispatchedto thedestinationSAP’smessagehandler. Thishandlerprocessestherequest,sendsa reply, andreturns.

On processorA, the polling threadreceives the reply, entersthe message-passinglibrary, andthensignalstheconditionvariablethattheapplicationthreadblocked on. When the handlerthreaddies, the schedulerre-enablesinterrupts,becausethe applicationthreadhasbecomerunnableagain. Next, the schedulerswitchesto theapplicationthreadandstartsto run it.

Summarizing,thebehavior of theintegratedthreadsandcommunicationsys-temis thatsynchronouscommunication,like thereceiptof anRPCreply, is usu-ally handledthroughpolling. Asynchronouscommunication,on theotherhand,is mostlyhandledthroughinterrupts.This mixing of polling andinterruptscom-binesnicely with LFC’s polling watchdog.

Page 156: Communication Architectures for Parallel-ProgrammingSystems

134 Panda

Concurrency ContextProcedure call Thread

Oneupcallat a time Activemessages Single-threadedupcallsMultiple upcalls Activemessages Popupthreads

Table6.2.Classificationof upcallmodels.

6.2.3 An Efficient Upcall Implementation

In thissectionwediscussthedesignoptionsfor upcallimplementations.Next, wediscussthreeupcallmodelsthat take differentpositionsin thedesignspace.Thepropertiesof thesemodelshave influencedthe designof Panda’s upcall model,which is discussedtowardtheendof this section.

The two main designaxes for upcall systemsare upcall context andupcallconcurrency. Upcallscanbeinvokedfrom differentprocessingcontexts. A simpleapproachis to invoke upcall routinesfrom the threadthathappensto berunningwhenamessagearrives.Wecall this theprocedurecall approach.Thealternativeapproach,the threadapproach,is to give eachupcall its own executioncontext.This amountsto allocatinga threadper incoming message.The exact contextfrom which a systemmakesupcallsaffectsboththeprogrammingmodelandtheefficiency of upcalls.

Upcall concurrency determineshow many upcallscanbeactive concurrentlyin asinglereceiverprocess.Themostrestrictivepolicy allowsat mostoneupcallto executeat any time; themostliberal strategy allows any numberof upcallstoexecuteconcurrently. Again,differentchoicesleadto differencesin theprogram-mingmodelandperformance.

Table6.2summarizesthedesignaxesandthedifferentpositionsontheseaxes.The tablealso classifiesthreeupcall modelsalong theseaxes: active message,single-threadedupcalls,and popupthreads. The active-messages model [148]is restrictive in that it prohibitsall blocking in messagehandlers. In particular,active-messagehandlersmaynotblockonlocks,whichcomplicatesprogrammingsignificantly. On the otherhand,highly efficient implementationsof this modelexist. Someof theseimplementationsallow active-messagehandlerinvocationsto nest,othersdonot. Popupthreads[116] allow messagehandlersto blockatanytime. Popupthreadsareoftenslower thanactivemessages,becauseconceptuallya threadis createdfor eachmessage.Finally, we considersingle-threadedup-calls [14], whichdisallow blockingin some,but notall cases.Below, wecomparetheexpressivenessof thesemodelsandthecostof implementingthem.

Page 157: Communication Architectures for Parallel-ProgrammingSystems

6.2 IntegratingMultithreadingandCommunication 135

int x, y;

voidhandle store(int ¶ x addr, int y val, int z0, int z1)·

¶ x addr = y val;¸voidhandle read(int src, int ¶ x addr, int z0, int z1)·

AM send 4(src, handle store, x addr, y, 0, 0);¸¹º¹º¹/ ¶ send read request to processor 5 ¶ /AM send 4(5, handle read, my cpu, &x, 0, 0);¹º¹º¹Fig. 6.6.Simpleremotereadwith activemessages.

ActiveMessages

As explainedin Section2.8, anactive messageconsistsof theaddressof a han-dler functionanda smallnumberof datawords(typically four words).Whenanactivemessageis received,thehandleraddressis extractedfrom themessageandthehandleris invoked;thedatawordsarepassedasargumentsto thehandler. Forlargemessages,theactive-messagesmodelprovidesseparatebulk-transferprim-itives.

A typical useof active messagesis shown in Figure 6.6. In this example,processormy cpusendsanactive messageto readthevalueof variabley on pro-cessor5. To sendthe message,we usethe hypotheticalroutine AM send4(),which acceptsa destination,a handlerfunction,andfour word-sizedarguments.At the receiving processor, the handlerfunction is appliedto the arguments.Inthiscase,themessagecontainstheaddressof thehandlerhandleread(), theiden-tity of thesender(my cpu), andtheaddressat which theresultof thereadshouldbe stored(&x). Functionhandleread() will be invokedon processor5 with srcsetto my cpuandx addr setto &x. Thehandlerreplieswith anotheractivemes-

Page 158: Communication Architectures for Parallel-ProgrammingSystems

136 Panda

sagethat containsthe valueof y. Whenthis reply arrivesat processormy cpu,handlestore() is invokedto storethevaluein variablex. We assumethatreadinganinteger(variabley) is anatomicoperation.

Active-messageimplementationsdeliverperformancecloseto thatof therawhardware. An importantreasonfor this high performanceis thatactive-messagehandlersdo not have their own executioncontext. Whenan applicationthreadpollsor is interruptedby anetwork signal,thecommunicationsysteminvokesthehandler(s)of incomingmessagesin the context of the applicationthread. Thatis, thehandler’s stackframesaresimply stackedon top of theapplicationframes(seeFigure6.8(a)). No separatethreadis created,which eliminatesthe costofbuilding a threaddescriptor, allocatingastack,anda threadswitch.

The lack of a dedicatedthreadstack makes active messagesefficient, butmakesthemunattractiveasa general-purposecommunicationmechanismfor ap-plicationprogrammers.Active-messagehandlersarerun on the stacksof appli-cationthreads.If an active-messagehandlerblocks,thenthe applicationthreadcannotbe resumed,becausepart of its stackis occupiedby the active-messagehandler(seeFigure6.8(a)). This type of blocking occurswhen the applicationthreadholdsa resource(e.g.,a lock) that is alsoneededby the active-messagehandler. Clearly, a deadlockis createdif the active-messagehandlerwaits untiltheapplicationthreadreleasestheresource.

Similar problemsarisewhena handlerwaits for the arrival of anothermes-sage.Considerfor examplethetransmissionof a largereplymessageby anupcallhandler. If the messageis sentby meansof a flow-controlledprotocol,thenthesendroutinemayblock whenits sendwindow closes.Thesendwindow canbereopenedonly afteranacknowledgementhasbeenprocessed,which requiresan-otherupcall. If the active-messagessystemallows at mostoneupcall at a time,thenwe have a deadlock. If the systemallows multiple upcalls,however, thentheacknowledgementhandlercanberun on top of theblockedhandler(i.e., asanestedupcall)andnodeadlockoccurs.

Theseproblemsarenot insolvable.If it is necessaryto suspenda handler, theprogrammercanexplicitly savestatein anauxiliarydatastructure,acontinuation,andlet thehandlerreturn.Thestatesavedin thecontinuationcanlaterbeusedbyalocal threador anotherhandlerto resumethesuspendedcomputation.Weassumethatcontinuationsarecreatedmanuallyby theprogrammer. Sometimes,though,it is possibleto createcontinuationsautomatically. Automaticapproacheseitherrely onacompilerto identify thestateto besavedor savestateconservatively. If acompilerrecognizespotentialsuspensionpointsin aprogram,thenit canidentifylive variablesandgeneratecodeto save live variableswhenthe computationis

Page 159: Communication Architectures for Parallel-ProgrammingSystems

6.2 IntegratingMultithreadingandCommunication 137

suspended.Blocking a threadis an exampleof conservative statesaving: thesavedstateconsistsof theentirethreadstack.

To selectbetweenthesestate-saving alternatives,thefollowing tradeoffs mustbeconsidered.With manualstatesaving, no compileris neededandapplication-specificknowledgecanbeexploitedto minimizetheamountof statesaved.How-ever, manualstatesaving canbetediousanderrorprone.Compilerscanremovethis burden,but mostcommunicationsystemsareconstructedaslibrariesratherthanlanguages.Finally, blockinganentirethreadis simple,but requiresthatev-ery upcall be run in its own threadcontext, otherwisewe will alsosuspendthecomputationon which theupcallhasbeenstacked. This alternative is discussedbelow, in thesectiononpopupthreads.

In smallandself-containedsystems,continuationscanbeusedeffectively tosolvetheproblemof blockingupcalls(seeSection7.1).Ontheotherhand,contin-uationsaretoo low-level anderrorproneto beusedby applicationprogrammers.In fact,theoriginalactive-messageproposal[148] explicitly statesthattheactive-messageprimitivesarenot designedto behigh-level primitives.Activemessagesare usedin implementationsof several parallelprogrammingsystems.A goodexampleis Split-C [39, 148], anextensionof C thatallowstheprogrammerto useglobal pointersto remotewordsor arraysof data. If a global pointerto remotedatais dereferenced,anactivemessageis sentto retrieve thedata.

Single-ThreadedUpcalls

In thesingle-threadedupcallmodel[14], all messagessentto aparticularprocessareprocessedby asingle,dedicatedthreadin thatprocess.Wereferto this threadas the upcall thread. Also, in its basicform, the single-threadedupcall modelallowsat mostoneupcallto executeat a time.

Figure 6.7 shows how single-threadedupcallscan be usedto accessa dis-tributedhashtable. Sucha tableis oftenusedin distributedgame-treesearchingto cacheevaluationvaluesof boardpositionsthathavealreadybeenanalyzed.Tolook up an evaluationvalue in a remotepart of the hashtable,a processsendsa handlelookupmessageto the processorthat holds the tableentry. Sincethetablemay be updatedconcurrentlyby local worker threads,andbecausean up-dateinvolvesmodifying multiple words(a key anda value),eachtableentry isprotectedby a lock. In contrastwith theactive-messagesmodel,thehandlercansafelyblockonthis lock whenit is heldby somelocal thread.While thehandlerisblocked,though,noothermessagescanbeprocessed.Thesingle-threadedupcallmodelassumesthat locks areusedonly to protectsmall critical sectionsso that

Page 160: Communication Architectures for Parallel-ProgrammingSystems

138 Panda

voidhandle lookup(int src, int key, int ¶ ret addr, int z0)·

int index;int val;

index = hash(key);

lock(table[index].lock);if (table[index].key == key)

·val = table[index].value;¸

else·val = -1;¸

unlock(table[index].lock);

AM send 4(src, handle reply, ret addr, val, 0, 0);¸Fig. 6.7. Messagehandlerwith locking.

Page 161: Communication Architectures for Parallel-ProgrammingSystems

6.2 IntegratingMultithreadingandCommunication 139

pendingmessageswill notbedelayedexcessively.Allowing at mostoneupcall at a time restrictsthe setof actionsthat a pro-

grammercansafelyperformin thecontext of anupcall. Specifically, this policydoesnot allow anupcall to wait for thearrival of a secondmessage,becausethisarrival canbedetectedonly if thesecondmessage’supcallexecutes.For example,if amessagehandlerissuesaremoteprocedurecall to anotherprocessor, deadlockwouldensuebecausethehandlerfor theRPC’s replymessagecannotbeactivateduntil the handlerthat sent the requestreturns. In practice,the single-threadedupcall model requiresthat messagehandlerscreatecontinuationsor additionalthreadsin caseswhereconditionsynchronizationor synchronouscommunicationis needed.

The differencebetweensingle-threadedupcallsandactive messagesis illus-tratedin Figure6.8. With active messages,upcall handlersare stacked on theapplication’s executionstack. Someimplementationsallow upcall handlerstonest,othersdonot. Single-threadedupcallhandlers,in contrast,donot runon thestackof an arbitrarythread;they arealwaysrun, oneat a time, on the stackoftheupcallthread.Executingmessagehandlersin thecontext of this threadallowsthe handlersto block without blocking other threadson the sameprocessor. Inparticular, thesingle-threadedupcallmodelallowsmessagehandlersto uselocksto synchronizetheir shared-dataaccesseswith theaccessesof otherthreads.Thisis animportantdifferencewith activemessages,whereall blockingis disallowed.If anactive-messagehandlerblocked,it wouldoccupy partof thestackof anotherthread(seeFigure6.8(a)),which thencannotberesumedsafely.

The single-threadedupcall modelhasbeenimplementedin several versionsof Panda.The currentversionof Panda,Panda4.0, usesa variantof the single-threadedupcallmodelthatwewill describelater.

Popup Threads

While thesingle-threadedupcallmodelismoreexpressivethantheactive-messagesmodel,it is still a restrictive modelbecauseall messagesarehandledby a singlethread.Thepopup-threadsmodel[116], in contrast,allowsmultiplemessagehan-dlersto beactive concurrently. Eachmessageis allocatedits own threadcontext(seeFigure6.8(c)). As a result,eachmessagehandlercansynchronizesafelyonlocks andconditionvariablesand issue(possiblysynchronous)communicationrequests,just likeany otherthread.

Figure6.9 illustratesthe advantagesof popupthreads. In this example,themessagehandlerhandlejob request() attemptsto retrievea job identifier(job id)

Page 162: Communication Architectures for Parallel-ProgrammingSystems

140 Panda

Active-message handler 2

Active-message handler 1

Application code

Application thread

(a)Activemessages.

Application thread Upcall thread

(b) Single-threadedupcalls.

Popup threadsApplication thread

(c) Popupthreads.

Fig. 6.8.Threedifferentupcallmodels.

Page 163: Communication Architectures for Parallel-ProgrammingSystems

6.2 IntegratingMultithreadingandCommunication 141

voidhandle job request(int src, int » jobaddr, int z0, int z1)¼

int job id;

lock(queue lock);while (is empty(job queue))

¼wait(job queue nonempty, queue lock);½

job id = fetch job(job queue);unlock(queue lock);

AM send 4(src, handle store, jobaddr, job id, 0, 0);½Fig. 6.9.Messagehandlerwith blocking.

from a job queue. When no job is available, the handlerblocks on conditionvariable job queuenonemptyand waits until a new job is addedto the queue.While this is a naturalway to expressconditionsynchronization,it is prohibitedin both theactive-messagesandthesingle-threadedupcallmodel. In theactive-messagesmodel,thehandleris notevenallowedto blockonthequeuelock. In thesingle-threadedupcallmodel,thehandlercanlock thequeue,but is not allowedto wait until a job is added,becausethenew job mayarrive in a messagenot yetprocessed.To addthisnew job, it is necessaryto runanew messagehandlerwhilethecurrenthandleris blocked.With single-threadedupcalls,this is impossible.

Dispatchinga popupthreadneednot beany moreexpensive thandispatchinga single-threadedupcall [88]. Popupthreads,however, have somehiddencoststhatdo not immediatelyshow up in microbenchmarks:

1. Whenmany messagehandlersblock, the numberof threadsin thesystemcanbecomelarge,which wastesmemory.

2. Thenumberof runnablethreadson a nodemayincrease,which canleadtoschedulinganomalies.In earlierexperimentswe observeda severeperfor-mancedegradationfor asearchalgorithmin whichpopupthreadswereusedto servicerequestsfor work [88]. The schedulerrepeatedlychoseto runhigh-priority popupthreadsinsteadof the low-priority applicationthread

Page 164: Communication Architectures for Parallel-ProgrammingSystems

142 Panda

that generatednew work. As a result,many uselessthreadswitcheswereexecuted.DruschelandBangafoundsimilarschedulingproblemsin UNIXsystemsthatprocessincomingnetwork packetsat toppriority [47].

3. Becausepopupthreadsallow multiplemessagehandlersto executeconcur-rently, theorderingpropertiesof theunderlyingcommunicationsystemmaybelost. For example,if two messagesaresentacrossa FIFO communica-tion channelfrom onenodeto another, thereceiving processwill createtwothreadsto processthesemessages.Sincethe threadschedulercansched-ule thesethreadsin any order, the FIFO propertyis lost. In general,onlythe programmerknows when the next messagecansafelybe dispatched.Hence,if FIFO orderingis needed,it hasto beimplementedexplicitly, forexampleby taggingmessageswith sequencenumbers.

Several systemsprovide popupthreads. Among thesesystemsareNexus [54],Horus [144], and the x-kernel [116]. Thesesystemshave all beenusedfor avarietyof parallelanddistributedapplications.

Panda’s Upcall Model

Thefirst Pandasystem[19] implementedpopupthreads.To reducetheoverheadof threadswitching,however, laterversionshave usedthesingle-threadedupcallmodel.Two kindsof threadswitchingoverheadwereeliminated:

1. In our early Pandaimplementationson Unix and Amoeba[139] all in-comingmessagesweredispatchedto a singlethread,thenetwork daemon,which passedeachmessageto a popupthread.(Thesesystemsmaintaineda pool of popupthreadsto avoid threadcreationcostson thecritical path.)Theuseof single-threadedupcallsallowedthenetwork daemonto processall messageswithout switchingto a popupthread.

2. Multiple upcall threadscouldbe blocked,waiting for anevent thatwastobegeneratedby a local (computation)threador amessage.Whentheeventoccurred,all threadswere awakened,even if the event allowed only onethreadto continue;the other threadswould put themselvesbackto sleep.It is frequentlypossibleto avoid putting upcall threadsto sleep. Instead,upcallscan createcontinuationswhich can be resumedby other threadswithout any threadswitching. Onceupcallsdo not put themselvesto sleepany more,they canbeexecutedby asingle-threadedupcallinsteadof popupthreads.

Page 165: Communication Architectures for Parallel-ProgrammingSystems

6.2 IntegratingMultithreadingandCommunication 143

In retrospect,thefirst typeof overheadcouldhavebeeneliminatedwithoutchang-ing theprogrammingmodel,by meansof lazy thread-creationtechniquessuchasdescribedbelow. Thesecondproblemoccurredin theimplementationof Orcaandis discussedin moredetail in Section7.1.4.

Panda4.0 implementsanupcallmodelthat lies betweenactive messagesandsingle-threadedupcalls. The model is as follows. First, as in single-threadedupcalls, Pandaclients can use locks in upcalls, but shouldnot usecondition-synchronizationor synchronouscommunicationprimitivesin upcalls.

Second,Pandaclientsmustreleasetheir locksbeforeinvokingany Pandasendor receive routine.This restrictionis themaindifferencebetweenPanda’s upcallmodelandsingle-threadedupcalls.It sometimesallowstheimplementationto runupcallson thecurrentstackratherthanona separatestack.

Finally, Pandaallows up to oneupcall perendpointto beactive at any time.Pandaclientsmustthereforebepreparedto dealwith concurrency betweenupcallsfor differentendpoints.In practice,thisposesnoproblems,becauseprogrammersalreadyhave to dealwith concurrency betweenupcallsanda computationthreador betweenmultiple computationthreads.SincePandarunsonly upcallsfor dif-ferentendpointsconcurrently, theorderof messagessentto any singleendpointis preserved,soweavoid adisadvantageof popupthreads.

Implementation of Panda’s Upcall Model

On LFC, we useanoptimizedimplementationof Panda’s upcallmodel. In manycases,this implementationcompletelyavoids threadswitching during messageprocessing.The implementationdistinguishesbetweensynchronousand asyn-chronousupcalls. Synchronousupcallsoccur implicitly, asthe resultof pollingby an LFC routine, or explicitly as the result of an invocationof lfc poll() byPanda.Pandapollsexplicitly whenall threadsareidle or whenit triesto consumean asyet unreceived part of a streammessage(seeSection6.3). A successfulpoll resultsin the (synchronous)invocationof lfc upcall() in the context of thecurrentthread(or the last-active threadif all threadsareidle). This synchronousupcallqueuesthepacket just receivedat thepacket’sdestinationSAP. If this is thefirst packet of a message,andno othermessagesarequeuedbeforethis message,thenPandainvokestheSAP’s upcallroutine(by meansof a plain procedurecall)without switchingto anotherstack. This is safe,becausePanda’s upcall modelrequiresthatthepolling threaddoesnotholdany locks.

AsynchronousupcallsoccurwhentheNI generatesa network interrupt.Net-work interruptsarepropagatedto thereceiving userprocessby meansof aUNIX

Page 166: Communication Architectures for Parallel-ProgrammingSystems

144 Panda

signal. The signal handlerinterruptsthe running threadand executeson thatthread’s stack. Eventually, the signal handlerinvokes lfc poll() to processthenext network packet. In thiscase,thereis noguaranteethattheinterruptedthreaddoesnot hold any locks,so we cannotsimply invoke a SAP’s upcall routine. Asimplesolutionis to storethepacket just receivedin aqueueandsignalaseparateupcallthread.Unfortunately, this involvesa full threadswitch.

To avoid this full threadswitch,OpenThreadsinvokeslfc poll() by meansofa special,slightly moreexpensive procedure-callmechanism.Insteadof runninglfc poll() onthestackof theinterruptedthread,OpenThreadsswitchesto aspecialupcall stackandrunsthe poll routineon that stack. At first sight, this is just athreadswitch,but therearetwo differences.First, sincetheupcall stackis usedonly to run upcalls,OpenThreadsdoesnot needto restoreany registerswhenitswitchesto this stack. Put differently, we alwaysstartexecutingat the bottomof theupcall stack. Second,OpenThreadssetsup thebottomstackframeof theupcall stackin sucha way that the poll routinewill automaticallyreturn to thesignalhandler’s stackframeon thetop of thestackof the interruptedthread(seeFigure6.10(a)).

If theSAPhandlerdoesnot block, thenwe will leavebehindanemptyupcallstack,returnto thestackof theinterruptedthread,returnfrom thesignalhandler,andresumethe interruptedthread. This is the commoncasethat OpenThreadsoptimizes. If, on the otherhand,the SAP handlerblockson a lock, thenOpen-Threadswill disconnecttheupcallstackfrom thestackof theinterruptedthread.This is illustratedin Figure6.10(b).OpenThreadsmodifiesthereturnaddressinthebottomstackframeso that it pointsto a specialexit function. This preventsthe upcall from returningto the stackof the interruptedthreadandit allows theinterruptedthreadto continueexecutionindependently.

Summarizing,OpenThreads’s specialcalling mechanismoptimistically ex-ploits theobservationthatmosthandlersrun to completionwithout blocking. Inthis case,upcallsexecutewithout truethreadswitches.If thehandlerdoesblock,thenwe promoteit to an independentthreadwhich OpenThreadsschedulesjustlikeany otherthread.

6.3 StreamMessages

Panda’sstreammessagesareimplementedin thesystemmodule.Theimplemen-tationminimizescopying andallowspipeliningof datatransfersfrom applicationto applicationprocess.Thekey ideais illustratedin Figure6.11.Thedata-transfer

Page 167: Communication Architectures for Parallel-ProgrammingSystems

6.3StreamMessages 145

¾�¾¾�¾¾�¾¾�¾¿�¿¿�¿¿�¿¿�¿Signal frameReturn link

Upcallthread

Interrupted thread

(a) Interruptedthreadandupcallthreadbefore the upcall threadhasblocked. Theupcallwill re-turn to the interruptedthread’sstack.

À�ÀÀ�ÀÀ�ÀÀ�ÀÁ�ÁÁ�ÁÁ�ÁÁ�ÁÂ�Â�ÂÂ�Â�ÂÂ�Â�ÂÂ�Â�ÂÃ�ÃÃ�ÃÃ�ÃÃ�ÃSignal frame

Upcallthread

Exit frame

Return link

Interrupted thread

(b) When the upcall threadblocks, Open-Threadsmodifiesthereturnaddress.

Fig. 6.10.Fastthreadswitchto anupcall thread.

pipelineconsistsof four stages,whichoperatein parallel.In thefirst stage,Pandacopiesclient data into LFC sendpackets. In the secondstage,LFC transmitssendpacketsto thedestinationNI. In thethird stage,the receiving NI copiesre-ceive packets to hostmemory. Finally, in the fourth stage,the receiving Pandaclient consumesdatafrom receive packetsin hostmemory. Messageslargerthana singlepacket will benefitfrom the parallelismin this pipeline. (A streamofshortmessagesalsobenefitsfrom this pipelining, but the key featureof streammessagesis thatthesameparallelismcanbeexploitedwithin asinglemessage.)

Streammessagesareimplementedasfollows. At thesendingside,thesystem-modulefunctionspan unicast() andpan multicast() areresponsiblefor messagefragmentation. They repeatedlyallocatean LFC sendpacket, storea messageheaderinto this packet, andfill the remainderof the packet with datafrom theuser’s I/O vectorandtheheaderstack.

Unicastheaderformatsareshown in Figure6.12.Thefirst packetof amessagehasa differentheaderthanthepacketsthat follow. Both headerscontaina countof piggybackedcreditsanda field that identifiesthedestinationSAP. Thecreditsfield is usedby Panda’s sliding-window flow-controlscheme.TheSAPidentifier

Page 168: Communication Architectures for Parallel-ProgrammingSystems

146 Panda

Receive packets (NI)

I/O vector1 1

3

4

Send packets (NI) Receive packets (host)

4

2

Panda

LFC

Network

Panda messagePanda client

Fig. 6.11. Application to applicationstreamingwith Panda’s streammessages.(1) Pandacopiesuserdatainto a sendpacket. (2) LFC transmitsa sendpacket.(3) LFC copiesthepacket to hostmemory. (4) ThePandaclient consumesdata.

Piggybacked credits

Header stack size

Message size

SAP identifier

Header for first packet Header for subsequent packets

Piggybacked credits

SAP identifier

Fig. 6.12.Panda’s message-headerformats.

is usedfor reassembly. Pandadoesnot interleaveoutgoingpacketsthatbelongtodifferentmessages.Consequently, sendersneednotaddamessageid to outgoingpackets. The sourceidentifier in the LFC header(seeFigure3.1) andthe SAPidentifier in the Pandaheadersuffice to locatethe messageto which a packetbelongs.

Theheaderof amessage’sfirst packetcontainstwo extrafields: thesizeof theheaderstackcontainedin thepacket andthetotalmessagesize.Theheader-stacksizeis usedto find thestartof thedatathatfollowstheheaders.Thetotalmessagesizeis usedto determinewhenall of amessage’spacketshavearrived.

Figure6.13illustratesthereceiver-sidedatastructures.Thereceivermaintainsanarrayof SAPs.EachSAPcontainsa pointerto theSAP’s handleranda queueof pendingstreammessages.A streammessageis consideredpendingwhenitsfirst packet hasarrived andthe receiving processhasnot yet entirely consumed

Page 169: Communication Architectures for Parallel-ProgrammingSystems

6.3StreamMessages 147

SAP 0SAP 1SAP 2SAP 3

Packet

Stream message

Fig. 6.13.Receiver-sideorganizationof Panda’sstreammessages.

thestreammessage.A streammessageis representedby a datastructurethat iscreatedwhenthe streammessage’s first packet arrives. This datastructurecon-tainsan arraywhich holdspointersto the streammessage’s constituentpackets.Packetsareenteredinto this arrayasthey arrive. In Figure6.13,onestreammes-sageis pendingfor SAP0 andtwo streammessagesarependingfor SAP3. Thefirst streammessagefor SAP3 consistsof threepackets. All packetsbut the lastonearefull.

Whena packet arrives, lfc upcall() searchesthe destinationSAP’s queueofpendingstreammessages.If it doesnot find the streammessageto which thispacket belongs,it createsa new streammessageand appendsit to the queue.(SinceLFC deliversall unicastpacketsin-order, thereis noneedto storeanoffsetor sequencenumberin themessageheaders.)If anincomingpacket is not thefirstpacket of its streammessage,thenlfc upcall() will find thestreammessagein theSAP’s queueof pendingmessages.The packet is thenappendedto the streammessage’spacket array.

Whenastreammessagereachestheheadof its SAP’s queue,PandadequeuesthestreammessageandinvokestheSAPhandler, passingthestreammessageasanargument.Thestreammessagesqueuedat aspecificSAPareprocessedoneatatime. (Thatis,aSAP’shandleris not invokedagainuntil thepreviousinvocationhasreturned.)

Page 170: Communication Architectures for Parallel-ProgrammingSystems

148 Panda

Pan msgconsume() copiesdatafrom the packets in a message’s packet listto userbuffers. Eachtime pan msgconsume() hasconsumeda completepacket,it returnsthepacket to LFC. If a Pandaclient doesnot want to consumeall of amessage’sdata,it caneitherskipsomebytesor discardtheentiremessagewithoutcopying any data.

6.4 Totally-Order edBroadcasting

Thissectiondescribesanefficientimplementationof Panda’stotally-orderedbroad-castprimitive. Totally-orderedbroadcastis a powerful communicationprimitivethatcanbeusedto manageshared,replicateddata.Panda’s totally-orderedbroad-cast,for example,is usedto implementmethodinvocationsfor replicatedsharedobjectsin theOrcasystem(seeSection7.1).

A broadcastis totally-orderedif all broadcastmessagesarereceivedin asingleorder by all receivers and if this order agreeswith the order in which senderssent the messages.Most totally-orderedbroadcastprotocolsusea centralizedcomponentto implementthe orderingconstraintand the protocol presentedinthis sectionis no exception. The protocolusesa centralnode,a sequencer, toordermessages.Theprotocolwe describeusesLFC’s fetch-and-addprimitive toobtainsequencenumbersto ordermessages.

Theprotocolis simple.Beforesendingabroadcastmessage,thesenderjustin-vokeslfc fetch and add() to obtainasystem-wideuniquesequencenumber. Thissequencenumberis attachedto the messageandthenthe message’s packetsarebroadcastusingLFC’s broadcastprimitive. All sendersperformtheir fetch-and-addoperationson a singlefetch-and-addvariablethatactsasa sharedsequencenumber. Thisvariableis storedin asingleNI’smemory.

Receiversassemblethemessagein thesameway they assembleunicastmes-sages(seeSection6.3). The only differenceis that a messageis not delivereduntil all precedingmessageshavebeendelivered.Thisenforcesthetotalorderingconstraint.

This Get-Sequence-number-then-Broadcast(GSB) protocol was first devel-opedfor a transputer-basedparallelmachine[65]. Themaindifferencewith theoriginal implementationis that with LFC, requestsfor a sequencenumberarehandledentirelyby theNI. This reducesthelatency of suchrequestsin two ways.First, theNI neednotcopy therequestto hostmemoryandthehostneednotcopya reply messageto NI memory. This gain is measurable– an LFC fetch-and-addcostslessthanan LFC (host-to-host)roundtrip– but small (19.8 µs versus

Page 171: Communication Architectures for Parallel-ProgrammingSystems

6.5Performance 149

0Ä 200Ä 400Ä 600Ä 800Ä 1000Message size (bytes)Å

0

10

20

30

40

50

Late

ncy

(mic

rose

cond

s)Æ

PandaLFC

Fig. 6.14.Panda’s message-passinglatency.

23.3µs). Themaingainis thereductionin thenumberof interrupts.Namely, thesequencerdoesnot expecta sequencenumberrequest.Therefore,suchrequestsarequitelikely to bedeliveredby meansof aninterruptratherthana poll. LFC’sfetch-and-addprimitiveavoidsthis typeof interrupt.

6.5 Performance

In this sectionwe discussPanda’s performance. We measurethe latency andthroughputfor the message-passingandbroadcastmodules. We do not presentseparateresultsfor theRPCmodule,becauseRPCsaretrivially composedof pairsof message-passingmodulemessages.WecomparethePandameasurementswiththe LFC measurementsof Chapter5, to assessthe costof Panda’s higher-levelfunctionality.

In Chapter5wemeasuredlatency andthroughputusingthreereceivemethods:no-touch,read-only, andcopy. Hereweuseonly thecopy method;this is themostcommonscenariofor Pandaclientsandincludesall overheadthatPanda’sstream-messageabstractionaddsto LFC’s packet interface.

6.5.1 Performanceof the Message-PassingModule

Figure6.14showstheone-way latency for Panda’smessage-passingmodule.Forcomparison,wealsoshow theone-way latency (with copying at thereceiverside)

Page 172: Communication Architectures for Parallel-ProgrammingSystems

150 Panda

of LFC’s unicastprimitive. As shown, Pandaaddsa constantamount(approx-imately 5 µs) of overheadto LFC’s latency. Thereare several reasonsfor thisincrease:

1. Locking. To ensuremultithread-safeexecution,Pandabracketsall calls toLFC routineswith lock/unlockpairs.

2. Messageabstraction.At thesendingside,Pandahasto processanI/O vectorbeforesendinganLFC packet. At thereceiving side,incomingpacketshaveto beappendedto a streammessagebeforethereceiver upcallcanprocessthedata. Pandaalsomaintainsseveralpointersandcountersto keeptrackof thecurrentpacket andthecurrentpositionin thatpacket.

3. Demultiplexing. Pandaaddsa headerto eachLFC packet. The headeridentifiesthedestinationport andthe streammessageto which the packetbelongs.

4. Flow control. Pandahasto checkandupdateits flow-controlstatefor eachoutgoingandincomingpacket.

For 1024-bytemessages,the differencebetweenPanda’s latency andLFC’s la-tency is largerthanfor smallermessagesizes.For thismessagesize,Pandaneedstwo packets,whereasLFC needsonly one;this is dueto Panda’s header.

Figure6.15showstheone-waythroughputfor Panda’smessage-passingmod-ule andfor LFC. Due to the overheadsdescribedabove, Panda’s throughputforsmall messagesis not asgoodasLFC’s throughput. In particular, sendingandreceiving the first packet of a streammessageinvolvesmorework thansendingandreceiving subsequentpackets. At thesendingside,for example,we needtostoretheheaderstackinto thefirst packet;at thereceiving side,wehave to createa streammessagewhenthefirst packet arrives. For largermessages,theseover-headscanbe amortizedover multiple packets. For this reason,Pandadoesnotreachits peakthroughputuntil amessageconsistsof multipleLFC packets.

For larger messages,PandasometimesattainshigherthroughputsthanLFC.This is dueto cacheeffectsthatoccurduringthecopying of datainto sendpacketsandoutof receivepackets.

6.5.2 Performanceof the BroadcastModule

Wemeasuredbroadcastperformanceusingthreedifferentbroadcastprimitives:

Page 173: Communication Architectures for Parallel-ProgrammingSystems

6.5Performance 151

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Å0

10

20

30

40

50

60

Thr

ough

put (

Mby

te/s

econ

d)Ç

LFCPanda

Fig. 6.15.Panda’s message-passingthroughput.

1. LFC’s broadcastusingthecopy receivestrategy (seeSection5.3).

2. Panda’s unorderedbroadcast.Pandaprovidesanoptionto disabletotal or-dering. When this option is enabled,Pandadoesnot tag broadcastmes-sageswith asystem-wideuniquesequencenumber. Thatis, Pandaskipsthefetch-and-addoperationthatit normallyperformsto obtainsuchasequencenumber.

3. Panda’s totally-orderedbroadcast.This is thebroadcastprimitivedescribedearlierin this chapter.

Figure 6.16 shows the latency for all threeprimitives, for messagesizesup to1 Kbyteandfor 16and64processors.Thedifferencein latency for anull messagebetweenPanda’s unorderedbroadcastandLFC’s broadcastis 9 µs. As expected,addingtotal orderingincreasesthe latency by a constantamount: the costof afetch-and-addoperation.On 64 processors,the null latency differencebetweenPanda’sunorderedandorderedbroadcastsis 23 µs.

Figure 6.17 shows broadcastthroughputfor 16 and64 processors.Addingtotal orderingreducesthe throughputfor small andmedium-sizemessages.Forlargermessages,however, PandareachesLFC’s throughput.

Figure6.16andFigure6.17show only single-senderbroadcastperformance.With multiple senders,theperformanceof totally-orderedbroadcastsmaysuffer

Page 174: Communication Architectures for Parallel-ProgrammingSystems

152 Panda

0Ä 200Ä 400Ä 600Ä 800Ä 1000Message size (bytes)Å

0

50

100

150

200La

tenc

y (m

icro

seco

nds)

Æ

16 processorsÈ

Panda, orderedPanda, unorderedLFC, unordered

0Ä 200Ä 400Ä 600Ä 800Ä 1000Message size (bytes)Å

0

50

100

150

200 64 processorsÈ

Panda, orderedPanda, unorderedLFC, unordered

Fig. 6.16.Panda’sbroadcastlatency.

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Å0

10

20

30

Thr

ough

put (

Mby

te/s

econ

d)

16 processors

LFC, unorderedPanda, unorderedPanda, ordered

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Å0

10

20

30

64 processors

LFC, unorderedPanda, unorderedPanda, ordered

Fig. 6.17.Panda’sbroadcastthroughput.

Page 175: Communication Architectures for Parallel-ProgrammingSystems

6.6RelatedWork 153

from thecentralizedsequencer. In practice,however, this is rarelya problem.Inmany applications,thereis at mostonebroadcastingprocessat any time. Whenprocessesdo broadcastsimultaneously, the overall broadcastrate is often low.In an Orcaperformancestudy[9], we measuredthe overheadof obtaininga se-quencenumberfrom thesequencer. For 7 out of the8 broadcastingapplicationsconsidered,the time spenton fetchingsequencenumberswaslessthan0.2%oftheapplication’s executiontime. Only oneapplicationsufferedfrom thecentral-izedsequencer:for thisapplication,alinearequationsolver, thetimeto accessthesequenceraccountedfor 4% of theapplication’s executiontime. All applicationswererun on FM/MC, which usesan NI-level fetch-and-addto obtainsequencenumbers,just likeLFC.

6.5.3 Other PerformanceIssues

Someoptimizationsdiscussedin thischapterreducecoststhatarearchitectureandoperatingsystemdependent.The costof a threadswitch, for example,dependsontheamountof CPUstatethathasto besavedandonthewayin whichthisstateis saved.TheearlyPandaimplementationsranonSPARC processorarchitectureson which each(user-level) threadswitchrequiresa trap to theoperatingsystem.This trap is neededto flushtheregisterwindows,a SPARC-specificcacheof thetop of the runtimestack,to memory[72, 79]. On other architectures,threadswitchesarecheaper. On thePentiumProsusedfor theexperimentsin this thesis,aswitchfrom onePandathreadto anothercosts1.3µs. This is still aconsiderableoverheadto addto LFC’s one-waynull latency, but it is notexcessive.

Interruptsandsignalprocessingareothersourcesof architectureandoperating-systemdependentoverhead.Whereasthreadswitchesarecheapon our experi-mentalplatform,interruptsandsignalsareexpensiveandshouldbeavoidedwhenpossible(seeSection2.5).

6.6 RelatedWork

We discussrelatedwork in fiveareas:portablemessage-passinglibraries,pollingand interrupts,upcall modelsand their implementation,streammessages,andtotally-orderedbroadcasting.

Page 176: Communication Architectures for Parallel-ProgrammingSystems

154 Panda

6.6.1 Portable Message-PassingLibraries

PVM [137] and MPI [53] are the most widely usedportablemessage-passingsystems.Unlike Panda,thesesystemstargetapplicationprogrammers.ThemaindifferencesbetweenthesesystemsandPandaarethatPVM andMPI arehardtousein a multithreadedenvironment,usereceive downcallsinsteadof upcallsanddonotsupporttotally-orderedbroadcast.

PVM andMPI do not provide a multithreadingabstraction.Both systemsuseblockingreceivedowncallsanddonot providea mechanismto notify anexternalthreadschedulerthata threadhasblocked. This is not a problemif theoperatingsystemprovideskernel-scheduledthreadsandtheclientusesthesekernelthreads.If, on theotherhanda client employs a threadpackagenot supportedby theop-eratingsystem,or if theoperatingsystemis not awareof threadsat all, thentheuseof blocking downcallsandthe lack of a notificationmechanismcomplicatetheuseof multithreadedsystemson topof MPI andPVM.

MultithreadedMPI andPVM clientscanavoid theblocking-receive problemby using nonblockingvariantsof the receive calls, but this meansthat appli-cationswill have to poll for messages,becausePVM and MPI do not supportasynchronous(i.e., interrupt-driven) messagedelivery. The samelack of asyn-chronousmessagedelivery complicatesthe implementationof PPSsthat requireasynchronousdelivery to operatecorrectlyandefficiently. ThesePPSsareforcedto poll (e.g.,in abackgroundthread),but finding theright polling rateis difficult.

LFC doesnot provide receive downcalls: all packets are deliveredthroughupcalls. Blocking receive downcalls can still be implementedon top of LFC,though.(Panda’s message-passingmoduleprovidesa blockingreceive.) Whenablockingreceive is implementedoutsidethemessage-passingsystem,controlcanbepassedto thethreadschedulerwhena threadblocksona receive.

NeitherPVM nor MPI providesa totally-orderedbroadcast.While a totally-orderedbroadcastcan be implementedusing PVM and MPI primitives, suchan implementationwill be slower thanan implementationthat supportsorderedbroadcastingat thelowestlayersof thesystem.

The Nexus communicationlibrary [54] hasthe samegoalsasPanda;Nexusis intendedto beusedasa compilertargetor aspartof a larger runtimesystem.Nexus provides an asynchronous,one-way message-passingprimitive called aRemoteServiceRequest(RSR). Thedestinationof anRSRis namedby meansof a global pointer, which is a system-wideuniquenamefor a communicationendpoint. Endpointscanbe createddynamicallyandreferencesto themcanbepassedaroundin messages.Unlike Panda,Nexus doesnot provide a broadcast

Page 177: Communication Architectures for Parallel-ProgrammingSystems

6.6RelatedWork 155

primitive.Horus [144] and its precursorIsis [22] were designedto simplify the con-

structionof distributedprograms.Horusfocuseson message-orderingandfault-toleranceissues. Pandasupportsparallel ratherthan distributed programming,providesonly onetypeof orderedbroadcast,anddoesnotaddressfault tolerance.Like Panda,Horuscanbeconfiguredin differentways.Horus,however, is muchmoreflexible in that it allows protocolstacksto be specifiedat run time ratherthanatcompiletime.

6.6.2 Polling and Interrupts

In theirremote-queueingmodel[28], Breweretal. focusonthebenefitsof polling.They recognize,however, that interruptsare indispensableandcombinepollingwith selectiveinterrupts.Interruptsaregeneratedonly for specificmessages(e.g.,operating-systemmessages)or underspecificcircumstances(e.g.,network over-flow). In contrast,our integratedthreadpackagechoosesbetweenpolling andinterruptsbasedon theapplication’sstate(idle or not).

CRL [74] is a DSM systemthat illustratesthe needto combinepolling andinterruptsin a singleprogram.Operationson shareddataarebracketedby callsto the CRL library. During or betweensuchoperations,a CRL applicationmaynot enterthelibrary for a long time,sounlesstheresponsibilityfor polling is puton theuserprogram,protocolrequestscanbehandledin a timely manneronly byusinginterrupts. CRL thereforeusesinterruptsto deliver protocolrequestmes-sages;polling is usedto receive protocolreply messages.This typeof behavioroccursnotonly in CRL, but alsoin otherDSMssuchasCVM [114] andOrca[9].It is exactly thekind of behavior thatPandadealswith transparently.

Fosteretal. describetheproblemsinvolvedin implementingNexus’smessagedelivery mechanism(RSRs)on differentoperatingsystemsand hardware [54].Available mechanismsfor polling and interrupts,and their costs,vary widelyacrossdifferentsystems.Moreover, thesemechanismsarerarelywell integratedwith the availablemultithreadingprimitives. For example,a blocking readon asocket may block the entireprocessratherthan just the calling thread. We be-lieve that our system,in which we have full control over threadschedulingandcommunication,achievesthedesiredlevel of integration.

Page 178: Communication Architectures for Parallel-ProgrammingSystems

156 Panda

6.6.3 Upcall Models

Wehavedescribedthreemessage-handlingmodelsin whicheachmessageresultsin a handlerinvocationat the receiving side. Nexususestwo of thesemodelstodispatchRSRhandlers.Nexus eithercreatesa new threadto run the handlerorelserunsthehandlerin a preallocatedthread.Thefirst casecorrespondsto whatwecall thepopupthreadsmodel,thesecondto theactive-messagesmodel.In thesecondcase,thethreadis not allowedto block.

A model closely relatedto popupthreadsis the optimistic active-messagesmodel [150]. This model removessomeof the restrictionsof the basicactive-messagesmodel. It extendsthe classof handlersthat can be executedas anactive-messagehandler—i.e.,without creatinga separatethread—with handlersthatcansafelybeabortedandrestartedwhenthey block. A stubcompilerprepro-cessesall handlercode.Whenthestubcompilerdetectsthata handlerperformsa potentially blocking action, it generatescodethat abortsthe handlerwhen itblocks. Whenaborted,thehandleris re-runin thecontext of a first-classthread(which is createdon the fly). Thread-managementcostsare thus incurredonlyif a handlerblocks;otherwisethehandlerrunsasefficiently asa normalactive-messagehandler.

Optimisticactive messagesarelesspowerful thanpopupthreads.First, opti-misticactivemessagesrequirethatall potentiallyblockingactionsberecognizableto thestubcompiler. Popupthreads,in contrast,canbeimplementedwithoutcom-piler supportanddonot rely onprogrammerannotationsto indicatewhatpartsofa handlermay block. Second,an optimistic active-messageshandlercanbe re-run safelyonly if it doesnot modify any globalstatebeforeit blocks;otherwisethis statewill bemodifiedtwice. Theprogrammeris thusforcedto avoid or undochangesto globalstateuntil it is known thatthehandlercannotblock any more.

The stack-splittingandreturn-addressmodificationtechniqueswe useto in-voke messagehandlersefficiently are similar to the techniquesusedby LazyThreads[58]. Lazy Threadsprovideanefficient fork primitive thatoptimisticallyrunsa child on its parent’s stack. Whenthechild suspends,its returnaddressismodifiedto reflectthenew state.Also, futurechildrenwill beallocatedon differ-entstacks.In our case,a threadthatfindsa message—eitherbecauseit polledorbecauseit receiveda signal—needsto fork a messagehandlerfor this message.Unlike Lazy Threads,however, the parentthread(the threadthat polled or thatwas interrupted)doesnot wait for its child (the messagehandler)to terminate.Also, we run all children,includingthefirst child, on theirown stack.

Page 179: Communication Architectures for Parallel-ProgrammingSystems

6.6RelatedWork 157

6.6.4 StreamMessages

Streammessageswereintroducedin Illinois FastMessages(version2.0). Streammessagesin FastMessagesdiffer in severalwaysfrom streammessagesin Panda.The main differenceis that FastMessagesrequiresthat eachincomingmessagebe consumedin the context of the message’s handler. This implies that theuserneedsto copy themessageif themessagearrivesataninconvenientmoment.LFCallowsahandlerto putasideastreammessagewithoutcopying thatmessage.

This differenceresultsfrom the way receive-packet managementis imple-mentedin FastMessages.FastMessagesappendsincomingpacketsto a circularqueueof packet buffersin hostmemory. For eachpacket in this queue,FastMes-sagesinvokesa handlerfunction. Whenthe handlerfunction returns,FastMes-sagesrecycles the buffer. This schemeis simple,becausethe hostnever needsto communicatethe identitiesof freebuffers to theNI. Both thehostandtheNImaintaina local index into thequeue:thehost’s index indicateswhich packet toconsumenext andtheNI’scounterindicateswhereto storethenext packet.

LFC explicitly passestheidentitiesof freebuffersto theNI (seeSection3.6),which hasthe advantagethat the hostcan replacebuffers. This is exactly whathappenswhen lfc upcall() doesnot allow LFC to recycle a packet immediately.As a result,packetscanbe put asidewithout copying, which is convenientandefficient. Clientsshouldbe aware,however, thatpacket buffers residein pinnedmemoryandarethereforea relatively preciousresource.If a client continuestobuffer packetsanddoesnotreturnthem,thenLFC will continueto addnew packetbuffers to its receive ring andwill eventually run out of pinnablememory. Toavoid this, clientsshouldkeeptrackof their resourceusageandtake appropriatemeasures,eitherby implementingflow control or by copying packetswhenthenumberof bufferedpacketsexceedsa threshold.

6.6.5 Totally-Order edBroadcast

Totally-orderedbroadcastis awell knownconceptin distributedcomputingwhereit is usedto simplify theconstructionof distributedprograms.However, totally-orderedbroadcastis not usedmuch in parallel programming. As we shall seein Section7.1, though,the Orcasharedobjectsystemusesthis primitive in aneffectiveway to updatereplicatedobjects.

Many differenttotally-orderedbroadcastprotocolsaredescribedin thelitera-ture. Heinzleet al. describetheGSBprotocolusedby Pandaon LFC [65]. ThemaindifferencebetweenGSBandPanda’s implementationis thatPandahandles

Page 180: Communication Architectures for Parallel-ProgrammingSystems

158 Panda

(throughLFC) sequencenumberrequestson theprogrammableNI; this reducesthenumberof network interrupts.

Kaashoekdescribestwo other protocols,Point-to-point/Broadcast(PB) andBroadcast/Broadcast(BB), whichalsorely onacentral,dedicatedsequencernode[75]. In PB, the sendersendsits messageto the sequencer. The sequencertagsthe messagewith a sequencenumberandbroadcaststhe taggedmessageto alldestinations. In BB, the senderbroadcastsits messageto all destinationsandthe sequencer. Whenthe sequencerreceivesthe message,it assignsa sequencenumberto the messageand then broadcastsa small messagethat identifiesthemessageandcontainsthesequencenumber. Receiversarenot allowedto delivera messageuntil they have received its sequencenumberanduntil all precedingmessageshavebeendelivered.

The performancecharacteristicsof GSB,PB, andBB dependon the type ofnetwork they run on. PB andBB weredevelopedon anEthernet,wherea broad-casthasthe samecostasa point-to-pointmessage.On switchednetworks likeMyrinet, however, a (spanning-tree)broadcastis much more expensive than apoint-to-pointmessage.Consequently, the performanceof large messageswillbe dominatedby the cost of broadcastingthe data,irrespective of the orderingmechanism.

For smallmessages,BB’sseparatesequence-numberbroadcastis unattractive,becausethe worst-caselatency of a totally-orderedbroadcastbecomesequaltotwice the latency of an unorderedbroadcast(one for the dataand one for thesequencenumber). Incidentally, BB wasalsoconsideredunattractive for smallmessagesin an Ethernetsetting,but for a different reason:BB generatesmoreinterruptsthanPB.

PB is muchmoreattractive for smallmessages,becauseit addsonly a singlepoint-to-pointmessageto the broadcastof the data. GSB usestwo messagestoobtaina sequencenumber. For large messages,however, PB hasthe disadvan-tagethat all datapacketsmusttravel to andthroughthe sequencer, thusputtingmoreloadon the network andon the sequencer. PB alsoputsmoreload on thesequencerthanGSBif many processorstry to broadcastsimultaneously. With PB,theoccupancy of thesequencerwill behigh, becausethesequencermustbroad-castevery datapacket. Which GSB, the occupancy will be lower, becausethesequencerneedonly respondto fetch-and-addrequests.Finally, if all broadcastsoriginatefrom thesequencer, therearefeweropportunitiesto piggybackacknowl-edgementsonmulticasttraffic.

Page 181: Communication Architectures for Parallel-ProgrammingSystems

6.7Summary 159

6.7 Summary

This chapterdescribedtheimplementationof Pandaon LFC. PandaextendsLFCwith multithreading,messagesof arbitrary length, demultiplexing, remotepro-cedurecall, andtotally-orderedbroadcast.The efficient implementationof thisfunctionality is enabledby LFC’s performanceandinterfaceandby novel tech-niquesemployedby Panda.LFC allows packetsto bedeliveredthroughpollingand interrupts. Both mechanismsare useful, but manuallyswitching betweenthemis tediousanderror-prone. Pandahidesthe complexity of managingbothmechanisms.Pandaclientsneednot poll, becausePanda’s threadingsubsystem,OpenThreads,transparentlyswitchesto polling whenall client threadsare idle.All messagesaredeliveredusingasynchronousupcalls. Pandausesthe single-threadedupcall model. This model imposesfewer restrictionsthan the active-messagesmodel,but morethanpopupthreads.Upcallsaredispatchedefficientlyusingthreadinlining andlazy thread-creation.

Pandaimplementsstreammessages,anefficient messageabstractionthaten-ablesend-to-endpipelining of datatransfers.Panda’s streammessagesseparatenotificationandthe consumptionof messagedata,which allows clientsto defermessageprocessing.

Finally, we discusseda simplebut efficient implementationof totally-orderedbroadcasting.This implementationis enabledby LFC’s efficient multicastprimi-tiveandits fetch-and-addprimitive.

Page 182: Communication Architectures for Parallel-ProgrammingSystems
Page 183: Communication Architectures for Parallel-ProgrammingSystems

Chapter 7

Parallel-Programming Systems

This chapterfocuseson the implementationof parallel-programmingsystems(PPSs)on LFC andPanda. We will show that the communicationmechanismsdevelopedin thepreviouschapterscanbeappliedeffectively to avarietyof PPSs.Weconsiderfour PPSs:

É Orca[9], anobject-basedDSM

É Manta,a parallelJavasystem[99]

É CRL [74], a region-basedDSM

É MPICH [61], animplementationof theMPI message-passingstandard[53]

Thesesystemsoffer differentparallel-programmingmodelsandtheir implemen-tationsstressdifferentpartsof theunderlyingcommunicationsystems.

This chapteris organizedasfollows. Section7.1 to Section7.4 describetheprogrammingmodel, implementation,and performanceof, respectively, Orca,Manta,CRL, andMPI. Section7.5discussesrelatedwork. Section7.6comparesthedifferentsystemsandsummarizesthechapter.

7.1 Orca

Orcais a PPSbasedon theshared-objectprogrammingmodel. This sectionde-scribesthis programmingmodel,givesanoverview of theOrcaimplementation,andthenzoomsin ontwo importantimplementationissues:operationtransferandoperationexecution.Finally, it discussestheperformanceof Orca.

161

Page 184: Communication Architectures for Parallel-ProgrammingSystems

162 Parallel-ProgrammingSystems

7.1.1 Programming Model

In shared-memoryand page-basedDSM systems[78, 82, 92, 155], processescommunicateby readingandwriting memorywords. To synchronizeprocesses,theprogrammermustusemutual-exclusionprimitivesdesignedfor sharedmem-ory, suchas locks and semaphores.Orca’s programmingmodel, on the otherhand,is basedon high-level operationson shareddatastructuresandon implicitsynchronization,which is integratedinto themodel.

Orca programsencapsulateshareddata in objects, which are manipulatedthroughoperationsof anabstractdatatype. An objectmaycontainany numberofinternalvariablesandarbitrarily complex datastructures(e.g.,lists andgraphs).A key ideain Orca’smodelis to makeeachoperationonanobjectatomic, withoutrequiringtheprogrammerto uselocks. All operationson anobjectareexecutedwithout interferingwith eachother. Eachoperationis appliedto a singleobject,but within this object the operationcanexecutearbitrarily complex codeusingtheobject’s data.Objectsin Orcaarepassive: objectsdo not containthreadsthatwait for messages.Parallel executionis expressedthroughdynamicallycreatedprocesses.

The shared-objectmodel resemblesthe useof monitors. Both sharedob-jectsandmonitorsarebasedon abstractdatatypesandfor bothmodelsmutual-exclusionsynchronizationis doneby thesysteminsteadof theprogrammer. Forcondition synchronization,however, Orca usesa higher-level mechanism[12],basedonDijkstra’sguardedcommands[44]. A guard is abooleanexpressionthatmustbesatisfiedbeforetheoperationcanbegin. An operationcanhave multipleguards,eachof which hasan associatedsequenceof statements.If the guardsof anoperationareall false,theprocessthat invokedtheoperationis suspended.As soonasoneor moreguardsbecometrue,onetrueguardis selectednondeter-ministically andits sequenceof statementsis executed.This mechanismavoidstheuseof explicit wait andsignalcalls thatareusedby monitorsto suspendandresumeprocesses,simplifying programming.

Figure7.1 givesan exampledefinition of an object type Int. The definitionconsistsof an integer instancevariablex andtwo operations,inc() andawait().Operationinc() incrementsinstancevariablex andoperationawait() blocksuntilx hasreachedat leastthevalueof parameterv. Instancesof typeInt areinitializedby theinitialization block thatsetsinstancevariablex to zero.

Figure7.2showshow Orcaprocessesaredefinedandcreatedandhow objectsaresharedbetweenprocesses.At application-startuptime, theOrcaruntimesys-tem (RTS) createsone instanceof processtype OrcaMain() on processor0. In

Page 185: Communication Architectures for Parallel-ProgrammingSystems

7.1Orca 163

object implementation Int;x: integer;

operation inc();begin

x := x + 1;end;

operation await(v : integer);begin

guard x Ê v dood;

end;

beginx := 0;

end;end;

Fig. 7.1.Definitionof anOrcaobjecttype.

processWorker(counter:shared Int);begin

counter$inc();˺˺Ëend;

processOrcaMain();begin

counter:Int;

for i in 1..15dofork Worker(counter)on i;

od;counter$await(15);˺˺Ë

end;

Fig. 7.2. Orcaprocesscreation.

Page 186: Communication Architectures for Parallel-ProgrammingSystems

164 Parallel-ProgrammingSystems

Compiled Orca program

Orca runtime system

Panda and OpenThreads

LFC

Myrinet and Linux

Fig. 7.3. Thestructureof theOrcasharedobjectsystem.

this example,OrcaMain() creates15 otherprocesses—instancesof processtypeWorker()— on processors1 to 15. To eachof theseprocessesOrcaMain() passesa referenceto object counter. When all processeshave beenforked, counter,which wasoriginally local to OrcaMain(), is sharedbetweenOrcaMain() andallWorker() processes.OrcaMain() waitsuntil all Worker() processeshaveindicatedtheirpresenceby incrementingcounter.

7.1.2 Implementation Overview

Figure7.3 shows thesoftwarecomponentsthatareusedduring theexecutionofanOrcaprogram.TheOrcacompilertranslatesthemodulesthatmakeupanOrcaprogram. For portability, the compilergeneratesANSI C. Sincethe Orcacom-piler performsmany optimizationssuchascommon-subexpressionelimination,strengthreduction,andcodemotion, the C codeit generatesoften performsaswell asequivalent,hand-codedC programs.TheC codeis compiledto machinecodeby a platform-dependentC compiler. The resultingobjectfiles are linkedwith theOrcaRTS, Panda,LFC, andothersupportlibraries. Below, we describehow thecompilerandtheRTSsupporttheefficientexecutionof Orcaprograms.

The Compiler

BesidestranslatingsequentialOrcacode,the compilersupportsan efficient im-plementationof operationson sharedobjectsin threeways. First, the compilerdistinguishesbetweenreadandwrite operations. Readoperationsdo not mod-ify the stateof the object they operateon. Operationawait() in Figure7.1, forexample,is a readoperation.If thecompilercannottell if anoperationis a readoperation,thenit markstheoperationasawrite operation.

Page 187: Communication Architectures for Parallel-ProgrammingSystems

7.1Orca 165

Second,thecompilertriesto determinetherelativefrequency of readandwriteoperations[11]. The resultingestimatesarepassedto theRTS which usesthemto determineanappropriatereplicationstrategy for theobject.For example,if thecompilerestimatesthatanobjectis readmuchmorefrequentlythanit is written,then the RTS will replicatethe object, becausereadoperationson a replicatedobjectdo not requirecommunication.Thecompiler’s estimatesareusedonly asa first guess.TheRTS maintainsusagestatisticsandmaylaterdecideto reviseapreviousdecision[87].

Third, thecompilergeneratesoperation-specificmarshalingcode.Marshalingis discussedin Section7.1.3.

The Runtime System

TheOrcaRTS implementsprocesscreationandtermination,shared-objectman-agement,andoperationson sharedobjects.Processcreationoccursinfrequently;mostprogramscreate,at initialization time, asmany processesasthenumberofprocessors.EachOrca processis implementedasa Pandathread. To createaprocessin responseto anapplication-level FORK call, theRTS broadcastsa forkmessage.Uponreceiving this message,theprocessoron which theprocessis tobecreated(theforkee)issuesanRPCbackto theforking processor. This RPCisusedto fetch copiesof any sharedobjectsthat needto be storedon the forkee’sprocessor. Whenthe forkeereceivesthe RPC’s reply, it createsa Pandathreadthatexecutesthecodefor thenew Orcaprocess.Theforkerusesa totally-orderedbroadcastmessageratherthanapoint-to-pointmessageto initiatethefork. Broad-castingforks allows all processorsto keeptrackof thenumberandtypeof Orcaprocesses;this informationis usedduringobjectmigrationdecisions.

The RTS implementstwo object-managementstrategies. The simpleststrat-egy is to storea single copy of an object in a single processor’s memory. Todecidein which processor’smemorytheobjectmustbestored,theRTSmaintainsoperationstatisticsfor eachsharedobject. TheRTS usesthesestatisticsandthecompiler-generatedestimatesto migrateeachobjectto theprocessorthataccessestheobjectmostfrequently[87].

Thesecondstrategy is to replicateanobjectin thememoriesof all processorsthatcanaccesstheobject.In thiscase,themainproblemis to keepthereplicasina consistentstate.Orcaguaranteesa sequentiallyconsistent[86] view of sharedobjects. Among others,sequentialconsistency requiresthat all processesagreeon the order of writes to sharedobjects. To achieve this, all write operationsareexecutedby meansof a totally-orderedbroadcast,which leadsnaturallyto a

Page 188: Communication Architectures for Parallel-ProgrammingSystems

166 Parallel-ProgrammingSystems

RemoteWrite

Yes

callLocal procedure Totally-ordered

broadcastLocal procedurecall

Remote procedure call

No

Read Local

Replicated object?

Local or remote object?Read or write operation?

Fig. 7.4.Decisionprocessfor executinganOrcaoperation.

singleorderfor all writes.Feketeet al. givedetailedformaldescriptionof correctimplementationstrategiesfor Orca’s memorymodel[52].

Most programscreateonly a small numberof sharedobjects. Using thecompiler-generatedhintsandits runtimestatistics,theRTSusuallydecidesquicklyandcorrectlyon which processor(s)it muststoreeachobject[87]. For theserea-sons,object-managementactionshave not beenoptimized:replicatingor migrat-ing anobjectinvolvesa totally-orderedbroadcastandanRPC.

Operationsareexecutedin oneof threeways: througha local procedurecall,througharemoteprocedurecall,or throughatotally-orderedbroadcast.Figure7.4shows how oneof thesemethodsis selected.A readoperationon a replicatedobjectis executedlocally, withoutcommunication.This is, of course,thepurposeof replication: to avoid communication.Write operationson replicatedobjectsrequirea totally-orderedbroadcast.For operationson nonreplicatedobjectstheRTS doesnot distinguishbetweenreadandwrite operations,but consideronlytheobject’s location. If theobject is storedin the memoryof theprocessorthatinvokestheoperation,thentheoperationis executedbymeansof alocalprocedurecall; otherwisea remoteprocedurecall is used.

The descriptionso far is correctonly for operationsthat have at most oneguard.With multiple guards,thecompilerdoesnot know in advancewhethertheoperationis areador write operation:thisdependsontheguardalternativethatisselected.Thecompilercouldconservatively classifyoperationsthathave at leastonewrite alternativeaswrite operations.Instead,however, thecompilerclassifiesindividual alternativesasreador write alternatives. Whenthe RTS executesanoperation,it first triesto executea readalternative,becausefor replicatedobjectsreadsare lessexpensive thanwrites. The write alternativesaretried only if allreadalternativesfail. If theobjectis replicated,this involvesabroadcast.

Operationsthat requirecommunicationare executedby meansof function

Page 189: Communication Architectures for Parallel-ProgrammingSystems

7.1Orca 167

shipping. Insteadof moving theobjectto theinvoker’sprocessor, theRTS movesthe invocationandits parametersto all processorsthat storethe object. This isdoneusinganRPCor a totally-orderedbroadcast,asexplainedabove. Sinceallprocessorsrun the samebinary program,the codeto executean operationneednotbetransportedfrom oneprocessorto another;asmalloperationidentifiersuf-fices. In additionto this identifier the RTS marshalsthe object identity andtheoperation’sparameters.In thecaseof anoperationona nonreplicatedobject,thisinformationis storedin theRPC’s requestmessage.Whenthe requestarrivesatthe processorthat storesthe object,this processor’s RTS executesthe operationandsendsareply thatcontainstheoperation’soutputparametersandreturnvalue.In the caseof a write operationon a replicatedobject,the invoker’s RTS broad-caststhe operationidentifier andthe operationparametersto all processorsthathold a replica. SincetheRTS storeseachreplicatedobjecton all processorsthatreferenceit, the invoker’s processoralwayshasa copy of theobject. All proces-sorsthat have a replica perform the operationwhen they receive the broadcastmessage;otherprocessorsdiscardthe message.On the invoker’s machine,theRTS additionallypassestheoperation’soutputparametersandreturnvalueto the(blocked)invoker.

7.1.3 Efficient Operation Transfer

This sectiondescribeshow the Orca RTS, the Orca compiler, Panda,andLFCcooperateto transferoperationinvocationsasefficiently aspossible.

Compiler and Runtime Support for Marshaling Operations

In earlier implementationsof Orca,all marshalingof operationswasperformedby theRTS.To marshalanoperation,theRTSneededthefollowing information:

É thenumberof parameters

É theparametermodes(I N, OUT, I N OUT, or SHARED)

É theparametertypes

Thisinformationwasstoredin arecursivedatastructurein whicheachcomponenttypeof acomplex typehadits own typedescriptornode.Duringoperationexecu-tion, theRTS madetwo passesover this datastructure:oneto find the total sizeof thedatato bemarshaledandoneto copy thedatainto amessagebuffer. For an

Page 190: Communication Architectures for Parallel-ProgrammingSystems

168 Parallel-ProgrammingSystems

type Person= recordage:integer;weight: integer;

end;

type PersonList= array [integer] of Person;

Fig. 7.5.Orcadefinitionof anarrayof records.

arrayof recordssuchasdefinedin Figure7.5,for example,theRTSwouldinspecteachrecord’s typedescriptorto determinetherecord’ssize,evenif all recordshadthesame,staticallyknown size.

Thecurrentcompilerdeterminesobjectsizesatcompiletimewheneverpossi-ble. (Orcasupportsdynamicarraysanda built-in graphdatatype,soobjectsizescannotalwaysbe computedstatically.) For eachoperationdefinedaspart of anobjecttype,thecompilergeneratesoperation-specificmarshalingandunmarshal-ing routines,which areinvoked by the OrcaRTS. In somecases,theseroutinescall backinto theRTS to marshaldynamicdatastructuresandobjectmetastate.

Figure7.6shows thecodegeneratedto marshalandunmarshaltheOrcatypedefinedin Figure7.5; this codehasbeeneditedandslightly simplified for read-ability. Thefigureshows only theroutinesthatareusedto constructandreadanoperationrequestmessage.Anothertriple of routinesis generatedfor RPCreplymessagesthathold OUT parametersandreturnvalues.

Thefirst routine,sizeofiovecarray record(), computesthenumberof point-ers in the I/O vectorand is usedby the RTS to allocatea sufficiently large I/Ovector. (TheRTS cachesI/O vectors,but hasto checkif thereis a cachedvectorthatcanhold asmany pointersasneeded.)Thesecondroutine,fill iovecarray -record(), generatesaPandaI/O vectorthatcontainspointersto thedataitemsthatareto bemarshaled.If thearrayis empty, the routineaddsonly a pointerto thearraydescriptor;otherwiseit addsa pointerto thearraydescriptoranda pointerto thearraydata.

To transmitthe operation,the RTS storesits header(s)in a headerstackandpassesboth the I/O vectorandthe headerstackto oneof Panda’s sendroutines.Pandacopiesthedataitemsreferencedby theI/O vectorinto LFC sendpackets.

At thereceiving side,Pandacreatesastreammessageandpassesthismessageto the OrcaRTS. The RTS looks up the target objectandthe descriptorfor theoperationthat is to be invoked,andtheninvokesunmarshal array record(), thethird routine in Figure7.6. This operation-specificroutinefirst unmarshalsthe

Page 191: Communication Architectures for Parallel-ProgrammingSystems

7.1Orca 169

int sizeof iovec array record(PersonListDesc » a)¼if (a Ì a sz Í 0)

¼return 2; / » pointer to descriptor and to array data » /½

return 1; / » pointer to descriptor; no data (empty array) » /½pan iovec p fill iovec array record(pan iovec p iov, PersonListDesc » a)¼

iov Ì data = a; / » add pointer to the array descriptor » /iov Ì len = sizeof( » a);iov++;

if (a Ì a sz Í 0)¼

/ » add pointer to array data » /iov Ì data = (Person » ) a Ì a data;iov Ì len = a Ì a sz » sizeof(Person);iov++;½

return iov;½void unmarshal array record(pan msg p msg, PersonListDesc » a)¼

Person » r;pan msg consume(msg, a, sizeof( » a)); / » unmarshal array descriptor » /if (a Ì a sz Í 0)

¼r = malloc(a Ì a sz » sizeof(Person));a Ì a data = r;pan msg consume(msg, r, a Ì a sz » sizeof(Person));½

else¼a Ì a sz = 0;a Ì a data = 0;½½

Fig. 7.6.Specializedmarshalingcodegeneratedby theOrcacompiler.

Page 192: Communication Architectures for Parallel-ProgrammingSystems

170 Parallel-ProgrammingSystems

Orca datastructures

Orca datastructures

LFC

Panda

Orca runtime

Orca application

11

23

4 4I/O vector

Send packets(in NI memory) Receive packets

(in NI memory)

Receive packets(in host memory)

Message

2. Data transmission1. Marshaling of Orca data into send packets

3. DMA transfer from NI to host memory4. Unmarshaling of Panda stream message

Data transferPointer

Fig. 7.7.Datatransferpathsin Orca.

arraydescriptorandthendecidesif it needsto unmarshalany data.

Data Transfers

Above,we discussedtheOrcapartof operationtransfers.This sectiondiscussestheentirepath,throughall communicationlayers. All datatransfersinvolved inan operationon a remoteobjectareshown in Figure7.7. At the sendingside,Pandacopies(usingprogrammedI/O) datareferencedby the I/O vectordirectlyfrom userdatastructuresinto LFC sendpackets. The secondstagein the datatransferpipelineconsistsof moving LFC sendpackets to the destinationNI. Inthethird stage,LFC usesDMA to movenetwork packetsfrom NI memoryto hostmemory. Pandaorganizesthepacketsin hostmemoryinto a streammessageandpassesthis messageto theOrcaRTS. In thefourth stage,theRTS allocatesspaceto hold thedatastoredin themessageandunmarshals(i.e., copies)the contentsof themessageinto this space.All four stagesoperateconcurrently. As soonasPandahasfilled a packet, for example,LFC transmitsthis packet, while Pandastartsfilling thenext packet.

At the sendingside,oneunnecessarydatatransferoccurs. SinceLFC usesprogrammedI/O to movedatainto sendpackets,thedatacrossesthememorybustwice: oncefrom hostmemoryto theprocessorregistersandfrom thoseregisters

Page 193: Communication Architectures for Parallel-ProgrammingSystems

7.1Orca 171

to NI memory. By usingDMA insteadof PIO, thedatawould crossthememorybusonly once.Thedatain Figure2.2 suggeststhat for messagesof 1 Kbyte andlarger DMA will be fasterthanPIO. The useof DMA also freesthe processorto do otherwork while the transferprogresses.The sameasynchrony, however,alsoimplies thata mechanismis neededto testif a transferhascompleted.ThisrequirescommunicationbetweenthehostandtheNI. Either thehostmustpoll alocationin NI memory, whichslowsdown thedatatransfer, or theNI mustsignalthe endof the datatransferby meansof a small DMA transferto hostmemory,which increasestheoccupancy of boththeNI processorandtheDMA engine.

Anotherproblemwith DMA is thatit is necessaryto pin thepagescontainingthe Orca datastructuresor to set up a dedicatedDMA area. Unlessthe Orcadatastructurescanbestoredin theDMA area,thelatterapproachrequiresacopyto theDMA area.On-demandpinningof Orcadatais feasible,but complex. Toavoid pinningandunpinningoneverytransfer, acachingschemesuchasVMMC-2’s UTLB is needed(seeSection2.3). For small transfers,sucha schemeis lessefficient thanprogrammedI/O.

At thereceiving side,theoptimalscenariowouldbefor theNI to transferdatafrom thenetwork packetsin NI memoryto thedatabuffersallocatedby theRTS(i.e., to avoid theuseof hostreceive packets). In practice,this is difficult. First,the NI hasto know to which messageeachincomingdatapacket belongs.ThisimpliesthattheNI hasto maintainstatefor eachmessageandlook upthisstateforeachincomingpacket. Second,theNI hasto find a sufficiently largedestinationbuffer in hostmemoryinto which thedatacanbecopied.Third, afterthedatahasbeencopied,theNI hasto notify thereceiving process.UnlessthehostandtheNIperforma handshake to agreeuponthedestinationbuffer, this notificationcannotbe merged with the datatransferas in LFC. A handshake, however, increaseslatency.

Summarizing,eliminatingall copiesis difficult andit is doubtfulwhetherdo-ing so will significantly improve performance. We thereforedecidedto useapotentiallyslower, but simplerdatapath.

7.1.4 Efficient Implementation of Guarded Operations

This sectiondescribesan efficient implementationof Orca’s guardedoperationson top of Panda’s upcall model. WhenPandareceivesan operationrequest,itdispatchesanupcallto theOrcaRTS.It is notobvioushow theRTSshouldexecutethe requestedoperationin the context of this upcall. The problemis that theoperationmayblockonaguard.Panda’supcallmodel,however, forbidsmessage

Page 194: Communication Architectures for Parallel-ProgrammingSystems

172 Parallel-ProgrammingSystems

RTS serverthreads

Broadcastqueue

Orca processes

Thread

Upcall

RPC queue

Upcall threadPanda

RTS

Orca

Fig. 7.8. Orcawith popupthreads.

handlersto block andwait for incomingmessages.Suchmessagesmayhave tobereceivedandprocessedto makeaguardevaluateto true.

To solvethisproblem,anearlierversionof theOrcaRTSimplementedits ownpopupthreads.The structureof this RTS, which we call RTS-threads,is shownin Figure7.8. In RTS-threads,Panda’s upcall threaddoesnot executeincomingoperationrequests,but merelystoresthemin oneof two queues.RPCrequestsfor operationsonnonreplicatedobjectsarestoredin theRPCqueueandbroadcastmessagesfor write operationson replicatedobjectsare storedin the broadcastqueue.Thesequeuesareemptiedby a pool of RTS-level server threads.Whenall server threadsareblocked, the RTS extendsthe threadpool with new serverthreads. Sincethe server threadsdo not listen to the network, they can safelyblockona guardwhenthey processanincomingoperationrequestfrom theRPCqueue.(This blockingis implementedby meansof conditionvariables.)Also, itis now safeto performsynchronouscommunicationwhile processingabroadcastmessage.This typeof nestedcommunicationoccursduring thecreationof Orcaprocessesandduringobjectmigration.

RTS-threadsusesonly onethreadto servicethebroadcastqueue.With mul-tiple threads,incoming(totally-ordered)broadcastmessagescouldbe processedin a differentorderthanthey werereceived.Usingonly a singlethread,however,implies (onceagain)that,without extra measures,write operationson replicatedobjectscannotblock.

Besidesthis orderingandblockingproblem,RTS-threadssuffers from prob-lems relatedto the useof popupthreads(seeSection6.2.3): increasedthreadswitchingandincreasedmemoryconsumption.Processinganincomingoperation

Page 195: Communication Architectures for Parallel-ProgrammingSystems

7.1Orca 173

Upcall thread

Upcall

Thread

Panda

Orca processesOrca

RTS

Fig. 7.9.Orcawith single-threadedupcalls.

requestalwaysrequiresat leastonethreadswitch(from Panda’supcallto aserverthread).In addition,whenmany incomingoperationrequestsblock,RTS-threadsis forced to allocatemany threadstacks,which wastesmemory. Finally, whensomeoperationmodifiesan object that many operationsare blocked on, RTS-threadswill signalall threadsassociatedwith theseoperations.Thesethreadswillthenre-evaluatethe guardthey wereblocked on, perhapsonly to find that theyneedto blockagain.This leadsto excessive threadswitching.

To solvetheseproblems,werestructuredtheRTSsothatall operationrequestscanbeprocesseddirectlyby Panda’supcall,withouttheuseof serverthreads.Thenew structureis shown in Figure7.9.Thenew RTSemploysonly oneRTSserverthread(notshown),whichis usedonly duringactionsthatoccurinfrequently, suchasprocesscreationandobjectmigration.In thesecases,it is necessaryto performanRPC(i.e., synchronouscommunication)in responseto anincomingbroadcastmessage.ThisRPCis performedby theRTSserverthread,justasin RTS-threads.

Blocking on guardsis handledby exploiting the observation thatOrcaoper-ationscanblock only at the very beginning. A blocked operationcanthereforealwaysberepresentedby anobjectidentifier, anoperationidentifier, andtheop-eration’s parameters.This invocationinformationis availableto theRTS whenitinvokesanoperation.Whenanoperationblocks,thecompiler-generatedcodeforthat operationreturnsan error statusto the RTS. Insteadof blocking the callingthreadon aconditionvariable,asin RTS-threads,theRTS createsacontinuation.In thiscontinuation,theRTSstorestheinvocationinformationandthenameof anRTS function that,giventhe invocationinformation,canresumetheblocked in-vocation.Dif ferentresumefunctionsareused,dependingon theway theoriginal

Page 196: Communication Architectures for Parallel-ProgrammingSystems

174 Parallel-ProgrammingSystems

cont init(contqueue, lock)cont clear(contqueue)state ptr = cont alloc(contqueue, size, contfunc)cont save(state ptr)cont resume(contqueue)

Fig. 7.10.Thecontinuationsinterface.

invocationtook place.Theresumefunctionfor anRPCinvocation,for example,differs from the resumefunction for a broadcastinvocation,becauseit needstosenda reply messageto theinvokingprocessor.

Eachsharedobjecthasa continuationqueuein which theRTS storescontin-uationsfor blockedoperations.Whentheobjectis latermodified,themodifyingthreadwalkstheobject’scontinuationqueueandcallseachcontinuation’sresumefunction. If anoperationfails again,its continuationremainson thequeue,other-wiseit is destroyed.

Figure7.10 shows the interfaceto the continuationmechanism.Queuesofcontinuationsresembleconditionvariables,whichareessentiallyqueuesof threaddescriptors.This similarity makesreplacingconditionvariableswith continua-tionsquiteeasy. Cont init() initializesacontinuationqueue;initially, thequeueisempty. Like a conditionvariable,eachcontinuationqueuehasanassociatedlock(lock) thatensuresthataccessesto thequeueareatomic.Cont clear() destroys acontinuationqueue.Cont alloc() heap-allocatesa continuationstructureandas-sociatesit with a continuationqueue.(Continuationscannotbe allocatedon thestack,becausethe stackis reused.)Cont alloc() returnsa pointer to a buffer ofsizebytesin which theclient savesits state.After saving its state,theclient callscont save() which appendsthecontinuationto thequeue.Together, cont alloc()andcont save() correspondto await operationonaconditionvariable.In thecaseof conditionvariables,however, noseparateallocationcall is needed,becausethesystemknowswhatstateto save: theclient’s threadstack.Finally, cont resume()resemblesa broadcaston a conditionvariable;it traversesthequeueandresumesall continuations.

Representingblockedoperationsby continuationsinsteadof blockedthreadshasthreeadvantages.

1. Continuationsconsumevery little memory. Thereis no associatedstack,just theinvocationinformation.

2. Resuminga blockedinvocationdoesnot involve any signalingandswitch-

Page 197: Communication Architectures for Parallel-ProgrammingSystems

7.1Orca 175

ing to otherthreads,but only plainprocedurecalls.

3. Continuationsareportableacrosscommunicationarchitectures.With ap-propriatesystemsupport(suchas the fast handlerdispatchdescribedinSection6.2.3),popupthreadscanbe implementedmoreefficiently thaninRTS-threads. Continuationsdo not requiresuchsupportand they avoidsomeof thedisadvantagesof popupthreads.

Manualstatesaving with continuationsis relatively easyif messagehandlerscanblockonly at thebeginning(aswith Orcaoperations),becausethestatethatneedsto be saved consistsessentiallyof the messagethat hasbeenreceived. Manualstatesaving is more difficult when handlersblock after they have creatednewstate.In thiscase,thehandlermustbesplit in two parts:thecodeexecutedbeforeandafterthesynchronizationpoint. For very largesystems,this manualfunctionsplitting becomestedious. The difficulty lies in identifying the statethat needsto becommunicatedfrom thefunction’s first half to its secondhalf (by meansofa continuation). This statemay well includestatestoredin stackframesbelowtheframethatis aboutto block (assumingthatstacksgrow upward). In theworstcase,theentirestackmustbesaved.

A secondcomplicationarisesif a handlerperformssynchronouscommunica-tion. A synchronouscommunicationcall canonly besplit in two partsif thereis anonblockingalternative for thesynchronousprimitive. Earlierversionsof Panda,for example,did not provide asynchronousmessagepassingprimitives,but onlya synchronousRPCprimitive. In this case,trueblockingcannotbeavoidedandit is necessaryto handoff stateto a separatethreadthat cansafelyperformthesynchronousoperation.In thecontinuation-basedRTS,suchcasesarehandledbyasingleRTS server thread.

7.1.5 Performance

Thissectiondiscussesonly operationlatency andthroughput.An elaborateappli-cation-performancestudywasperformedusinganOrcaimplementationthatrunson Panda(version3.0) andFM/MC [9]. Application performanceon Panda4.0andLFC is discussedin Chapter8 which studiestheapplication-level effectsofdifferentLFC implementations.All measurementsin thischapterwereperformedon theexperimentalplatformdescribedin Section1.6,unlessnotedotherwise.

Page 198: Communication Architectures for Parallel-ProgrammingSystems

176 Parallel-ProgrammingSystems

0Ä 200Ä 400Ä 600Ä 800Ä 1000Message size (bytes)Å

0

50

100

Rou

ndtr

ip la

tenc

y (m

icro

seco

nds)

Î OrcaPandaLFC

Fig. 7.11.Roundtriplatenciesfor Orca,Panda,andLFC.

Performanceof Operationson NonreplicatedObjects

Figure 7.11 shows the executiontime of an Orca operationon a remote,non-replicatedobject. Theoperationtakesanarrayof bytesasits only argumentandreturnsthe samearray. The horizontalaxis specifiesthe numberof bytesin thearray. Theoperationis performedusingaPandaRPCto theremotemachine.Forcomparison,Figure7.11alsoshowsroundtriplatenciesfor PandaandLFC usingmessagesthathave thesamesizeastheOrcaarray. In all threebenchmarks,therequestandreply messagesarereceivedthroughpolling.

Orca-specificprocessingcostsapproximately15 µs. This includesthetime tolock theRTS, to readtheRTS headers,to allocatespacefor thearrayparameter,to look uptheobject,to executetheoperation,to updatetheruntimeobjectaccessstatistics,andto build thereply message.

We usedthe sametest to measureoperationthroughput. The resultsof thistest, and the correspondingroundtrip throughputsfor PandaandLFC with thecopy receive strategy, areshown in Figure7.12. We defineroundtripthroughputas Ï 2mÐ�Ñ RTT, wherem is thesizein bytesof thearraytransmittedin therequestandreplymessageandRTT themeasuredroundtriptime.

Comparedto the throughputobtainedby PandaandLFC, Orca’s throughputis good. This is dueto theuseof I/O vectorsandstreammessages,which avoidextra copying insidethe OrcaRTS. For all systems,throughputdecreaseswhen

Page 199: Communication Architectures for Parallel-ProgrammingSystems

7.1Orca 177

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Å0

10

20

30

40

50R

ound

trip

thro

ughp

ut (

Mby

te/s

econ

d)

Ò LFCPandaOrca

Fig. 7.12.Roundtripthroughputfor Orca,Panda,andLFC.

messagesno longerfit into theL2 cache(256Kbyte).

Performanceof Operationson ReplicatedObjects

Figure 7.13 shows the latency of a write operationon an object that hasbeenreplicatedon 64 processors.The operationtakes an array of bytesas its onlyargumentanddoesnot returnany results.This operationis performedby meansof a totally-orderedPandabroadcast.For comparison,the figurealsoshows thebroadcastlatenciesfor Panda(orderedandunordered)andLFC (unordered).Inall cases,thelatency shown is thelatency betweenthesenderandthelastreceiverof thebroadcast.Orcaaddsapproximately18µs(17%)to Panda’s totally-orderedbroadcastprimitive.

Figure7.14shows thethroughputobtainedusingthesamewrite operationasabove. The lossin throughputrelative to Panda(with ordering)is at most15%.This occurswhenOrcasendstwo packetsandPandaone(dueto an extra Orcaheader).In all othercasesthelossis at most8%.

Page 200: Communication Architectures for Parallel-ProgrammingSystems

178 Parallel-ProgrammingSystems

0Ä 200Ä 400Ä 600Ä 800Ä 1000Message size (bytes)Å

0

50

100

150

200La

tenc

y (m

icro

seco

nds)

Æ64 processorsÈ

OrcaPanda, orderedPanda, unorderedLFC

Fig. 7.13.Broadcastlatency for Orca,Panda,andLFC.

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Å0

5

10

15

20

25

Thr

ough

put (

Mby

te/s

econ

d)

64 processors

LFC, unorderedPanda, unorderedPanda, orderedOrca

Fig. 7.14.Broadcastthroughputfor Orca,Panda,andLFC.

Page 201: Communication Architectures for Parallel-ProgrammingSystems

7.2Manta 179

The Impact of Continuations

Sincewehavenoimplementationof RTS-threadsonPanda4.0,wecannotdirectlymeasurethe gainsof using continuationsinsteadof popupthreads. An earliercomparisonshowed that operationlatencieswent from 2.19 ms in RTS-threadsto 1.90ms in the continuation-basedRTS [14], an improvementof 13%. Thesemeasurementswereperformedon 50 MHz SPARCClassicclonesconnectedby10MbpsEthernet.For communicationweusedPanda2.0on topof thedatagramserviceof theAmoebaoperatingsystem. Threadswitchingon theSPARC wasexpensive, becauseit requireda trap to the kernel. On the PentiumPro, threadswitchesareperformedentirelyin userspace,sothegainsof usingcontinuationsinsteadof popupthreadsaresmaller. Nevertheless,thecostsof switchingfrom aPandaupcallto apopupthreadaresignificant.In Manta,for example,dispatchinga popupthreadcosts4 µs andaccountsfor 9% of the executiontime of a nulloperationon a remoteobject(seeSection7.2).

7.2 Manta

Mantais a parallel-programmingsystembasedon Java [60]. Java is a portable,type-secure,andobject-orientedprogramminglanguagewith automaticmemorymanagement(i.e., garbagecollection); thesepropertieshave madeJava a verypopularprogramminglanguage. Java is also an attractive (base)languageforparallelprogramming.Multithreading,for example,is part of the languageandanRPC-like communicationmechanismis availablein standardlibraries.

Manta implementsthe JavaParty [117] programmingmodel,which is basedon sharedobjectsandresemblesOrca’s shared-objectmodel. Theprogrammingmodelandits implementationon PandaandLFC aredescribedbelow. A detaileddescriptionof Mantais givenin theMScthesisof MaassenandvanNieuwpoort[98].

7.2.1 Programming Model

MantaextendsJava with onekeyword: remote. Java threadscanshareinstancesof any classthattheprogrammertagswith thiskeyword. Suchinstancesarecalledremoteobjects. In addition,Mantaprovideslibrary routinesto createthreadsonremoteprocessors.Threadsthat run on the sameprocessorcan communicatethroughglobalvariables;threadsthatrunondifferentmachinescancommunicateonly by invokingmethodson sharedremoteobjects.

Page 202: Communication Architectures for Parallel-ProgrammingSystems

180 Parallel-ProgrammingSystems

Issue Orca MantaProgramming Objectplacement Transparent Explicitmodel Mutual exclusion Peroperation, Synchronized

implicit methods/blocksCondition Guards Conditionsynchronization variablesGarbagecollection No Yes

Implementation Compilation To C To assembly(x86 andSPARC)

Objectreplication Yes NoObjectmigration Yes NoInvocation Totally-ordered RMImechanisms bcastandRPCUpcallmodels Pandaupcalls Popupthreads

Table 7.1.A comparisonof OrcaandManta.

ThisprogrammingmodelresemblesOrca’sshared-objectmodel,but thereareseveral importantdifferences,which aresummarizedin Table7.1 anddiscussedbelow. Section7.2.2discussestheimplementationdifferences.

First, in Manta,objectplacementis not transparentto theprogrammer. Whencreatingaremoteobject,theprogrammermustspecifyexplicitly onwhichproces-sortheobjectis to bestored.Eachobjectis storedonasingleprocessor—Mantadoesnot replicateobjects—andcannotbemigratedto anotherprocessor.

Second,OrcaandJavausedifferentsynchronizationmechanisms.In Orca,alloperationson anobjectareexecutedatomically. In Java, individual methodscanbe marked as ’synchronized.’ Two synchronizedmethodscannotinterferewitheachother, but a nonsynchronizedmethodcaninterferewith other(synchronizedandnonsynchronized)methods.Also, Javausesconditionvariablesfor conditionsynchronization;Orcausesguards.

The last differenceis that Java is a garbage-collectedlanguage,which hassomeimpacton communicationperformance(seeSection7.2.3).

7.2.2 Implementation

LikeOrca,MantadoesnotuseLFC directly, but builds on Panda.Mantabenefitsfrom Panda’s multithreadingsupport,Panda’s transparentmixing of polling andinterrupts,andPanda’s streammessages.SinceMantadoesnot replicateobjects,

Page 203: Communication Architectures for Parallel-ProgrammingSystems

7.2Manta 181

it neednot maintainmultiple consistentcopiesof an object,andthereforedoesnotusePanda’s totally-orderedbroadcastprimitive.

Mantaachieveshighperformancethroughafastremote-objectinvocationmech-anismand the useof compilationinsteadof interpretation. Both arediscussedbelow.

Remote-ObjectInvocation

Like Orca,Mantausesfunction shippingto accessremoteobjects. Method in-vocationson a remoteobjectareshippedby meansof anobject-orientedvariantof RPCcalledRemoteMethodInvocation(RMI) [152]. EachMantaRMI is exe-cutedby meansof aPandaRPC.

ProcessinganRMI requestconsistsof executingamethodonaremoteobject.All parametersexcept remoteobjectsarepassedby value. Method invocationscanblock at any time on a conditionvariable.This is differentfrom Orcawhereblocking occursonly at the beginning of an operation. Due to this difference,Manta cannotalways usePandaupcallsand continuationsto handleincomingRMIs. If a methodis executedby a Pandaupcall andblockshalfway through,thenthe Mantaruntimesystemis not awareof the statecreatedby that methodandcannotcreateacontinuation.

To dealwith blockingmethods,Mantausespopupthreadsto processincomingRMIs. WhenanRMI requestis deliveredby a Pandaupcall, theMantaruntimesystempassesthe requestto a popupthreadin a threadpool. This threadexe-cutesthemethodandsendsthereply. Whenthemethodblocks,thepopupthreadis blockedon a Pandaconditionvariable;no continuationsarecreated.In short,Mantausesthe samesystemstructureasearlierOrcaimplementations(cf. Fig-ure7.8).

Compiler Support

BesidesdefiningtheJava language,theJavadevelopersalsodefinedtheJavaVir-tualMachine(JVM) [95], avirtual instructionsetarchitecture.Javaprogramsaretypically compiledto JVM bytecode;this bytecodeis interpreted.This approachallows Java programsto berun on any machinewith a Java bytecodeinterpreter.The disadvantageis that interpretationis slow. Currentresearchprojectsareat-tackingthis problemby meansof just-in-time(JIT) compilationof Javabytecodefor aspecificarchitectureandby meansof hardwaresupport[104].

Mantaincludesa compilerthat translatesJava sourcecodeto machinecode

Page 204: Communication Architectures for Parallel-ProgrammingSystems

182 Parallel-ProgrammingSystems

0Ä 200Ä 400Ä 600Ä 800Ä 1000Message size (bytes)Å

0

50

100

150

Rou

ndtr

ip la

tenc

y (m

icro

seco

nds)

Î Manta RMIOrcaPandaLFC

Fig. 7.15.Roundtriplatenciesfor Manta,Orca,Panda,andLFC.

for SPARC andIntel x86architectures[145]. This removestheoverheadof inter-pretationfor applicationsfor which sourcecodeis available.For interoperability,Mantacanalsodynamicallyload andexecuteJava bytecode[99]. The experi-mentsbelow, however, useonly nativebinarieswithoutbytecode.

The compiler also supportsManta’s RMI. For eachremoteclass,the com-piler generatesmethod-specificmarshalingand unmarshalingroutines. Mantamarshalsarraysin the sameway asOrca(seeFigure7.6). At the sendingside,thecompiler-generatedcodeaddsa pointerto thearraydatato a PandaI/O vec-tor andPandacopiesthe datainto LFC sendpackets. At the receiving side,thecompiler-generatedcodeusesPanda’span msgconsume() to copy datafrom LFChostreceivepacketsto aJavaarray.

For nonarraytypes,Mantamakesanextracopy at thesendingside.All nonar-ray datathat is to be marshaledis copiedto a singlebuffer anda pointerto thisbuffer is addedto a PandaI/O vector. Sincenonarrayobjectsareoftensmall,theextracopying costsmayoutweighthecostof I/O vectormanipulations.

7.2.3 Performance

Figures7.15and7.16show Manta’s RMI (roundtrip)latency andthroughput,re-spectively. Thesenumbersweremeasuredby repeatedlyperforminganRMI onaremoteobject.Themethodtakesasinglearrayparameterandreturnsthisparame-

Page 205: Communication Architectures for Parallel-ProgrammingSystems

7.2Manta 183

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Å0

10

20

30

40

50

60R

ound

trip

thro

ughp

ut (

Mby

te/s

econ

d)

Ò LFCPandaOrcaManta RMI

Fig. 7.16.Roundtripthroughputfor Manta,Orca,Panda,andLFC.

ter; thehorizontalaxesof Figure7.15andFigure7.16specifythesizeof thearray.For comparison,thefigureshowsalsotheroundtriplatenciesandthroughputsforLFC, Panda,andOrca.All four systemsmake thesamenumberof datacopies.

The null latency for an empty array is 76 µs. This high latency is due tounoptimizedarraymanipulations.With zeroparametersinsteadof anemptyarrayparameter, thenull latency dropsto 48 µs. Thethreadswitchfrom Panda’s upcallto a popupthreadin theMantaruntimesystemcosts4 µs. For nonemptyarrays,Manta’s latency increasesfasterthanthelatency of Orca,PandaandLFC. This isnotdueto extracopying, but to thecacheeffectdescribedbelow.

Figure7.16showstheroundtripthroughputfor LFC, Panda,Orca,andManta.Manta’s RMI peakthroughputis muchlower thanthepeakthroughputfor Orca,Panda,andLFC. The reasonis that Mantadoesnot free the datastructuresintowhich incomingmessagesareunmarshaleduntil thegarbagecollectoris invoked.In this test, the garbagecollector is never invoked, so eachincoming messageis unmarshaledinto a freshmemoryarea,which leadsto poor cachebehavior.Specifically, eachtime a messageis unmarshaledinto a buffer, the contentsofreceive buffers usedin previous iterationsis flushedfrom the cacheto memory.This problemdoesnot occurin the othertests(Orca,Panda,andLFC), becausethey reusethesamereceivebuffer in eachiteration.

If Mantacanseethat thedataobjectscontainedin a requestmessageareno

Page 206: Communication Architectures for Parallel-ProgrammingSystems

184 Parallel-ProgrammingSystems

longerreachableafter completionof the remotemethodthat is invoked, thenitcan immediatelyreusethe messagebuffer without waiting for the garbagecol-lector to recycle it. At present,the Manta compiler is able to perform the re-quiredescapeanalysis[36] for simplemethods,includingthemethodusedin ourthroughputbenchmark.Sincewewishto illustratethegarbagecollectionproblemandsinceescapeanalysisworks only for simplemethods,we show the unopti-mizedthroughput.With escapeanalysisenabled,MantatracksPanda’s through-put curve.

7.3 The C RegionLibrary

TheC RegionLibrary (CRL) is ahigh-performancesoftwareDSM systemwhichwasdevelopedat MIT [74]. UsingCRL, processescanshareregionsof memoryof auser-definedsize.TheCRL runtimesystemimplementscoherentcachingfortheseregions.ThissectiondiscussesCRL’sprogrammingmodel,its implementa-tion onLFC, andtheperformanceof this implementation.

7.3.1 Programming Model

CRL requirestheprogrammerto storeshareddatain regions. A region is a con-tiguouschunkof memoryof a user-definedsize; a region canneithergrow norshrink.Processescanmapregionsinto theiraddressspaceandaccessthemusingmemory-referenceinstructions.

To make changesvisible to other processes,and to observe changesmadeby otherprocesses,eachprocessmustencloseits region accessesby calls to theCRL runtimesystem. (CRL is library-basedandhasno compiler.) A seriesofreadaccessesto a region r shouldbebracketedby calls to rgn start read(r)andrgn end read(r), respectively. If a seriesof accessesincludesan instructionthatmodifiesa region r, thenrgn start write(r) andrgn endwrite(r) shouldbeused.Rgnstart read() locks a region for readingand rgn start write() locks a regionfor writing. Both callsblock if anotherprocesshasalreadylockedtheregion in aconflictingway. (A conflictoccurswheneverat leaston write accessis involved.)Rgnend read() andrgn endwrite() releasethereadandwrite lock, respectively.If all regionaccessesin aprogramareproperlybracketed,thenCRL guaranteesasequentiallyconsistent[86] view of all regions.However, theCRL library cannotverify thatusersindeedbracket theiraccessesproperly.

Page 207: Communication Architectures for Parallel-ProgrammingSystems

7.3TheC RegionLibrary 185

7.3.2 Implementation

AlthoughCRL usesa sharingmodelthat is similar to Orca’s sharedobjects,theimplementationis different. First, CRL runsdirectly on LFC anddoesnot usePanda. Second,CRL usesa differentconsistency protocolfor replicatedsharedobjects.Orcaupdatessharedobjectsby meansof functionshipping. CRL, in con-trast,usesinvalidationanddatashipping. (A function-shippingimplementationof CRL exists[67], but theLFC implementationof CRL is basedon theoriginal,data-shippingvariant.)CRL usesaprotocolsimilar to thecachecoherenceproto-colsusedin scalablemultiprocessorssuchastheOrigin2000[151] andtheMITAlewife [1]. The protocol treatsevery sharedregion asa cacheline andrunsadirectory-basedcachecoherenceprotocolto maintaincacheconsistency.

The core of CRL’s runtime systemis a statemachinethat implementstheconsistency protocol. The runtimesystemmaintains(meta)statefor eachregionon all nodesthat cachethe region. The statemachineis implementedasa setof handlerroutinesthat are invoked whena specificevent occurs. An event iseitherthe invocationof a CRL library routinethatbracketsa region access(e.g.,rgn start write()) or thearrival of a protocolmessage.A detaileddescriptionofthestatemachineis givenin Johnson’sPhDthesis[73].

WhenaprocesscreatesaregiononaprocessorP, thenP becomestheregion’shomenode. Thehomenodestorestheregion’sdirectoryinformation;it maintainsa list of all processorsthatarecachingtheregion. It is the locationwherenodesthatarenot cachingtheregion canobtaina valid copy of the region. It is alsoacentralpointatwhichconflictingregionaccessesareserializedsothatconsistencyis maintained.

CRL Transactions

Communicationin CRL is causedmainly by readandwrite misses.Figure7.17shows the communicationpatternsthat result from differenttypesof misses.Amissis a cleanmissif thehomenodeH hasanup-to-datecopy of theregion thatprocessP is trying to access,otherwisethemissis a dirty miss. In thecaseof acleanmiss, the datatravels acrossthe network only once(from H to P). For acleanwrite miss,the homenodesendsinvalidationsto all nodesR that recentlyhad readaccessto the region. The region is not shippedto P until all readersR have acknowledgedtheseinvalidations. In the caseof a dirty missthe dataistransferredtwice, oncefrom thelastwriter W to homenodeH andoncefrom Hto the requestingprocessP. The CRL versionwe useddoesnot implementthe

Page 208: Communication Architectures for Parallel-ProgrammingSystems

186 Parallel-ProgrammingSystems

Inva

lidat

e

Read miss

ACK

ACK

ACK + data

(a) Clean read miss (b) Dirty read miss

Read miss

ACK + data ACK + data

Invalidate

InvalidateWrite miss

ACK + data ACK + data

Write miss

ACK + data

P H

R

R

P H

HP

R

R

HP W

W

(c) Clean write miss (d) Dirty write miss

Invalidate

Fig. 7.17.CRL readandwrite misstransactions.

three-partyoptimizationthatavoidsthis doubledatatransfer. (Laterversionsdidimplementthis optimization).

Polling and Interrupts

CRL usesboth polling andinterruptsto deliver messages.The implementationis single-threaded.All messages,whetherthey are received throughpolling orinterrupts,arehandledin thecontext of theapplicationthread.

The runtime systemnormally enablesnetwork interruptsso that nodescanbe interruptedto processincoming requests(misses,invalidations,and region-maprequests)even if they happento be executingapplicationcode. For someapplications,interruptsare necessaryto guaranteeforward progress.For otherapplications,they improveperformanceby reducingresponsetime.

Page 209: Communication Architectures for Parallel-ProgrammingSystems

7.3TheC RegionLibrary 187

In many cases,though, interruptscan be avoided. All transactionsin Fig-ure 7.17startwith a requestfrom nodeP to homenodeH andendwith a replyfrom H to P. While waiting for thereply from H, P is idle. Similarly, if H sendsout invalidationsthenH will be idle until thefirst acknowledgementarrivesandin betweensubsequentacknowledgements.In theseidle periods,CRL disablesnetwork interruptsandpolls. This way the expectedreply messageis receivedthroughpolling ratherthaninterrupts,which reducesthereceiveoverhead.

This behavior is a goodmatchto LFC’s polling watchdogwhich workswellfor clientsthat mix polling andinterrupts. Interruptsareusedto handlerequestmessagesunlessthe destinationnodehappensto be waiting (andpolling) for areply to oneof its own requests.By delayinginterrupts,LFC increasestheproba-bility thatthis happens.

MessageHandling

CRL sendstwo typesof messages.Data messagesareusedto transferregionscontaininguserdata(e.g.,whenaregionis replicated).Thesemessagesconsistofasmallheaderfollowedby theregiondata.Control messagesaresmallmessagesusedto requestcopiesof regions,to invalidateregions,to acknowledgerequests,etc. Both dataandcontrol messagesarepoint-to-pointmessages;CRL doesnotmakesignificantuseof LFC’s multicastprimitive.

EachCRL messagecarriesa small header. Eachheadercontainsa demulti-plexingkey thatdistinguishesbetweencontrolmessages,datamessages,andother(lessfrequentlyused)messagetypes. For controlmessages,theheadercontainsthe addressof a handlerfunction andup to four integer-sizedargumentsfor thehandler. A datamessageconsistsof a region pointer, a region offset,anda fewmorefields. Theregion pointeridentifiesa mappedregion in therequestingpro-cess’saddressspace.Sinceregionscanhaveany size,they maywell belargerthananLFC packet. Theoffset is usedto reassemblelargedatamessages;it indicateswherein theregion to storethedatathatfollows theheader.

Incomingpacketsarealwaysprocessedimmediately. The datacontainedindatapacketsis alwayscopiedto its destinationwithout delay. Protocol-messagehandlers,in contrast,cannotalwaysbeinvokedimmediately;in thatcasethepro-tocolmessageis copiedandqueued.(Thiscopy is unnecessary, but protocolmes-sagesconsistof only a few words.) Sinceno processcaninitiate morethanonecoherencetransaction,theamountof buffering requiredto storeblockedprotocolhandlersis bounded.

Page 210: Communication Architectures for Parallel-ProgrammingSystems

188 Parallel-ProgrammingSystems

0Ó 200Ó 400Ó 600Ó 800Ó 1000Region size (bytes)

0

50

100

150

Mis

s la

tenc

y (m

icro

seco

nds)

CRL, 8 readersCRL, 4 readersCRL, 1 readerCRL, 0 readersLFC

Fig. 7.18.Write misslatency.

7.3.3 Performance

CRLprogramssendmany controlmessagesandarethereforesensitiveto sendandreceive overhead.Moreover, severalCRL applicationsusesmall regions,whichyieldssmalldatamessages.Sincemany CRL actionsusea request-replystyleofcommunication,communicationlatenciesalsohaveanimpactonperformance.

Figures7.18and7.19show theperformanceof cleanwrite missesfor variousnumbersof readers.ThehomenodeH sendsan invalidationto eachreaderandawaitsall readers’acknowledgementsbeforesendingthedatato requestingpro-cessP (seeFigure7.17(c)).TheLFC curveshows theperformanceof a raw LFCtestprogramthatsendsrequestandreply messagesthathave thesamesizeasthemessagessentby CRL in thecaseof zeroreaders(i.e.,whenno invalidationsareneeded).

For small regions,CRL addsapproximately5 µs (22%) to LFC’s roundtriplatency (seeFigure7.18). CRL’s throughputcurvesarecloseto the throughputcurveof theLFC benchmark.

7.4 MPI — The MessagePassingInterface

TheMessagePassingInterface(MPI) is astandardlibrary interfacethatdefinesalargevarietyof message-passingprimitives[53]. MPI is likely themostfrequentlyusedimplementationof themessage-passingprogrammingmodel.

Page 211: Communication Architectures for Parallel-ProgrammingSystems

7.4MPI — TheMessagePassingInterface 189

64 256 1K 4K 16

K64

K25

6K 1M

Region size (bytes)

0

10

20

30

40

50

60M

iss

thro

ughp

ut (

Mby

te/s

econ

d)

LFCCRL, 0 readersCRL, 1 readerCRL, 4 readersCRL, 8 readers

Fig. 7.19.Write missthroughput.

SeveralMPI implementationsexist. Major vendorsof parallelcomputers(e.g.,IBM andSiliconGraphics)havebuilt implementationsthatareoptimizedfor theirhardwareandsoftware.A popularalternative to thesevendor-specificimplemen-tationsis MPI Chameleon(MPICH) [61]. MPICH is free and widely usedinworkstationenvironments.We implementedMPI on LFC by portingMPICH toPanda.Thefollowing subsectionsdescribethis MPI/Panda/LFCimplementationandits performance.

7.4.1 Implementation

MPICH consistsof threelayers. At the bottom,platform-specificdevice chan-nelsimplementreliable,low-level sendandreceive functions.Themiddle layer,theapplicationdevice interface(ADI), implementsvarioussendandreceive fla-vors (e.g.,rendezvous)usingdowncallsto the currentdevice channel. The toplayerimplementstheMPI interface.TheADI andthedevice channelshavefixedinterfaces.WeportedMPICH version1.1andADI version2.0to Pandaby imple-mentinga new device channel.This Pandadevice channelimplementsthebasicsendandreceive functionsusingPanda’s message-passingmodule.

SinceMPICH is not a multithreadedsystem,the Pandadevice channelusesa stripped-down versionof Pandathat doesnot supportmultithreadingandthat

Page 212: Communication Architectures for Parallel-ProgrammingSystems

190 Parallel-ProgrammingSystems

receivesall messagesthroughpolling. We refer to this simplerPandaversionasPanda-ST(Panda-SingleThreaded).Themultithreadedversionof Pandadescribedin Chapter6 is referredto asPanda-MT(PandaMultiThreaded).

By assumingthatthePandaclient is single-threaded,Panda-STeliminatesalllocking insidePanda.Panda-STalsodisablesLFC’s network interrupts.MPICHdoesnot needinterrupts,becauseall communicationis directly tied to sendandreceive actionsin userprocesses.It sufficesto poll whena receive statementisexecuted(eitherby theuserprogramor aspartof theimplementationof acomplexfunction). Messagehandlersdo not execute,not evenlogically, in thecontext ofadedicatedupcallthread.Instead,they executein thecontext of theapplication’smainthread,bothphysically(on thesamestack)andlogically.

MPICH provides its own spanning-treebroadcastimplementation. UnlikeLFC’smulticastimplementation,however, MPICH forwardsmessagesratherthanpackets.This way, themulticastprimitivecanbeimplementedeasilyin termsofMPI’s unicastprimitiveswhich alsooperateon messages.An obviousdisadvan-tageof this implementationchoiceis the lossof pipeliningat processorsthatareinternal nodesof a multicasttree and that have to forward messages.Given alargemulticastmessage,sucha processorwill not startforwardingbeforeit hasreceivedtheentiremessage.If doneon a per-packet basis,asin LFC, forwardingcanbegin assoonasthefirst packet hasarrivedat theforwardingprocessor’s NI.Weexperimentedbothwith MPICH’s default broadcastimplementationandwithanimplementationthatusesPanda’s unorderedbroadcaston top of LFC’s broad-cast.Theperformanceof thesetwo broadcastimplementations,MPI-default andMPI-LFC, is discussedin thenext section.

7.4.2 Performance

Below, we discussthe unicastandbroadcastperformanceof MPICH on PandaandLFC.

UnicastPerformance

Figure7.20andFigure7.21show MPICH’s one-way latency andthroughput,re-spectively. Both testswereperformedusingMPI’s MPI Send() andMPI Recv()functions.For comparison,thefiguresalsoshow theresultsof thecorrespondingtestsat thePandaandLFC level.

Figure7.20shows MPI’s unicastlatency and,for comparison,theunicastla-tenciesof PandaandLFC. MPICH is a heavyweightsystemandadds6 µs (42%)

Page 213: Communication Architectures for Parallel-ProgrammingSystems

7.4MPI — TheMessagePassingInterface 191

0Ô 200Ô 400Ô 600Ô 800Ô 1000Message size (bytes)Õ

0

10

20

30

40

50

Late

ncy

(mic

rose

cond

s)

Ö MPIPanda (no threads)LFC

Fig. 7.20.MPI unicastlatency.

to Panda’s0-byteone-way latency.Exceptfor smallmessages,MPICH attainsthesamethroughputasPandaand

LFC. Thereasonis thatMPICH makesno extrapassesover thedatathatis trans-ferred.MPICH simplyconstructsanI/O vectorandhandsit to Panda,which thenfragmentsthemessage.As a result,theMPICH layerdoesnot incur any per-bytecostsandthroughputis good.

BroadcastPerformance

Figure7.22shows multicastlatency on 64 processorsfor LFC, Panda,andthreeMPI implementations.Both MPI-default resultswereobtainedusingMPICH’sdefaultbroadcastimplementation,which forwardsbroadcastmessagesasawholeratherthanpacketby packet. MPI-defaultnormallyusesbinomialbroadcasttrees.SinceLFC usesbinary trees,we addedan option to MPICH that allows MPI-default to usebothbinomialandbinarytrees.TheMPI-LFC resultswereobtainedusingLFC’s broadcastprimitive(i.e.,with binarytrees).

As in the unicastcase,we find that MPICH addsconsiderableoverheadtoPanda. Binary and binomial treesyield no large differencesin MPI-default’sbroadcastperformance,because64-nodebinaryandbinomialtreeshavethesamedepth. There is a large differencebetweenthe MPI-default versionsandMPI-LFC. With binarytrees,MPI-default adds67 µs (77%)to MPI-LFC’s latency fora 64-bytemessage;with binomial trees,theoverheadis 59 µs (68%). In this la-

Page 214: Communication Architectures for Parallel-ProgrammingSystems

192 Parallel-ProgrammingSystems

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Õ0

10

20

30

40

50

60

Thr

ough

put (

Mby

te/s

econ

d)

× LFCPanda (no threads)MPI

Fig. 7.21.MPI unicastthroughput.

tency test it is not possibleto pipelinemultipacket messages.The causeof theperformancedifferenceis thatMPI-default addsat leasttwo extra packet copiesto thecriticial pathat eachinternalnodeof themulticasttree:onecopy from thenetwork interfaceto thehostandonecopy perchild from thehostto thenetworkinterface.

Figure7.23comparesthe throughputof thesameconfigurations.Thediffer-encebetweenthe MPI-default curves is causedby the differencein treeshape:binomial treeshave a higherfanoutthanbinary trees,which reducesthroughput.Both MPI-default versionsperformsworsethanMPI-LFC, not so muchdue toextra datacopying, but becauseMPI-default doesnot forwardmultipacket multi-castmessagesin a pipelinedmanner. In addition,host-level, message-basedfor-wardinghasa largermemoryfootprint thanpacket-basedforwarding. Whenthemessageno longerfits into theL2 cache,MPI-default’s throughputdropsnotice-ably(from 20to 9 Mbyte/sfor binarytreesandfrom 10to 5 Mbyte/sfor binomialtrees).Theproblemis thattheuseof MPI-default resultsin onecopying passovertheentiremessagefor eachchild thata processorhasto forwardthemessageto.If the messagedoesnot fit in the cache,thenthesecopying passestrashthe L2cache.

Page 215: Communication Architectures for Parallel-ProgrammingSystems

7.4MPI — TheMessagePassingInterface 193

0Ô 200Ô 400Ô 600Ô 800Ô 1000Message size (bytes)Õ

0

100

200

300

Late

ncy

(mic

rose

cond

s)

Ö

64 processorsØ

MPI-default, binaryMPI-default, binomialMPI-LFCPanda (unordered, no threads)LFC

Fig. 7.22.MPI broadcastlatency.

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Õ0

5

10

15

20

25

Thr

ough

put (

Mby

te/s

econ

d)

×

64 processorsØ

LFCPanda, unordered, no threadsMPI-LFCMPI-default, binaryMPI-default, binomial

Fig. 7.23.MPI broadcastthroughput.

Page 216: Communication Architectures for Parallel-ProgrammingSystems

194 Parallel-ProgrammingSystems

7.5 RelatedWork

Weconsiderrelatedwork in two areas:operationtransferandoperationexecution.

7.5.1 Operation Transfer

Orca’s operation-specificmarshalingroutineseliminateinterpretationoverheadby exploiting compile-timeknowledge. In addition, they do not copy to inter-mediateRTS buffers,but generateanI/O vectorwhich is passedto Panda’s sendroutines.Thesameapproachis takenin Manta[99] (seeSection7.2),whichgen-eratesspecializedmarshalingcodefor Javamethods.

In theirimplementationof OptimisticOrca[66] bymeansof OptimisticActiveMessages(OAMs), Hsiehet al. usecompilersupportto optimizethe transferofsimpleoperationsof simpleobjecttypes(seealsoSection6.6.3). For suchoper-ations,OptimisticOrcaavoidsthegenericmarshalingpathandcopiesargumentsdirectly into an(optimistic)activemessage.At thereceiving side,theseoptimisticactivemessagesaredispatchedin a specialway (seeSection7.5.2).We optimizemarshalingfor all operations,usinga genericcompilationstrategy. To attaintheperformanceof Optimistic Orca,however, further optimizationwould be neces-sary. For example,we would have to eliminatePanda’s I/O vectors(which areexpensive for simpleoperations).

Muller et al. optimizedmarshalingin Sun’s RPCprotocolby meansof a pro-gramspecializationtool, Tempo,for theC programminglanguage[109]. Giventhestubsgeneratedby Sun’snonspecializingstubcompilerthis tool automaticallygeneratesspecializedmarshalingcodeby removing many testsandindirections.WeusetheOrcacompilerinsteadof ageneral-purposespecializationtool to gen-erateoperation-specificcode.

As describedin Section7.1.3,thereareredundantdatatransferoperationsonOrca’s operationtransferpath. Thesetransferscanbe removed by usingDMAinsteadof PIO at thesendingside,andby removing thereceiver-sidecopy (fromLFC packetsto Orcadatastructures).ChangandvonEickendescribeazero-copyRPCarchitecturefor Java, J-RPC,that removesthesetwo datatransfers[35]. J-RPCis a connection-orientedsystemin which receiversassociatepinnedpageswith individual (per-sender)endpoints.Oncea senderhasfilled thesepages,thereceiver unpinsthemandallocatesnew pages.The addressof the new receiveareais piggybackedonRPCreplies.

J-RPCsuffersfrom two problems.First, asnotedby its designers,allocatingpinnedmemoryfor eachendpointdoesnotscalewell to largenumbersof proces-

Page 217: Communication Architectures for Parallel-ProgrammingSystems

7.5RelatedWork 195

sors. Second,J-RPCrelieson piggybackingto returninformationaboutreceiveareasto asender. Thisworkswell for RPC,but not for one-waymulticasting.

7.5.2 Operation Execution

Whereasour approachhas beento avoid expensive threadswitchesby hand,OAMs [66, 150], Lazy TaskCreation[105], andLazy Threads[58] all rely oncompilersupport.OAMs transformanAM handlerthatrunsinto a lockedmutexinto a true thread.Theoverheadof creatinga threadis paidonly whenthe lockoperationfails. Hsiehet al. usedOAMs to improvetheperformanceof oneof thefirst Panda-basedOrcaimplementations.Like RTS-threads,this implementationusedpopupthreads.On theCM-5, the implementationbasedon OAMs reducedthelatency of Orcaobjectinvocationby anorderof magnitudeby avoidingthreadswitchingandby avoiding the CM-5’s bulk transfermechanismfor small mes-sages.In contrastwith our approach,OAMs usecompilersupportandrequirethatall locksbeknown in advance.

DravesandBershadusedcontinuationsinsidean operatingsystemkerneltospeedupthreadswitching[46]. Insteadof blockingakernelthread,acontinuationis createdandthesamestackis usedto run thenext thread.Weusethesametech-niquein userspacefor thesamereasons:to reducethreadswitchingoverheadandmemoryconsumption.Orca’s continuation-basedRTS usescontinuationsin thecontext of upcallsandaccessesthemthroughacondition-variablelike interface.

Ruhl and Bal usecontinuationsin a collective-communicationmodule im-plementedon top of Panda[123]. Several collective-communicationoperationscombineresultsfrom differentprocessorsusinga reversespanningtree. Ratherthan assigningto one particular thread(e.g., the local computationthread)thetaskof waiting for, combining,andpropagatingdata,the implementationstoresintermediateresultsin continuations.Subresultsarecreatedeitherby the localcomputationthreador by anupcallthread.Thefirst threadthatcreatesasubresultcreatesa continuationandstoresits subresultin this continuation.Whenanotherthreadlaterdeliversa new subresult,it mergesits subresultwith theexistingsub-resultby invokingthecontinuation’sresumefunction.Thisavoidsthreadswitchesto adedicatedthread.

Page 218: Communication Architectures for Parallel-ProgrammingSystems

196 Parallel-ProgrammingSystems

7.6 Summary

This chapterhasshown that the communicationmechanismsdevelopedin thepreviouschapterscanbeappliedeffectively to avarietyof PPSs.

Wedescribedportabletechniquesto transferandexecuteOrcaoperationseffi-ciently. Operationtransferis optimizedby lettingthecompilergenerateoperation-specificmarshalingcode.Also, theOrcaRTS doesnot copy datato intermediatebuffers: datais copieddirectly betweenLFC packetsandOrcadatastructures.Operationexecutionis optimizedby representingblockedinvocationscompactlyby continuationsinsteadof blockedthreads.Theuseof continuationsallowsoper-ationsto beexecuteddirectly by Pandaupcalls,savesmemory, andavoidsthreadswitchingduringthere-evaluationof theguardsof blockedoperations.

The Orca RTS exploits all communicationmechanismsof PandaandLFC.Panda’s transparentswitchingbetweenpolling andinterruptsandLFC’s pollingwatchdogwork well for messagesthat carry operationrequestsor replies. Aninterruptis generatedwhenthereceiverof a requestis engagedin a long-runningcomputationanddoesnotrespondto therequest.RPCrepliesareusuallyreceivedthroughpolling.

Panda’s totally-orderedbroadcastis usedto updatereplicatedobjects. Thisbroadcastis implementedefficiently usingLFC’sNI-supportedfetch-and-addandmulticastprimitives. LFC’s NI-level fetch-and-adddoesnot generateinterruptson thesequencernode.LFC’s multicastmaygenerateinterruptson thereceivingnodes,but theseinterruptsarenoton themulticast-packet forwardingpath.

Panda’s streammessagesallow theRTS to marshalandunmarshaldatawith-out making unnecessarycopies. Finally, Panda’s upcall model works well forOrca,eventhoughblockingupcallsoccurfrequentlyin Orcaprograms.

Manta hasa similar programmingmodelas Orca (sharedobjects)andalsousessomeof Orca’s implementationtechniques,in particularfunction shippingandcompiler-generated,method-specificmarshalingroutines. The communica-tion requirementsof MantaandOrcaaredifferent,though.First,Mantadoesnotreplicateobjectsandthereforedoesnot needmulticastsupport. Second,Mantausespopupthreadsto processincomingoperationrequests,becauseJava meth-odscancreatenew statebeforethey block, which makesit difficult to representa blocked invocationby a continuation(asin Orca). In thecurrentimplementa-tion, the useof popupthreadsaddsa threadswitch to the executionpathof alloperationson remoteobjects.

In termsof its communicationrequirements,CRL is thesimplestof thePPSsdiscussedin this chapter. CRL runs directly on top of LFC and benefitsfrom

Page 219: Communication Architectures for Parallel-ProgrammingSystems

7.6Summary 197

PPS Latency (µs) Throughput (Mbyte/s)Unicast Broadcast Unicast Broadcast

LFC/CRL 23/28 Ù / Ù 60/56 Ù / ÙLFC/Panda-ST/MPI 12/15/21 72/78/87 63/61/60 28/27/27LFC/Panda-MT/Orca 24/31/46 72/103/121 58/56/52 28/28/27LFC/Panda-MT/Manta 24/31/76 Ù / Ù / Ù 58/56/35 Ù / Ù / Ù

Table 7.2. PPSperformancesummary. All broadcastmeasurementsweretakenon 64processors.

LFC’s efficientpacket interfaceandits polling watchdog.Theimplementationof MPI usesPanda’sstreammessagesandPanda’sbroad-

cast. As in OrcaandManta,Panda’s streammessagesallow datato be copieddirectly betweenLFC packetsandapplicationdatastructures.Panda’s broadcastis muchmoreefficient thanMPICH’sdefaultbroadcast.MPICH performsall for-wardingonthehost,ratherthanontheNI. Whatis worse,however, is thatMPICHforwardsmessage-by-messageratherthanpacket-by-packet,andthereforeincursthefull messagelatency ateachhopin its multicasttree.

All PPSshavethesamedatatransferbehavior. At thesendingside,eachclientcopiesdatafrom client-level objectsinto LFC sendpacketsin NI memory. At thereceiving side,eachclient copiesdatafrom LFC receive packetsin hostmemoryto client-level objects.

Table7.2 summarizestheminimumlatency andthemaximumthroughputofcharacteristicPPS-level operationson PPS-level data. The tablealsoshows thecostof thecommunicationpatterninducedby theseoperations,atall softwarelev-elsbelow thePPSlevel. Theperformanceresultsfor differentlevelsareseparatedby slashes;differencesin theseresultsarecausedby layer-specificoverheads.Re-call thatPandacanbeconfiguredwith (Panda-MT)orwithout(Panda-ST)threads.

For CRL, Table7.2shows thecostof acleanwrite miss.Thecommunicationpatternconsistsof a small, fixed-sizecontrol messageto the homenodewhichreplieswith adatamessagecontainingaregion. For MPI, thetableshowsthecostof a one-way messageandthe costof a broadcast.For Orca, it shows the costof an operationon a remote,nonreplicatedobject (which resultsin a RPC)andthecostof anoperationonareplicatedobject(whichresultsin anF&A operationfollowedby abroadcast).Finally, for Manta,thetableshowsthecostof a remotemethodinvocationon a remoteobject.

The resultsin Table7.2 show that our PPSsaddsignificantoverheadto an

Page 220: Communication Architectures for Parallel-ProgrammingSystems

198 Parallel-ProgrammingSystems

LFC-level implementationof thesamecommunicationpattern.Theseoverheadsstemfrom severalsources.

1. Demultiplexing. All clientsperformoneor moredemultiplexingoperations.In Orca,for example,anoperationrequestcarriesanobjectidentifierandanoperationidentifier. In CRL, eachmessagecarriestheaddressof a handlerfunctionandthenameof a sharedregion. In MPI, eachmessagecarriesauser-specifiedtag. MPI andOrcaareboth layeredon top of Panda,whichperformsadditionaldemultiplexing.

2. Fragmentationandreassembly. Both PandaandCRL implementfragmen-tationandreassembly. This requiresextra headerfieldsandextra protocolstate.

3. Locking. Orcahasa multithreadedruntimesystemthatusesa singlelockto achieve mutual exclusion. To avoid recursiondeadlocks,this lock isreleasedwhenthe Orcaruntimesysteminvokesanothersoftwaremodule(e.g.,Panda)andacquiredagainwhenthe invocationreturns. Panda-MTalsouseslocksto protectglobaldatastructures.

4. Procedurecalls. Althoughall PPSsandPandauseinlining insidesoftwaremodulesandexport inlined versionsof simplefunctions,many procedurecalls remain. This is due to the layeredstructureof the PPSs. Layeringis usedto managethe complexity and to enhancethe portability of thesesystems:most systemsare fairly large and have beenportedto multipleplatforms.

Theoverheadsaddedby thevarioussoftwarelayersrunningontopof LFC arelarge comparedto, for example,the roundtrip latency at the LFC level (e.g.,anempty-arrayroundtripfor Mantaadds217%to anLFC roundtrip).Consequently,we expect that theseoverheadswill dampenthe application-level performancedifferencesbetweendifferentLFC implementations.Applicationperformanceisstudiedin thenext chapter.

Page 221: Communication Architectures for Parallel-ProgrammingSystems

Chapter 8

Multile vel PerformanceEvaluation

The precedingchaptershave shown that the LFC implementationdescribedinChapter3 andChapter4 provideseffectiveandefficientsupportfor severalPPSs.This implementation,however, is but onepoint in the designspaceoutlined inChapter2. This chapterdiscussesalternative designsof two key componentsofLFC: reliablepoint-to-pointcommunicationsby meansof NI-level flow controland reliable multicastcommunicationby meansof NI-level forwarding. Bothservicesare critically important for PPSsand their applications. Existing andproposedcommunicationarchitecturestake widely differentapproachesto thesetwo issues(seeChapter2), yet therehave beenfew attemptsto comparethesearchitecturesin a systematicway.

A key ideain this chapteris to usemultiple implementationsof LFC’s point-to-point,multicast,andbroadcastprimitives.This allows usto evaluatethedeci-sionsmadein theoriginal designandimplementation.We focuson assumptions1 and3 statedin Chapter4: reliablenetwork hardwareand the presenceof anintelligentNI. The alternative implementationsall relax oneor bothof theseas-sumptions.Theuseof a singleprogramminginterfaceis crucial;existing studiescomparecommunicationarchitectureswith different functionality anddifferentprogramminginterfaces[5], which makesit difficult to isolatetheeffectsof par-ticular designdecisions.

Wecomparetheperformanceof fiveimplementationsatmultiple levels.First,we performdirect comparisonsbetweenthe systemsby meansof microbench-marksthatrundirectlyon topof LFC’sprogramminginterface.Second,wecom-paretheperformanceof parallelapplicationsby runningthemonall fiveLFC im-plementations.Eachof theseapplicationsusesoneof four differentPPSs:Orca,CRL, MPI, andMultigame.Orca,CRL, andMPI weredescribedin theprevious

199

Page 222: Communication Architectures for Parallel-ProgrammingSystems

200 Multilevel PerformanceEvaluation

chapter;Multigamewill beintroducedlaterin thischapter. Ourperformancecom-parisonthusinvolvesthreelevelsof software:thedifferentLFC implementations,the PPSs,andthe applications.A key contribution of this chapteris that it tiestogetherlow-level designissues,the communicationstyle inducedby particularPPSs,andcharacteristicsof parallelapplications.

Thischapteris organizedasfollows. Section8.1givesanoverview of all LFCimplementations.Section8.2 andSection8.3 present,respectively, thepoint-to-point andmulticastpartsof the implementationsandcomparethe different im-plementationsusingmicrobenchmarks.Section8.4describesthecommunicationstyleandperformanceof thePPSsusedby our applicationsuite.Section8.5dis-cussesapplicationperformance.Section8.6 classifiesapplicationsaccordingtotheir performanceon differentLFC implementations.Section8.7 studiesrelatedwork.

8.1 Implementation Overview

LFC’s programminginterfaceandoneimplementationweredescribedin Chap-ter 3 andChapter4. This sectiongivesan overview of all implementationsanddescribesthoseimplementationcomponentsthat aresharedby all implementa-tions.

8.1.1 The Implementations

Wedevelopedthreepoint-to-pointreliability schemesandtwo multicastschemes.EachLFC implementationconsistsof a combinationof one reliability schemeandonemulticastscheme.Oneof theseLFC implementationscorrespondsto thesystemdescribedin Chapters3 and4; we refer to this systemasthedefaultLFCimplementation.

Thepoint-to-pointschemesarenoretransmission(Nrx), host-levelretransmis-sion(Hrx), andNI-levelretransmission(Irx). Somesystems,includingtheNI-levelprotocolsdescribedin Chapter4, assumethatthenetwork hardwareandsoftwarebehave andarecontrolledasin anMPPenvironment.Thesesystemsexploit thehigh reliability of many modernnetworks andusenonretransmittingcommuni-cationprotocols[26, 32, 106, 113, 141,154]. Suchprotocolsarecalledcarefulprotocols[106], becausethey may never drop packets, which requirescarefulresourcemanagement.Theno-retransmission(Nrx) schemerepresentsthesepro-tocols; it assumesthat the network hardware never dropsor corruptsnetwork

Page 223: Communication Architectures for Parallel-ProgrammingSystems

8.1 ImplementationOverview 201

ForwardingRetransmission Hostforwards(Hmc) Interfaceforwards(Imc)No retransmission(Nrx) NrxHmc NrxImc (default)Interfaceretransmits(Irx) IrxHmc IrxImc

Hostretransmits(Hrx) HrxHmc Not implemented

Table8.1. Fiveversionsof LFC’s reliability andmulticastprotocols.

packetsandwill fail if this assumptionis violated.Systemsintendedto operateoutsideanMPPenvironmentusuallyimplement

retransmission.Retransmissionusedto be implementedon the host, but sev-eral researchsystems(e.g.,VMMC-2 [49]) implementretransmissionon a pro-grammableNI. Our retransmittingschemes,Hrx andIrx, representthesetwo typesof systems.

The multicastschemesarehost-level multicast(Hmc) andNI-level multicast(Imc). On scalable,switchednetworks, multicastingis usually implementedbymeansof spanning-treeprotocolsthat forwardmulticastdata.Most communica-tion systemsimplementtheseforwardingprotocolson topof point-to-pointprim-itives. Other systemsusethe NI to forward multicast traffic, which resultsinfewer datatransfersand interruptson the critical path from senderto receivers[16, 56, 68,146]. Hmc andImc representbothtypesof systems.

The point-to-pointreliability andthe multicastschemescanbe combinedinsix differentways(seeTable8.1).We implementedfiveoutof thesesix combina-tions. Thecombinationof host-level retransmissionandinterface-level multicastforwardinghasnotbeenimplementedfor reasonsdescribedin Section8.3.1.Theimplementationdescribedin Chapter3 andChapter4 is essentiallyNrxImc. Thereare,however, somesmalldifferencesbetweenthe implementationdescribedear-lier and the onedescribedhere. (NrxImc recycles senddescriptorsin a slightlydifferentway thanthedefault LFC implementation.)

Table8.2summarizesthehigh-level differencesbetweenthefive LFC imple-mentations.We have alreadydiscussedreliability andmulticastforwarding. Allimplementationsusethesamepolling-watchdogimplementation.Theretransmit-ting implementationsuseretransmissionandacknowledgementtimers.IrxImc andIrxHmc useMyrinet’son-boardtimer to implementthesetimers.HrxHmc usesbothMyrinet’son-boardtimerandthePentiumPro’s timestampcounter. All protocolsexceptHrxHmc implementLFC’sfetch-and-addoperationontheNI. SinceHrxHmc

representsconservative protocolsthatmake little useof theprogrammableNI, itstoresfetch-and-addvariablesin hostmemoryandhandlesfetch-and-addmes-

Page 224: Communication Architectures for Parallel-ProgrammingSystems

202 Multilevel PerformanceEvaluation

Function NrxImc NrxHmc IrxImc IrxHmc HrxHmc

Reliability NI NI NI NI HostMcastforwarding NI Host NI Host HostPolling watchdog NI + host NI + host NI + host NI + host NI + hostFine-graintimer — — NI NI NI + hostFetch-and-add NI NI NI NI host

Table8.2. Divisionof work in theLFC implementations.

sageson thehost.

8.1.2 Commonalities

TheLFC implementationssharemuchof theircode.All implementationstransmitdatain variable-lengthpackets. Hostsand NIs storepackets in packet bufferswhich all have the samemaximumpacket size. EachNI hasa sendbuffer poolanda receivebuffer pool; hostsonly havea receivebuffer pool.

We distinguishfour packet types: unicast,multicast,acknowledgement,andsynchronization.Unicastand multicastpackets containclient data. Acknowl-edgementsareusedto implementreliability. Synchronizationpacketscarryfetch-and-add(F&A) requestsandreplies. An F&A operationis implementedasanRPCto the nodeholding the F&A variable. Dependingon the implementation,this variableis storedeitherin hostor NI memory.

All implementationsorganizemulticastdestinationsin a binary treewith thesenderasits rootandforwardmulticastpacketsalongthis tree,in parallel.Oneoftheproblemsin implementinga multicastforwardingschemeis thepotentialforbuffer deadlock.Severalstrategiescanbeusedto dealwith this problem: reser-vationin advance[56, 146], deadlock-freerouting,anddeadlockrecovery [16].

To sendapacket,LFC’ssendroutinesstoreoneor moretransmissionrequestsin senddescriptorsin NI memory(usingPIO). Eachdescriptoridentifiesa sendpacket, thepacket’s destination,thepacket’s size,andprotocol-specificinforma-tion suchasasequencenumber. Multiple descriptorscanreferto thesamepacket,sothatthesamedatacanbetransmittedto multiple destinations.

LFC clientsusePIOto copy datathatis to betransmittedinto LFC sendpack-ets(whicharestoredin NI memory).To speedupthesedatacopies,all implemen-tationsmarkNI memoryasawrite-combinedmemoryregion (seeSection3.3).

Eachhostmaintainsa pool of freereceive buffers. Theaddressesof freehost

Page 225: Communication Architectures for Parallel-ProgrammingSystems

8.1 ImplementationOverview 203

buffersarepassedto theNI controlprogramthrougha sharedqueuein NI mem-ory. WhentheNI hasreceiveda packet in oneof its receive buffers,it copiesthepacket to a freehostbuffer (usingDMA). Thehostcanpoll aflag in thisbuffer totestwhetherthepacket hasbeenfilled.

TheNI controlprogram’seventloopprocessesthefollowing events:

1. Host transmissionrequest. The host enqueueda packet for retransmis-sion. The NI tries to move the packet to its packet transmissionqueue.In someprotocols,however, thepacket mayhave to wait until anNI-levelsendwindow opensup. If the window is closed,the packet is storedona per-destinationblocked-sendsqueue. Incomingacknowledgementswilllater openthe window andmove the packet from this queueto the packettransmissionqueue.

2. Packettransmitted.TheNI hardwarecompletedthetransmissionof apacket.If thepacket transmissionqueueis notempty, theNI startsanetwork DMAto transmitthenext packet. For eachoutgoingpacket,Myrinet computes(inhardware)aCRCchecksumandappendsit to thepacket. Thereceivehard-warerecomputesandverifiesthechecksumandappendstheresult(check-sumverificationsucceededor failed) to thepacket. This result is checkedin software.

3. Packet received.TheNI hardwarereceivedapacket from thenetwork. TheNI checksthechecksumof thepacket just received. If thechecksumfails,the NI dropsthe packet or signalsan error to the host. (The retransmit-ting protocolsdrop the packet, forcing the senderto retransmit.The non-retransmittingprotocolssignalanerror.) Otherwise,if thepacket is a uni-castor multicastpacket, then the NI enqueuesit on its NI-to-hostDMAqueue. Someprotocolsalsoenqueuemulticastpackets for forwardingonthe packet transmissionqueueor, if necessary, on a blocked-sendsqueue.Acknowledgementandsynchronizationpacketsarehandledin a protocol-specificway.

4. NI-to-hostDMA completion. If the NI-to-hostDMA queueis not empty,thena DMA is startedfor thenext packet.

5. Timeout.All implementationsusetimeoutsto implementtheNI-levelpollingwatchdogdescribedin Section3.7. Someimplementationsalsousetime-outsto triggerretransmissionsor acknowledgements.

Eachimplementationis configuredusingtheparametersshown in Table8.3.

Page 226: Communication Architectures for Parallel-ProgrammingSystems

204 Multilevel PerformanceEvaluation

Parameter Meaning UnitP #processorsthatparticipatein theapplication ProcessorsPKTSZ Maximumpayloadof apacket BytesISB NI sendpool size Packet buffersIRB NI receivepool size Packet buffersHRB Hostreceivepool size Packet buffersW Maximumsendwindow size PacketsINTD Polling watchdog’snetwork interruptdelay µsTGRAN Timergranularity µs

Table8.3.Parametersusedby theprotocols.

8.2 ReliablePoint-to-Point Communication

To comparedifferentpoint-to-pointprotocols,we have implementedLFC’s reli-ablepoint-to-pointprimitive in threedifferentways.Theretransmittingprotocols(Hrx andIrx) assumeunreliablenetwork hardwareandcanrecover from lost,cor-rupted,anddroppedpacketsby meansof time-outs,retransmissions,andCRCchecks.In Hrx, reliability is implementedby the hostprocessor;in Irx, it is im-plementedby theNI. As explainedabove, the threeimplementationshave muchin common;below, we discusshow the threeimplementationsdiffer from oneanother.

8.2.1 The No-RetransmissionProtocol (Nrx)

Nrx assumesreliablenetwork hardwareandcorrectoperationof all connectedNIs.Nrx doesnot detectlost packetsandtreatscorruptedpacketsasa fatalerror. Nrx

never retransmitsany packet andcan thereforeavoid the overheadof bufferingpacketsfor retransmissionandtimermanagement.

To avoid buffer overflow, Nrx implementsflow control betweeneachpair ofNIs (seethe UCAST protocoldescribedin Section4.1). At initialization time,eachNI reservesW ÚÜÛ IRBÝ PÞ receive buffers for eachsender. The numberofbuffers(W) is thewindow sizefor thesliding window protocolthatoperatesbe-tweensendingandreceiving NIs. EachNI transmitsapacketonly if it knowsthatthereceiving NI hasfreebuffers. If apacketcannotbetransmitteddueto aclosedsendwindow, thentheNI queuesthepacketonablocked-sendsqueuefor thatdes-tination.Blockedpacketsaredequeuedandtransmittedduringacknowledgementprocessing(seebelow).

Page 227: Communication Architectures for Parallel-ProgrammingSystems

8.2ReliablePoint-to-PointCommunication 205

EachNI receive buffer is releasedwhenthe packet containedin it hasbeencopiedto hostmemory. Thenumberof newly releasedbuffersis piggybackedoneachpacket that flows back to the sender. If thereis no return traffic, then thereceiving NI sendsanexplicit half-windowacknowledgementafterreleasingW Ý 2buffers. (That is, the credit refreshthresholddiscussedin Section4.1 is set toW Ý 2.)

8.2.2 The Retransmitting Protocols(Hrx and Irx)

The two retransmissionschemesmake different tradeoffs betweensendandre-ceive overheadon thehostprocessorandcomplexity of theNI controlprogram.Wefirst describetheprotocolsin generaltermsandthenexplainprotocol-specificmodifications.In thefollowing, thetermssenderandreceiverrefer to NIs in thecaseof Irx andto hostprocessorsin thecaseof Hrx.

Eachsendermaintainsa sendwindow to eachreceiver. A senderthat hasfilled its sendwindow to a particulardestinationis not allowed to sendto thatdestinationuntil someof thepacketsin thewindow have beenacknowledgedbythereceiver. Eachpacket carriesa sequencenumberwhich is usedto detectlost,duplicated,andout-of-orderpackets.After transmittingapacket,thesenderstartsa retransmissiontimer for the packet. When this timer expires,all unacknowl-edgedpacketsareretransmitted.This go-back-Nprotocol[115, 138] is efficientif packetsarerarelydroppedor corrupted(asis thecasewith Myrinet).

Retransmissionrequiresthat packetsbe buffereduntil an acknowledgementis received. With asynchronousmessagepassing,this normally requiresa copyon the sendinghost. However, since the Myrinet hardware can transmitonlypacketsthatresidein NI memory, all protocolsmustcopy datafrom hostmemoryto NI memory. The retransmittingprotocolsusethe packet in NI memoryforretransmissionand thereforedo not make morememorycopiesthanNrx. Thisoptimizationis hardware-specific;it doesnotapplyto NIs thatuseFIFOsinto thenetwork (e.g.,severalATM interfaces).

Receiversacceptapacketonly if it carriesthenext-expectedsequencenumber;otherwisethey dropthepacket. In addition,NIs dropincomingpacketsthathavebeencorrupted—i.e., packetsthat yield a CRC error— or that cannotbe storeddueto ashortageof NI receivebuffers.

Sincesendpacketscannotbe reuseduntil they have beenacknowledgedandsinceNI memoryis relatively small,senderscanrunoutof sendpacketswhenac-knowledgementsdo not arrive promptly. To prevent this, receiversacknowledgeincomingdatapacketsin threeways. As in Nrx, receiversusepiggybacked and

Page 228: Communication Architectures for Parallel-ProgrammingSystems

206 Multilevel PerformanceEvaluation

0Ô 200Ô 400Ô 600Ô 800Ô 1000Message size (bytes)Õ

0

10

20

30

40

Late

ncy

(mic

rose

cond

s)

Ö HrxIrxNrx

Fig. 8.1.LFC unicastlatency.

half-window acknowledgements.In addition,eachreceiver setsanacknowledge-menttimer whenever a packet arrives. Whenthe timer expiresandthe receiverhasnot yet sentany acknowledgement,thereceiver sendsanexplicit delayedac-knowledgement.

In Irx, receiving NIs keeptrackof theamountof buffer spaceavailable.Whena packet mustbe droppeddue to a buffer spaceshortage,the NI registersthis.Whenmorebuffer spacebecomesavailable,theNI sendsa NACK to thesenderof thedroppedpacket. Whena NACK is received,thesenderretransmitsall un-acknowledgedpackets. Without theNACKs, packetswould not beretransmitteduntil thesender’s timer expired.

8.2.3 Latency Measurements

Figure8.1 shows theone-way latency for Nrx, Hrx, andIrx. All messagesfit in asinglenetwork packet andarereceived throughpolling. Nrx outperformsthe re-transmittingprotocols,whichmustmanagetimersandwhich performmoreworkon outstandingsendpacketswhenan acknowledgementis received. All curveshave identicalslopes,indicatingthattheprotocolshave identicalper-bytecosts.

Table8.4showsthevaluesof theLogP[40,41] parametersandtheend-to-endlatency (os ß or ß L) of thethreeprotocols.TheLogPparametersweredefinedinSection5.1.2.Nrx andIrx performthesameworkonthehost,sothey have(almost)identicalsendandreceive overheads(os andor , respectively). Irx, however, runs

Page 229: Communication Architectures for Parallel-ProgrammingSystems

8.2ReliablePoint-to-PointCommunication 207

Protocol os or L g os ß or ß LNrx 1.5 2.3 7.2 6.7 11.0Irx 1.4 2.3 8.3 9.7 12.0Hrx 2.4 5.3 6.5 8.3 14.2

Table8.4. LogPparametervalues(in microseconds)for a16-bytemessage.

Protocol Latency (µs)Nrx 18Irx 25Hrx 31

Table 8.5.Fetch-and-addlatencies.

a retransmissionprotocolon theNI, which is reflectedin theprotocol’s largergap(g Ú 9 à 7 versusg Ú 6 à 7) andlatency (L Ú 8 à 3 versusL Ú 7 à 2). Hrx runsa similarprotocolon thehostandthereforehaslargersendandreceiveoverheadsthanNrx

andIrx.Table8.5 shows theF&A latenciesfor thedifferentprotocols.Again, Nrx is

the fastestprotocol (18 µs). Hrx is the slowest protocol (31 µs); it is the onlyprotocolin which F&A operationsarehandledby thehostprocessorof thenodethat storesthe F&A variable. In an application,handlingF&A requestson thehostprocessorcanresultin interrupts.In this benchmark,however, Hrx receivesall F&A requeststhroughpolling.

8.2.4 Window Sizeand ReceiveBuffer Space

To obtainmaximumthroughput,acknowledgementsmustflow backto thesenderbeforethe sendwindow closes. To avoid senderstalls, all protocolsthereforeneedsufficiently largesendwindows andreceive buffer pools. Figure8.2 showsthemeasuredpeakthroughputfor variouswindow sizesandfor two packet sizes.To speedup sequencenumbercomputationsfor Irx, we useonly power-of-twowindow sizes.In thisbenchmark,thereceiving processcopiesdatafrom incomingpackets to a receive buffer. This is not strictly necessary, but all LFC clientsperformsucha copy while processingthedatathey receive. Throughputwithoutthis receiver-sidecopy is discussedlater.

For 1 Kbyte packets,Nrx andIrx needonly a smallwindow (W á 4) to attainhigh throughput. Hrx’s throughputstill increasesnoticeablywhenwe grow the

Page 230: Communication Architectures for Parallel-ProgrammingSystems

208 Multilevel PerformanceEvaluation

5â 10 15Send-window size (packets)

0

20

40

60P

eak

thro

ughp

ut (

Mby

te/s

)

1 Kbyte packets,with receiver copyã

NrxIrxHrx

5â 10 15Send-window size (packets)

0

20

40

60

2 Kbyte packets,with receiver copyã

NrxIrxHrx

Fig. 8.2. Peakthroughputfor differentwindow sizes.

window sizefrom W Ú 4 (45 Mbyte/s)to W Ú 8 (56 Mbyte/s). A smallwindowis importantfor Nrx, becauseeachNI allocatesW receive buffersper sender. Inthe retransmittingprotocols(Irx andHrx) NI receive buffers aresharedbetweensendersandweneedonly W receivebuffersto allow any singlesenderto achievemaximumthroughput.

Surprisingly, Irx’s throughputdecreasesfor W ä 8. Thereasonis thatNI-levelacknowledgementprocessingin Irx involvescancelingall outstandingretransmis-sion timers. With a larger window, more(virtual) timersmustbe canceled.(Irxmultiplexesmultiple virtual timerson a singlephysicaltimer; cancelinga virtualtimersconsistsof adequeueoperation.)At somepoint, theslow NI canno longerhidethiswork behindthecurrentsendDMA transfer. Theeffectdisappearswhenwe switchto 2 Kbyte packets— thatis, whenthesendDMA transferstake moretime.

Figure8.3shows one-way throughputfor 1-Kbyteand2-Kbytepackets,withandwithout receiver-sidecopying. For all threeprotocols,thewindow sizewassetto W Ú 8.

With 1Kbytepacketsandwithoutcopying,weseecleardifferencesin through-putbetweenthethreeprotocols.Performingretransmissionadministrationon thehost(Hrx) increasesthe per-packet costsandreducesthroughput.Whenthe re-transmissionwork is movedfrom thehostto the NI (Irx), throughputis reducedfurther. TheNI performsessentiallythesamework asthehostin Hrx, but doessomoreslowly. TheLogPmeasurementsin Table8.4 confirmthis: Irx hasa larger

Page 231: Communication Architectures for Parallel-ProgrammingSystems

8.2ReliablePoint-to-PointCommunication 209

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Õ0

20

40

60

80

Thr

ough

put (

Mby

te/s

econ

d)

×

1 Kbyte packets; no copy

NrxHrxIrx

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Õ0

20

40

60

80

Thr

ough

put (

Mby

te/s

econ

d)

×

1 Kbyte packets; copyØ

NrxHrxIrx

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Õ0

20

40

60

80

Thr

ough

put (

Mby

te/s

econ

d)

×

2 Kbyte packets; no copy

NrxHrxIrx

64 256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Õ0

20

40

60

80

Thr

ough

put (

Mby

te/s

econ

d)

×

2 Kbyte packets; copyØ

NrxHrxIrx

Fig. 8.3.LFC unicastthroughput.

Page 232: Communication Architectures for Parallel-ProgrammingSystems

210 Multilevel PerformanceEvaluation

gap(g) thanthe otherprotocols. Increasingthe packet sizeto 2 Kbyte reducesthe numberof timesper-packet costsare incurredand increasesthe throughputof all protocols,but mostnoticeablyfor Irx. For Hrx andNrx, theimprovementissmallerandcomesmainly from fastercopying at thesendingside. Irx, however,is now ableto hidemoreof its per-packet overheadin thelatency of largerDMAtransfers.

Throughputis reducedwhenthereceiving processcopiesdatafrom incomingpacketsto adestinationbuffer, especiallywhenthemessageis largeanddoesnotfit into the L2 cache. This is important,becauseall LFC clientsdo this. Sincethemaximummemcpy() speedon the PentiumPro (52 Mbyte/s)is lessthanthemaximumthroughputachieved by our protocols(without copying), the copyingstagebecomesabottleneck.In addition,memorytraffic dueto copying interfereswith theDMA transfersof incomingpackets.

With copying, thethroughputdifferencesbetweenthethreeprotocolsarefairlysmall.To alargeextent,this is dueto theuseof theNI memoryas’retransmissionmemory,’ which eliminatesacopy at thesendingside.

8.2.5 SendBuffer Space

While NI receive buffer spaceis an issuefor Nrx, sendbuffer spaceis just asimportantfor Irx andHrx. First, however, we considerthe sendbuffer spacere-quirementsfor Nrx. Nrx can reusesendbuffers as soonas their contentshavebeenput on thewire, soonly a few sendbuffersareneededto achieve maximumthroughputbetweena singlesender-receiverpair. With morereceiversit is usefulto havealargersendpool,in casesomereceiverdoesnotconsumeincomingpack-etspromptly. Transmissionsto that receiver will besuspendedandextra packetsarethenneededto allow transmissionsto otherreceivers.

In the retransmittingprotocols,sendbuffers cannotbe reuseduntil an ac-knowledgementconfirmsthereceiptof their contents.(This is a consequenceofusingNI memoryasretransmissionmemory.) Sincewedonotacknowledgeeachpacket individually, asenderthatcommunicateswith many receiversmayrunoutof sendbuffers.Oneof ourapplications,aFastFourierTransform(FFT)program,illustratesthis problem.In FFT, eachprocessorsendsa uniquemessageto everyotherprocessor. With 64processors,thesemessagesareof mediumsize(2 Kbyte)andfill lessthanhalf a window. If thenumberof sendbuffers is small,a sendermayrun out of sendbuffersbeforeany receiver’shalf-window acknowledgementhasbeentriggered.

If thenumberof sendbuffers is largerelative to thenumberof receivers,the

Page 233: Communication Architectures for Parallel-ProgrammingSystems

8.2ReliablePoint-to-PointCommunication 211

senderwill trigger half-window acknowledgementsbeforerunning out of sendbuffers. In general,we cannotsendmore than P åçæ W Ý 2 Ù 1è packets withouttriggeringat leastonehalf-window acknowledgement,unlesspacketsarelost.

To avoid unnecessaryretransmissionswhen the numberof sendbuffers issmall, Irx andHrx could acknowledgepackets individually, but this will leadtoincreasednetwork occupancy, NI occupancy, and(in thecaseof Hrx) hostoccu-pancy. Instead,Irx andHrx startanacknowledgementtimer for incomingpackets.Eachreceiver maintainsonetimer per sender. The timer for senderS is startedwhenevera datapacket from Sarrivesandthetimer is not yet running.Thetimeris canceledwhenanacknowledgement(possiblypiggybacked) travelsbackto S.If thetimerexpires,thereceiversendsanexplicit acknowledgement.

Thisschemerequiressmalltimeoutvaluesfor theacknowledgementtimer, be-causeasenderneedsonly afew hundredmicrosecondsto allocateall sendbuffers.UsingOStimersignalsto implementthetimeronHrx doesnotwork, becausethegranularityof OStimersis oftentoocoarse(10msis acommonvalue).TheMyri-netNI, however, providesaccessto aclockwith agranularityof 0.5µsand is ableto sendsignalsto theuserprocess.WethereforeusetheNI asanintelligentclock,asfollows. WechooseaclockperiodTGRANandlet theNI generateaninterrupteveryTGRANµs.

To reducethe numberof clock signals,we usetwo optimizations.First, be-foregeneratingaclock interrupt,theNI readsaflagin hostmemorythatindicateswhetherany timersare running. No interrupt is generatedwhenno timersarerunning. Second,eachtime the host performsa timer operationit recordsthecurrenttime by readingthePentiumPro’s timestampcounter(seeSection1.6.2).This counteris incrementedevery clock cycle (5 ns). If necessary, the host in-vokestimeouthandlers.Beforegeneratingaclock interrupt,theNI readsthetimerecordedby thehostto decidewhetherthehostrecentlyreadits clock. If so,nointerruptis generated.

8.2.6 ProtocolComparison

An importantdifferencebetweenthe retransmitting(Irx and Hrx) and the non-retransmitting(Nrx) protocolsis that the retransmittingprotocolsdo not reservebuffer spacefor individual senders. In Irx and Hrx, NI receive buffers can besharedby all sendingNIs, becauseanNI thatrunsoutof receivebufferscandropincomingpackets,knowing thatthesenderswill eventuallyretransmit.In Irx andHrx, a singlesendercanfill all of an NI’s receive buffers. Nrx, in contrast,mayneverdroppacketsandthereforeallocatesafixednumberof NI receivebuffersto

Page 234: Communication Architectures for Parallel-ProgrammingSystems

212 Multilevel PerformanceEvaluation

eachsender. In mostcasesthis is wasteful,but sincethesmall bandwidth-delayproductof Nrx allowssmallwindow sizes,thisstrategy doesnotposeproblemsinmedium-sizeclusters.Nevertheless,this staticbuffer allocationschemepreventsNrx from scalingto very largeclusters,unlessNI memoryscalesaswell.

Acknowledgementsin Nrx have a differentmeaningthanin Irx andHrx. Nrx

implementsflow control(at theNI level), while Irx andHrx implementonly relia-bility. In Nrx, anacknowledgementconsistsof acountof buffersreleasedsincethelastacknowledgement.In Irx andHrx, anacknowledgementconsistsof asequencenumber. (A countinsteadof asequencenumberfailswhenanacknowledgementsis lost.) Thesequencenumberindicateswhichpacketshavearrivedat thereceiver(hostor NI), but doesnot indicatewhich packetshavebeenreleased.

An obviousdisadvantageof Nrx is thatit cannotrecoverfrom lostor corruptedpackets. On Myrinet, we have observed both typesof errors,causedby bothhardware andsoftwarebugs. On the positive side, carefulprotocolsare easierto implementthanretransmittingprotocols. Thereis no needto buffer packetsfor retransmission,to maintainsequencenumbers,or to dealwith timers.For thesamereasons,thenonretransmittingLFC implementationsachievebetterlatenciesthan the retransmittingimplementations.With 1 Kbyte packets thereis also athroughputadvantage,but this is partlyobscuredby thememory-copy bottleneck.

In spiteof thesedifferences,themicrobenchmarksindicatethatall threepro-tocolscanbemadeto performwell, includingprotocolsthatperformmostof theirwork on theNI. Thekeys to goodperformanceareto useNI buffers for retrans-missionandto overlapDMA transferswith usefulwork.

8.3 ReliableMulticast

We considertwo multicastschemes,which differ in wherethey implementtheforwardingof multicastpackets. In theHmc protocols,multicastpacketsarefor-wardedby thehostprocessor. In theImcprotocols,theNI forwardsmulticastpack-ets. In Section8.3.1,we first comparehost-level andinterface-level forwarding.Section8.3.2comparesindividual implementationsin moredetail. Section8.3.3discussesdeadlockissues.Finally, Section8.3.4analyzestheperformanceof allmulticastimplementations.

Page 235: Communication Architectures for Parallel-ProgrammingSystems

8.3ReliableMulticast 213

Host Host

Network interfaceNetwork interface

Fig. 8.4.Host-level (left) andinterface-level (right) multicastforwarding.

8.3.1 Host-Level versusInterface-Level Packet Forwarding

Host-level forwardingusinga spanningtree is the traditionalapproachfor net-workswithout hardwaremulticast.Thesenderis theroot of a multicasttreeandtransmitsa packet to eachof its children.Eachreceiving NI passesthepacket toits host,which reinjectsthepacket to forward it to thenext level in themulticasttree(seeFigure8.4). All host-level forwardingprotocolsreinjecteachmulticastpacket at mostonce. Insteadof makinga separatehost-to-NIcopy for eachfor-wardingdestination,thehostcreatesmultiple transmissionrequestsfor thesamepacket.

Host-level forwarding hasthreedisadvantages.First, sinceeachreinjectedpacket wasalreadyavailablein theNI’s memory, host-level forwardingresultsinan unnecessaryhost-to-NIdatatransferat eachinternal treenodeof the multi-casttree,whichwastesbusbandwidthandprocessortime. Second,noforwardingtakesplaceunlessthehostprocessesincomingpackets. If onehostdoesnot pollthenetwork in a timely manner, all its childrenwill beaffected. Insteadof rely-ing on polling, theNI canraisean interrupt,but interruptsareexpensive. Third,thecritical sender-receiver pathincludesthehost-NI interactionsof all thenodesbetweenthe senderand the receiver. For eachmulticastpacket, theseinterac-tions consistof copying the packet to the host,hostprocessing,andreinjectingthepacket.

TheNI-level multicastimplementationsaddressall threeproblems.In the Imc

protocols,the host doesnot reinjectmulticastpackets into the network (whichsavesa datatransfer),forwardingtakesplaceeven if thehostis not polling, andthehost-NIinteractionsarenot on thecritical path.

Host-level andNI-level multicastforwardingcanbecombinedwith thethreepoint-to-pointreliability schemesdescribedin the previous section. We imple-mentedall combinationsexcept the combinationof interface-level forwarding

Page 236: Communication Architectures for Parallel-ProgrammingSystems

214 Multilevel PerformanceEvaluation

with host-level retransmission.In sucha variantwe expectthe hostto maintainthesendandreceive window administration,asin Hrx. However, if anNI needsto forward an incomingpacket thenit needsa sequencenumberfor eachof theforwardingdestinations.To make this schemework we would have to duplicatesequencenumberinformation(in aconsistentway).

8.3.2 AcknowledgementSchemes

For reliability, multicastpackets must be acknowledgedby their receivers, justlike unicastpackets. To avoid an acknowledgementimplosion, most multicastforwardingprotocolsusea reverseacknowledgementschemein which acknowl-edgementsflow backalongthemulticasttree.

Our protocolsusetwo reverseacknowledgementschemes:ACK-forwardandACK-receive. ACK-forwardis usedby NrxImc(thedefault implementation),IrxImc,andHrxHmc. In theseprotocols,acknowledgementsaresentat the level (NI orhost)atwhichmulticastforwardingtakesplace.Themaincharacteristicof ACK-forwardis thatit doesnotallow areceiverto acknowledgeamulticastpacket to itsparentin themulticasttreeuntil thatpacket hasbeenforwardedto thereceiver’schildren.

ACK-forward cannoteasily be usedby NrxHmc or IrxHmc, becausein theseprotocolsacknowledgementsaresentby theNI andforwardingis performedbythehost.NrxHmc andIrxHmc thereforeuseasimplerscheme,ACK-receive,whichacknowledgesmulticastpacketsin thesameway asunicastpackets(i.e.,withoutwaiting for forwarding).In thecaseof NrxHmc, multicastpacketscanbeacknowl-edgedassoonasthey havebeendeliveredto thelocal host.In thecaseof IrxHmc,multicastpacketscanbeacknowledgedassoonasthey havebeenreceivedby theNI.

A potentialproblemwith ACK-receive is that it createsa flow-control loop-holethatcannoteasilybeclosedby anLFC client. In NrxHmc andIrxHmc, a hostreceive buffer containinga multicastpacket is not releaseduntil the packet hasbeendeliveredto theclientandhasbeenforwardedto all children.Consequently,afterprocessinga packet, theclient cannotbesurethatthepacket is availableforreuse,becauseit may still have to be forwarded. This makesit difficult for theclient to implementa fool-proofflow-controlscheme(whenneeded).

Page 237: Communication Architectures for Parallel-ProgrammingSystems

8.3ReliableMulticast 215

8.3.3 DeadlockIssues

With aspanning-treemulticastprotocol,it is easyto createabufferdeadlockcyclein which all NIs (or hosts)have filled their receive poolswith packetsthat needto beforwardedto thenext NI (or host)in thecycle which alsohasa full receivepool. Suchdeadlockscanbe preventedor recoveredfrom in several ways. TheNI-level flow controlprotocolof NrxImc(thedefault implementation),for example,allows deadlock-freemulticastingfor a certainclassof multicasttrees(includingbinarytrees).

Our otherprotocolsdo not implementtrue flow control at the multicastfor-wardinglevel andmaythereforesuffer from deadlocks.For certaintypesof mul-ticasttrees,includingbinarytrees,theACK-forwardschemepreventsbufferdead-locks if sufficient buffer spaceis availableat the forwardinglevel. Buffer spaceis sufficient if eachreceiver canstorea full sendwindow from eachsender(seeAppendixA). NrxImc’s NI-level flow control schemesatisfiesthis requirement.Unfortunately, reservingbuffer spacefor eachsenderdestroys oneof theadvan-tagesof retransmission:betterutilizationof receivebuffers(seeSection8.2.4).

NrxHmc and IrxHmc do not provide any flow control at the forwarding (i.e.,host) level andarethereforesusceptibleto deadlock.With our benchmarksandapplications,however, thenumberof hostreceivebuffersis sufficiently largethatdeadlocksdo notoccur.

8.3.4 Latency and Thr oughput Measurements

Multicast latency for 64 processorsis shown in Figure8.5. We definemulticastlatency asthe time it takesto reachthe last receiver. The top threecurvesin thegraphcorrespondto the Hmc protocols,which performmulticastforwardingonthehostprocessor. Theseprotocolsclearlyperformworsethanthosethatperformmulticastforwardingon the NI. The reasonis simple: with host-level forward-ing two extra datacopiesoccuron the critical pathat eachinternalnodeof themulticasttree.

Figure8.6shows multicastthroughputon 64 processors,using1 Kbyte pack-etsandcopying atthereceiverside.Throughputin thisgraphrefersto thesender’soutgoingdatarate,not to thesummedreceiver-sidedatarate.In contrastwith thelatency graph(Figure8.5), this graphshows that interface-level forwardingdoesnotalwaysyield thebestresults:for messagesizesup to 128Kbyte, for example,HrxHmc achieveshigherthroughputthanIrxImc.

Thedip in theHrxHmc curve is theresultof HrxHmc’s higherreceiveoverhead

Page 238: Communication Architectures for Parallel-ProgrammingSystems

216 Multilevel PerformanceEvaluation

0Ô 200Ô 400Ô 600Ô 800Ô 1000Message size (bytes)Õ

0

50

100

150

200

250

Late

ncy

(mic

rose

cond

s)

Ö64 processors, 1 Kbyte packets, no receiver copy

IrxHmcHrxHmcNrxHmcIrxImcNrxImc

Fig. 8.5.LFC multicastlatency.

256 1K 4K 16

K64

K25

6K 1M

Message size (bytes)Õ0

5

10

15

20

Thr

ough

put (

Mby

te/s

econ

d)

64 processors, 1 Kbyte packets,with receiver copyã

NrxImcNrxHmcHrxHmcIrxImcIrxHmc

Fig. 8.6.LFC multicastthroughput.

Page 239: Communication Architectures for Parallel-ProgrammingSystems

8.4Parallel-ProgrammingSystems 217

anda cacheeffect. For larger messages,the receive buffer no longerfits in theL2 cache,andsothecopying of incomingpacketsbecomesmoreexpensive, be-causethesecopiesnow generateextramemorytraffic. In HrxHmc, theseincreasedcopying coststurn thereceiving hostinto thebottleneck:dueto its higherreceiveoverhead,HrxHmc doesnot manageto copy a packet completelybeforethe nextpacket arrives. In the otherprotocols,the receive overheadis lower, so thereisenoughtime to copy anincomingpacket beforethenext packetarrives.

8.4 Parallel-Programming Systems

All applicationsdiscussedin thischapterrunononeof thefollowing PPSs:Orca,CRL, MPI, or Multigame. Orca, CRL, andMPI were discussedin Chapter7;Multigameis introducedbelow. This sectiondescribesthecommunicationstyleof all PPSsanddiscussesPPS-level performance.

8.4.1 Communication Style

Eachof our PPSsimplementsa smallnumberof communicationpatterns,whichareexecutedin responseto actionsin theapplication.For eachPPS,we summa-rize its maincommunicationpatternsanddiscussthe interactionsbetweenthesepatternsandthecommunicationsystem.

CRL

In CRL, mostcommunicationresultsfrom readandwrite misses.The resultingcommunicationpatternsweredescribedin Section7.3.2(seeFigure7.17).

CRL’s invalidation-basedcoherenceprotocoldoesnot allow applicationstomulticastdata.(CRL usesLFC’smulticastprimitiveonly duringinitializationandbarriersynchronization).Also, sinceCRL’s single-threadedimplementationdoesnot allow a requestingprocessto continuewhile a missis outstanding,applica-tionsaresensitiveto roundtripperformance.Roundtriplatency is moreimportantthanroundtripthroughput,becauseCRL mainly sendssmall messages.Controlmessagesarealwayssmallanddatamessagesaresmallbecausemostapplicationsusefairly smallregions.

The roundtripnatureof CRL’s communicationpatternsandthe small regionsizesusedin mostCRL applicationsleadto low acknowledgementratesfor theNrx protocols.The roundtripsmake piggybackingeffective anddueto thesmall

Page 240: Communication Architectures for Parallel-ProgrammingSystems

218 Multilevel PerformanceEvaluation

datamessagesfew half-window acknowledgementsareneeded.Finally, NrxImc

andNrxHmc neversenddelayedacknowledgements.This is important:for oneofourapplications(Barnes,seeSection8.5),for example,HrxHmc sends31timesasmany acknowledgementsasNrxImc, thedefault LFC implementation.

MPI

SinceMPI is a message-passingsysytem,programmerscanexpressmany differ-ent communicationpatterns. MPI’s main restrictionis that incomingmessagescannotbe processedasynchronously, which occasionallyforcesprogrammerstoinsertpolling statementsinto their programs.Thesensitivity of anMPI programto theparametersof thecommunicationarchitecturedependslargelyon theappli-cation’scommunicationrateandcommunicationpattern.

All our MPI applications(QR, ASP, andSOR)useMPI’s collective-commu-nicationoperations.Internally, theseoperations(broadcastandreduce-to-all)allusebroadcasting.

All MPI measurementsin thischapterwereperformedusingthesameMPICH-basedMPI implementationasdescribedin Section7.4. RecallthatMPICH pro-vides a default spanning-treebroadcastimplementationbuilt on top of unicastprimitives.Thisdefault implementationcanbereplacedwith amoreefficient ’na-tive’ implementation.All MPI measurementsin Section8.5wereperformedwitha native implementationthat usesPanda’s broadcastwhich, in turn, usesoneofour implementationsof LFC’s broadcast.In Section7.4, we usedmicrobench-marksto show that this native implementationis more efficient than MPICH’sdefault implementation.Figure8.7 shows that this differenceis alsovisible atthe applicationlevel. The figure shows the relative performanceof two broad-castingMPI applications,ASPandQR, which we will discussin moredetail inSection8.5. The MPICH-default measurementswere performedusing unicastprimitiveson top of thedefault LFC implementation(NrxImc). For bothASPandQR, this default broadcastis significantly slower than the implementationthatusesthebroadcastof NrxImc: ASPruns25%slowerandQRruns50%slower.

MPICH’s default broadcastimplementationsuffersfrom two problems.First,MPICH forwardsentiremessagesratherthanindividual packets. For largemes-sages,this eliminatespipeliningandreducesthroughput.ASPpipelinesmultiplebroadcastmessages,eachof which consistsof multiple packets. In QR, thereisno suchpipeliningof messages;only thepacketswithin a singlemessagecanbepipelined.Second,thedefault implementationcannotreuseNI packet buffers.Ateachinternalnodeof themulticasttree,themessagewill becopiedto thenetwork

Page 241: Communication Architectures for Parallel-ProgrammingSystems

8.4Parallel-ProgrammingSystems 219

NrxIm

c

MPIC

H0.0

0.5

1.0

1.5

Nor

mal

ized

exe

cutio

n tim

e

ASP/MPIé

NrxIm

c

MPIC

H0.0

0.5

1.0

1.5QR/MPIé

Fig. 8.7. Application-level impactof efficientbroadcastsupport.

interfaceoncefor eachchild. TheLFC implementationsreuseNI packet buffersto avoid this repeatedcopying.

Orca

Orcaprogramscanperformtwotypesof communicationpatterns:RPCandtotally-orderedbroadcast. In their communicationbehavior, Orca programsresemblemessage-passingprograms,exceptin thatthereis noasynchronousone-waymes-sagesendprimitive. Sucha primitivecanbesimulatedby meansof multithread-ing, but noneof our applicationsdoesthis. (OurOrcaapplicationsaredominatedby multicasttraffic.) In Orca,messagescarry theparametersandresultsof user-definedoperations,soprogrammershave controlover messagesizeandmessagerate.

Multigame

Multigame(MG) is a parallelgame-playingsystemdevelopedby Romein[122].Giventherulesof a boardgameanda boardevaluationfunction,Multigameau-tomaticallysearchesfor goodmoves,usingoneof severalsearchstrategies(e.g.,IDA*). Duringasearch,processorspushsearchjobsto eachother. A job consistsof (recursively) evaluatinga boardposition. To avoid re-searchingpositions,po-sitionsarecachedin adistributedhashtable.Whenaprocessneedsto readatableentry, it sendsthe job to theownerof the tableentry [122]. Theowner looksuptheentryandcontinuesto work on thejob until it needsto accessa remotetableentry.

Page 242: Communication Architectures for Parallel-ProgrammingSystems

220 Multilevel PerformanceEvaluation

Multigame’s job descriptorsaresmall (32 bytes). To avoid the overheadofsendingandreceiving many smallmessages,Multigameaggregatesjob messages.The communicationgranularitydependson the maximumnumberof jobs permessage.

Multigamerunson Panda-STandreceivesall messagesthroughpolling. Al-most all communicationconsistsof small to medium-sizeone-way messages,which are sent to randomdestinations. The maximummessagesize dependson the messageaggregationlimit. The messageratealsodependson this limitand,additionally, on theapplication-specificcostof evaluatinga boardposition.Sendersneednot wait for replies, so as long as eachprocesshasa sufficientamountof work to do, latency (in the LogP sense)is unimportant. Communi-cationoverheadis dominatedby sendandreceiveoverhead.

8.4.2 PerformanceIssues

Table7.2summarizestheminimumlatency andthemaximumthroughputof char-acteristicoperationsfor Orca,CRL, andMPI. Rulesin Multigamespecificationsarenot easilytied to communicationpatterns.Themostimportantpattern,how-ever, consistsof sendinganumberof jobsin asinglemessageto aremoteproces-sor. All communicationconsistsof Panda-STpoint-to-pointmessages.

Table7.2showsthatourPPSsaddsignificantoverheadto anLFC-level imple-mentationof thesamecommunicationpattern,sowe expectthattheseoverheadswill reducetheapplication-level performancedifferencesbetweendifferentLFCimplementations.

Client-level optimizationsform anotherdampeningfactor. Two optimizationsareworth mentioning:messagecombiningandlatency hiding.

As describedabove,Multigameperformsmessagecombiningby aggregatingsearchjobsin per-processorbuffers.Insteadof sendingoutasearchjob assoonasit hasbeengenerated,theMultigameruntimesystemstoresjobsin anaggregationqueuefor thedestinationprocessor. Thecontentsof this queueis not transmitteduntil a sufficient numberof jobs hasbeenplacedinto thequeue.Thedifferencein packet ratebetweenPuzzle-4andPuzzle-64in Figure8.8shows thatmessagecombiningsignificantlyreducesthe packet ratefor all protocols.Of course,theresultingperformancegain (seeFigure 8.9) is larger for protocolswith higherper-packet costs.

CRL andOrcaimplementlatency hiding. Both systemsperformRPC-styletransactions:a processorsendsa requestto anothermachineandwaits for thereply. Both systemsthenstartpolling the network andprocessincomingpack-

Page 243: Communication Architectures for Parallel-ProgrammingSystems

8.5ApplicationPerformance 221

Application PPS Problemsize T1 S64 E64

ASP MPI 1024nodes 49.32 74.73 1.17Awari Orca 13 stones 448.90 31.81 0.50Barnes CRL 16,384bodies 123.24 23.25 0.36FFT CRL 1,048,576complex floats 4.46 49.56 0.77LEQ Orca 1000equations 610.90 30.12 0.47Puzzle-4 MG 15-puzzle,ê 4 jobs/message 281.00 45.32 0.71Puzzle-64 MG 15-puzzle,ê 64 jobs/message 281.00 53.63 0.84QR MPI 1024 ë 1024doubles 54.16 45.51 0.71Radix CRL 3,000,000ints,radix 128 3.20 10.32 0.16SOR MPI 1536 ë 1536doubles 30.91 51.52 0.80

Table 8.6. Application characteristicsandtimings. T1 is the executiontime (inseconds)on oneprocessor;S64 is the speedupon 64 processors;E64 is the effi-ciency on64 processors.

ets while waiting for the reply. Consequently, protocolscan compensatehighlatenciesby processingotherpacketsin their idle periods.(This holdsonly if theincreasedlatency is not theresultof increasedhostprocessoroverhead.)

8.5 Application Performance

Thissectionanalyzestheperformanceof severalapplicationsonall fiveLFC im-plementations.Section8.5.1summarizestheperformanceresults.Sections8.5.2to 8.5.10discussandanalyzetheperformanceof individualapplications.

8.5.1 PerformanceResults

Table8.6 lists, for eachapplication,thePPSit runson, its input parameters,se-quentialexecutiontime, speedup,andparallelefficiency. (Parallel efficiency isdefinedasspeedupdividedby thenumberof processors:E64 Ú S64Ý 64.) These-quentialexecutiontime andparallelefficiency weremeasuredusingthe defaultimplementation(NrxImc). Sequentialexecutiontimesrangefrom a few secondsto severalminutes;speedupson 64 processorsrangefrom poor(Radix)to super-linear (ASP). While we usesmall problemsthat easilyfit into the memoriesof64 processors,7 out of 10 applicationsachieve anefficiency of at least50%. Su-perlinearspeedupoccursbecauseweusefixed-sizeproblemsandbecause64pro-

Page 244: Communication Architectures for Parallel-ProgrammingSystems

222 Multilevel PerformanceEvaluation

cessorshavemorecachememoryat theirdisposalthana singleprocessor.With the exceptionof Awari and the Puzzleprograms,all programsimple-

mentwell-known parallelalgorithmsthatarefrequentlyusedto benchmarkPPSs.All CRL applications(Barnes,FFT, andRadix), for example,areadaptationsofprogramsfrom theSPLASH-2suiteof parallelbenchmarkprograms[153].

We performedthe applicationmeasurementsusing the following parametersettings:64 processors(P = 64), a packet sizeof 1 Kbyte (PKTSZ = 1), 4096hostreceivebuffers(HRB = 4096),asendwindow of 8 packetbuffers(W = 8), aninterruptdelayof 70µs(INTD = 70),andatimergranularityof 5000µs(TGRAN= 5000). For the retransmittingprotocols(IrxImc, IrxHmc, andHrxHmc), we use256NI sendbuffersand386NI receive buffers(ISB + IRB = 256+ 384= 640).For thecarefulprotocols(NrxImc andNrxHmc), we useISB + IRB = 128+ 512=640. With thesesettings,thecontrolprogramoccupiesapproximately900Kbyteof theNI’s memory(1 Mbyte). The remainingspaceis usedby thecontrolpro-gram’s runtimestack.

Communicationstatisticsfor all applicationsaresummarizedin Figure8.8,whichgivesper-processordataandpacketrates,brokendown accordingto packettype. The figure distinguishesunicastdatapackets,multicastdatapackets,ac-knowledgements,and synchronizationpackets. Data and packet ratesrefer toincomingtraffic, soa broadcastis countedasmany timesasthenumberof desti-nations(63).

Thegoalof thefigureis to show thattheapplicationsexhibit diversecommuni-cationpatterns.Compare,for example,Barnes,Radix,andSOR.Barneshashighpacket ratesandlow datarates(i.e., packetsaresmall). Radix, in contrast,hasbothhigh packet anddatarates.SORhasa low packet rateanda high datarate(i.e., packetsare large). Theserateswereall measuredusingHrxHmc; the rateson other implementationsaredifferent, of course,but show similar differencesbetweenapplications.

Figure8.9 shows applicationperformanceof thealternative implementationsrelative to the performanceof the default implementation(NrxImc). Noneof thealternative implementationsis fasterthanthe default implementation.In severalcases,however, thealternativesareslower. Below, theseslowdownsarediscussedin moredetail.

8.5.2 All-pairs ShortestPaths (ASP)

ASPsolvestheAll-pairs ShortestPath(ASP)problemusinganiterativealgorithm(Floyd-Warshall).ASPfindstheshortestpathbetweenall nodesin a graph.The

Page 245: Communication Architectures for Parallel-ProgrammingSystems

8.5ApplicationPerformance 223

0Ô 2ì 4í

6îData rate (Mbyte/processor/s)ï0Ô 2ì 4

í6î

Data rate (Mbyte/processor/s)ï0Ô 2ì 4í

6îData rate (Mbyte/processor/s)ï

SOR/MPI Radix/CRL QR/MPI Puzzle-64/MG Puzzle-4/MG LEQ/Orca FFT/CRL Barnes/CRL Awari/Orca ASP/MPI

UnicastMulticast

0Ô 5000 10000Ô 15000Ô 20000Ô 25000ÔPacket rate (packets/processor/s)ð0Ô 5000 10000Ô 15000Ô 20000Ô 25000ÔPacket rate (packets/processor/s)ð0Ô 5000 10000Ô 15000Ô 20000Ô 25000ÔPacket rate (packets/processor/s)ð0Ô 5000 10000Ô 15000Ô 20000Ô 25000ÔPacket rate (packets/processor/s)ð0Ô 5000 10000Ô 15000Ô 20000Ô 25000ÔPacket rate (packets/processor/s)ð0Ô 5000 10000Ô 15000Ô 20000Ô 25000ÔPacket rate (packets/processor/s)ð

SOR/MPI Radix/CRL QR/MPI Puzzle-64/MG Puzzle-4/MG LEQ/Orca FFT/CRL Barnes/CRL Awari/Orca ASP/MPI

UnicastMulticastAckñSync

Fig. 8.8. Dataandpacket ratesof NrxImc on 64nodes.

Page 246: Communication Architectures for Parallel-ProgrammingSystems

224 Multilevel PerformanceEvaluation

1.0ò 1.1ò 1.2ò 1.3ò 1.4ò 1.5òSlowdown relative to NrxImcó

SOR/MPI

Radix/CRL

QR/MPI

Puzzle-64/MG

Puzzle-4/MG

LEQ/Orca

FFT/CRL

Barnes/CRL

Awari/Orca

ASP/MPI

NrxHmcIrxImcIrxHmcHrxHmc

Fig. 8.9. Normalizedapplicationexecutiontimes.

Page 247: Communication Architectures for Parallel-ProgrammingSystems

8.5ApplicationPerformance 225

graphis representedasa distancematrix which is row-wisedistributedacrossallprocessors.In eachiteration,oneprocessorbroadcastsoneof its 4 Kbyte rows.Thealgorithmiteratesoverall of thisprocessor’srowsbeforeswitchingto anotherprocessor’s rows. Consequently, thecurrentsendercanpipelinethebroadcastsofits rows.

Figure 8.9 shows that both protocolsthat use interface-level multicast for-wardingperformbetterthan the protocolsthat usehost-level forwarding. Thisis surprising,becauseFigure8.6 showedthat IrxImc achievestheworst multicastthroughput. Indeed,ASP revealedseveral problemswith host-level forwardingthat do not show up in microbenchmarks.First, in ASP, it is essentialthat thesendercanpipelineits broadcasts.Receivers,however, areoftenstill working onone iterationwhenbroadcastpackets for the next iterationarrive. Thesepack-ets are not processeduntil the receivers invoke a receive primitive. (This is apropertyof our MPI implementation,which usesonly polling; seeSection7.4.)Consequently, thesenderis stalled,becauseacknowledgementsdo not flow backin time. In the interface-level forwardingprotocols,acknowledgementsaresentby the NI, not by the hostprocessor. As long asthe hostsuppliesa sufficientlylargenumberof freehostreceivebuffers,NIs cancontinueto deliverandforwardbroadcastpackets.

We augmentedtheHmc versionsof ASPwith application-level polling state-ments(MPI probe()), which improved the performanceof the host-level proto-cols. Figure8.9 shows the improvednumbers.Nevertheless,the Hmc protocolsdonotattainthesameperformanceastheImc protocols.Theremainingdifferenceis dueto theprocessoroverheadcausedby failedpolls andhost-level forwarding.Theeffect of forwarding-relatedprocessoroverheadis visible in themulticastla-tency benchmark(seeFigure8.5).

8.5.3 Awari

Awari createsanendgamedatabasefor Awari,atwo-playerboardgame,by meansof parallelretrogradeanalysis[8]. In contrastwith top-down searchtechniqueslike α-β search,retrogradeanalysisproceedsbottom-upby making unmoves,startingwith theendpositionsof Awari. TheOrcaprogramcreatesanendgamedatabaseDBn, which can be usedby a game-playingprogram. Eachentry inthedatabaserepresentsa boardposition’s hashandtheboard’s game-theoreticalvalue.Thegame-theoreticalvaluerepresentstheoutcomeof thegamein thecasethatbothplayersmakebestmovesonly. DBn containsgame-theoreticalvaluesforall boardsthathaveatmostn stonesleft on theboard.Thegame-theoreticalvalue

Page 248: Communication Architectures for Parallel-ProgrammingSystems

226 Multilevel PerformanceEvaluation

is anumberbetweenÙ n andn thatindicatesthenumberof piecesthatcanbewon(or lost). WeranAwari with n Ú 13.

ParallelAwari operatesasfollows. Eachprocessorstorespartof thedatabase.The parentsof a boardare the boardsthat can be reachedby applying a legalunmoveto theboard.Whenaprocessorupdatesaboard’sgame-theoreticalvalueit mustalsoupdatetheboard’sparents.Sincetheboardsarerandomlydistributedacrossall processors,it is likely thata parentis locatedon anotherprocessor. Asingleupdatemaythusresultin severalremoteupdateoperationsthatareexecutedby RPCs.

To avoid excessivecommunicationoverhead,remoteupdatesaredelayedandstoredin a queuefor their destinationprocessor. As soonasa reasonablenumberof updateshasaccumulated,they aretransferredto their destinationwith a singleRPC.Theperformanceof Awari is determinedby theseRPCs;broadcastsdo notplayamajorrole in this application.

Theupdatesaretransferredby a singlecommunicationthread,which canrunin parallelwith thecomputationthread.Application-level statistics,however, in-dicatethatmostupdatesaretransmittedwhenthereis nowork for thecomputationthread.As a result,performanceis dominatedby thespeedat which work canbedistributed. Queuedupdatesthatmustbesentto different destinationprocessorscannotbesentoutataratethatis higherthantherateatwhichthecommunicationthreadcanperformRPCs.

AlthoughAwari’sdatarateappearslow (seeFigure8.8),communicationtakesplacein specificprogramphases.In thesephases,communicationis bursty andmostmessagesaresmall despitemessagecombining. Performanceis thereforedominatedby occupancy ratherthanroundtriplatency. The LogP parametersinTable8.4 show that Irx hasthe highestgap(g Ú 9 à 7 µs), followed by Hrx (g Ú8 à 3 µs) andthenNrx (g Ú 6 à 7 µs). This ranking is reflectedin the application’sperformance.

8.5.4 Barnes-Hut

Barnessimulatesa galaxyusingthe hierarchicalBarnes-HutN-body algorithm.Theprogramorganizesall simulatedbodiesin asharedoct-tree.Eachnodein thistreeandeachbody(storedin aleaf) is representedasasmallCRL region(88–108bytes).Eachprocessorownspartof thebodies.Most communicationtakesplaceduring the phasethat computesthe gravitational forcesthat bodiesexert on oneanother. In this phase,eachprocessortraversesthetreeto computefor eachof itsbodiesthe interactionwith otherbodiesor subtrees.Barneshasa relatively high

Page 249: Communication Architectures for Parallel-ProgrammingSystems

8.5ApplicationPerformance 227

packet rate,but dueto thesmallregionsthedatarateis low (seeFigure8.8).Even thoughBarnesrunson a differentPPS,its performanceprofile in Fig-

ure8.9is similar to thatof Awari. SinceCRL makeslittle useof LFC’smulticast,thereis no largedifferencebetweeninterface-level andhost-level forwarding.Re-transmissionsupport,however, leadsto decreasedperformance.This effect isstrongestfor Irx, whichhasthehighestinterpacketgap(g Ú 9 à 7µs)andis thereforemorelikely to suffer from NI occupancy underhighloads.ThesehighloadsoccurbecauseBarneshasa high packet rate(seeFigure8.8). Due to CRL’s roundtripcommunicationstyle (seeSection7.3),Barnesis sensitive to this increasein oc-cupancy.

8.5.5 FastFourier Transform (FFT)

FFT performsa FastFourier transformon an arrayof complex numbers.Thesenumbersarestoredin a matrix which is partitionedinto P2 squareblocks(whereP is thenumberof processors).Eachprocessorownsa verticalstripeof P blocksand eachblock is storedin a 2 Kbyte CRL region. The main communicationphasesconsistof matrix transposes.During a transposea personalizedall-to-allexchangeof all blockstakesplace: eachprocessorexchangeseachof its blocks(excepttheblockonthediagonal)with ablockownedby anotherprocessor. Eachblock transferis theresultof awrite miss.Thenodethatgeneratesthewrite misssendsasmallrequestmessageto theblock’shomenode,whichreplieswith adatamessagethatcontainstheblock.

As explainedin Section8.2.5, this communicationpatternputspressureonthenumberof sendbuffersfor theretransmittingimplementations.Sincewehaveconfiguredtheseimplementationswith a fairly largenumberof sendbuffers(ISB= 256),however, this posesno problems.

Theperformanceresultsin Figure8.9show no largedifferencesbetweendif-ferentLFC implementations.Theretransmittingprotocolsperformslightly worsethanthecarefulprotocols.Occasionally, theseprotocolssenddelayedacknowl-edgements.If we increasethe delayedacknowledgementtimeout, no delayedacknowledgementsaresent,andthedifferencesbecomeevensmaller.

8.5.6 The Linear Equation Solver (LEQ)

LEQ is an iterative solver for linear systemsof the form Ax Ú b. Eachiterationrefinesa candidatesolutionvectorxi into a bettersolutionxi ô 1. This is repeateduntil thedifferencebetweenxi ô 1 andxi becomessmallerthana specifiedbound.

Page 250: Communication Architectures for Parallel-ProgrammingSystems

228 Multilevel PerformanceEvaluation

In eachiteration,eachprocessorproducesa partof xi ô 1, but needsall of xi asitsinput. Therefore,all processorsexchangetheir128-bytepartialsolutionvectorsattheendof eachiteration(6906total). Eachprocessorbroadcastsits partof xi ô 1 toall otherprocessors.After this exchange,all processorssynchronizeto determinewhetherconvergencehasoccurred.

In LEQ, HrxHmc suffers from its host-level fetch-and-addimplementation.ProcessingF&A requestson thehostratherthanon theNI increasestheresponsetime (seeTable8.5)andtheoccupancy of theprocessorthatstorestheF&A vari-able(processor0).

Although LEQ is dominatedby broadcasttraffic, the costof executinga re-transmissionprotocolappearsto be themainperformancefactor: increasedhostandNI occupancy leadto decreasedperformance.Thisis dueto LEQ’scommuni-cationbehavior. All processesbroadcastat thesametime,congestingthenetworkandtheNIs. In ASPandQR,processesbroadcastoneat a time.

8.5.7 Puzzle

PuzzleperformsaparallelIDA* searchtosolvethe15-puzzle,awell-know single-player sliding-tile puzzle. We experimentedwith two versionsof Puzzle: inPuzzle-4,Multigameaggregatesat most4 jobs beforepushingthesejobs to an-otherprocessor, while in Puzzle-64upto 64jobsareaccumulated.Bothprogramssolvethesameproblem,but Puzzle-4sendsmany moremessagesthanPuzzle-64.

In Puzzle,all communicationis one-wayandprocessesdonotwait for incom-ing messages.As a result, sendandreceive overheadaremore importantthanNI-level latency andoccupancy. HrxHmc hastheworstsendandreceiveoverheadsof our LFC implementations(seeTable8.4) andthereforeperformsworsethanthe other implementations.In Puzzle-4,the differencebetweenHrxHmc andtheotherprotocolsis larger thanin Puzzle-64,becausePuzzle-4needsto sendmoremessagesto transferthesameamountof data.Consequently, thehighersendandreceiveoverheadsareincurredmoreoftenduringtheprogram’sexecution.

8.5.8 QR Factorization

QR is a parallel implementationof QR factorization[59] with columndistribu-tion. In eachiteration,onecolumn,theHouseholdervectorH, is broadcastto allprocessors,whichupdatetheircolumnsusingH. Thecurrentupperrow andH arethendeletedfrom thedatasetso that thesizeof H decreasesby 1 in eachof the1024iterations.Thevectorwith maximumnormbecomestheHouseholdervector

Page 251: Communication Architectures for Parallel-ProgrammingSystems

8.5ApplicationPerformance 229

for thenext iteration. This is decidedwith a Reduce-To-All collective operationto whicheachprocessorcontributestwo integersandtwo doubles.

The performancecharacteristicsof QR aresimilar to thoseof ASP(seeFig-ure8.9): theimplementationsbasedonhost-level forwardingperformworsethanthosebasedon interface-level forwarding. Theperformanceof QR is dominatedby the broadcastsof columnH. As in ASP, host-level forwarding leadsto in-creasedprocessoroverhead.In addition,theimplementationsbasedon host-levelforwardingaresensitiveto theirhigherbroadcastlatency. At thestartof eachiter-ation,eachreceiving processormustwait for an incomingbroadcast.In contrastwith ASP, thebroadcastingprocessorcannotgetaheadof thereceiversby pipelin-ing multiplebroadcasts,becausetheReduce-to-Allsynchronizesall processorsineachiteration. (Also, dueto pivoting, the identity of the broadcastingprocessorchangesin almostevery iteration.)

8.5.9 Radix Sort

Radixsortsa largearrayof (random)integersusingaparallelversionof theradixsortalgorithm.Eachprocessorownsa contiguouspartof thearrayandeachpartis furthersubdividedinto CRL regions,which actassoftwarecachelines. Com-municationis dominatedby the algorithm’s permutationphase,in which eachprocessormovesintegersin its partition to someotherpartition of the array. Ifthe keys aredistributeduniformly, eachprocessoraccessesall otherprocessors’partitions.We chosea region size(1 Kbyte) thatminimizeswrite sharingin thisphase.After thepermutationphaseeachprocessorreadsthenew valuesin its par-tition andstartsthenext sortingiteration. Radixhasthehighestunicastdatarateof all our applications(seeFigure8.8). In eachof the threepermutationphases,mostof thearrayto besortedis transferredacrossthenetwork.

Although Radix sendslarger messagesthanAwari, Barnes,andLEQ, it hasa similar performanceprofile as theseapplications. Radix’s messagesare stillrelatively small (at most1 Kbyte of region data)andeachmessagerequiresatmosttwo LFC packets,a full packet anda nearlyemptypacket. Consequently,Radixremainssensitiveto thesmall-messagebottleneck,sothatNI-level retrans-mission(IrxImc andIrxHmc) performsworst,followedby host-level retransmission(HrxHmc).

Page 252: Communication Architectures for Parallel-ProgrammingSystems

230 Multilevel PerformanceEvaluation

8.5.10 SuccessiveOverrelaxation (SOR)

SORis usedto solve discretizedLaplaceequations.TheSORprogramusesred-black iteration to updatea 1536 ë 1536matrix. Eachprocessorowns an equalnumberof contiguousrows of thematrix. In eachof the42 iterations,processorsexchangeneighboringborderrows (12 Kbyte) andperformasingle-element(onedouble)reductionto determinewhetherconvergencehasoccurred.

For SOR,HrxHmc performsworsethanthe otherimplementations.With theHrxHmc implementation,the host processorsuffers from sliding-window stalls.Each12 Kbyte messagethatis sentto a neighborrequires13 LFC packets,whilethe window sizeis only 8 packets. If the destinationprocessis actively pollingwhen the sender’s packetsarrive, it will senda half-window acknowledgementbeforethe senderis stalled. In SOR,however, all processessendout their mes-sagesat approximatelythe sametime, first to their higher-numberedneighbor,thento their lower-numberedneighbor. Thedestinationprocesswill thereforenotpoll until oneof its own sendsblocks(dueto aclosedsendwindow). At thatpoint,thesenderhasalreadybeenstalled.

The implementationsthat implementNI-level reliability do not suffer fromthis problem,becausethey implementthesliding window on theNI. Evenif theNI’s sendwindow to anotherNI closes,the hostprocessorcancontinueto postpacketsuntil the supplyof free senddescriptorsis exhausted.Also, if the NI’ssendwindow to oneNI fills up, it canstill sendpacketsto otherNIs if the hostsuppliespacketsfor otherdestinations.

The performanceof HrxHmc improvessignificantly if the boundaryrow ex-changecodeis modifiedsothatnotall processessendin thesamedirectionat thesametime. In practice,this improvestheprobability thatacknowledgementscanbepiggybackedandreducesthenumberof window stalls.

8.6 Classification

Basedon theperformanceanalysisof theindividualapplications,wehave identi-fiedthreeclassesof applicationswith distinctbehavior. Theseclassesandanindi-cationof therelativeperformanceof eachLFC implementationfor eachclassareshown in Table8.7. In this table,a ’ ß ’ indicatesthatanimplementationperformswell for aclassof applications;a ’ Ù ’ indicatesamodestdecreasein performance;and’ ÙõÙ ’ indicatesasignificantdecreasein performance.

Theperformanceof roundtripapplicationsis dominatedby roundtriplatency.

Page 253: Communication Architectures for Parallel-ProgrammingSystems

8.6Classification 231

Class Applications NrxImc NrxHmc IrxImc IrxHmc HrxHmc

Roundtrip Barnes,Radix,Awari + + ÙöÙ ÙöÙ ÙMulticast ASP, QR + Ù + Ù ÙõÙOne-way Puzzle-4,Puzzle-64 + + + + Ù

Table 8.7.Classificationof applications.

Sincemulticastplaysno importantrole in theseapplications,performancedif-ferencesbetweentheLFC implementationsaredeterminedby differencesin thereliability schemes.This classof applicationsshows that the robustnessof re-transmissionhasa price. Both the host-level retransmissionschemesand theinterface-level retransmissionschemesperformworsethantheno-retransmissionscheme,but for differentreasons.Host-level retransmission(HrxHmc) suffersfromits highersendandreceiveoverheadandis up to 11%slower thanthedefault im-plementation(NrxImc). NI-level retransmission(IrxImc and IrxHmc) suffers fromincreased(NI) occupancy and latency. Theseimplementationsare up to 16%slower thanthedefault implementation.For this classof applications,theLogPmeasurementspresentedearlierin thischaptergiveagoodindicationof thecausesof differencesin application-level performance.

Theperformanceof themulticastapplicationsis determinedby theefficiencyof the multicastimplementation.The performanceresultsshow the advantagesof NI-level multicast forwarding. HrxHmc, IrxHmc, and NrxHmc all suffer fromhost-level forwardingoverheadwhichtakesawaytimefrom theapplication.Host-level forwardingis upto 38%slowerthaninterface-level forwardingin thedefaultimplementation.

The’class’of one-wayapplicationscontainsbothvariantsof thePuzzleappli-cation. In contrastwith roundtripapplications,one-way applicationsdo not waitfor replymessages,solatency andNI occupancy canbetoleratedfairly well. Sendandreceive overhead,on the otherhand,cannotbe hidden. As a result,HrxHmc

suffersfrom its increasedsendandreceive overhead;this leadsto a performancelossof up to 16%relative to thedefault implementation.

Theremainingapplications(FFT, LEQ, andSOR)do not fit into any of thesecategories.

Page 254: Communication Architectures for Parallel-ProgrammingSystems

232 Multilevel PerformanceEvaluation

FM/MC

HamlynFMBIP (large)

VMMCTrapeze VMMC-2 AM-II

U-NetBIP (small)

Host retransmitsNo retransmission NI retransmits

Host forwards

NI forwards

PM

Fig. 8.10. Classificationof communicationsystemsfor Myrinet basedon themulticastandreliability designissues.

8.7 RelatedWork

This chapterhasconcentratedon NI protocol issuesandtheir relationshipwithpropertiesof PPSsandapplications.Figure8.10classifiesseveralexisting Myri-netcommunicationsystemsalongtwo protocoldesignaxes:reliability andmulti-cast.Mostsystemsdonotimplementretransmission.Weassumethatthisis partlydueto theprototypenatureof researchsystems.The figurealsoshows that fewsystemsprovide multicastor broadcastsupport. Below, we first discussrelatedwork in thesetwo areas.Next, wediscussotherNI-relatedapplicationstudies.

8.7.1 Reliability

Only a few studieshave comparedcarefulandretransmittingprotocols.Theear-liest studythatwe areawareof is by MosbergerandPeterson[106], which com-paresa carefulanda retransmittingprotocol implementationfor FiberChannel.They notethe scalabilityproblemsof staticbuffer reservation (as in Nrx). Thisproblemexists, but the low bandwidth-delayproductof modernnetworks (andprotocols)allows theuseof staticreservation in medium-sizeclusters.(Another,dynamicsolution,usedin PM’s NI protocol,is describedbelow.) MosbergerandPetersonusedonly oneapplication,Jacobi.Thatapplicationshows muchlargerbenefitsfor carefulprotocolsthanwe find with our applicationsuite. This maybe due to extra copying in their retransmittingprotocol. As explainedin Sec-tion 8.2.2,our retransmittingprotocolsdo not make morecopiesthanour carefulprotocols.

Someof theprotocolsin Figure8.10combinethestrategiesusedby Nrx, Hrx,and Irx. Like Nrx, for example,PM assumesthat the hardware never dropsor

Page 255: Communication Architectures for Parallel-ProgrammingSystems

8.7RelatedWork 233

corruptspackets. Like Irx, though,PM letssendersshareNI receive buffersanddropsincoming data packets when all buffers are full. When a datapacket isdropped,a negative acknowledgementis sentto its sender. PM never dropsac-knowledgements(negative or positive). Sinceit is assumedthat the hardwaredeliversall packetscorrectly, sendersalwaysknow whenoneof their packetshasbeendroppedandcanretransmitthatpacketwithouthaving to setaretransmissiontimer.

Active MessagesII combinesanNI-level (alternating-bit)reliability protocolwith a host-level sliding window protocolwhich is usedboth for reliability andflow control[37].

Our NI-supportedfine-graintimer implementationis similar to thesoft timerschemedescribedby Aron andDruschel[6] —developedindependently—whoalsousepolling to implementafine-graintimer. They optimizekernel-level timersandpoll thehost’s timestampcounteron systemcalls,exceptions,andinterruptsto amortizethestate-saving overheadof theseeventsover multiple actions(e.g.,network interruptprocessingandtimer processing).Soft timersusethe kernel’sclock interruptasa backupmechanism.This interruptoftenhasa granularityofat leastseveralmilliseconds.Hrx usestheNI’s timerasabackup.This timerhasa0 à 5 µsgranularityandis polledin eachiterationof theNI’smainloop. Ourbackupmechanismis thereforemorepreciseandcangenerateaninterruptata timecloseto the desiredtimer expiration time in casethe host’s timestampcounteris notpolled frequentlyenough.This is only a modestadvantage,becauseit doesnotaddressthe interruptoverheadproblemthatcanresultfrom infrequentpolling ofthetimestampcounter.

8.7.2 Multicast

To thebestof our knowledge,this is thefirst studythatcomparesa high-perfor-manceNI-level and high-performancehost-level multicastand their respectiveimpacton applicationperformance.Several papersdescribemulticastprotocolsthatuseNI-level multicastforwarding[16, 56, 68, 80, 146]. This work wasde-scribedin Section4.5.3. Someof thesepapers(e.g., [146]) comparehost-levelforwardingandNI-level forwarding,but mostof thesecomparisonsusea host-level forwardingschemethat copiesdatato the NI multiple times. We arenotawareof any previousstudythatstudiestheapplication-level impactof bothfor-wardingstrategies.

Page 256: Communication Architectures for Parallel-ProgrammingSystems

234 Multilevel PerformanceEvaluation

8.7.3 NI ProtocolStudies

Araki et al. usedmicrobenchmarksand the LogP performancemodel to com-paretheperformanceof GenericActiveMessages,Illinois FastMessages(version2), BIP, PM, andVMMC-2 [5]. Their studycomparescommunicationsystemswith programminginterfacesthat are sometimesfundamentallydifferent (e.g.,memory-mappedcommunicationin VMMC-2 versusmessage-basedcommuni-cation in PM andrendezvous-stylecommunicationin BIP versusasynchronousmessagepassingin Fast Messages). In our study, we comparefive differentimplementationsof the sameinterface. The most importantdifferencewith thework describedin this chapter, however, is that Araki et al. do not considertheapplication-level impactof their results.

In anotherstudy [102], Martin et al. do focus on applications,but evaluateonly a single communicationarchitecture(Active Messages)and a single pro-grammingsystem(Split-C). WhereMartin et al. vary the performancecharac-teristics(theLogGPparameters)of a singlecommunicationsystem,we usefivedifferentsystemswhichhavedifferentperformancecharacteristicsdueto thewaythey divide protocolwork betweenthe hostprocessorandthe NI. An importantcontributionof our work is theevaluationof differentnetwork interfaceprotocolsusing a wide variety of parallel applications(fine-grain to medium-grain,uni-castandmulticast)andPPSs(distributedsearch,messagepassing,update-basedDSM, and invalidation-basedDSM). Eachsystemhasits own, fairly large andcomplex runtime system,which imposessignificantcommunicationoverheads(seeTable7.2). In addition,threeout of four systemsdo not run immediatelyontopof oneNI protocollayer(LFC), but onanintermediatemessage-passinglayer(Panda).

Bilasetal.discussNI supportfor sharedvirtual memorysystems(SVMs)[21].They studyNI supportfor fetchingdatafrom remotememories,for depositingdatainto remotememories,andfor distributedlocks.They find thatthesemecha-nismssignificantlyimprove theperformanceof SVM implementationsandclaimthat they aremorewidely applicable.Using thesemechanisms,andby restruc-turing their SVM, they eliminatedall asynchronoushost-level protocolprocess-ing in their SVM. As a result,interruptsarenot neededandpolling is usedonlywhena messageis expected.This result,however, dependson theability to pushasynchronousprotocolprocessingto theNI. Bilaset al. useaPPSin which asyn-chronousprocessingconsistsof accessingdataat a known addressor accessingalock. Theseactionsarerelatively simpleandcanbeperformedby theNI. PPSssuchasOrcaandManta,however, mustexecuteincominguser-definedoperations,

Page 257: Communication Architectures for Parallel-ProgrammingSystems

8.8Summary 235

which cannoteasilybehandledby theNI.In a simulation study, Bilas et al. identify bottlenecksin software shared-

memorysystems[20]. This study is structuredaroundthe samethreelayersaswe usedin this chapter:low-level communicationsoftwareandhardware,PPS,andapplication. Both the communicationlayer andthe PPSsaredifferentfromtheonesstudiedin this thesis.Bilas et al. assumea communicationlayer basedon virtual memory-mappedcommunication(seeChapter2) andanalyzetheper-formanceof page-basedandfine-grainedDSMs.Ourwork usesacommunicationlayerbasedonlow-level messagepassingandPPSsthatimplementmessagepass-ing or object-basedsharing.(CRL usesasimilarcachecoherenceprotocolasfine-grainedDSMs,but relieson theprogrammerto define’cachelines’ andto triggercoherenceactions.Fine-grainedDSMs suchasTempest[119] andShasta[125]requirelittle or noprogrammerintervention.)

8.8 Summary

In thischapter, wehavestudiedtheperformanceof fiveimplementationsof LFC’scommunicationinterface.Eachimplementationis acombinationof onereliabilityscheme(no retransmission,host-level retransmission,or NI-level retransmission)andonemulticastforwardingscheme(host-level forwardingor NI-level forward-ing). Thesefive implementationsrepresentdifferentassumptionsaboutthecapa-bilities of network interfaces(availability of aprogrammableNI processoranditsspeed)andthe operatingenvironment(reliability of the network hardware). Wecomparedtheperformanceof all fiveimplementationsatmultiple levels.Weusedmicrobenchmarksfor direct comparisonsbetweenthe differentimplementationsandfor comparisonsat thePPSlevel. We usedfour differentPPSs,which repre-sentdifferentprogrammingparadigmsandwhichexhibit differentcommunicationpatterns.Most importantly, wealsoperformedanapplication-level comparison.

Theseperformancecomparisonsreveal several interestingfacts. First, al-thoughperformancedifferencesarevisible, all five LFC implementationscanbemadeto performwell on most LFC-level microbenchmarks.Multicast latencyformsanexception:NI-level multicastforwardingyieldsbettermulticastlatencythanhost-level forwarding.

Second,all PPSsaddlargeoverheadsto LFC implementations.Measuringes-sentiallythesamecommunicationpatternatmultiple levelsshowslargeincreasesin latency andlargereductionsin throughput.While this is a logical consequenceof the layeredstructureof the PPSs,it is important to make this observation,

Page 258: Communication Architectures for Parallel-ProgrammingSystems

236 Multilevel PerformanceEvaluation

becauseonewould expect that theseoverheads,andnot the relative small per-formancedifferencesbetweenLFC implementations,will dominateapplicationperformance.

Third, in spiteof thepreviousobservation,runningapplicationson thediffer-ent LFC implementationsyields significantdifferencesin applicationexecutiontime. Most of our applicationsexhibited oneof threecommunicationpatterns:roundtrip,one-way, or multicast. For eachof thesepatterns,the relative perfor-manceof theLFC implementationsis similar:

÷ Theroundtripapplicationsperformbeston thenonretransmittingLFC im-plementations.Retransmissionsupportaddssufficientoverheadthatit can-not be hiddenby applicationsin this class. The overheadshave a largerimpactwhenretransmissionis implementedon theNI.

÷ The multicastapplicationsperformbeston the implementationsthat per-form NI-level forwarding. NI-level forwardingyields lower multicastla-tency anddoesnotwastehost-processorcycleson packet forwarding.

÷ Theone-way applications—really two variantsof thesameapplications—performwell on all implementations,excepton the implementationsthatimplementhost-level retransmission.This implementationsuffers from itslargersendandreceiveoverheads.

Our original implementationof LFC, NrxImc, performsbestfor all patterns.This implementation,however, optimistically assumesthe presenceof reliablenetwork hardwareandanintelligentNI.

Page 259: Communication Architectures for Parallel-ProgrammingSystems

Chapter 9

Summary and Conclusions

Theimplementationof a high-level programmingmodelconsistsof multiple lay-ers: a runtimesystemor parallel-programmingsystemthat implementsthe pro-grammingmodel and one or more communicationlayers that have no know-ledgeof theparallel-programmingmodel,but thatprovide efficient communica-tion mechanisms.This thesishasconcentratedon the lower communicationlay-ers,in particularon thenetwork interface(NI) protocollayer. A key contributionof this thesis,however, is that it alsoaddressesthe interactionsof the NI proto-col layerwith higher-level communicationlayersandapplications.Thefollowingsectionssummarizeour work asit relatesto eachlayeranddraw conclusions.

9.1 LFC and NI Protocols

We have implementedour ideasin andon top of LFC, a new user-level commu-nicationsystem.In severalways,LFC resemblesa hardwaredevice: communi-cationis packet-basedandall packetsaredeliveredto a singleupcall. However,LFC addstwo servicesthatareusuallynot providedby network hardware: reli-ablepoint-to-pointandmulticastcommunication.The efficient implementationof bothserviceshasbeenthekey to LFC’s success.

Efficient communicationis of obvious importanceto parallel-programmingsystems(PPSs).While many communicationsystemsprovide low-latency, high-throughputpoint-to-pointprimitives,very few systemsprovideefficientmulticastimplementations.In fact,many donotprovidemulticastatall. Thisis unfortunate,becausePPSssuchasOrcaandMPI rely onmulticastingto updatereplicatedob-jectsandto implementcollective-communicationoperations,respectively. More-

237

Page 260: Communication Architectures for Parallel-ProgrammingSystems

238 SummaryandConclusions

over, multicastis notanadd-onfeature:multicastslayeredontopof point-to-pointprimitivesperformconsiderablyworsethanmulticaststhat aresupportedby thebottom-mostcommunicationlayer(seeChapters7 and8).

Thepresenceof reliablecommunicationgreatlyreducestheeffort neededtoimplementa PPS.Comparedto fragmentationanddemultiplexing —servicesnotprovidedby LFC— reliability protocolsaredifficult to implementcorrectlyandefficiently. Nevertheless,all PPSsmustdeliverdatareliably.

For efficiency, the default implementationof LFC usesNI support. The NIperformsfour tasks: flow control for reliability, multicastforwarding, interruptmanagement,and fetch-and-addprocessing.The following paragraphsdiscussthesetasksin moredetail.

The default implementationassumesreliable network hardware and imple-mentsreliablecommunicationchannelsby meansof a simple,NI-level flow con-trol protocol. This protocol,UCAST, is alsothe basisof MCAST andRECOV,LFC’s NI-supportedmulticastprotocols. Sincetheseprotocolsassumereliablehardware,however, they canoperateonly in a controlledenvironmentsuchasaclusterof computersdedicatedto runninghigh-performanceparalleljobs.

MCAST is simpleandefficient,but doesnotwork for all multicasttreetopolo-gies. RECOV works for all topologies,but is morecomplex andrequiresaddi-tional NI buffer space.Both MCAST andRECOV performfewer datatransfersthanhost-level store-and-forwardmulticastprotocols,which reducesthenumberof hostcyclesspenton multicastprocessing.TheperformancemeasurementsinChapter8 show that host-level multicastingreducesapplicationperformancebyup to 38%. Thesemeasurementscomparedtheperformanceof anNI-supportedmulticast to an agressivehost-level multicast. In practice,host-level multicastimplementationsarefrequentlyimplementedon top of unicastprimitives,whichfurtherincreasesthenumberof redundantdatatransfers.

Network interruptsform animportantsourceof overheadsin communicationsystems. To reducethe numberof unnecessaryinterrupts,LFC implementsasoftwarepolling watchdogon theNI. This mechanismdelaysnetwork interruptsfor incoming packets for a certainamountof time. In addition, LFC’s pollingwatchdogmonitorshost-level packet processingprogressto determinewhetheranetwork interruptshouldbegenerated.

We have implementedanNI-supportedfetch-and-addoperation.Pandacom-binesLFC’s fetch-and-addwith LFC’s broadcastto implementa totally-orderedbroadcast,which, in turn, is usedby Orca.Fetch-and-addalsohasotherapplica-tions.KaramchetiandChien,for example,haveusedfetch-and-addto implementpull-basedmessaging[76].

Page 261: Communication Architectures for Parallel-ProgrammingSystems

9.2Panda 239

Othertypesof synchronizationprimitivescanalsbenefitfrom NI support.Bi-laset al. usetheNI to implementdistributedlocking in a sharedvirtual memorysystem[21]. TheseexamplesindicatethatNI supportfor synchronizationprim-itives is a good idea. Without suchsupport,synchronizationrequestsgenerateexpensive interrupts. It is not clearyet, however, which primitivesexactly mustbe supported.Dif ferentsystemsrequiredifferentprimitivesand, in spiteof thegeneralityclaimedby the implementorsof somemechanisms[21], it is unlikelythatoneprimitivewill satisfyeverysystem.Anotherproblemis thevirtualizationof NI-supportedprimitives. LFC, for example,supportsonly onefetch-and-addvariableper NI. This constraintwasintroducedto boundthe spaceoccupiedbyNI variablesandto avoid introducingan extra demultiplexing step. Ideally, thistypeof constraintwouldbehiddenfrom users.ActiveMessagesII virtualizesNI-level communicationendpointsby usinghostmemoryasbackingstoragefor NImemory[37]. Thisstrategy is general,but increasesthecomplexity anddecreasestheefficiency of theimplementation.

It is frequentlyclaimedthatprogrammableNIs are’tooslow’; suchclaimsareoftenaccompaniedwith referencesto thefailureof I/O channels.LFC performsfour different taskson the NI: flow control for reliability, multicastforwarding,interrupt management,and fetch-and-addprocessing.Nevertheless,LFC is anefficient communicationsystem. The main issueis not whetherNIs shouldbeprogrammable,but which mechanismsthe lowest communicationlayer shouldsupplyandwherethey shouldbeimplemented.ProgrammableNIs canbeconsid-eredonetool in aspectrumof toolsthathelpanswerthis typeof questions;othersincludeformal analysisandsimulation.In this thesis,we studiedreliability, mul-ticast, synchronizationand interruptmanagement.Othershave investigatedNIsupportfor addresstranslationin zero-copy protocols[13, 49, 142], distributedlocking [21], andremotememoryaccess[51, 83]). Someof thesemechanisms(e.g., remotememoryaccess)have recentlyfound their way into industrystan-dards(theVirtual InterfaceArchitecture)andcommercialproducts.

9.2 Panda

Many parallel-programmingsystemsaresufficiently complex thatthey canbenefitfrom anintermediatecommunicationlayerthatprovideshigher-level communica-tion servicesthana systemlikeLFC. Pandaprovidesthreads,messages,messagepassing,RPC,andgroupcommunication.Theseservicesareusedin the imple-mentationof threeof thefour PPSsdescribedin Chapter7.

Page 262: Communication Architectures for Parallel-ProgrammingSystems

240 SummaryandConclusions

Panda’s threadpackage,OpenThreads,dynamicallyswitchesbetweenpollingandinterrupts. By default, interruptsareenabled,but whenall threadsareidle,OpenThreadsdisablesinterruptsandpollsthenetwork. Thisstrategy is simplebuteffective. Whena messageis expected,it will oftenbereceivedthroughpolling,which is moreefficient thanreceiving it by meansof an interrupt. Unexpectedmessagesgeneratean interrupt,unlessthe receiving processpolls beforeLFC’spolling watchdoggeneratestheinterrupt.

Panda’s streammessagesallow Pandaclientsto transmitandreceive a mes-sagein a pipelinedmanner:the receiver canstartprocessingan incomingmes-sagebeforeit hasbeenfully received. Streammessagesareimplementedon topof LFC’s packet-basedinterfacewithout introducingextradatacopies.

Blockingin messagehandlersis adifficult problem.Creatinga(popup)threadper messageallows messagehandlersto block whenever convenient,but it alsoremovesthe orderingbetweenmessages,introducesschedulinganomalies,andwastes(stack)space.Disallowing all handlersto block (as in active messages)forcesprogrammersto useasynchronousprimitivesandcontinuationswheneverany form of blocking (e.g., acquiringa lock) is required. Pandausessingle-threadedupcallsto processincomingmessages.Messagehandlersareallowedto block on locks, but cannotwait for the arrival of othermessages.If the re-ceiverof a messagecandecideearlythatwaiting for anothermessagewill not benecessary, thenthemessagecanbe processedwithout creatinga new thread. Inothercases,blockingcanbeavoidedby usingnonblockingprimitivesinsteadofblockingprimitivesor by usingcontinuations.

PandaandLFC work well together. Panda’s threadschedulerusesLFC’s in-terrupt supportto switch dynamicallybetweenpolling and interrupts. Streammessagespipeline the transmissionand consumptionof LFC packets. Finally,Panda’s totally-orderedbroadcastcombinesLFC’s fetch-and-addandbroadcastprimitives.

9.3 Parallel-Programming Systems

DifferentPPSshavedifferentcommunicationstylesandarethereforesensitive todifferentaspectsof theunderlyingcommunicationsystem.This is illustratedbythefour PPSsstudiedin Chapter7.

Orcausestwo communicationprimitivesto implementoperationson sharedobjects:remoteprocedurecall andasynchronous,totally-orderedbroadcast.Theperformanceof theOrcaapplicationsthatwestudiedin Chapter8 dependsonone

Page 263: Communication Architectures for Parallel-ProgrammingSystems

9.3Parallel-ProgrammingSystems 241

of thetwo. Orcaoperationsmayblock on guards.Sinceguardsoccuronly at thebeginning of an operation,it is easyto suspendthe operationwithout blocking.Without blockingandthreadswitching,theOrcaruntimesystemcreatesa smallcontinuationthatrepresentstheblockedoperation.

Like Orca,theJava-basedMantasystemprovidesmultithreadingandsharedobjects.SinceMantadoesnotreplicateobjects,it requiresonly RPCto implementoperationson sharedobjects.In contrastwith Orca,Mantaoperationscanblockhalfway through. This makesit difficult for the RTS to usesmall continuationsto representthe blocked operation. The RTS thereforecreatesa popupthreadfor eachoperation,unlessthe compiler canprove that the operationwill neverblock. Another complicationin Manta is the presenceof a garbagecollector.The spaceoccupiedby unmarshaledparametersusually cannotbe reuseduntiltheparametershave beengarbagecollected.Consequently, unmarshalingleavesa largememoryfootprintandpolluteslargepartsof thedatacache.

In CRL, mostcommunicationpatternsareroundtrips.Wheninvalidationsareneeded,nestedRPCsoccur. On average,CRL’s RPCsaresmallerthanthoseofMantaandOrca,becauseCRL frequentlysendssmall control messages(regionrequests,invalidations,andacknowledgements).MantaandOrcageneratesomecontroltraffic, but far lessthanCRL.

MPI hasthesimplestprogrammingmodelof thefour PPSsstudiedin this the-sis. In mostcases,communicationat theMPI level correspondsin a straightfor-wardway with communicationat lower levels. MPI’s collective-communicationoperationsform an importantexception. Thesehigh-level primitivesareusuallyimplementedby acombinationof point-to-pointandmulticastmessages.

While thesePPSsgeneratedifferentcommunicationpatterns,they often relyon the samecommunicationmechanisms.All PPSsrequirereliablecommuni-cation. Several low-level communicationsystems(e.g.,U-Net andBIP) do notprovide reliablecommunicationprimitives.

Orca,Manta,andCRL all benefitfrom LFC’s andPanda’s mechanismsto re-ducethe numberof network interrupts. OrcaandMantarely on Panda’s threadpackage,OpenThreads,to poll whenall threadsareidle. In, CRL thesameopti-mizationhasbeenhardcoded.

Orca and MPI benefit from an efficient multicast implementation. Orca’stotally-orderedbroadcastis layered(in Panda)on top of LFC’s broadcastandfetch-and-add.MPI’s collective-communicationoperationsuse(throughPanda)LFC’s broadcast.

Orca, Manta, and MPI all usePanda’s streammessages.Streammessagesprovideaconvenientandefficientmessageinterfacewithout introducinginterme-

Page 264: Communication Architectures for Parallel-ProgrammingSystems

242 SummaryandConclusions

diatedatacopies.TheOrcaandMantacompilersgenerateoperation-specificcodeto marshalto andfrom streammessages.

9.4 PerformanceImpact of DesignDecisions

Chapter8 studiedthe impact of different reliability andmulticastprotocolsonapplicationperformanceandshowed that the LFC implementationdescribedinChapters3 and4 (NrxImc) performsbetterthanalternative implementations.Thisimplementation,however, is agressivein two ways.First,it assumesreliablehard-ware,whichmeansit canbeusedonly in dedicatedenvironments.Second,it del-egatessometasks(flow control,multicast,fetch-and-add,andinterruptmanage-ment)to theNI. Thesetasksareperformedby theNI-level protocolsdescribedinChapter4. Additionalrobustnesscanbeobtainedby meansof retransmission.Weinvestigatedboth host-level andNI-level retransmission.The unicast-dominatedapplicationsof Chapter8 show that both typesof retransmissionhave a modestapplication-level cost. The largestoverhead(up to 16%) is incurredby applica-tions in the ’roundtrip’ class.Theperformancecostof moving NI-level compo-nentsto thehostcanbelarge.Performingmulticastforwardingon thehostratherthanon theNI slowsdown applicationsby up to 38%.

Chapter8 alsorevealedinterestinginteractionsbetweenthestructureof appli-cationsandthe implementationof the underlyingcommunicationsystem(s).InSOR,theexactlocation(hostor NI) of thesliding window protocolhada notice-ableimpacton performance.In ASP, we addedpolling statementsto reducethedelaysthat resultedfrom synchronoushost-level multicastforwarding. In somecases,applicationscanberestructuredto work aroundtheweaknessesof apartic-ular communicationsystem.ThePuzzleapplication,for example,canbeconfig-uredto usea larger aggregationbuffer on top of a host-level reliability protocolto reducetheimpactof increasedsendandreceiveoverhead.TheperformanceofSORwasimprovedby changingtheboundaryexchangephase.

9.5 Conclusions

Basedon our experienceswith LFC andthesystemslayeredon top of LFC, wedraw thefollowing conclusions:

1. Low-levelcommunicationsystemsshouldsupportbothpolling andinterrupt-drivenmessagedelivery. Westudiedfour PPSs.All of themusepolling and

Page 265: Communication Architectures for Parallel-ProgrammingSystems

9.6LimitationsandOpenIssues 243

threeof themuseinterrupts(Orca,Manta,andCRL). Transparentemula-tion of interrupt-drivenmessagedeliveryis nontrivial —severalsystemsusebinaryrewriting— or relieson interruptsasabackupmechanism.

2. Low-levelcommunicationsystemsshouldsupportmulticast.Multicastplaysan importantrole in two of the four PPSsstudied(OrcaandMPI). Multi-castimplementationsareoften layeredon top of point-to-pointprimitives.This alwaysintroducesunnecessarydatatransfersthatconsumecyclesthatcouldhavebeenusedby theapplication.Moreover, if multicastforwardingis implementedatthemessageratherthanthepacketlevel, thenthisstrategyalsoreducesthroughputby removing intramessagepipelining.

3. A variety of PPSscanbe implementedefficiently on a simplecommuni-cationsystemif all layerscooperateto avoid unnecessaryinterrupts,threadswitches,anddatacopies.LFCprovidesasimpleinterface:reliable,packet-basedpoint-to-pointandmulticastcommunicationand interruptmanage-ment. Thesemechanismsareusedto implementabstractionsthatpreserveefficiency. In Panda,they areusedto implementOpenThreads’s automaticswitching betweenpolling and interrupts,messagestreams,and totally-orderedmulticast. Orca,Manta,andMPI areall built on top of Panda.Inaddition,OrcaandMantauseoperation-specificmarshalingin combinationwith streammessagesto avoid makingextra datacopies.

4. An aggressive, NI-supportedcommunicationsystemcanyield betterper-formancethana traditionalcommunicationsystemwithoutNI support.Wehaveexperimentedwith a reliability protocol(UCAST), two multicastpro-tocols(MCAST andRECOV), apolling watchdog(INTR), andafetch-and-addprimitive. Additional robustnessandsimplicity canbeobtainedby sys-temsthatimplementall functionalityonthehost,but thesesystemsperformworse,bothatthelevel of microbenchmarksandat thelevel of applications.

9.6 Limitations and Open Issues

Although this thesishasa broadscope,several importantissueshave not beenaddressedandsomeissuesthathavebeenaddressedrequirefurtherresearch.

First, LFC lacks zero-copy support. Zero-copy mechanismsare basedonDMA ratherthan PIO and, for sufficiently large transfers,give betterthrough-put thanPIO (seeChapter2). Sender-side zero-copy supportis relatively easy

Page 266: Communication Architectures for Parallel-ProgrammingSystems

244 SummaryandConclusions

to implementanduse. For optimal performance,however, datashouldalsobemoved directly to its final destinationat the receiver side. Somesystems(e.g.,Hamlyn [32]) achieve this by having the senderspecifya destinationaddressinthe receiver’s addressspace.PPSsneedto be restructuredin nontrivial waystodeterminethis addresswithout introducingextra messages[21, 35]. While zero-copy communicationsupportfor PPSswarrantsfurther investigation,we arenotawareof any study that shows significantapplication-level performanceadvan-tagesof zero-copy communicationmechanismsovermessage-basedmechanisms.Thereare studiesthatshow thatparallelapplicationsarerelatively insensitive tothroughput[102].

Second,more considerationneedsto be given to differentbuffer and timerconfigurationsfor the variousLFC implementationsstudiedin Chapter8. Weselectedbuffer and timer settingsthat yield good performancefor eachimple-mentation.It is not completelyclearhow sensitive thedifferentimplementationsareto differentsettings.

Finally, theevaluationin Chapter8 focusedon theimpactof alternativerelia-bility andmulticastschemes.Additional work is neededto determinetheimpactof NI-supportedsynchronizationand interruptdelivery (i.e., the polling watch-dog). Someof our results(not presentedin this thesis),however, show a goodqualitativematchwith thosepresentedby Maquelinet al. [101].

Page 267: Communication Architectures for Parallel-ProgrammingSystems

Bibliography

[1] A. Agarwal, R. Bianchini, D. Chaiken,K.L. Johnson,D. Kranz,J. Kubi-atowicz, B.H. Lim, K. MacKenzie,andD. Yeung. TheMIT Alewife Ma-chine: ArchitectureandPerformance.In Proc. of the 22ndInt. Symp.onComputerArchitecture, pages2–13,SantaMargheritaLigure, Italy, June1995.

[2] A. Agarwal, J. Kubiatowicz, D. Kranz,B.H. Lim, D. Yeung,G. D’Souza,andM. Parkin.Sparcle:An EvolutionaryProcessorDesignfor Large-ScaleMultiprocessors.IEEE Micro, 13(3):48–61,June1993.

[3] A. Alexandrov, M.F. Ionescu,K.E. Schauser, andC. Scheiman.LogGP:IncorporatingLong Messagesinto theLogPModel – OneStepCloserTo-wardsa RealisticModel for Parallel Computation. In Proc. of the 1995Symp.on Parallel AlgorithmsandArchitectures, pages95–105,SantaBar-bara,CA, July1995.

[4] K.V. Anjan and T.M. Pinkston. An Efficient, Fully Adaptive DeadlockRecovery Scheme:DISHA. In Proc. of the22ndInt. Symp.on ComputerArchitecture, pages201–210,SantaMargheritaLigure, Italy, June1995.

[5] S.Araki, A. Bilas,C. Dubnicki, J.Edler, K. Konishi,andJ.Philbin. User-SpaceCommunication:A QuantitativeStudy. In Supercomputing’98, Or-lando,FL, November1998.

[6] M. Aron andP. Druschel. Soft Timers: Efficient MicrosecondSoftwareTimer Supportfor Network Processing.In Proc.of the17thSymp.on Op-eratingSystemsPrinciples, pages232–246,KiawahIslandResort,SC,De-cember1999.

[7] H.E. Bal. ProgrammingDistributedSystems. PrenticeHall InternationalLtd., HemelHempstead,UK, 1991.

245

Page 268: Communication Architectures for Parallel-ProgrammingSystems

246 Bibliography

[8] H.E. Bal and L.V. Allis. Parallel RetrogradeAnalysis on a DistributedSystem.In Supercomputing’95, SanDiego,CA, December1995.

[9] H.E.Bal, R.A.F. Bhoedjang,R.F.H. Hofman,C. Jacobs,K.G. Langendoen,T. Ruhl, andM.F. Kaashoek.PerformanceEvaluationof theOrcaSharedObjectSystem.ACM Trans.on ComputerSystems, 16(1):1–40,February1998.

[10] H.E. Bal, R.A.F. Bhoedjang,R.F.H. Hofman, C. Jacobs,K.G. Langen-doen,andK. Verstoep. Performanceof a High-Level Parallel LanguageonaHigh-SpeedNetwork. Journalof Parallel andDistributedComputing,40(1):49–64,February1997.

[11] H.E. Bal andM.F. Kaashoek.ObjectDistribution in OrcausingCompile-timeandRun-TimeTechniques.In Conf. onObject-OrientedProgrammingSystems,Languagesand Applications, pages162–177,Washington,DC,September1993.

[12] H.E. Bal, M.F. Kaashoek,andA.S. Tanenbaum.Orca: A LanguageforParallel Programmingof DistributedSystems. IEEE Trans.on SoftwareEngineering, 18(3):190–205,March1992.

[13] A. Basu,M. Welsh,andT. von Eicken. IncorporatingMemory Manage-mentinto User-Level Network Interfaces. In Hot Interconnects’97, Stan-ford, CA, April 1997.

[14] R.A.F. BhoedjangandK.G. Langendoen.FriendlyandEfficient MessageHandling. In Proc. of the 29th AnnualHawaii Int. Conf. on SystemSci-ences, pages121–130,Maui, HI, January1996.

[15] R.A.F. Bhoedjang,J.W. Romein,and H.E. Bal. Optimizing DistributedDataStructuresUsingApplication-SpecificNetwork InterfaceSoftware.InProc. of the1998Int. Conf. on Parallel Processing, pages485–492,Min-neapolis,MN, August1998.

[16] R.A.F. Bhoedjang,T. Ruhl, andH.E. Bal. Efficient Multicast on MyrinetUsingLink-Level Flow Control. In Proc.of the1998Int. Conf. onParallelProcessing, pages381–390,Minneapolis,MN, August1998.

Page 269: Communication Architectures for Parallel-ProgrammingSystems

Bibliography 247

[17] R.A.F. Bhoedjang,T. Ruhl, andH.E. Bal. LFC: A CommunicationSub-stratefor Myrinet. In 4th AnnualConf. of theAdvancedSchool for Com-putingandImaging, pages31–37,Lommel,Belgium,June1998.

[18] R.A.F. Bhoedjang,T. Ruhl, andH.E. Bal. User-Level Network InterfaceProtocols.IEEE Computer, 31(11):53–60,November1998.

[19] R.A.F. Bhoedjang,T. Ruhl, R.F.H. Hofman,K.G. Langendoen,H.E. Bal,andM.F. Kaashoek.Panda:A PortablePlatformto SupportParallelPro-grammingLanguages.In Proc.of theUSENIXSymp.on ExperienceswithDistributedandMultiprocessorSystems(SEDMSIV), pages213–226,SanDiego,CA, September1993.

[20] A. Bilas, D. Jiang,Y. Zhou, and J.P. Singh. Limits to the Performanceof Software SharedMemory: A LayeredApproach. In Proc. of the 5thInt. Symp.on High-PerformanceComputerArchitecture, pages193–202,Orlando,FL, January1999.

[21] A. Bilas,C.Liao,andJ.P. Singh.UsingNetwork InterfaceSupportto AvoidAsynchronousProtocolProcessingin SharedVirtual MemorySystems.InProc. of the 26th Int. Symp.on ComputerArchitecture, pages282–293,Atlanta,GA, May 1999.

[22] K.P. Birman. TheProcessGroupApproachto ReliableDistributedCom-puting. Communicationsof theACM, 36(12):37–53,December1993.

[23] A.D. Birrell andB.J.Nelson.ImplementingRemoteProcedureCalls.ACMTrans.on ComputerSystems, 2(1):39–59,February1984.

[24] M.A. Blumrich,R.D.Alpert, Y. Chen,D.W. Clark,S.N.Damianakis,E.W.Felten,L. Iftode, K. Li, M. Martonosi,andR.A. Shillner. DesignChoicesin the SHRIMP System: An Empirical Study. In Proc. of the 25th Int.Symp.on ComputerArchitecture, pages330–341,Barcelona,Spain,June1998.

[25] M.A. Blumrich,C. Dubnicki,E.W. Felten,andK. Li. Protected,User-levelDMA for theSHRIMPNetwork Interface.In Proc.of the2ndInt. Symp.onHigh-PerformanceComputerArchitecture, pages154–165,SanJose,CA,February1996.

Page 270: Communication Architectures for Parallel-ProgrammingSystems

248 Bibliography

[26] M.A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E.W. Felten,andJ. Sand-berg. Virtual MemoryMappedNetwork Interfacefor theSHRIMPmulti-computer. In Proc.of the21stInt. Symp.on ComputerArchitecture, pages142–153,Chicago,IL, April 1994.

[27] N.J. Boden,D. Cohen,R.E. Felderman,A.E. Kulawik, C.L. Seitz, J.N.Seizovic, andW. Su. Myrinet: A Gigabit-per-secondLocalAreaNetwork.IEEE Micro, 15(1):29–36,February1995.

[28] E.A. Brewer, F.T. Chong,L.T. Liu, S.D. Sharma,andJ.D. Kubiatowicz.RemoteQueues:ExposingMessageQueuesfor OptimizationandAtomic-ity. In Proc. of the1995Symp.on Parallel AlgorithmsandArchitectures,pages42–53,SantaBarbara,CA, July1995.

[29] A. Brown andM. Seltzer. OperatingSystemBenchmarkingin theWakeofLmbench:A CaseStudyof the Performanceof NetBSDon the Intel x86Architecture. In Proc. of the 1997Conf. on Measurementand Modelingof ComputerSystems(SIGMETRICS), pages214–224,Seattle,WA, June1997.

[30] J. Bruck, L. De Coster, N. Dewulf, C.-T. Ho, andR. Lauwereins.On theDesignandImplementationof BroadcastandGlobalCombineOperationsUsingthePostalModel. IEEE Trans.on Parallel andDistributedSystems,7(3):256–265,March1996.

[31] M. Buchanan.Privatecommunication,1997.

[32] G. Buzzard,D. Jacobson,M. MacKey, S.Marovich, andJ.Wilkes.An Im-plementationof the Hamlyn Sender-managedInterfaceArchitecture. In2nd USENIX Symp.on Operating SystemsDesign and Implementation,pages245–259,Seattle,WA, October1996.

[33] J.Carreira,J.GabrielSilva,K.G. Langendoen,andH.E.Bal. ImplementingTuple Spacewith Threads. In 1st Int. Conf. on Parallel and DistributedSystems, pages259–264,Barcelona,Spain,June1997.

[34] C.-C.Chang,G. Czajkowski, C. Hawblitzel, andT. von Eicken. Low La-tency Communicationon the IBM RS/6000SP. In Supercomputing’96,Pitssburgh,PA, November1996.

Page 271: Communication Architectures for Parallel-ProgrammingSystems

Bibliography 249

[35] C.-C. ChangandT. von Eicken. A SoftwareArchitecturefor Zero-CopyRPCin Java. TechnicalReport98-1708,Dept.of ComputerScience,Cor-nell University, Ithaca,NY, September1998.

[36] J.-D.Choi, M. Gupta,M. Serrano,V.C. Sreedhar, andS. Midkif f. EscapeAnalysisfor Java.In Conf. onObject-OrientedProgrammingSystems,Lan-guagesandApplications, pages1–19,Denver, CO,November1999.

[37] B. Chun,A. Mainwaring,andD.E.Culler. Virtual Network TransportPro-tocolsfor Myrinet. In Hot Interconnects’97, Stanford,CA, April 1997.

[38] D.D. Clark. TheStructuringof SystemsUsingUpcalls.In Proc.of the10thSymp.onOperatingSystemsPrinciples, pages171–180,OrcasIsland,WA,December1985.

[39] D.E. Culler, A. Dusseau,S.C. Goldstein,A. Krishnamurty, S. Lumetta,T. von Eicken,andK. Yelick. ParallelProgrammingin Split-C. In Super-computing’93, pages262–273,Portland,OR,November1993.

[40] D.E. Culler, R. Karp, D. Patterson,A. Sahay, K.E. Schauser, E. Santos,R. Subramonian,andT. von Eicken. LogP:Towardsa RealisticModel ofParallelComputation.In 4th Symp.on PrinciplesandPracticeof ParallelProgramming, pages1–12,SanDiego,CA, May 1993.

[41] D.E. Culler, L.T. Liu, R.P. Martin, andC.O. Yoshikawa. AssessingFastNetwork Interfaces.IEEEMicro, 16(1):35–43,February1996.

[42] L. DagumandR. Menon.OpenMP:An Industry-StandardAPI for Shared-Memory Programming. IEEE ComputationalScienceand Engineering,5(1):46–55,January/March1998.

[43] W.J.Dally andC.L.Seitz.Deadlock-FreeMessageRoutingin Multiproces-sorInterconnectionNetworks. IEEETrans.onComputers, 36(5):547–553,May 1987.

[44] E.W. Dijkstra. GuardedCommands. Communicationsof the ACM,18(8):453–457,August1975.

[45] J.J.Dongarra,S.W. Otto, M. Snir, and D.W. Walker. A MessagePass-ing Standardfor MPP and Workstations. Communicationsof the ACM,39(7):84–90,July1996.

Page 272: Communication Architectures for Parallel-ProgrammingSystems

250 Bibliography

[46] R.P. Draves,B.N. Bershad,R.F. Rashid,andR.W. Dean.UsingContinua-tionsto ImplementThreadManagementandCommunicationin OperatingSystems.In Proc.of the13thSymp.onOperatingSystemsPrinciples, pages122–136,Asilomar, Pacific Grove,CA, October1991.

[47] P. DruschelandG. Banga. Lazy Receiver Processing(LRP): A NetworkSubsystemArchitecturefor Server Systems. In 2nd USENIXSymp.onOperating SystemsDesignand Implementation, pages261–275,Seattle,WA, October1996.

[48] P. Druschel,L.L. Peterson,andB.S.Davie. Experienceswith aHigh-SpeedNetwork Adaptor: A SoftwarePerspective. In Proc. of the1994Conf. onCommunicationsArchitectures,Protocols,andApplications(SIGCOMM),pages2–13,London,UK, September1994.

[49] C. Dubnicki, A. Bilas, Y. Chen,S. Damianakis,and K. Li. VMMC-2:Efficient Supportfor Reliable,Connection-OrientedCommunication. InHot Interconnects’97, Stanford,CA, April 1997.

[50] C. Dubnicki,A. Bilas,K. Li, andJ.Philbin. DesignandImplementationofVirtual Memory-MappedCommunicationonMyrinet. In 11thInt. ParallelProcessingSymp., pages388–396,Geneva,Switzerland,April 1997.

[51] D. Dunning,G. Regnier, G. McAlpine, D. Cameron,B. Shubert,F. Berry,A.M. Merritt, E. Gronke,andC. Dodd. TheVirtual InterfaceArchitecture.IEEE Micro, 18(2):66–76,March–April 1998.

[52] A. Fekete, M. F. Kaashoek,and N. Lynch. ImplementingSequentiallyConsistentSharedObjectsUsing Broadcastand Point-to-PointCommu-nication.In Proc.of the15thInt. Conf. onDistributedComputingSystems,pages439–449,Vancouver, Canada,May 1995.

[53] MPI Forum. MPI: A MessagePassingInterfaceStandard.Int. Journal ofSupercomputingApplications, 8(3/4),1994.

[54] I. Foster, C. Kesselman,andS. Tuecke. The NEXUS Approachto Inte-gratingMultithreadingandCommunication.Journal of Parallel andDis-tributedComputing, 37(2):70–82,February1996.

[55] M. Frigo, C.E. Leiserson,andK.H. Randall. The Implementationof theCilk-5 MultithreadedLanguage. In Proc. of the Conf. on Programming

Page 273: Communication Architectures for Parallel-ProgrammingSystems

Bibliography 251

LanguageDesignandImplementation, pages212–223,Montreal,Canada,June1998.

[56] M. Gerla,P. Palnati,andS.Walton.MulticastingProtocolsfor High-Speed,Wormhole-RoutingLocal Area Networks. In Proc. of the 1996Conf. onCommunicationsArchitectures,Protocols,andApplications(SIGCOMM),pages184–193,StanfordUniversity, CA, August1996.

[57] R.B.Gillett. MemoryChannelfor PCI. IEEEMicro, 15(1):12–18,February1996.

[58] S.C.Goldstein,K.E. Schauser, andD.E.Culler. LazyThreads:Implement-ing a FastParallel Call. Journal of Parallel and DistributedComputing,37(1):5–20,August1996.

[59] G.H. GolubandC. F. vanLoan. Matrix Computations. TheJohnHopkinsUniversityPress,3rd edition,1996.

[60] J. Gosling, B. Joy, and G. Steele. The Java Language Specification.Addison-Wesley, Reading,MA, 1996.

[61] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-Performance,PortableImplementationof the MPI MessagePassingInterfaceStandard.Parallel Computing, 22(6):789–828,September1996.

[62] The Pandagroup. ThePanda4.0 InterfaceDocument. Dept. of Mathe-maticsandComputerScience,Vrije Universiteit,April 1998. On-line athttp://www.cs.vu.nl/panda/panda4/panda.html.

[63] M.D. Haines. An Open Implementation Analysis and Design ofLightweightThreads.In Conf. on Object-OrientedProgrammingSystems,LanguagesandApplications, pages229–242,Atlanta,GA, October1997.

[64] M.D. HainesandK.G. Langendoen.Platform-IndependentRuntimeOp-timizationsUsing OpenThreads.In 11th Int. Parallel ProcessingSymp.,pages460–466,Geneva,Switzerland,April 1997.

[65] H.P. Heinzle, H.E. Bal, and K.G. Langendoen. ImplementingObject-BasedDistributedSharedMemory on Transputers.In TransputerAppli-cationsand Systems’94, pages390–405,Villa Erba, Cernobbio,Como,Italy, September1994.

Page 274: Communication Architectures for Parallel-ProgrammingSystems

252 Bibliography

[66] W.C.Hsieh,K.L. Johnson,M.F. Kaashoek,D.A. Wallach,andW.E.Weihl.Efficient Implementationof High-Level Languageson User-Level Com-municationArchitectures.TechnicalReportMIT/LCS/TR-616,MIT Lab-oratoryfor ComputerScience,May 1994.

[67] W.C. Hsieh,M.F. Kaashoek,andW.E. Weihl. DynamicComputationMi-grationin DSM Systems.In Supercomputing’96, Pitssburgh,PA, Novem-ber1996.

[68] Y. HuangandP.M. McKinley. Efficient Collective Operationswith ATMNetwork InterfaceSupport. In Proc. of the 1996 Int. Conf. on ParallelProcessing, volumeI, pages34–43,Bloomingdale,IL, August1996.

[69] G. Iannello,M. Lauria,andS.Mercolino. Cross-PlatformAnalysisof FastMessagesfor Myrinet. In Proc. of the Workshopon Communication,Ar-chitecture, andApplicationsfor Network-basedParallel Computing(LNCS1362), pages217–231,LasVegas,NV, January1998.

[70] Intel Corporation.PentiumPro FamilyDeveloper’sManual, 1995.

[71] Intel Corporation. PentiumPro Family Developer’s Manual, Volume3:OperatingSystemWriter’sGuide, 1995.

[72] SPARC International.TheSPARCArchitectureManual: Version8, 1992.

[73] K.L. Johnson.High-PerformanceAll-Software DistributedShared Mem-ory. PhDthesis,MIT Laboratoryfor ComputerScience,Cambridge,USA,December1995.TechnicalReportMIT/LCS/TR-674.

[74] K.L. Johnson,M.F. Kaashoek,andD.A. Wallach.CRL: High-PerformanceAll-SoftwareDistributedSharedMemory. In Proc. of the 15th Symp.onOperatingSystemsPrinciples, pages213–226,CopperMountain,CO,De-cember1995.

[75] M.F. Kaashoek.GroupCommunicationin DistributedComputerSystems.PhDthesis,Vrije UniversiteitAmsterdam,1992.

[76] V. KaramchetiandA.A. Chien.A Comparisonof ArchitecturalSupportforMessagingin theTMC CM-5 andtheCrayT3D. In Proc.of the22ndInt.Symp.onComputerArchitecture, pages298–307,SantaMargheritaLigure,Italy, June1995.

Page 275: Communication Architectures for Parallel-ProgrammingSystems

Bibliography 253

[77] R.M. Karp, A. Sahay, E.E.Santos,andK.E. Schauser. OptimalBroadcastandSummationin theLogPModel. In Proc.of the1993Symp.onParallelAlgorithmsandArchitectures, pages142–153,Velen,Germany, June1993.

[78] P. Keleher, A.L. Cox, S. Dwarkadas,andW. Zwaenepoel.TreadMarks:DistributedSharedMemoryon StandardWorkstationsandOperatingSys-tems.In Proc.of theWinter 1994UsenixConf., pages115–131,SanFran-cisco,CA, January1994.

[79] D. Keppel. Register Windows and User SpaceThreadson the SPARC.TechnicalReportUWCSE91-08-01,Dept.of ComputerScienceandEn-gineering,Universityof Washington,Seattle,WA, August1991.

[80] R. Kesavan and D.K. Panda. Optimal Multicast with PacketizationandNetwork InterfaceSupport. In Proc. of the 1997 Int. Conf. on ParallelProcessing, pages370–377,Bloomingdale,IL, August1997.

[81] C.H.Koelbel,D.B. Loveman,R. Schreiber, JrG.L. Steele,andM.E. Zosel.TheHigh PerformanceFortran Handbook. MIT Press,Cambridge,MA,1994.

[82] L.I. KontothanassisandM.L. Scott. UsingMemory-MappedNetwork In-terfaceto ImprovethePerformanceof DistributedSharedMemory. In Proc.of the2ndInt. Symp.on High-PerformanceComputerArchitecture, pages166–177,SanJose,CA, February1996.

[83] A. Krishnamurthy, K.E. Schauser, C.J.Scheiman,R.Y. Wang,D.E. Culler,andK. Yelick. Evaluationof ArchitecturalSupportfor Global Address-BasedCommunicationin Large-ScaleParallel Machines. In Proc. of the7th Int. Conf. on Architectural Supportfor ProgrammingLanguagesandOperatingSystems, pages37–48,Cambridge,MA, October1996.

[84] V. Kumar, A. Grama,A. Gupta,andG. Karypsis. Introductionto ParallelComputing:DesignandAnalysisof Algorithms. TheBenjamin/CummingsPublishingCompany, Inc.,RedwoodCity, CA, 1994.

[85] J.Kuskin,D. Ofelt, M. Heinrich,J.Heinlein,R. Simoni,K. Gharachorloo,J.Chapin,D. Nakahira,J.Baxter, M. Horowitz, A. Gupta,M. Rosenblum,andJ.Hennessy. TheStanfordFLASH Multiprocessor.In Proc.of the21stInt. Symp.on ComputerArchitecture, pages302–313,Chicago,IL, April1994.

Page 276: Communication Architectures for Parallel-ProgrammingSystems

254 Bibliography

[86] L. Lamport. How to Make a MultiprocessorComputerThatCorrectlyEx-ecutesMultiprocessPrograms.IEEE Trans.on Computers, C-28(9):690–691,September1979.

[87] K. Langendoen,R.A.F. Bhoedjang,andH.E. Bal. AutomaticDistributionof SharedDataObjects.In Third Workshopon Languages,Compilers andRun-TimeSystemsfor ScalableComputers, Troy, NY, May 1995.

[88] K.G. Langendoen,R.A.F. Bhoedjang,and H.E. Bal. Models for Asyn-chronousMessageHandling. IEEE Concurrency, 5(2):28–37,April–June1997.

[89] K.G. Langendoen,J. Romein,R.A.F. Bhoedjang,andH.E. Bal. Integrat-ing Polling, Interrupts,and ThreadManagement. In The 6th Symp.ontheFrontiersof MassivelyParallel Computation, pages13–22,Annapolis,MD, October1996.

[90] M. LauriaandA.A. Chien.MPI-FM: High PerformanceMPI on Worksta-tion Clusters.Journal of Parallel andDistributedComputing, 40(1):4–18,January1997.

[91] C.E.Leiserson,Z.S.Abuhamdeh,D.C.Douglas,C.R.Feynman,M.N. Gan-mukhi, J.V. Hill, W.D. Hillis, B.C. Kuszmaul,M.A. St.Pierre,D.S.Wells,M.C. Wong-Chan,S.-W. Yang,andR. Zak. TheNetwork Architectureofthe ConnectionMachineCM-5. In Proc. of the 1992Symp.on ParallelAlgorithmsandArchitectures, pages272–285,SanDiego,CA, June1992.

[92] K. Li andP. Hudak. Memory Coherencein SharedVirtual Memory Sys-tems.ACM Trans.onComputerSystems, 7(4):321–359,November1989.

[93] C. Liao, D. Jiang,L. Iftode, M. Martonosi,andD.W. Clark. MonitoringSharedVirtual Memory Performanceon a Myrinet-basedPC Cluster. InSupercomputing’98, Orlando,FL, November1998.

[94] J. Liedtke. On Micro-Kernel Construction. In Proc. of the 15th Symp.on Operating SystemsPrinciples, pages237–250,CopperMountain,CO,December1995.

[95] T. Lindholm and F. Yellin. The Java Virtual Machine Specification.Addison-Wesley, Reading,MA, 1997.

Page 277: Communication Architectures for Parallel-ProgrammingSystems

Bibliography 255

[96] P. Lopez,J.M. Martınez,andJ.Duato. A Very Efficient DistributedDead-lock DetectionMechanismfor WormholeNetworks. In Proc. of the 4thInt. Symp.onHigh-PerformanceComputerArchitecture, pages57–66,LasVegas,NV, February1998.

[97] H. Lu, S.Dwarkadas,A.L. Cox,andW. Zwaenepoel.QuantifyingthePer-formanceDifferencesbetweenPVM andTreadMarks.Journal of ParallelandDistributedComputing, 43(2):65–78,June1997.

[98] J. MaassenandR. van Nieuwpoort. FastParallel Java. Master’s thesis,Dept.of MathematicsandComputerScience,Vrije Universiteit,Amster-dam,TheNetherlands,August1998.

[99] J.Maassen,R.vanNieuwpoort,R.Veldema,H.E.Bal, andA. Plaat.An Ef-ficient Implementationof Java’s RemoteMethodInvocation. In 7th Symp.on PrinciplesandPracticeof Parallel Programming, pages173–182,At-lanta,GA, May 1999.

[100] A.M. Mainwaring,B.N. Chun,S. Schleimer, andD.S.Wilkerson.SystemAreaNetwork Mapping.In Proc.of the1997Symp.onParallel AlgorithmsandArchitectures, pages116–126,Newport,RI, June1997.

[101] O. Maquelin,G.R.Gao,H.H.J.Hum,K.B. Theobald,andX. Tian. PollingWatchdog:CombiningPolling andInterruptsfor Efficient MessageHan-dling. In Proc. of the 23rd Int. Symp.on ComputerArchitecture, pages179–188,Philadelphia,PA, May 1996.

[102] R.P. Martin,A.M. Vahdat,D.E.Culler, andT.E.Anderson.Effectsof Com-municationLatency, Overhead,andBandwidthin a ClusterArchitecture.In Proc. of the 24th Int. Symp.on ComputerArchitecture, pages85–97,Denver, CO,June1997.

[103] M. Martonosi,D. Ofelt, andM. Heinrich. IntegratingPerformanceMon-itoring andCommunicationin Parallel Computers. In Proc. of the 1996Conf. on Measurementand Modeling of ComputerSystems(SIGMET-RICS), pages138–147,Philadelphia,PA, May 1996.

[104] H. McGhanandM. O’Connor. PicoJava: A Direct ExecutionEngineforJavaBytecode.IEEE Computer, 31(10):22–30,October1998.

Page 278: Communication Architectures for Parallel-ProgrammingSystems

256 Bibliography

[105] E. Mohr, D.A. Kranz, andR.H. Halstead. Lazy TaskCreation: A Tech-niquefor IncreasingtheGranularityof ParallelPrograms.IEEE Trans.onParallel andDistributedSystems, 2(3):264–280,July1991.

[106] D. MosbergerandL.L. Peterson.CarefulProtocolsor How to UseHighlyReliableNetworks. In Proc.of theFourthWorkshoponWorkstationOper-ating Systems, pages80–84,Napa,CA, October1993.

[107] S.S.Mukherjee,B. Falsafi,M.D. Hill, andD.A. Wood. CoherentNetworkInterfacesfor Fine-GrainCommunication.In Proc.of the23rd Int. Symp.on ComputerArchitecture, pages247–258,Philadelphia,PA, May 1996.

[108] S.S.MukherjeeandM.D. Hill. TheImpactof DataTransferandBufferingAlternativeson Network InterfaceDesign. In Proc. of the 4th Int. Symp.on High-PerformanceComputerArchitecture, pages207–218,LasVegas,NV, February1998.

[109] G. Muller, R. Marlet,E.-N.Volanschi,C. Consel,C. Pu,andA. Goel.Fast,OptimizedSunRPCUsing Automatic ProgramSpecialization. In Proc.of the18th Int. Conf. on DistributedComputingSystems, pages240–249,Amsterdam,TheNetherlands,May 1998.

[110] B. Nichols, B. Buttlar, and J. Proulx Farrell. PthreadsProgramming.O’Reilly & Associates,Inc.,Newton,MA, 1996.

[111] M. Oey, K. Langendoen,andH.E.Bal. ComparingKernel-SpaceandUser-SpaceCommunicationProtocolsonAmoeba.In Proc.of the15thInt. Conf.on DistributedComputingSystems, pages238–245,Vancouver, Canada,May 1995.

[112] S. Pakin, V. Karamcheti,andA.A. Chien. FastMessages(FM): Efficient,PortableCommunicationfor WorkstationClustersandMassively ParallelProcessors.IEEE Concurrency, 5(2):60–73,April–June1997.

[113] S. Pakin, M. Lauria, andA.A. Chien. High PerformanceMessagingonWorkstations:Illinois FastMessages(FM) for Myrinet. In Supercomputing’95, SanDiego,CA, December1995.

[114] D. Perkovic andP.J.Keleher. Responsivenesswithout Interrupts.In Proc.of theInt. Conf. onSupercomputing, pages101–108,Rhodes,Greece,June1999.

Page 279: Communication Architectures for Parallel-ProgrammingSystems

Bibliography 257

[115] L.L. PetersonandB.S.Davie. ComputerNetworks:A SystemsApproach.MorganKaufmannPublishers,Inc.,SanFrancisco,CA, 1996.

[116] L.L. Peterson,N. Hutchinson,S. O’Malley, andH. Rao. Thex-kernel: APlatformfor AccessingInternetResources.IEEE Computer, 23(5):23–33,May 1990.

[117] M. PhilippsenandM. Zenger. JavaParty— TransparentRemoteObjectsinJava.Concurrency:PracticeandExperience, pages1225–1242,November1997.

[118] L. Prylli andB. Tourancheau.ProtocolDesignfor High PerformanceNet-working: A Myrinet Experience.TechnicalReport97-22,LIP-ENSLyon,July1997.

[119] S.K. Reinhardt,J.R.Larus,andD.A. Wood. TempestandTyphoon:User-Level SharedMemory. In Proc.of the21stInt. Symp.on ComputerArchi-tecture, pages325–337,Chicago,IL, April 1994.

[120] S.H.Rodrigues,T.E.Anderson,andD.E.Culler. High-PerformanceLocal-Area CommunicationWith Fast Sockets. In USENIX Technical Conf.,pages257–274,Anaheim,CA, January1997.

[121] J.W. Romein,H.E Bal, andD. Grune. An Application DomainSpecificLanguagefor DescribingBoardGames.In Parallel andDistributedPro-cessingTechniquesandApplications, volumeI, pages305–314,LasVegas,NV, July1997.

[122] J.W. Romein,A. Plaat,H.E. Bal, andJ. Schaeffer. TranspositionDrivenWork Schedulingin Distributed Search. In AAAI National Conference,pages725–731,Orlando,FL, July1999.

[123] T. Ruhl andH.E. Bal. A PortableCollective CommunicationLibrary us-ing CommunicationSchedules.In Proc. of the 5th Euromicro Workshopon Parallel and DistributedProcessing, pages297–306,London,UnitedKingdom,January1997.

[124] T. Ruhl, H.E. Bal, R. Bhoedjang,K.G. Langendoen,andG. Benson.Ex-periencewith a PortabilityLayer for ImplementingParallelProgrammingSystems. In The1996Int. Conf. on Parallel and DistributedProcessing

Page 280: Communication Architectures for Parallel-ProgrammingSystems

258 Bibliography

Techniquesand Applications, pages1477–1488,Sunnyvale, CA, August1996.

[125] D.J. Scales,K. Gharachorloo,andC.A. Thekkath. Shasta:A Low Over-head,Software-OnlyApproachfor SupportingFine-GrainSharedMemory.In Proc. of the 7th Int. Conf. on Architectural Supportfor ProgrammingLanguagesandOperating Systems, pages174–185,Cambridge,MA, Oc-tober1996.

[126] I. SchoinasandM.D. Hill. AddressTranslationMechanismsin NetworkInterfaces.In Proc. of the4th Int. Symp.on High-PerformanceComputerArchitecture, pages219–230,LasVegas,NV, February1998.

[127] J.T. Schwartz. Ultracomputers.ACM Trans.on ProgrammingLanguagesandSystems, 2(4):484–521,October1980.

[128] R. Sivaram,R. Kesavan,D.K. Panda,andC.B. Stunkel. Whereto ProvideSupportfor Efficient Multicasting in Irregular Networks: Network Inter-faceor Switch? In Proc. of the 1998Int. Conf. on Parallel Processing,pages452–459,Minneapolis,MN, August1998.

[129] M. Snir, P. Hochschild,D.D. Frye,andK.J. Gildea. The CommunicationSoftwareandParallelEnvironmentof theIBM SP2.IBM SystemsJournal,34(2):205–221,1995.

[130] E. Speight,H. Abdel-Shafi,andJ.K. Bennett. Realizingthe PerformancePotentialof theVirtual InterfaceArchitecture.In Proc.of theInt. Conf. onSupercomputing, pages184–192,Rhodes,Greece,June1999.

[131] E.SpeightandJ.K.Bennett.UsingMulticastandMultithreadingto ReduceCommunicationin SoftwareDSM Systems.In Proc.of the4th Int. Symp.on High-PerformanceComputerArchitecture, pages312–323,LasVegas,NV, February1998.

[132] E. Spertus,S.C.Goldstein,K.E. Schauser, T. von Eicken,D.E. Culler, andW.J.Dally. Evaluationof Mechanismsfor Fine-GrainedParallelProgramsin theJ-MachineandtheCM-5. In Proc.of the20thInt. Symp.onComputerArchitecture, pages302–313,SanDiego,CA, May 1993.

[133] P.A. Steenkiste.A SystematicApproachto HostInterfaceDesignfor High-SpeedNetworks. IEEE Computer, 27(3):47–57,March1994.

Page 281: Communication Architectures for Parallel-ProgrammingSystems

Bibliography 259

[134] D. Stodolsky, J.B.Chen,andB. Bershad.FastInterruptPriority Manage-mentin OperatingSystems.In Proc.of theUSENIXSymp.onMicrokernelsand Other Kernel Architectures, pages105–110,SanDiego, September1993.

[135] C.B.Stunkel, J.Herring,B. Abali, andR.Sivaram.A New SwitchChipforIBM RS/6000SPSystems.In Supercomputing’99, Portland,OR,Novem-ber1999.

[136] C.B.Stunkel, R. Sivaram,andD.K. Panda.ImplementingMultidestinationWorms in Switch-basedParallel Systems:ArchitecturalAlternativesandtheir Impact. In Proc. of the 24th Int. Symp.on ComputerArchitecture,pages50–61,Denver, CO,June1997.

[137] V.S. Sunderam.PVM: A Framework for Parallel DistributedComputing.Concurrency:PracticeandExperience, 2(4):315–339,December1990.

[138] A.S. Tanenbaum.ComputerNetworks. PrenticeHall PTR,UpperSaddleRiver, NJ,3rd edition,1996.

[139] A.S. Tanenbaum,R. van Renesse,H. van Staveren,G.J.Sharp,S.J.Mul-lender, A.J. Jansen,andG. van Rossum. Experienceswith the AmoebaDistributedOperatingSystem.Communicationsof theACM, 33(2):46–63,December1990.

[140] H. Tang,K. Shen,andT. Yang. Compile/Run-Time Supportfor ThreadedMPI Executionon MultiprogrammedSharedMemory Machines. In 7thSymp.on Principles and Practiceof Parallel Programming, pages107–118,Atlanta,GA, May 1999.

[141] H. Tezuka,A. Hori, Y. Ishikawa, andM. Sato. PM: An OperatingSys-tem CoordinatedHigh-PerformanceCommunicationLibrary. In High-PerformanceComputingand Networking(LNCS1225), pages708–717,Vienna,Austria,April 1997.

[142] H. Tezuka,F. O’Carroll, A. Hori, andY. Ishikawa. Pin-down Cache:AVirtualMemoryManagementTechniquefor Zero-copy Communication.In12th Int. Parallel ProcessingSymp., pages308–314,Orlando,FL, March1998.

Page 282: Communication Architectures for Parallel-ProgrammingSystems

260 Bibliography

[143] C.A. ThekkathandH.M. Levy. HardwareandSoftwareSupportfor Effi-cient ExceptionHandling. In Proc. of the 6th Int. Conf. on ArchitecturalSupportfor ProgrammingLanguagesandOperating Systems, pages110–119,SanJose,CA, October1994.

[144] R. van Renesse,K.P. Birman, and S. Maffeis. Horus, a Flexible GroupCommunicationSystem.Communicationsof theACM, 39(4):76–83,April1996.

[145] R. Veldema.Jcc,aNativeJavaCompiler. Master’s thesis,Dept.of Mathe-maticsandComputerScience,Vrije Universiteit,Amsterdam,TheNether-lands,August1998.

[146] K. Verstoep,K.G. Langendoen,andH.E. Bal. Efficient ReliableMulticastonMyrinet. In Proc.of the1996Int. Conf. onParallel Processing, volumeIII, pages156–165,Bloomingdale,IL, August1996.

[147] T. vonEicken,A. Basu,V. Buch,andW. Vogels.U-Net: A User-Level Net-work Interfacefor ParallelandDistributedComputing.In Proc.of the15thSymp.onOperatingSystemsPrinciples, pages303–316,CopperMountain,CO,December1995.

[148] T. von Eicken, D.E. Culler, S.C. Goldstein,and K.E. Schauser. ActiveMessages:a Mechanismfor IntegratedCommunicationandComputation.In Proc. of the19th Int. Symp.on ComputerArchitecture, pages256–266,GoldCoast,Australia,May 1992.

[149] T. von EickenandW. Vogels. Evolution of theVirtual InterfaceArchitec-ture. IEEE Computer, 31(11):61–68,November1998.

[150] D.A. Wallach,W.C.Hsieh,K.L. Johnson,M.F. Kaashoek,andW.E.Weihl.Optimistic Active Messages:A Mechanismfor SchedulingCommunica-tion with Computation.In 5thSymp.onPrinciplesandPracticeof ParallelProgramming, pages217–226,SantaBarbara,CA, July1995.

[151] H. Wasserman,O.M. Lubeck,Y. Luo,andF. Bassetti.PerformanceEvalua-tion of theSGIOrigin2000:A Memory-CentricCharacterizationof LANLASCI Applications.In Supercomputing’97, SanJose,CA, November1997.

[152] A. Wollrath, J.Waldo,andR. Riggs. Java-CentricDistributedComputing.IEEE Micro, 17(3):44–53,May/June1997.

Page 283: Communication Architectures for Parallel-ProgrammingSystems

Bibliography 261

[153] S.C.Woo, M. Ohara,E. Torrie, J.P. Singh,andA. Gupta. TheSPLASH-2Programs:CharacterizationandMethodologicalConsiderations.In Proc.of the 22nd Int. Symp.on ComputerArchitecture, pages24–36, SantaMargheritaLigure, Italy, June1995.

[154] K. Yocum,J.Chase,A. Gallatin,andA. Lebeck.Cut-ThroughDelivery inTrapeze:An Exercisein Low-Latency Messaging.In The6th Int. Symp.onHigh PerformanceDistributedComputing, pages243–252,Portland,OR,August1997.

[155] Y. Zhou,L. Iftode,andK. Li. PerformanceEvaluationof Two Home-BasedLazy Release-Consistency Protocolsfor SharedVirtual MemorySystems.In 2ndUSENIXSymp.on Operating SystemsDesignandImplementation,pages75–88,Seattle,WA, October1996.

Page 284: Communication Architectures for Parallel-ProgrammingSystems
Page 285: Communication Architectures for Parallel-ProgrammingSystems

Appendix A

Deadlockissues

As discussedin Section4.2, LFC’s basicmulticastprotocolcannotusearbitrarymulticasttrees. With chains,for example,we cancreatea buffer deadlock(seeFigure4.8).Below, in SectionA.1, wederive— informally — asufficientcondi-tion for deadlock-freemulticastingin LFC. Thebinarytreesusedby LFC satisfythiscondition.Theconditionappliesonly to multicastingwithin asinglemulticastgroup. Theexamplein SectionA.2 shows that theconditiondoesnot guaranteedeadlock-freemulticastingin overlappingmulticastgroups.

A.1 Deadlock-FreeMulticasting in LFC

Beforederiving a conditionfor deadlock-freemulticasting,we first considerthenatureof deadlocksin LFC. Recall that LFC partitionseachNI’s receive bufferspaceamongall possiblesenders(seeSection4.1).Thatis, eachNI hasaseparatereceivebuffer pool for eachsender.

A deadlockin LFC consistsof a cycleC of NIs suchthat for eachNI SwithsuccessorR (bothin C)

1. Shasanonemptyblocked-sendsqueueBSQSø R for R

2. R hasa full NI receivebuffer poolRBPSø R for S

We assumethat hostsarenot involved in the deadlockcycle. This is true onlyif all hostsregularly drain the network by supplyingfree hostreceive buffers totheir NI. This requirement,however, is partof thecontractbetweenLFC anditsclients(seeSection3.6). If all hostsdrainthenetwork, thefull receivepoolsin the

263

Page 286: Communication Architectures for Parallel-ProgrammingSystems

264 Deadlockissues

deadlockcycledonotcontainpacketsthatarewaiting(only) for afreehostreceivebuffer. Consequently, eachpacket p in a full receive pool RBPSø R is waiting tobeforwarded.Thatis, p is amulticastpacket thathastravelledfrom S to R andisenqueuedon at leastoneblocked-sendsqueueinvolved in a deadlockcycle (notnecessarilyC). Finally, observe that if RBPSø R is partof deadlockcycleC, thenat leastoneof the packets in RBPSø R hasto be enqueuedon the blocked-sendsqueueto thenext NI in C (i.e., to thesuccessorof R).

We now formulate a sufficient condition for deadlock-freemulticastinginLFC. We assumethesystemconsistsof P NIs, numbered0 ù�à�à�àúù P Ù 1. The ’dis-tance’from NI m to NI n is definedas æ n ß P Ù mè modP: the numberof hopsneededto reachn from m if all NIs areorganizedin a unidirectionalring. In amulticastgroupG the setof multicasttreesTG (oneper groupmember)cannotyield deadlockif thefollowing conditionholds:

For eachmemberg û G, it holdsthat,for eachtreet û TG, thedistancefrom g’s parentto g in t is smallerthaneachof thedistancesfrom gto its childrenin t.

To seewhy this is true, assumethat the condition is satisfiedand that we canconstructadeadlockcycleC of lengthk. Accordingto thenatureof deadlocksinLFC, eachNI ni in C holdsat leastonemulticastpacket pi from its predecessorin C thatneedsto beforwardedto its successorin C. Accordingto thecondition,eachNI in thecycle mustthereforehave a largerdistanceto its successorthantoits predecessor. Thatis, if wedenotethedeadlockcycle

n0d0Ùýü n1

d1Ùýü à�àþà dk ÿ 1Ù ü n0

wheredi is thedistancefrom ni � 1 to ni , thenwemusthave

d0� d1

� à�à�à � dk � 1� d0

which is impossible.Using this condition,we canseewhy LFC’s binary treesaredeadlock-free.

In thebinary treefor node0, it is clearthateachnode’s distancesto its childrenarelargerthanthedistanceto thenode’sparent(seefor exampleFigure4.9). Themulticasttreefor anodep �Ú 0 is obtainedby renumberingeachtreenoden in themulticasttreeof node0 to æ n ß pè modP, whereP is thenumberof processors.This renumberingpreservesthedistancesbetweennodes,sotheconditionappliesfor all trees.Consequently, LFC’s binarytreesaredeadlock-free.

Page 287: Communication Architectures for Parallel-ProgrammingSystems

A.2 OverlappingMulticastGroups 265

0

1 2 4 8

3 5 6 9

7 11

15

1413

1210

Fig. A.1. A biomial spanningtree.

Previous versionsof LFC usedbinomial insteadof binary trees[16].1 Fig-ureA.1 showsa 16-nodebinomialtree.In binomialtreesthedistance(asdefinedabove)betweenany pairof nodesis alwaysapowerof two. Moreover, andin con-trastwith binarytrees,binomialtreeshave thepropertythatthedistancebetweena nodeand its parentis always greater than the distancesbetweena nodeandits children. The conditionfor deadlock-freemulticastingin LFC caneasilybeadaptedto applyto treesthathave thelatterproperty. Usingthis adaptedversion,wecanshow thatbinomialmulticasttreesarealsodeadlock-free.

A.2 Overlapping Multicast Groups

Without deadlockrecovery, overlappingmulticastgroupsareunsafe.Considera9-nodesystemwith threemulticastgroupsG1 Ú�� 0 ù 3 ù 4 ù 5 ù 6 � , G2 Ú�� 0 ù 3 ù 6 ù 7 ù 8 � ,andG3 Ú�� 0 ù 1 ù 2 ù 3 ù 6 � . Within eachgroup,we usebinarymulticasttrees.Thesetreesareconstructedasif the nodeswithin eachgroupwerenumbered0 ù�à�à�àúù 4.FigureA.2 shows thetreesfor node0 in G1, for node3 in G2, andfor node6 inG3. The bold arrows in eachtreeindicatea forwardingpaththat overlapswitha forwardingpathin anothertree. By concatenatingthesepathswe cancreatea

1This paper([16]) containstwo errors. First, it incorrectlystatesthat forwardingin binomialtreesis equivalentto e-cuberouting,which is deadlock-free[43]. This is true for somesenders(e.g.,node0), but not for all. Second,thepaperassumesthatbinarytreesarenot deadlock-free.

Page 288: Communication Architectures for Parallel-ProgrammingSystems

266 Deadlockissues

0

3 4

65

6

3

7

8 0 2 3

0 1

6

Fig. A.2. Multicasttreesfor overlappingmulticastgroups.

deadlockcycle.

Page 289: Communication Architectures for Parallel-ProgrammingSystems

267

Page 290: Communication Architectures for Parallel-ProgrammingSystems

268 Abbreviations

Appendix B

Abbreviations

ADC ApplicationDeviceChannelADI ApplicationDevice InterfaceAM ActiveMessagesAPI ApplicationProgrammingInterfaceASCI AdvancedSchoolfor ComputingandImagingATM AsynchronousTransferModeBB Broadcast/BroadcastBIP BasicInterfacefor ParallelismCRL C RegionLibraryDAS DistributedASCI SupercomputerDMA Direct MemoryAccessDSM DistributedSharedMemoryF&A FetchandAddFM FastMessagesGSB GetSequencenumber, thenBroadcastJIT JustIn TimeJVM JavaVirtual MachineLFC Link-level Flow ControlMG MultigameMPI MessagePassingInterfaceMPICH MPI ChameleonMPL Message-PassingLibraryMPP Massively ParallelProcessorNI Network InterfaceNIC Network InterfaceCardOT OpenThreads

Page 291: Communication Architectures for Parallel-ProgrammingSystems

269

PB Point-to-point/BroadcastPIO ProgrammedI/OPPS ParallelProgrammingSystemRMI RemoteMethodInvocationRPC RemoteProcedureCallRSR RemoteServiceRequestRTS RunTimeSystemSAP ServiceAccessPointTCP TransmissionControlProtocolTLB TranslationLookasideBufferUDP UserDatagramProtocolUTLB User-managedTLBVI Virtual InterfaceVMMC Virtual Memory-MappedCommunicationWWW World Wide Web

Page 292: Communication Architectures for Parallel-ProgrammingSystems
Page 293: Communication Architectures for Parallel-ProgrammingSystems

Samenvatting

Parallelleprogramma’s lateneenaantalcomputerstegelijk aaneenrekenintensiefprobleemwerkenmetalsdoeldatprobleemsnelleropte lossendanmeteencom-putermogelijk is. Daarhetmoeilijk is om computersefficientencorrectsamente latenwerken, zijn er systemenontwikkeld om het schrijven en uitvoerenvanparallelleprogramma’s te vereenvoudigen.Dergelijke systemennoemenwe pa-rallelle programmeersystemen.

Verschillendeparallelleprogrammeersystemenbiedenverschillendeprogram-meerparadigmataaanenwordenin verschillendehardware-omgevingengebruikt.Dit proefschriftconcentreertzich op programmeersystemenvoor computerclus-ters. Zo’n clusterbestaatuit eendooreennetwerkverbondenverzamelingcom-puters.De meesteclustersbestaanuit tientallencomputers,maarer bestaanookclustersvanhonderdenenzelfsduizendencomputers.Kenmerkendvooreencom-puterclusteris dat zowel de computersals het netwerkdoor massaproductiere-latief goedkoopzijn endatdecomputersmetelkaarcommunicerendoorvia hetnetwerk,enniet via eengemeenschappelijkgeheugen,berichtenmetelkaaruit tewisselen.

Deuitwisselingvanberichtenwordtondersteunddoorcommunicatiesoftware.Dezesoftware bepaaltin belangrijke matede prestatiesvan de parallellepro-gramma’s die met een parallel programmeersysteemontwikkeld worden. Ditproefschrift,getiteld”Communicatie-architecturenvoor parallelleprogrammeer-systemen”,richt zich op dezecommunicatiesoftwareen behandeltde volgendevragen:

1. Welkemechanismendienteencommunicatiesysteemaantebieden?

2. Hoemoetenparallelleprogrammeersystemendeaangebodenmechanismengebruiken?

3. Hoemoetendezemechanismenin hetcommunicatiesysteemgeımplemen-teerdworden?

271

Page 294: Communication Architectures for Parallel-ProgrammingSystems

272 Samenvatting

Met betrekkingtot dezevragenleverthetproefschriftdevolgendebijdragen:

1. Het toontaandateenkleineverzamelingeenvoudigecommunicatiemecha-nismeneffectief aangewendkan wordenom verschillendeparallellepro-grammeersystemenefficient te implementeren.Dezecommunicatiemecha-nismenzijn geımplementeerdin eennieuwcommunicatiesysteem,LFC,datlaterin dezesamenvattingpreciezerbeschrevenwordt. Eenaantalvandezemechanismenverondersteltondersteuningvandenetwerkadapter. De net-werkadapteris hetcomputeronderdeeldatdekoppelingtussennetwerkencomputerverzorgt. Modernenetwerkadapterszijn vaakprogrammeerbaarenkunnendaaromeenaantalcommunicatietakenuitvoerendie traditioneeldooreencomputeruitgevoerdworden.

2. Het beschrijfteenefficient en betrouwbaarmulticast-algoritme.Multicastis eenvandebovengenoemdecommunicatiemechanismenenspeeltin ver-schillendeparallelleprogrammeersystemeneenbelangrijke rol. Het is eenvorm van communicatiewaarbij een computereenberichtnaarverschei-deneanderecomputersverstuurt.

Opnetwerkendiemulticastniet in hardwareondersteunen,wordtmulticastgeımplementeerddoormiddelvandoorsturen. Eenmulticastberichtwordtdoordezendereerstsequentieelnaareen(meestal)klein aantalcomputersverstuurd;ieder van die computersontvangthet bericht, verwerkthet enstuurthet door naaranderecomputers,enz. Dezewijze vandoorsturenisredelijk efficient, omdathet berichtdoor verscheidenecomputerstegelijkverwerkten doorgestuurdwordt. Zoalsbeschreven is de methodeechternietoptimaal,omdatiederedoorsturendecomputerhetberichtterugkopieertnaarde netwerkadapter. Dit proefschriftbeschrijft eenalgoritmedat ge-bruik maaktvan de netwerkadapterom multicastberichtenop efficienterewijze door te sturen: zodrade netwerkadaptereenmulticastberichtont-vangt,stuurthet dit berichtzelfstandigdoor naaranderenetwerkadaptersenkopieerthet tevensnaardecomputer. De ontvangendecomputeris nietmeerbetrokkenbij hetdoorsturenvanhetbericht.

3. Het onderzoektop systematischewijze deinvloedvanverschillendeverde-lingenvanprotocoltakentussencomputerennetwerkadapteropdeprestatiesvan parallelleprogramma’s. De evaluatieconcentreertzich op het wel ofniet gebruikenvande netwerkadapterbij het implementerenvanbetrouw-barecommunicatieen bij het doorsturenvan multicastberichten.Een in-

Page 295: Communication Architectures for Parallel-ProgrammingSystems

Samenvatting 273

teressantresultaatis dat bepaaldewerkverdelingende prestatiesvan som-migeparallelleprogramma’sverbeteren,maardeprestatiesvananderepro-gramma’s doen verslechteren.De invloed van eenwerkverdeling blijktdeelsaf te hangenvandecommunicatiepatronendie eenparallelprogram-meersysteemgebruikt.

Hoofdstuk1 vandit proefschriftbeschrijftbovengenoemdeonderzoeksvragenen-bijdragenin detail.

Hoofdstuk2 geefteenoverzichtvanbestaande,efficientecommunicatiesyste-men.Dit zijn bijnazonderuitzonderingsystemendierechtstreeksdoorgebruikers-processenaangesprokenkunnenworden,datwil zeggen,zondertussenkomstvanhet besturingssysteem.Hoewel de eliminatie van het besturingssysteemleidttot goedeprestatiesin eenvoudigetests,voldoet geenvan dezecommunicatie-systemenvolledig aande eisendie parallelleprogrammeersystemenstellen. Inhet bijzonderblijkt dat veel van dezecommunicatiesystemengeenasynchroneafhandelingvan berichtenof geenmulticastondersteunen;beidezijn belangrijkin parallelleprogrammeersystemen.Eenanderebelangrijke observatie is dat debestaandecommunicatiesystemensterkverschillenin demanierwaaropzewerkverdelentussencomputeren netwerkadapter. Sommigesystemenbeperken hetwerkopdenetwerkadaptertot eenminimum,omdatdeprocessorvandenetwerk-adapterin hetalgemeentraagis. Anderesystemendaarentegen,draaienvolledigecommunicatieprotocollenop denetwerkadapter. Hoofdstuk8 vandit proefschriftonderzoektopsystematischewijze deinvloedvanverschillendewerkverdelingen.

Hoofdstuk3 beschrijfthetontwerpendeimplementatievanLFC, eennieuwcommunicatiesysteemvoor parallelleprogrammeersystemen.LFC ondersteuntvijf parallelleprogrammeersystemen.Hoewel dezeprogrammeersystemensterkverschillendecommunicatie-eisenstellen,zijn zealleopefficientewijze bovenopLFC geımplementeerd.

LFC biedtdegebruikereeneenvoudigenlaag-niveaucommunicatie-interfaceaan.Dit interfacebestaatuit functiesom netwerkpakkettenbetrouwbaarnaareenof meeranderecomputerste versturen.Dezepakkettenkunnenzowel synchroonalsasynchroonontvangenworden.Het interfacebevatook eeneenvoudige,maarnuttigesynchronisatiefunctie(fetch-and-add).

De implementatievanhetinterfacemaaktagressiefgebruikvaneenprogram-meerbarenetwerkadapter. Hoofdstuk4 beschrijftin detaildeprotocollenentakendie LFC op denetwerkadapteruitvoert:÷ hetvoorkomenvanbufferoverlooptenbehoevevanbetrouwbarecommuni-

catie(flow control)

Page 296: Communication Architectures for Parallel-ProgrammingSystems

274 Samenvatting

÷ hetdoorsturenvanmulticastberichten

÷ hetvertraagdgenererenvaninterrupts

÷ hetafhandelenvansynchronisatieberichten

LFC verondersteltdat de netwerkhardwarepakkettencorrumpeertnochver-liest,maardat is niet voldoendeom betrouwbarecommunicatietussenprocessente garanderen.Het is ook noodzakelijk dat ontvangersvan berichtenvoldoendebufferruimtehebbenom binnenkomendeberichtenop te slaan.Om dat te garan-deren,implementeertLFC eeneenvoudig flow-controlprotocolop de netwerk-adapter.

LFC laat de netwerkadaptermulticastberichtenzondertussenkomst van decomputerdoorsturen.Het blijkt mogelijk om dit te implementerenals eeneen-voudigeuitbreidingvanbovengenoemdflow-controlprotocol.

LFC staatgebruikerstoeomberichtensynchroonteontvangen,dooreencom-puteractief te latentestenof eenberichtgearriveerdis (polling), of asynchroon,doorde netwerkadaptereeninterruptte latengenereren.Eencomputerkansneltestenof eenberichtis binnengekomen,maarhetafhandelenvaneeninterruptver-looptmeestaltraag.Als eenberichtverwachtwordt,danis polling dusefficienterdanhetgebruikvan interrupts.Als niet bekendis wanneerhet volgendeberichtbinnenkomt, dan is polling in het algemeenduur, omdatde meestepolls geenberichtenzullenvinden.Omdatinterruptsduurzijn, laatLFC denetwerkadapterpasnaeenkortevertragingeeninterruptgenereren.Op dezewijze krijgt deont-vangendecomputerdekanseenberichtdoormiddelvanpolling weg te lezenendeinterruptte voorkomen.

LFC biedt eeneenvoudigesynchronisatiefunctieaan: fetch-and-add.Dezefunctieleestenverhoogtopondeelbarewijze dewaardevaneengedeeldeinteger-variabele.LFC slaatfetch-and-add-variabelenin het geheugenvan de netwerk-adapterop en laatde netwerkadapterzelfstandigbinnenkomendefetch-and-add-verzoekenafhandelen.

Hoofdstuk5evalueertdeprestatiesvanLFC.DetakendieLFC opdenetwerk-adapteruitvoertzijn relatiefeenvoudig,maaromdatdeprocessorvaneennetwerk-adapterin het algemeentraagis, is niet a priori duidelijk of het verstandigis aldezetaken op de netwerkadapteruit te voeren. De prestatiemetingenin hoofd-stuk5 latenechterziendatLFC uitstekendpresteert,ondankshetagressieve ge-bruik vanderelatieftragenetwerkadapter.

Hoofdstuk6 beschrijftPanda,eencommunicatiesysteemdateenhoger-niveauinterfaceaanbiedtdanLFC. Pandais bovenopLFC geımplementeerdenbiedtde

Page 297: Communication Architectures for Parallel-ProgrammingSystems

Samenvatting 275

gebruiker threads,messagepassing,RemoteProcedureCall (RPC) en groeps-communicatie.Dezeabstractiesvereenvoudigenvaakde implementatievan eenparallelprogrammeersysteem.

Pandagebruikt in het threadsysteemaanwezigekennis om automatischtebesluitenof binnenkomendeberichtendoor middel van polling dan wel inter-ruptsontvangenmoetenworden.Gewoonlijk gebruiktPandainterrupts,maarzo-dra alle threadsgeblokkeerdzijn staptPandaover op polling. Pandabeschouwtiederbinnenkomendberichtals eenbytestroomen staateenontvangendprocestoe om te beginnenmet het lezenvan dezebytestroomvoordathet bericht inzijn geheelontvangenis. TenslotteimplementeertPandatotaal-geordendegroeps-communicatiemetbehulpvaneeneenvoudige,maareffectievecombinatievandesynchronisatie-enbroadcastfunctiesvanLFC.

Hoofdstuk7 beschrijftvier parallelleprogrammeersystemendie met behulpvan LFC en Pandageımplementeerdzijn. Orca, CRL en Manta zijn object-gebaseerdesystemendie processenop verschillendecomputerstoestaanom ob-jectenmet elkaar te delenen om via die gedeeldeobjectente communiceren.MPI daarentegen, is gebaseerdop het expliciet versturenvan berichtentussenprocessenop verschillendecomputers(messagepassing).OrcaenMantazijn opprogrammeertalengebaseerdesystemen(respectievelijk Orcaen Java). CRL enMPI wordennietdooreencompilerondersteund.

HoewelOrca,CRLenMantavergelijkbareprogrammeerabstractiesaanbieden,implementerenzedieabstractiesopverschillendemanieren.Orcarepliceertsom-migeobjectenentransporteertvia RPCengroepscommunicatiedeparametersvanoperatiesop gedeeldeobjecten.Mantarepliceertobjectenniet engebruiktalleenRPCom de parametersvan operatieste transporteren.CRL repliceertobjecten,maarmaaktgebruikvaninvalidatieomdewaardevangerepliceerdeobjectencon-sistentte houden.BovendienverstuurtCRL geenoperatieparameters,maarob-jectdata.Orca,MantaenMPI zijn alle metbehulpvanPandageımplementeerd;CRL is directbovenopLFC geımplementeerd.

Het hoofdstuklaatziendatal dezesystemenefficientop LFC enPandageım-plementeerdkunnenworden. In het geval van Orca,eenrelatief complex sys-teem,wordendaartoetweeoptimalisatiesgeıntroduceerd.Ten eerstegenereertde Orca-compilergespecialiseerdecodevoor het in- en uitpakkenvanberichtendie operatieparametersbevatten.Dit voorkomt onnodigkopierenvandataenon-nodigeinterpretatievantype-beschrijvingen.TentweedemaaktOrcagebruikvanvoortzettingen(continuations)in plaatsvanthreadsomgeblokkeerdeoperatiesopobjectencompactte representeren.VoorgaandeOrca-implementatiesgebruiktenthreadsin plaatsvanvoortzettingen.Threadsnemenechtermeerruimtein beslag

Page 298: Communication Architectures for Parallel-ProgrammingSystems

276 Samenvatting

enkunnenin onvoorspelbarevolgordesgeactiveerdworden.Hoofdstuk8 beschrijft en evalueertandereimplementatiesvan het commu-

nicatie-interfacevan LFC. In totaalwordenvijf implementatiesbestudeerd,in-clusiefde in hoofdstukken3 t/m 5 beschrevenLFC-implementatie.Dezeimple-mentatiesverschillenin de manierwaaropze protocoltaken tussencomputerennetwerkadapterverdelen,metnametakendie verbandhoudenmetbetrouwbaar-heid en multicast. Het hoofdstukconcentreertzich op de invloed van dezever-schillendewerkverdelingenop de prestatiesvanparallelleprogramma’s. Omdatalle LFC-implementatieshetzelfdecommunicatie-interfaceimplementeren,kandezeinvloedbestudeerdwordenzonderwijziging vandeparallelleprogrammeer-systemen.

Hoofdstuk8 bestudeerthet gedragvan negenparallelleprogramma’s. Dezeprogramma’s makengebruikvanOrca,CRL, MPI, enMultigame.Orca,CRL enMPI zijn bovenreedsbesproken. Multigameis eendeclaratiefparallelprogram-meersysteemdatautomatischenparallelspelbomendoorzoekt.Prestatiemetingenlatenziendatdein hoofdstukken3 t/m 5 beschrevenLFC-implementatiealtijd debesteprestatiesvan parallelleprogramma’s oplevert. DezeLFC-implementatieverondersteltdat de netwerkhardware betrouwbaaris en gebruikt de netwerk-adapterommulticastberichtendoortesturen.Deandereimplementatiesveronder-stellendatdenetwerk-hardwareonbetrouwbaaris enimplementerendaaromeenduurderbetrouwbaarheidsprotocol.Sommigeimplementatiesvoerendit protocolopdenetwerkadapteruit, andereopdecomputer.

Eeninteressanteuitkomstvandemetingenis dathetuitvoerenvanmeerwerkopdenetwerkadapterdeprestatiesvansommigeparallelleprogramma’sverbetert,maardeprestatiesvanandereprogramma’sdoetverslechteren.Met namehetge-bruik van de netwerkadapterom betrouwbaarheidte implementerenkan zowelpositief als negatief uitwerken. Het lijkt daarentegenaltijd nuttig om multicast-berichtendoordenetwerkadapter(ennietdoordecomputer)doorte latensturen.De invloedvaneenwerkverdelingblijkt deelsaf tehangenvandecommunicatie-patronendie eenparallelprogrammeersysteemgebruikt. Voor communicatiepa-tronendie veel relatief kleine RPC-transactiesbevattenis het verhogenvan dewerklastvan de netwerkadapterongunstig. Voor asynchronecommunicatiepa-tronendaarentegen, is het nuttig om de computerte ontlastenen werk naardenetwerkadapterte verschuiven.

Hoofdstuk9 besluitdit proefschrift.Uit hetgepresenteerdewerkblijkt dateencommunicatiesysteemmeteeneenvoudiginterfaceeenverscheidenheidaanparal-lelle programmeersystemenefficient kan ondersteunen.Zo’n communicatiesys-teemmoetinterrupts,polling en multicastaanbieden.De netwerkadapterspeelt

Page 299: Communication Architectures for Parallel-ProgrammingSystems

Samenvatting 277

eenbelangrijkerol bij deimplementatievanhetcommunicatiesysteem.Met namemulticastkanmetbehulpvandenetwerkadapterefficientgeımplementeerdwor-den. De prestatiesvan parallelleapplicatieswordenechterniet alleendoor hetcommunicatiesysteembepaald,maarook door communicatiebibliotheken zoalsPandaen door parallelleprogrammeersystemen.Dit proefschriftbeschrijftver-schillendetechniekenomookdezelagengoedtelatenpresterenenlegt verbandentusseneigenschappenvan communicatiesystemen,eigenschappenvan parallelleprogrammeersystemenendeprestatiesvanparallelleprogramma’s.

Page 300: Communication Architectures for Parallel-ProgrammingSystems
Page 301: Communication Architectures for Parallel-ProgrammingSystems

Curriculum Vitae

Personaldata

Name: RaoulA.F. Bhoedjang

Born: 14December1970,TheHague,TheNetherlands

Nationality: Dutch

Address: Dept.of ComputerScienceUpsonHall, CornellUniversityIthaca,NY 14853–7501USA

Email: [email protected]

WWW: http://www.cs.cornell.edu/raoul/

Education

November1992: Master’sdegreein ComputerScience(cumlaude)Dept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

August1988: AthenaeumdegreeHanFortmannCollege,Heerhugowaard,TheNetherlands

279

Page 302: Communication Architectures for Parallel-ProgrammingSystems

280 CurriculumVitae

ProfessionalExperience

Aug. 1999– present: ResearcherDept.of ComputerScience,CornellUniversity,Ithaca,NY, USA

July1998– July1999: ResearcherDept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

July1994– July1998: GraduatestudentDept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

Feb. 1994– June1994: ProgrammerDept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

Dec.1993– Jan.1994: Teachingassistant(CompilerConstruction)Dept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

Nov. 1993: Visiting studentDept.of ComputerScience,CornellUniversity,Ithaca,NY, USA

Sept.1993– Oct. 1993: Visiting studentLaboratoryfor ComputerScience,MIT,Cambridge,MA, USA

Feb. 1993– Sept.1993: Teachingassistant(OperatingSystems)Dept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

July1992– Sept.1992: SummerScholarshipstudentEdinburghParallelComputingCentre,Scotland,UK

Page 303: Communication Architectures for Parallel-ProgrammingSystems

CurriculumVitae 281

Jan.1992– June1992: ExchangestudentComputingDept.,LancasterUniversity,Lancaster, UK

Sept.1991– Dec.1991: Teachingassistant(Introductionto Programming)Dept.of ComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands

Page 304: Communication Architectures for Parallel-ProgrammingSystems
Page 305: Communication Architectures for Parallel-ProgrammingSystems

Publications

Journal publications

1. R.A.F. Bhoedjang,T. Ruhl, andH.E. Bal. User-Level Network InterfacePro-tocols. IEEE Computer, 31(11):53–60,November1998.

2. H.E.Bal,R.A.F. Bhoedjang,R.Hofman,C.Jacobs,K.G.Langendoen,T. Ruhl,andM.F. Kaashoek.PerformanceEvaluationof theOrcaSharedObjectSys-tem. ACM Trans.on ComputerSystems, 16(1):1–40,February1998.

3. H.E. Bal, K.G. Langendoen,R.A.F. Bhoedjang,andF. Breg. ExperiencewithParallelSymbolicApplicationsin Orca.J. of ProgrammingLanguages, 1998.

4. K.G. Langendoen,R.A.F. Bhoedjang,andH.E.Bal. Modelsfor AsynchronousMessageHandling. IEEEConcurrency, 5(2):28–37,April–June1997.

5. H.E. Bal, R.A.F. Bhoedjang,R. Hofman,C. Jacobs,K.G. Langendoen,andK. Verstoep.Performanceof aHigh-Level ParallelLanguageonaHigh-SpeedNetwork. J. of Parallel and DistributedComputing, 40(1):49–64,February1997.

Conferencepublications

1. T. Kielmann, R.F.H. Hofman, H.E. Bal, A. Plaat, and R.A.F. Bhoedjang.MAGPIE: MPI’s Collective CommunicationOperationsfor ClusteredWideAreaSystems.In Conf. on PrinciplesandPracticeof Parallel Programming(PPoPP’99), pages131–40,Atlanta,GA, May 1999.

2. T. Kielmann,R.F.H. Hofman,H.E.Bal,A. Plaat,andR.A.F. Bhoedjang.MPI’sReductionOperationsin ClusteredWide Area Systems.In Message Passing

283

Page 306: Communication Architectures for Parallel-ProgrammingSystems

284 Publications

InterfaceDeveloper’s and User’s Conference(MPIDC’99) pages43–52,At-lanta,GA, March1999.

3. R.A.F. Bhoedjang,J.W. Romein,andH.E. Bal. Optimizing DistributedDataStructuresUsing Application-SpecificNetwork Interface Software. In Int.Conf. onParallel Processing, pages485–492,Minneapolis,MN, August1998.

4. R.A.F. Bhoedjang,T. Ruhl, andH.E.Bal. EfficientMulticastonMyrinet UsingLink-Level Flow Control. In Int. Conf. onParallel Processing, pages381–390,Minneapolis,MN, August1998.(Bestpaperaward).

5. K.G. Langendoen,J. Romein,R.A.F. Bhoedjang,andH.E. Bal. IntegratingPolling,Interrupts,andThreadManagement.In The6thSymp.ontheFrontiersof MassivelyParallel Computation(Frontiers ’96), pages13–22,Annapolis,MD, October1996.

6. T. Ruhl, H.E. Bal, R. Bhoedjang,K.G. Langendoen,andG. Benson.Experi-encewith aPortabilityLayerfor ImplementingParallelProgrammingSystems.In The1996Int. Conf. on Parallel andDistributedProcessingTechniquesandApplications, pages1477–1488,Sunnyvale,CA, August1996.

7. R.A.F. BhoedjangandK.G. Langendoen.FriendlyandEfficientMessageHan-dling. In Proc.of the29thAnnualHawaii Int. Conf. onSystemSciences, pages121–130,Maui, HI, January1996.

8. K.G. Langendoen,R.A.F. Bhoedjang,andH.E.Bal. AutomaticDistributionofSharedDataObjects. In B. SzymanskiandB. Sinharoy, editors,Languages,Compilers and Run-Time Systemsfor ScalableComputers, pages287–290,Troy, NY, May 1995.Kluwer AcademicPublishers.

9. H.E. Bal, K.G. Langendoen,andR.A.F. Bhoedjang.Experiencewith ParallelSymbolicApplicationsin Orca. In T. Ito, R.H. Halstead,Jr, andC. Queinnec,editors,Parallel SymbolicLanguagesandSystems(PSLS’95), number1068inLectureNotesin ComputerScience,pages266–285,Beaune,France,October1995.

10. R.A.F. Bhoedjang,T. Ruhl, R. Hofman, K.G. Langendoen,H.E. Bal, andM.F. Kaashoek.Panda:A PortablePlatformto SupportParallelProgrammingLanguages.In Proc. of the USENIXSymp.on Experienceswith Distributedand MultiprocessorSystems(SEDMSIV), pages213–226,SanDiego, CA,September1993.

Page 307: Communication Architectures for Parallel-ProgrammingSystems

Publications 285

Technicalreports

1. R. Veldema,R.A.F. Bhoedjang,andH.E. Bal. DistributedSharedMemoryManagementfor Java. In 6th AnnualConf. of theAdvancedSchool for Com-putingandImaging, Lommel,Belgium,June2000.

2. R. Veldema,R.A.F. Bhoedjang,andH.E. Bal. Jackal:A Compiler-BasedIm-plementationof Java for Clustersof Workstations.In 5th AnnualConf. of theAdvancedSchool for Computingand Imaging, pages348–355,Heijen, TheNetherlands,June1999.

3. R.A.F. Bhoedjang,T. Ruhl, andH.E. Bal. LFC: A CommunicationSubstratefor Myrinet. In 4th AnnualConf. of theAdvancedSchool for ComputingandImaging, pages31–37,Lommel,Belgium,June1998.

4. H.E.Bal,R.A.F. Bhoedjang,R.Hofman,C.Jacobs,K.G.Langendoen,T. Ruhl,andM.F. Kaashoek.Orca:aPortableUser-Level SharedObjectSystem.Tech-nical ReportIR-408,Dept.of MathematicsandComputerScience,Vrije Uni-versiteit,Amsterdam,TheNetherlands,July1996.

5. H.E. Bal, R.A.F. Bhoedjang,R. Hofman,C. Jacobs,K.G. Langendoen,andK. Verstoep.Performanceof aHigh-Level ParallelLanguageonaHigh-SpeedNetwork. TechnicalReportIR-400,Dept.of MathematicsandComputerSci-ence,Vrije Universiteit,Amsterdam,TheNetherlands,1996.

6. K.G. Langendoen,J. Romein,R.A.F. Bhoedjang,andH.E. Bal. IntegratingPolling, Interrupts,andThreadManagement.In 2ndAnnualConf. of theAd-vancedSchool for Computingand Imaging, pages168–173,Lommel, Bel-gium,June1996.

7. R.A.F. Bhoedjangand K.G. Langendoen. User-SpaceSolutionsto ThreadSwitchingOverhead. In 1st AnnualConf. of the AdvancedSchool for Com-putingandImaging, pages397–406,Heijen,TheNetherlands,May 1995.

Page 308: Communication Architectures for Parallel-ProgrammingSystems