efficient reliability support for hardware multicast-based ... · efficient reliability support for...

23
Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar K. (DK) Panda 1 Department of Computer Science and Engineering, The Ohio State University 2 Engility Corporation

Upload: others

Post on 08-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

EfficientReliabilitySupportforHardwareMulticast-basedBroadcastinGPU-enabledStreamingApplications

1Ching-HsiangChu,1KhaledHamidouche,1HariSubramoni,1AkshayVenkatesh,2BracyEltonand1DhabaleswarK.(DK)Panda

1DepartmentofComputerScienceandEngineering,TheOhioStateUniversity2EngilityCorporation

Page 2: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 2NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns

• PerformanceEvaluation

• ConclusionandFutureWork

Outline

Page 3: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 3NetworkBasedComputingLaboratory

• Multi-coreprocessorsareubiquitous• InfiniBand(IB)isverypopularinHPCclusters• Accelerators/Coprocessorsarebecomingcommoninhigh-endsystems➠ PushingtheenvelopetowardsExascalecomputing

DriversofModernHPCClusterArchitectures

Accelerators/Coprocessorshighcomputedensity,highperformance/watt

>1Tflop/sDPonachip

HighPerformanceInterconnects– InfiniBand<1µslatency,>100GbpsBandwidth

Tianhe– 2 Titan Stampede Tianhe– 1A

Multi-coreProcessors

Page 4: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 4NetworkBasedComputingLaboratory

• GrowthofIBandGPUclustersinthelast3years– IBisthemajorcommoditynetworkadapterused

– NVIDIAGPUsboost18%ofthetop50 ofthe”Top500”systemsasofJune2016

IBandGPUinHPCSystems

8.4 7.8 9 9.8 10.4 13.8 13.2

41 41.4 44.4 44.851.8 47.4

40.8

0102030405060

June2013 Nov2013 June2014 Nov2014 June2015 Nov2015 June2016System

shareinTo

p500(%

)

GPUCluster InfiniBandCluster DatafromTop500list(http://www.top500.org)

Page 5: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 5NetworkBasedComputingLaboratory

• StreamingapplicationsonHPCsystems1. Communication(MPI)• Pipelineofbroadcast-type

operations

2. Computation(CUDA)• MultipleGPU nodesasworkers

– Examples• Deeplearningframeworks

• Protoncomputedtomography(pCT)

MotivationDataSource

DataDistributor

HPCresourcesforreal-timeanalytics

Real-timestreaming

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

Datastreaming-likebroadcastoperations

Page 6: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 6NetworkBasedComputingLaboratory

• High-performanceHeterogeneousBroadcast*

– LeveragesNVIDIAGPUDirect andIBhardwaremulticast(MCAST)features

– Eliminatesunnecessarydatastagingthroughhostmemory

CommunicationforStreamingApplications

*Ching-HsiangChu,KhaledHamidouche,HariSubramoni,Akshay Venkatesh,BracyElton,andD.K.Panda.“DesigningHighPerformanceHeterogeneousBroadcastforStreamingApplicationsonGPUClusters,“SBAC-PAD’16,Oct2016.

NodeNIBHCA

IBHCA

CPUGPU

Source

IBSwitch

GPUCPU

Node1

Multicaststeps

C Data

C

IBSLstep

Data

IBHCAGPUCPU

Data

CIBHCA:InfiniBandHostChannelAdapter

Page 7: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 7NetworkBasedComputingLaboratory

• IBhardwaremulticastsignificantlyimprovestheperformance,however,itisaUnreliableDatagram(UD)-based scheme

Ø Reliabilityneedstobehandledexplicitly

• ExistingNegativeACKnowledgement (NACK)-basedDesign– SendermuststalltocheckreceiptofNACKpackets

Ø Breaksthepipelineofbroadcastoperations– Re-sendMCASTpackets evenifitisnotnecessaryforsomereceivers

ØWastesnetworkresource,degradesthroughput/bandwidth

LimitationsoftheExistingScheme

Page 8: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 8NetworkBasedComputingLaboratory

• HowtoprovidereliabilitysupportwhileleveragingUD-basedIB

hardwaremulticasttoachievehigh-performancebroadcastfor

GPU-enabledstreamingapplications?

• Maintainsthepipelineofbroadcastoperations

• MinimizestheconsumptionofPeripheralComponentInterconnect

Express(PCIe)resources

ProblemStatement

Page 9: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 9NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns– RemoteMemoryAccess(RMA)-basedDesign

• PerformanceEvaluation

• ConclusionandFutureWork

Outline

Page 10: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 10NetworkBasedComputingLaboratory

• Goalsoftheproposeddesign– AllowsthereceiverstoretrievelostMCASTpacketsthroughtheRMA

operationswithoutinterruptingsender

Ø Maintainspipeliningofbroadcastoperations

Ø MinimizesconsumptionofPCIeresources

• MajorBenefitofMPI-3RemoteMemoryAccess(RMA)*– Supportsone-sidedcommunicationè broadcastsenderwon’tbeinterrupted

• MajorChallenge– HowandwherereceiverscanretrievethecorrectMCASTpackets

throughRMAoperations

Overview:RMA-basedReliabilityDesign

*”MPIForum”,http://mpi-forum.org/

Page 11: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 11NetworkBasedComputingLaboratory

• Maintainsanadditionalwindowofacircularbackupbufferfor

MCASTpackets

• ExposesthiswindowtootherprocessesintheMCASTgroup,e.g.,

performsMPI_Win_create

• UtilizesanadditionalhelperthreadtocopyMCASTpacketstothe

backupbufferè wecanoverlapwithbroadcastcommunication

ImplementingMPI_Bcast:SenderSide

Page 12: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 12NetworkBasedComputingLaboratory

• Whenareceiverexperiencestimeout (lostMCASTpacket)– PerformstheRMAGetoperationtothesender’sbackupbuffertoretrieve

lostMCASTpackets

– Senderisnotinterrupted

ImplementingMPI_Bcast:ReceiverSide

BroadcastreceiverBroadcastsenderIBHCA IBHCA MPI

Timeout

MPITime

Page 13: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 13NetworkBasedComputingLaboratory

• LargeenoughtokeeptheMCASTpacketsavailablewhenitisneeded

• Assmallaspossibletolimitsizeofmemoryfootprint

BackupBufferRequirements

𝑾 >𝑩×(𝑲×𝑹𝑻𝑻)

𝒇

Framesize:SizeofasingleMCASTpacket

Bandwidth Constant Round-TripTimebetweensenderandreceiver

Page 14: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 14NetworkBasedComputingLaboratory

• Pros:– Broadcastsenderisnotinvolvedinretransmission,i.e.,maintainsthe

pipelineofbroadcastoperationsØ Highthroughput,highscalability– NoextraMCASToperation,i.e.,minimizesconsumptionofPCIeresourcesØ Lowoverhead,lowlatency

• Cons:– CongestionmayoccurwhenmultipleRMAGetoperationsfrommultiple

receiversareissuedtoretrievethesamedatainextremeunreliablenetworks(highlyunlikelyforIBclusters)

ProposedRMA-basedReliabilityDesign

Page 15: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 15NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns

• PerformanceEvaluation– ExperimentalEnvironments

– StreamingBenchmarkLevelEvaluation

• ConclusionandFutureWork

Outline

Page 16: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 16NetworkBasedComputingLaboratory

1. RI2cluster@TheOhioStateUniversity*– Mellanox EDRInfiniBand HCAs

– 2NVIDIAK80GPUspernode

– Usedupto16GPUnodes

2. CSCScluster@SwissNationalSupercomputingCentrehttp://www.cscs.ch/computers/kesch_escha/index.html

– Mellanox FDRInfiniBandHCAs

– CrayCS-Stormsystem,8NVIDIAK80GPUcardspernode

– Usedupto88NVIDIAK80GPUcardsover11nodes

• ModifiedOhioStateUniversity(OSU)Micro-Benchmark(OMB)*– http://mvapich.cse.ohio-state.edu/benchmarks/

– osu_bcast - MPI_Bcast LatencyTest

– Modifiedtosupportheterogeneousbroadcast

• Streamingbenchmark– Mimicsrealstreamingapplications

– ContinuouslybroadcastsdatafromasourcetoGPU-basedcomputenodes

– Includesacomputationphasethatinvolveshost-to-deviceanddevice-to-hostcopies

ExperimentalEnvironments

*ResultsfromRI2andOMBareomittedinthispresentationduetotimeconstraints

Page 17: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 17NetworkBasedComputingLaboratory

OverviewoftheMVAPICH2Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and

RDMA over Converged Enhanced Ethernet (RoCE)– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR), Available since 2014– Support for MIC (MVAPICH2-MIC), Available since 2014– Support for Virtualization (MVAPICH2-Virt), Available since 2015– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015– Used by more than 2,675 organizations in 83 countries– More than 400,000 (> 0.4 million) downloads from the OSU site directly– Empowering many TOP500 clusters (June 2016 ranking)

• 12th ranked 462,462-core cluster (Stampede) at TACC• 15th ranked 185,344-core cluster (Pleiades) at NASA• 31th ranked 74520-core cluster (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu• Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 Tflop/s)⇒– Stampede at TACC (12th in June 2016, 462,462 cores, 5.168 Pflop/s)

Page 18: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 18NetworkBasedComputingLaboratory

• NegligibleoverheadcomparedtoexistingNACK-baseddesign

• RMA-baseddesignoutperformsNACK-basedschemeforlargemessages• AhelperthreadinthebackgroundperformsbackupsofMCASTpackets

Evaluation:Overhead

051015202530

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Latency(μs)

MessageSize(Bytes)

w/oreliability NACK RMA

0

1000

2000

3000

4000

16K

32K

64K

128K

256K

512K 1M 2M 4M

Latency(μs)

MessageSize(Bytes)

w/oreliability NACK RMA

Page 19: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 19NetworkBasedComputingLaboratory

LatencyreductionofproposedRMA-baseddesigncomparedtotheexistingNACK-basedscheme

Evaluation:LatencyonStreamingBenchmark

0

0.5

1

1.5

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8KNormalize

dLatency

MessageSize(Bytes)

0.01% 0.10% 1%

0

0.5

1

1.5

16K

32K

64K

128K

256K

512K 1M 2M 4MNormalize

dLatency

MessageSize(Bytes)

0.01% 0.10% 1% NormalizedtoSL-basedMCASTwithNACK-basedretransmissionscheme

MessageSize8KB 128KB 2MB

ModeledErrorRate

0.01% 16% 31% 11%0.1% 21% 36% 19%1% 24% 21% 10%

ModeledErrorrates: ModeledErrorrates:

Page 20: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 20NetworkBasedComputingLaboratory

• EqualorbetterthantheleadingNACK-baseddesignfordifferentmessagesizesanderrorrates

• Alwaysyields(upto56%)ahigherbroadcastratethantheexistingNACK-baseddesign

Evaluation:BroadcastRate(Throughput)

1

1.2

1.4

1.6

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4MNormalize

dLatency

MessageSize(Bytes)

0.01% 0.10% 1%

NormalizedtoSL-basedMCASTwithNACK-basedretransmissionscheme

ModeledErrorrates:

Page 21: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 21NetworkBasedComputingLaboratory

• Introduction

• ProposedDesigns

• PerformanceEvaluation

• ConclusionandFutureWork

Outline

Page 22: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 22NetworkBasedComputingLaboratory

• ProposeanRMA-basedreliabilitydesignontopofIBhardwaremulticastbasedbroadcastforstreamingapplications– Maintainspipeliningofbroadcastoperations

– MinimizesconsumptionofPCIeresources

– Providesgoodperformancewithstreamingbenchmarks,whichispromisingforrealstreamingapplications

• Futurework– IncludetheproposeddesigninfuturereleasesoftheMVAPICH2-GDR

library

– Evaluateeffectivenesswithrealstreamingapplications

Conclusion

Page 23: Efficient Reliability Support for Hardware Multicast-based ... · Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications 1Ching-Hsiang

COMHPC@SC16 23NetworkBasedComputingLaboratory

[email protected]

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/

This project is supported under the United States Department of Defense (DOD) High Performance ComputingModernization Program (HPCMP) User Productivity Enhancement and Technology Transfer (PETTT) activity(Contract No. GS04T09DBC0017 through Engility Corporation). The opinions expressed herein are those of theauthors and do not necessarily reflect the views of the DOD or the employer of the author.