an in-depth performance characterization of cpu...

31
An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures Ammar Ahmad Awan, Hari Subramoni, and Dhabaleswar K. Panda Network Based Computing Laboratory Dept. of Computer Science and Engineering The Ohio State University [email protected], {subramon,panda}@cse.ohio-state.edu Presentation at MLHPC ‘17

Upload: dinhkhanh

Post on 24-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

AnIn-depthPerformanceCharacterizationofCPU- andGPU-basedDNNTrainingonModernArchitectures

AmmarAhmadAwan,HariSubramoni,andDhabaleswarK.Panda

NetworkBasedComputingLaboratory

Dept.ofComputerScienceandEngineering

TheOhioStateUniversity

[email protected],{subramon,panda}@cse.ohio-state.edu

PresentationatMLHPC‘17

Page 2: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 2NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Introduction– CPU-basedDeepLearning

– DeepLearningFrameworks

• ResearchChallenges

• DesignDiscussion

• PerformanceCharacterization

• Conclusion

CPUbasedDeepLearningisnotasbadasyouthink!

Page 3: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 3NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• NVIDIAGPUshavebeenthemaindrivingforceforfastertrainingofDeepNeuralNetworks(DNNs)

• TheImageNetChallenge- (ILSVRC)– 90%oftheImageNetteamsusedGPUsin

2014*

– DLmodelslikeAlexNet,GoogLeNet,andVGG

– GPUs:AnaturalfitforDLduetothethroughput-orientednature

– GPUsarealsogrowingintheHPCarena!

GPUsaregreatforDeepLearning

*https://blogs.nvidia.com/blog/2014/09/07/imagenet/

https://www.top500.org/

Page 4: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 4NetworkBasedComputingLaboratory High-PerformanceDeepLearning

ButwhataboutCPUs?

1- https://dl.acm.org/citation.cfm?id=19935162- http://ieeexplore.ieee.org/abstract/document/5762730/3- https://dspace.mit.edu/bitstream/handle/1721.1/51839/MIT-CSAIL-TR-2010-013.pdf?sequence=1

• IntelCPUsareeverywhereandmany-coreCPUsareemergingaccordingtoTop500.org

• HostCPUsexistevenontheGPUnodes– Many-coreXeonPhisareincreasing

• XeonPhi1st generation:amany-coreco-processor

• XeonPhi2nd generation(KNL):aself-hostedmany-coreprocessor!

• Usually,wehearCPUsare10x– 100x slowerthanGPUs?[1-3]– Butcanwedobetter?

https://www.top500.org/statistics/list/

SystemCountforXeonPhi

Page 5: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 5NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• ThereareseveralDeepLearning(DL)orDNNTrainingframeworks– Caffe,CognitiveToolkit,TensorFlow,MXNet, andcounting....

• Every(almostevery)frameworkhasbeenoptimizedforNVIDIAGPUs– cuBLASandcuDNNhaveledtosignificantperformancegains!

• ButeveryframeworkisabletoexecuteonaCPUaswell– Sowhyarewenotusingthem?

– Performancehasbeen“terrible”andseveralstudieshavereportedsignificantdegradationwhenusingCPUs(seenvidia.qwiklab.com)

• Butthereishope:-)– AndMKL-DNN,justlikecuDNN,hasdefinitelyrekindledthis!!

– CoupledwithIntelXeonPhi(KnightsLandingorKNL)andMC-DRAM,thelandscapeforCPU-basedDLlookspromising..

DeepLearningFrameworks – CPUsorGPUs?

Page 6: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 6NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Caffeisapopularandwidelyusedframework;hasmanyforks(friends)

• NVIDIA-CaffeandBVLC-Caffe(OfficialCaffe)arealmostsimilar– NVIDIA-Caffeiscuttingedgethough!(Tensorcores,Volta,DrivePX,etc.)

• Intel-CaffeisoptimizedforCPU-basedDeepLearning

• OSU-Caffeisamulti-nodemulti-GPUvariantthatwehaveworkedonatOSU

TheDLFramework(s)indiscussion:Caffeandfriends

Caffe Variant Multi-GPUSupport Multi-node Support Multi-nodeCommunication

BVLC-Caffe Yes No N/A

NVIDIA-Caffe Yes No N/A

Intel-Caffe N/A Yes IntelMLSL2017.1.016(withIntel MPI2017)

OSU-Caffe Yes Yes MVAPICH2-GDR 2.2

Page 7: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 7NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Introduction

• ResearchChallenges

• DesignDiscussion

• PerformanceCharacterization

• Conclusion

Agenda

Page 8: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 8NetworkBasedComputingLaboratory High-PerformanceDeepLearning

CanweprovideaholisticyetcomprehensiveviewofDNNtrainingperformanceforadiversesetofhardwarearchitectures

includingIntelXeonPhi(KNL)processorsandNVIDIAPascalGPUs?

TheKeyQuestion!

Page 9: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 9NetworkBasedComputingLaboratory High-PerformanceDeepLearning

ResearchChallenges

LetusbringHPCandDL“together”!

ComputationandcommunicationcharacteristicsofDLworkloads?

VariousdatasetsandnetworkshandleddifferentlyinDLframeworks

Possiblestrategiestoevaluatethe

performanceofDLframeworks

Performancetrendsthatcanbeobservedfora

singlenode

Scale-outofDNNtrainingforCPU-basedandGPU-

basedDNNtraining

Performancebehaviorfor

hardwarefeatureslikeMCDRAM

Page 10: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 10NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Introduction

• ResearchChallenges

• DesignDiscussion– CaffeArchitecture

– UnderstandingtheImpactofExecutionEnvironments

– Multi-nodeTraining:Intel-Caffe,OSU-Caffe,andMPI

• PerformanceCharacterization

• Conclusion

Agenda

Page 11: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 11NetworkBasedComputingLaboratory High-PerformanceDeepLearning

Bcast(GPU0)

packed_comm_buff

L1L2..

Ln

F

L1L2..

Ln

L1L2..

Ln

L1L2..

Ln

Params

GPU

0

Params

GPU

1 Params

GPU

2

Params

GPU

3

Gradients

1.DataPropagation

2.ForwardBackwardPass

3.GradientAggregation

B F B F B F B

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

ApplyUpdates

Reduce(GPU0)

Loop{}

CaffeArchitecture

http://hidl.cse.ohio-state.edu

Page 12: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 12NetworkBasedComputingLaboratory High-PerformanceDeepLearning

Performanceisdependenton:

1. HardwareArchitectures– GPUs

– Multi-/Many-coreCPUs

2. SoftwareLibraries– cuDNN(forGPUs)

– MKL-DNN/MKL2017 (forCPUs)

3. Hardware/Softwareco-design– Softwarelibrariesoptimizedfor

oneplatformwillnothelptheother!

– cuDNNvs.MKL-DNN

UnderstandingtheImpactofExecutionEnvironmentsDLApplications(ImageRecognition,SpeechProcessing,etc.)

DLFrameworks(Caffe,TensorFlow,etc.)

BLASLibraries

Hardware

Many-coreGPU(PascalP100)

GenericConvolutionLayer

MKLOptimizedConvolutionLayer

MKL2017 cuDNN/cuBLAS

Multi-/Many-core(Xeon,XeonPhi)

cuDNN OptimizedConvolutionLayer

OtherBLASLibraries

OpenBLASATLAS

OtherProcessors

Page 13: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 13NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• MKL-DNN:ThekeyperformancedifferenceforCPU-basedDNNtraining!

• Doesthatreallyworkinpractice?

• IntelMKLclaimstooffermuchbetterperformance

• IntelMLSLpromisesmulti-nodetraining

Intel-CaffeandIntelMKL

Courtesy:http://www.techenablement.com/accelerating-python-deep-learning/

Multi-NodescalingusingIntelOmni-PathonAlexNet

Page 14: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 14NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• WeneedacommunicationlibraryforScale-out?– MessagePassingInterface(MPI)librarieslikeMVAPICH,IntelMPI,etc.

– NVIDIANCCL,FacebookGloo,Baidu-allreduce,etc.

– IntelMachineLearningScalingLibrary(higherlevellibrarybuiltontopofMPI)

• Howtochoose?– ForGPU-basedframeworks,CUDA-AwareMPI,NCCL,andGloo

– ForCPU-basedframeworks,anyMPIlibrarywilldo• MLSLofferssomethingmore

• MLSLissortofaDLframeworkAPI– canbeusedinsidetheframework

• Butcanbeusedinastand-aloneformattoo!

SowhattouseforScale-outwithIntel-Caffe?

Page 15: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 15NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• DeepLearningframeworksareadifferentgamealtogether– Unusuallylargemessagesizes(orderofmegabytes)

– MostcommunicationbasedonGPUbuffers

• State-of-the-art– cuDNN,cuBLAS,NCCL-->scale-up performance

– CUDA-AwareMPI-->scale-out performance• Forsmallandmediummessagesizesonly!

• Canweco-design theMPIruntime(MVAPICH2-GDR)andtheDLframework(Caffe)toachieveboth?– EfficientOverlap ofComputationandCommunication

– EfficientLarge-Message Communication(Reductions)

– Whatapplicationco-designsareneededtoexploitcommunication-runtime co-designs?

OSU-Caffe:Co-designtoTackleNewChallengesforMPIRuntimes

Scale-up

Perform

ance

Scale-outPerformance

cuDNN

NCCL

gRPC

Hadoop

ProposedCo-Designs

MPIcuBLAS

A.A.Awan,K.Hamidouche,J.M.Hashmi,andD.K.Panda,S-Caffe:Co-designingMPIRuntimesandCaffeforScalableDeepLearningonModernGPUClusters.In Proceedingsofthe22ndACMSIGPLANSymposiumonPrinciplesandPracticeofParallelProgramming (PPoPP'17)

Page 16: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 16NetworkBasedComputingLaboratory High-PerformanceDeepLearning

OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,825organizationsin85countries

– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly

– EmpoweringmanyTOP500clusters(June‘17ranking)• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

• 15th,241,108-core(Pleiades)atNASA

• 20th,462,462-core(Stampede)atTACC

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

• EmpoweringTop500systemsforoveradecade

– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->

– SunwayTaihuLight(1st inJun’17,10Mcores,100PFlops)

Page 17: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 17NetworkBasedComputingLaboratory High-PerformanceDeepLearning

0

2000

4000

6000

1 4 16 64 256 1K 4KBand

width(M

B/s)

MessageSize(Bytes)

GPU-GPUInter-nodeBi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

01000200030004000

1 4 16 64 256 1K 4KBand

width(M

B/s)

MessageSize(Bytes)

GPU-GPUInter-nodeBandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

0

10

20

30

0 2 8 32 128 512 2K 8K

Latency(us)

MessageSize(Bytes)

GPU-GPUInter-nodeLatency

MV2-(NO-GDR) MV2-GDR-2.3a

MVAPICH2-GDR-2.3aIntelHaswell(E5-2687W)node- 20cores

NVIDIAVoltaV100GPUMellanoxConnect-X4EDRHCA

CUDA9.0MellanoxOFED4.0withGPU-Direct-RDMA

10x

9x

Scale-outforGPU-basedTraining

1.88us11X

MVAPICH2-GDR:PerformancethatmeetsDeepLearningrequirements!

Page 18: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 18NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Introduction

• ResearchChallenges

• DesignDiscussion

• PerformanceCharacterization– Single-nodePerformance

– Multi-nodePerformance

• Conclusion

Agenda

Page 19: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 19NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• SeveralGPUgenerationsandCPUarchitectures

• Single-nodeResultsforAlexNetandResNet-50– ImpactofMKLengine

– ImpactofMC-DRAM

– Layer-wisebreakdown

– P100vs.KNL

• Multi-noderesultsusingIntel-CaffeandOSU-Caffe– Weakscaling

– ResNet-50andAlexNet

PerformanceCharacterization

Page 20: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 20NetworkBasedComputingLaboratory High-PerformanceDeepLearning

Name(Label)

ProcessorArchitecture(Description) No.ofCores No.of Sockets

Haswell1 Intel [email protected] 20(2*10) 2

Haswell2 Intel [email protected] 20(2*10) 2

Broadwell IntelXeonCPU [email protected] 28(2*14) 2

KNL IntelXeon [email protected] 68(1*68) 1

K40 NVIDIATeslaK4011.8GB @0.75GHz 2880CUDACores N/A

K80 NVIDIATeslaK8011.8GB @0.82GHz 2496CUDACores N/A

P100 NVIDIATeslaP100-PCIE16GB @1.33GHz 3584CUDACores N/A

PerformanceCharacterization:VariousArchitectures

Page 21: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 21NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• ThecomparisonofoptimizedMKLengineandthedefaultCaffeengine

• MKLengineisupto3XbetterthandefaultCaffeengine

• Biggest gainsforIntel XeonPhi(many-core)architecture

• BothHaswellandBroadwellarchitecturesgetsignificantspeedups(upto1.5X)

Single-node:ImpactofMKLengineinIntel-Caffe

020040060080010001200140016001800

Training

Time(m

s)

CPUArchitectures

Page 22: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 22NetworkBasedComputingLaboratory High-PerformanceDeepLearning

Single-node:ImpactofUtilizingMCDRAM

0100200300400500600700

DDR-All MCDRAM-All MCDRAMasCache

Training

Time(m

s)

MemoryConfigurations

Forward Backward

• “MCDRAMasCache”and“MCDRAM-All”offerverysimilarperformance

• WechosetouseMCDRAMasCache forallthesubsequentresults

• Onaverage,DDR-Allisupto1.5Xslower thanMCDRAM

Page 23: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 23NetworkBasedComputingLaboratory High-PerformanceDeepLearning

DivingDeeper:Layer-wiseBreakdown

050100150200250300350400450500

Time(m

s)

conv1 conv2 conv3 conv4 conv5

• ThefulllandscapeforAlexNet:ForwardandBackwardPass

• FasterConvolutionsà FasterTraining

• Mostperformancegainsarebasedonconv2 andconv3 forAlexNet

0100200300400500600700800

Time(m

s)

conv1 conv2 conv3 conv4 conv5

Page 24: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 24NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• FullyconnectedlayersaremuchsloweronKNLcomparedtoP100

• conv1 andconv3 alsocontributetodegradationonKNL

• conv2 isfasteronKNLcomparedtoP100

• ResNet-50hassomesurprises(notshownonthisslide)– KNLperformssignificantlybetterthanP100

– DifficulttovisualizeasthereareseverallayersinResNet-50

DivingDeeper:P100vs.KNL(AlexNet)

0

20

40

60

80

100

120

140

160

180

200

P100 KNL-Opt

Time(m

s)

HardwareArchitecture

conv1 conv2 conv3 conv4 conv5 fc6 fc7

Page 25: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 25NetworkBasedComputingLaboratory High-PerformanceDeepLearning

Multi-nodeResults:ResNet-50• Allresultsareweakscaling

– Thebatchsizeremainsconstant/solver

– Butincreasesoverallby:

– batch-size*(#nodesor#gpus)

• Images/secondisaderivedmetricbutmoremeaningfulforunderstandingscalability

• Efficiencyisanotherstory[1]– LargerDNNarchitecturesà Less

scalabilityduetocommunicationoverhead

0

100

200

300

400

500

600

0

50

100

150

200

250

300

350

400

2 4 8 16 20 32

Images/secon

d

Training

Time(secon

ds)

No.ofNodes

Time(seconds) Images/second

ResNet-50Intel-Caffe1.ExperiencesofScalingTensorFlowOnUpto512NodesOnCORISupercomputer,IntelHPCDev.Con.,https://www.intel.com/content/www/us/en/events/hpcdevcon/overview.html

Page 26: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 26NetworkBasedComputingLaboratory High-PerformanceDeepLearning

Multi-nodeResults:AlexNetComparison

• OSU-Caffevs.Intel-Caffe– Differentframeworkssonotdirectlycomparable

– Aroughcomparisoncanstillhelpinunderstandingscalabilitytrends

– Designofframeworkcanaffectperformancefordistributedtraining• MPI(orthecommunicationruntime)cancauseamarkeddifference

0

500

1000

1500

2000

2500

3000

3500

1 2 4 8 16 20 32

ImagesPe

rSecon

d

No.ofNodes

OSU-Caffe(GPU) Intel-Caffe(CPU)

0

10

20

30

40

50

60

70

80

90

1 2 4 8 16 20 32

Training

Time(secon

ds)

No.ofNodes

OSU-Caffe(GPU) Intel-Caffe(CPU)

Page 27: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 27NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Introduction

• ResearchChallenges

• DesignComparisons

• PerformanceCharacterization

• Conclusion

Agenda

Page 28: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 28NetworkBasedComputingLaboratory High-PerformanceDeepLearning

Conclusion• CPUisverycomparabletoGPUforDNNTrainingworkloadsif

appropriateoptimizationsareexploited

• GPUsarestillfasterthanCPUsingeneral

• KNLbeatsP100foronecasebutP100beatsKNLformostcases

• EvaluatingtheperformanceofaDLframework– Thehardwarearchitecturematters

– Butsoftwarestackhasahigherandmoresignificantimpactthanhardware

– Thefullexecutionenvironmentandcommunicationruntimeneedstobeevaluatedtoensurefairnessincomparisons

Page 29: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 29NetworkBasedComputingLaboratory High-PerformanceDeepLearning

• Evaluatewithupcomingarchitectures– VoltaGPUs

– DGX-1VSystem

– IntelNervanaNeuralNetworkProcessor

• VerifythehypothesisusingotherDLframeworks– TensorFlow

– IntelNeon

– NervanaGraph

• InvestigatenewdesignswithMVAPICH2andotherMPIstackstosupportfasterDNNtraining

FutureWork

Page 30: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 30NetworkBasedComputingLaboratory High-PerformanceDeepLearning

ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

HighPerformanceDeepLearninghttp://hidl.cse.ohio-state.edu/

[email protected]

http://web.cse.ohio-state.edu/~awan.10

TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/

TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/

Page 31: An In-depth Performance Characterization of CPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...An In-depth Performance Characterization of CPU-and GPU-based DNN Training

MLHPC‘17 31NetworkBasedComputingLaboratory High-PerformanceDeepLearning

PleasejoinusforothereventsatSC’17• Workshops

– ESPM22017:ThirdInternationalWorkshoponExtremeScaleProgrammingModelsandMiddleware

• Tutorials– InfiniBand,Omni-Path,andHigh-Speed

EthernetforDummies

– InfiniBand,Omni-Path,andHigh-SpeedEthernet:AdvancedFeatures,ChallengesinDesigning,HECSystemsandUsage

• BoFs– MPICHBoF:MVAPICH2Project:Latest

StatusandFuturePlans

Pleaserefertohttp://mvapich.cse.ohio-state.edu/talks/ formoredetails

• ACMSRCPosters– Co-designingMPIRuntimesandDeepLearning

FrameworksforScalableDistributedTrainingonGPUClusters

– High-PerformanceandScalableBroadcastSchemesforDeepLearningonGPUClusters

• BoothTalks– TheMVAPICH2Project:LatestDevelopmentsandPlans

TowardsExascaleComputing

– ExploitingLatestNetworkingandAcceleratorTechnologiesforMPI,Streaming,andDeepLearning:AnMVAPICH2-BasedApproach

– AcceleratingDeepLearningwithMVAPICH

– MVAPICH2-GDRLibrary:PushingtheFrontierofHPCandDeepLearning