next generation grid: integrating parallel and … with judy ` qiu, shantenu jha,...

`WorkwithJudyQiu,ShantenuJha,Supun Kamburugamuve,KannanGovindarajan,Pulasthi Wickramasinghe

12/29/17 1

The15thIEEEInternationalSymposiumonParallelandDistributedProcessingwithApplications(IEEEISPA2017)Guangzhou,China,December12-15,2017

http://trust.gzhu.edu.cn/conference/ISPA2017/GeoffreyFox,December13,2017

Department of Intelligent Systems [email protected], http://www.dsc.soic.indiana.edu/, http://spidal.org/

NextGenerationGrid:IntegratingParallelandDistributedComputingRuntimes

fromCloudtoEdgeApplications

Abstract• WelookagainatBigDataProgrammingenvironmentssuchasHadoop,Spark,Flink,Heron,Pregel;HPCconceptssuchasMPIandAsynchronousMany-TaskruntimesandCloud/Grid/Edgeideassuchasevent-drivencomputing,serverlesscomputing,workflowandServices.

• Thesecrossmanyresearchcommunitiesincludingdistributedsystems,databases,cyberphysicalsystemsandparallelcomputingwhichsometimeshaveinconsistentworldviews.

• Therearemanycommoncapabilitiesacrossthesesystemswhichareoftenimplementeddifferentlyineachpackagedenvironment.Forexample,communicationcanbebulksynchronousprocessingordataflow;schedulingcanbedynamicorstatic;stateandfault-tolerancecanhavedifferentmodels;executionanddatacanbestreamingorbatch,distributedorlocal.

• Wesuggestthatonecanusefullybuildatoolkit(calledTwister2byus)thatsupportsthesedifferentchoicesandallowsfruitfulcustomizationforeachapplicationarea.WeillustratethedesignofTwister2byseveralpointstudies.

12/29/17 2

• Supercomputerswillbeessentialforlargesimulationsandwillrunotherapplications

• HPCCloudsorNext-GenerationCommoditySystemswillbeadominantforce• MergeCloudHPCand(supportof) Edge computing• FederatedCloudsrunninginmultiplegiantdatacentersofferingalltypesofcomputing

• DistributeddatasourcesassociatedwithdeviceandFogprocessingresources• Server-hiddencomputingandFunctionasaServiceFaaS foruserpleasure“Noserveriseasiertomanagethannoserver”

• Supportadistributedevent-drivenserverless dataflowcomputingmodelcoveringbatch andstreaming dataasHPC-FaaS

• Needingparallelanddistributed(Grid)computingideas• SpanPleasinglyParalleltoDatamanagementtoGlobalMachineLearning

Predictions/Assumptions

12/29/17 3

• Useofpubliccloudsincreasingrapidly• CloudsbecomingdiversewithsubsystemscontainingGPU’s,FPGA’s,highperformancenetworks,storage,memory…

• Richsoftwarestacks:• HPC(HighPerformanceComputing)forParallelComputinglessusedthan(?)• ApacheforBigDataSoftwareStackABDSincludingcenterandedgecomputing(streaming)

• SurelyBigDatarequiresHighPerformanceComputing?• Service-orientedSystems,InternetofThingsandEdgeComputinggrowinginimportance

• Alotof confusioncomingfromdifferentcommunities(database,distributed,parallelcomputing,machinelearning,computational/datascience)investigatingsimilarideaswithlittleknowledgeexchangeandmixedup(unclear)requirements

BackgroundRemarks

12/29/17 4

• Ongeneralprinciplesparallelanddistributedcomputinghavedifferentrequirementsevenifsometimessimilarfunctionalities

• ApachestackABDStypicallyusesdistributedcomputingconcepts• Forexample,ReduceoperationisdifferentinMPI(Harp)andSpark

• Largescalesimulationrequirementsarewellunderstood• BigDatarequirementsarenotagreedbutthereareafewkeyusetypes

1) Pleasinglyparallelprocessing(includinglocalmachinelearningLML)asofdifferenttweetsfromdifferentuserswithperhapsMapReducestyleofstatisticsandvisualizations;possiblyStreaming

2) DatabasemodelwithqueriesagainsupportedbyMapReduceforhorizontalscaling3) GlobalMachineLearningGMLwithsinglejobusingmultiplenodesasclassicparallel

computing4) DeepLearningcertainlyneedsHPC– possiblyonlymultiplesmallsystems

• Currentworkloadsstress1)and2)andaresuitedtocurrentcloudsandtoABDS(withnoHPC)• ThisexplainswhySparkwithpoorGMLperformanceissosuccessfulandwhyitcanignoreMPIeventhoughMPIusesbesttechnologyforparallelcomputing

Requirements

12/29/17 5

HPCRuntimeversusABDSdistributedComputingModelonDataAnalytics

Hadoop writestodiskandis slowest;SparkandFlink spawnmanyprocessesanddonotsupportAllReduce directly;MPI doesin-placecombinedreduce/broadcastandisfastest

NeedPolymorphicReductioncapabilitychoosingbestimplementation

UseHPCarchitecturewithMutablemodelImmutabledata

12/29/17 6

UseCaseAnalysis

• Veryshortasdescribedinprevioustalksandpapers• StartedwithNISTcollectionof51usecases• “Version2”https://bigdatawg.nist.gov/V2_output_docs.php justreleasedAugust2017• 64FeaturesofDataandModelforlargescalebigdataorsimulationusecases

12/29/17 7

12/29/17 8

NISTBigDataPublicWorkingGroupStandardsBestPractice

IndianaCloudmeshlaunchingTwister2

Indiana

https://bigdatawg.nist.gov/V2_output_docs.php

Local(An

alytics/Inform

atics/Simulations)

2M

DataSourceandStyleView

PleasinglyParallelClassicMapReduce

Map-CollectiveMapPoint-to-Point

SharedMemorySingleProgramMultipleData

BulkSynchronousParallel

FusionDataflowAgents

Workflow

GeospatialInformationSystemHPCSimulationsInternetofThingsMetadata/ProvenanceShared/Dedicated/Transient/Permanent

Archived/Batched/Streaming – S1,S2,S3,S4,S5

HDFS/Lustre/GPFS

Files/ObjectsEnterpriseDataModelSQL/NoSQL/NewSQL

1M

Micro-benchmarks

ExecutionView

ProcessingView1234

6

7891011M

12

10D98D7D6D5D

4D

3D2D1D

MapStreaming 5

ConvergenceDiamondsViewsandFacets

ProblemArchitectureView

15MCo

reLibrarie

sVisualiza

tion

14M

GraphAlgorithm

s

13M

LinearAlgebraKernels/M

anysubclasses

12M

Global(A

nalytics/Inform

atics/Simulations)

3M

RecommenderEngine

5M

4M

BaseDataStatistics

10M

Stream

ingDa

taAlgorith

ms

Optimiza

tionMethodology

9M

Learning

8M

DataClassificatio

n

7M

DataSe

arch/Q

uery/In

dex

6M

11M

DataAlignm

ent

BigDataProcessingDiamonds

MultiscaleMethod

17M

16M

IterativePD

ESolvers

22M

Natureofm

eshifused

EvolutionofDiscreteSystem

s

21M

ParticlesandFields

20M

N-bodyM

ethods

19M

Spectra

lMethods

18M

Simulation(Exascale)ProcessingDiamonds

DataAbstraction

D12

ModelAbstraction

M12

DataMetric

=M

/Non-Metric

=N

D13

DataMetric

=M

/Non-Metric

=N

M13

𝑂𝑁#

=NN

/𝑂(𝑁)=N

M14

Regular=R/Irregular=

IModel

M10

Veracity7

Iterative/Sim

ple

M11

Communication

StructureM8

Dynamic=D/Static=

SD9

Dynamic=D/Static=

SM9

Regular=R/Irregular=

IData

D10

ModelVariety

M6

DataVelocity

D5

Performance

Metrics

1

DataVariety

D6

FlopsperByte/M

emory

IO/Flopsperw

att

2

ExecutionEnvironm

ent;Corelibraries

3

DataVolum

e

D4

ModelSize

M4

Simulations Analytics(ModelforBigData)

Both

(AllModel)

(NearlyallData+Model)

(NearlyallData)

(MixofDataandModel)

41/51Streaming26/51PleasinglyParallel25/51Mapreduce

64Featuresin4viewsforUnifiedClassificationof BigDataandSimulationApplications

12/29/17 9

1. PleasinglyParallel– asinBLAST,Proteindocking,some(bio-)imageryincludingLocalAnalyticsorMachineLearning– MLorfilteringpleasinglyparallel,asinbio-imagery,radarimages(pleasinglyparallelbutsophisticatedlocalanalytics)

2. ClassicMapReduce:Search,IndexandQueryandClassificationalgorithmslikecollaborativefiltering(G1forMRStat inFeatures,G7)

3. Map-Collective:Iterativemaps+communicationdominatedby“collective”operationsasinreduction,broadcast,gather,scatter.Commondatamining pattern

4. Map-PointtoPoint:Iterativemaps+communicationdominatedbymanysmallpointtopointmessagesasingraphalgorithms

5. Map-Streaming:Describesstreaming,steeringandassimilationproblems

6. SharedMemory:Someproblemsareasynchronousandareeasiertoparallelizeonsharedratherthandistributedmemory– seesomegraphalgorithms

7. SPMD:SingleProgramMultipleData,commonparallelprogrammingfeature

8. BSPorBulkSynchronousProcessing:well-definedcompute-communicationphases

9. Fusion:Knowledgediscoveryofteninvolvesfusionofmultiplemethods.

10. Dataflow:ImportantapplicationfeaturesoftenoccurringincompositeOgres

11. UseAgents:asinepidemiology(swarmapproaches)ThisisModelonly

12. Workflow:Allapplicationsofteninvolveorchestration(workflow)ofmultiplecomponents

ProblemArchitectureView(MetaorMacroPatterns)

Most(11oftotal12)arepropertiesofData+Model

12/29/17 10

These3arefocusofTwister2butweneedtopreservecapabilityonfirst2paradigms

ClassicCloudWorkload

GlobalMachineLearning

NoteProblemandSystemArchitectureasefficientexecutionsaystheymustmatch

12/29/17 11

• NeedtodiscussDataandModel asproblemshavebothintermingled,butwecangetinsightbyseparatingwhichallowsbetterunderstandingofBigData- BigSimulation“convergence”(ordifferences!)

• TheModel isauserconstructionandithasa“concept”, parametersandgives resultsdeterminedbythecomputation.Weuseterm“model”inageneralfashiontocoverallofthese.

• BigDataproblems canbebrokenupintoDataandModel• Forclustering,themodelparametersareclustercenterswhilethedataissetofpointstobeclustered

• Forqueries,themodelisstructureofdatabaseandresultsofthisquerywhilethedataiswholedatabasequeriedandSQLquery

• FordeeplearningwithImageNet,themodelischosennetworkwithmodelparametersasthenetworklinkweights.Thedataissetofimagesusedfortrainingorclassification

DataandModelinBigDataandSimulationsI

12/29/17 12

DataandModelinBigDataandSimulationsII• Simulations canalsobeconsideredasData plusModel

• Model canbeformulationwithparticledynamicsorpartialdifferentialequationsdefinedbyparameterssuchasparticlepositionsanddiscretizedvelocity,pressure,densityvalues

• Data couldbesmallwhenjustboundaryconditions• Data largewithdataassimilation(weatherforecasting)orwhendatavisualizationsareproducedbysimulation

• BigDataimpliesDataislargebutModelvariesinsize• e.g.LDA (LatentDirichletAllocation)withmanytopicsordeeplearninghasalargemodel• Clustering orDimensionreductioncanbequitesmallinmodelsize

• Data oftenstaticbetweeniterations(unlessstreaming);Modelparameters varybetweeniterations

• Data andModelParametersareoftenconfusedinpapersastermdatausedtodescribetheparametersofmodels.

• ModelsinBigDataandSimulationshavemanysimilaritiesandallowconvergence

12/29/17 13

• Applications– Divideusecasesinto Data andModelandcomparecharacteristicsseparatelyinthesetwocomponentswith64ConvergenceDiamonds(features).

• Identifyimportanceofstreamingdata,pleasinglyparallel,global/localmachine-learning• Software– Singlemodelof HighPerformanceComputing(HPC)EnhancedBigDataStackHPC-ABDS.21LayersaddinghighperformanceruntimetoApachesystemsHPC-FaaS ProgrammingModel

• Serverless InfrastructureasaServiceIaaS• Hardwaresystemdesignedforfunctionalityandperformanceofapplicationtypee.g.disks,interconnect,memory,CPUaccelerationdifferentformachinelearning,pleasinglyparallel,datamanagement,streaming,simulations

• UseDevOpstoautomatedeploymentofevent-drivensoftwaredefinedsystemsonhardware:HPCCloud 2.0

• TotalSystemSolutions(wisdom)asaService:HPCCloud 3.0

Convergence/DivergencePointsforHPC-Cloud-Edge- BigData-Simulation

UsesDevOpsnotdiscussedinthistalk

12/29/17 14

ParallelComputing:BigDataandSimulations• Allthedifferentprogrammingmodels(Spark,Flink,Storm,Naiad,MPI/OpenMP)havethesamehighlevelapproachbutapplicationrequirementsandsystemarchitecturecangivedifferentappearance

• First:BreakProblemData and/orModel-parameters intopartsassignedtoseparatenodes,processes,threads

• Then:Inparallel,docomputationstypicallyleavingdata untouchedbutchangingmodel-parameters.CalledMaps inMapReduceparlance;typicallyownercomputesrule.

• IfPleasinglyparallel,that’sallitisexceptformanagement• IfGloballyparallel,needtocommunicateresultsofcomputationsbetweennodesduringjob• Communicationmechanism(TCP,RDMA,NativeInfiniband)canvary• CommunicationStyle(PointtoPoint,Collective,Pub-Sub)canvary• Possibleneedforsophisticateddynamicchangesinpartioning (loadbalancing)

• Computationeitheronfixedtasksorflowbetweentasks• Choices:“AutomaticParallelismorNot”• Choices:“ComplicatedParallelAlgorithmorNot”• Fault-Tolerancemodelcanvary• Outputmodelcanvary:RDDorFilesorPipes

12/29/17 15

DifficultyinParallelismSizeofSynchronizationconstraints

SpectrumofApplicationsandAlgorithms

12/29/17 16

PleasinglyParallelOftenindependentevents

MapReduceasinscalabledatabases

StructuredAdaptiveSparsityHugeJobs

LooselyCoupled

LargescalesimulationsCurrentmajorBig

Datacategory

CommodityClouds

HPCCloudsHighPerformanceInterconnect

ExascaleSupercomputers

GlobalMachineLearninge.g.parallelclustering

DeepLearning

HPCClouds/SupercomputersMemoryaccessalsocritical

UnstructuredAdaptiveSparsityMediumsizeJobs

GraphAnalyticse.g.subgraphmining

LDA

LinearAlgebraatcore(typicallynotsparse)

DiskI/O

SoftwareHPC-ABDSHPC-FaaS

12/29/17 17

Software:MIDASHPC-ABDS

NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

Ogres Application Analysis

HPC-ABDS and HPC-FaaS SoftwareHarp and Twister2 Building Blocks

SPIDAL Data Analytics Library

12/29/17 18

HPC-ABDS

IntegratedwiderangeofHPCandBigDatatechnologies.

IgaveupupdatinglistinJanuary2016!

Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies Cross-Cutting

Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination: Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca

17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Networking: Google Cloud DNS, Amazon Route 53

21layers Over350SoftwarePackagesJanuary292016

12/29/17 19

• Googlelikestoshowatimeline;wecanbuildon(Apacheversionof)this• 2002GoogleFileSystemGFS~HDFS(Level8)• 2004MapReduce ApacheHadoop(Level14A)• 2006BigTableApacheHbase(Level11B)• 2008DremelApacheDrill(Level15A)• 2009Pregel ApacheGiraph (Level14A)• 2010FlumeJava ApacheCrunch(Level17)• 2010ColossusbetterGFS(Level18)• 2012Spanner horizontallyscalableNewSQLdatabase~CockroachDB (Level11C)• 2013 F1 horizontallyscalableSQLdatabase(Level11C)• 2013MillWheel ~ApacheStorm,TwitterHeron(Googlenotfirst!)(Level14B)• 2015CloudDataflowApacheBeamwithSparkorFlink(dataflow)engine(Level17)• Functionalitiesnotidentified:Security(3),DataTransfer(10),Scheduling(9),DevOps(6),serverless computing(whereApachehasOpenWhisk)(5)

ComponentsofBigDataStack

HPC-ABDSLevelsin()

12/29/17 20

DifferentchoicesinsoftwaresystemsinCloudsandHPC.HPC-ABDStakescloudsoftwareaugmentedbyHPCwhenneededtoimproveperformance

16of21layerspluslanguages

12/29/17 21

ImplementingTwister2tosupportaGridlinkedtoanHPCCloud

Cloud

HPC

Cloud

HPC

CentralizedHPCCloud+IoTDevices CentralizedHPCCloud+Edge=Fog+IoTDevices

Cloud

HPC

Fog

HPCCloudcanbefederated

12/29/17 22

Twister2:“NextGenerationGrid- Edge– HPCCloud”• Original2010Twister paperhas914citations;itwasaparticularapproachtoMapCollective iterativeprocessingformachinelearning

• Re-engineercurrentApacheBigDataandHPCsoftwaresystemsasatoolkit• Supportaserverless (cloud-native)dataflow event-drivenHPC-FaaS (microservice)frameworkrunningacrossapplicationandgeographicdomains.

• SupportalltypesofDataanalysisfromGMLtoEdgecomputing• BuildonCloudbestpracticebutuseHPCwhereverpossibletogethighperformance• SmoothlysupportcurrentparadigmsHadoop,Spark,Flink,Heron,MPI,DARMA…• Use interoperablecommonabstractionsbutmultiple polymorphicimplementations.

• i.e.donotrequireasingleruntime• FocusonRuntimebutthisimpliesHPC-FaaS programmingandexecutionmodel• ThisdefinesanextgenerationGridbasedondataandedgedevices– notcomputingasinoldGridSeepaperhttp://dsc.soic.indiana.edu/publications/twister2_design_big_data_toolkit.pdf

12/29/17 23

• UnitofProcessingisanEventdrivenFunction(amicroservice)replaceslibraries• Canhavestatethatmayneedtobepreservedinplace(IterativeMapReduce)• Functionscanbesingleor1of100,000mapsinlargeparallelcode

• ProcessingunitsruninHPCclouds,fogsordevices buttheseallhavesimilarsoftwarearchitecture(seeAWSGreengrass andLambda)

• UniversalProgrammingmodelsoFog(e.g.car)lookslikeacloudtoadevice(radarsensor)whilepubliccloudlookslikeacloudtothefog(car)

• Analyzetheruntimeofexistingsystems(Morestudyneeded)• Hadoop,Spark,Flink,PregelBigDataProcessing• Storm,HeronStreamingDataflow• Kepler,Pegasus,NiFi workflowsystems• HarpMap-Collective,MPIandHPCAMTruntimelikeDARMA• AndapproachessuchasGridFTP andCORBA/HLA(!)forwideareadatalinks

ProposedTwister2Approach

12/29/17 24

ComparingSparkFlinkHeronandMPI

• OnGlobalMachineLearningGML.• NoteIsaidSparkandFlinkaresuccessfulonLMLnotGMLand

currentlyLMLismorecommonthanGML

12/29/17 25

MachineLearningwithMPI,SparkandFlink

• Threealgorithmsimplementedinthreeruntimes• MultidimensionalScaling(MDS)• Terasort• K-Means(dropasnotime)

• ImplementationinJava• MDSisthemostcomplexalgorithm- threenestedparallelloops• K-Means- oneparallelloop• Terasort - noiterations

• Withcare,Javaperformance~Cperformance• Withoutcare,Javaperformance<<Cperformance(detailsomitted)

12/29/17 26

MultidimensionalScaling:3NestedParallelSections

MDSexecutiontimeon16nodeswith20processesineachnodewith

varyingnumberofpoints

MDSexecutiontimewith32000pointsonvaryingnumberofnodes.Eachnoderuns20paralleltasks

12/29/17 27

Flink

Spark

MPI

MPIFactorof20-200FasterthanSpark/Flink

TerasortSorting1TBofdatarecords

Terasort executiontimein64and32nodes.OnlyMPIshowsthesortingtimeandcommunicationtimeas

othertwoframeworksdoesn'tprovideaviablemethodtoaccuratelymeasurethem.Sortingtimeincludesdata

savetime.MPI-IB- MPIwithInfiniband

Partitionthedatausingasampleandregroup

TransferdatausingMPI

12/29/17 28

DataflowatDifferentGrainsizes

12/29/17 29

Reduce

Maps

Iterate

InternalExecutionDataflowNodes

HPCCommunication

CoarseGrainDataflowslinksjobsinsuchapipelineDatapreparation Clustering

DimensionReduction

Visualization

Butinternallytoeachjobyoucanalsoelegantlyexpressalgorithmasdataflowbutwithmorestringentperformanceconstraints

• P=loadPoints()• C=loadInitCenters()• for(int i =0;i <10;i++){• T=P.map().withBroadcast(C)• C=T.reduce()}

Iterate

CorrespondingtoclassicSparkK-meansDataflow

NiFi Workflow

12/29/17 30

Flink MDSDataflowGraph

8/30/2017

ImplementingTwister2indetailI

Thisbreaksrulefrom2012-2017ofnot“competing”withbutrather“enhancing”ApacheLookatCommunicationindetail

12/29/17 32

http://www.iterativemapreduce.org/

Twister2ComponentsI

9/25/2017 33

Area Component Implementation Comments: User API

Architecture Specification

Coordination PointsState and Configuration Management; Program, Data and Message Level

Change execution mode; save and reset state

Execution Semantics

Mapping of Resources to Bolts/Maps in Containers, Processes, Threads

Different systems make different choices - why?

Parallel Computing Spark Flink Hadoop Pregel MPI modes Owner Computes Rule

Job Submission (Dynamic/Static) Resource Allocation

Plugins for Slurm, Yarn, Mesos, Marathon, Aurora

Client API (e.g. Python) for Job Management

Task System

Task migration Monitoring of tasks and migrating tasks for better resource utilization

Task-based programming with Dynamic or Static Graph API;

FaaS API;

Support accelerators (CUDA,KNL)

Elasticity OpenWhisk

Streaming and FaaS Events

Heron, OpenWhisk, Kafka/RabbitMQ

Task Execution Process, Threads, Queues

Task Scheduling Dynamic Scheduling, Static Scheduling,Pluggable Scheduling Algorithms

Task Graph Static Graph, Dynamic Graph Generation

Twister2ComponentsII

9/25/2017 34

Area Component Implementation Comments

Communication API

Messages Heron This is user level and could map to multiple communication systems

Dataflow Communication

Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA

Coarse grain Dataflow from NiFi, Kepler?

Streaming, ETL data pipelines;

Define new Dataflow communicationAPI and library

BSP CommunicationMap-Collective

Conventional MPI, Harp MPI Point to Point and Collective API

Data AccessStatic (Batch) Data File Systems, NoSQL, SQL

Data APIStreaming Data Message Brokers, Spouts

Data Management Distributed Data Set

Relaxed Distributed Shared Memory(immutable data), Mutable Distributed Data

Data Transformation API;

Spark RDD, Heron Streamlet

Fault Tolerance Check PointingUpstream (streaming) backup; Lightweight; Coordination Points; Spark/Flink, MPI and Heron models

Streaming and batch casesdistinct; Crosses all components

Security Storage, Messaging, execution

Research needed Crosses all Components

SchedulingChoices• Scheduling isonekeyareawheredataflowsystemsdiffer

• DynamicScheduling(Spark)• Finegraincontrolofdataflowgraph• Graphcannotbeoptimized

• StaticScheduling(Flink)• Lesscontrolofthedataflowgraph• Graphcanbeoptimized

• Twister2willalloweither

12/29/17 35

CommunicationModels• MPICharacteristics:Tightlysynchronizedapplications

• Efficientcommunications(µslatency)withuseofadvancedhardware• Inplacecommunicationsandcomputations(Processscopeforstate)

• Basicdataflow:Modelacomputationasagraph• NodesdocomputationswithTaskascomputationsandedgesareasynchronouscommunications

• Acomputationisactivatedwhenitsinputdatadependenciesaresatisfied

• Streamingdataflow:Pub-Subwithdatapartitionedintostreams• Streamsareunbounded,ordereddatatuples• Orderofeventsimportantandgroupdataintotimewindows

• MachineLearningdataflow:Iterativecomputationsandkeeptrackofstate• ThereisbothModelandData,butonlycommunicatethemodel• CollectivecommunicationoperationssuchasAllReduce AllGather (nodifferentialoperatorsinBigDataproblems)

• Canusein-placeMPIstylecommunication

S

W G

S

W

WDataflow

12/29/17 36

MahoutandSPIDAL• MahoutwasHadoopmachinelearninglibrarybutlargelyabandonedasSparkoutperformedHadoop

• SPIDALoutperformsSparkMllib andFlinkduetobettercommunicationandin-placedataflow.

• SPIDALalsohascommunityalgorithms

• BiomolecularSimulation• GraphsforNetworkScience• Imageprocessingforpathologyandpolarscience

12/29/17 37

Qiu/FoxCoreSPIDALParallelHPCLibrarywithCollectiveUsed• DA-MDS Rotate,AllReduce,Broadcast• DirectedForceDimensionReductionAllGather,Allreduce

• IrregularDAVSClusteringPartialRotate,AllReduce,Broadcast

• DASemimetric Clustering(DeterministicAnnealing)Rotate,AllReduce,Broadcast

• K-means AllReduce,Broadcast,AllGather DAAL• SVM AllReduce,AllGather• SubGraph Mining AllGather,AllReduce• LatentDirichletAllocationRotate,AllReduce• MatrixFactorization(SGD)RotateDAAL• RecommenderSystem(ALS) RotateDAAL• SingularValueDecomposition(SVD)AllGatherDAAL

• QRDecomposition(QR)Reduce,BroadcastDAAL• NeuralNetworkAllReduce DAAL• Covariance AllReduce DAAL• LowOrderMomentsReduceDAAL• NaiveBayesReduceDAAL• LinearRegressionReduceDAAL• RidgeRegressionReduceDAAL• Multi-classLogisticRegression Regroup,Rotate,AllGather

• RandomForestAllReduce• PrincipalComponentAnalysis(PCA)AllReduceDAAL

DAAL impliesintegratedonnodewithIntelDAALOptimizedDataAnalyticsLibrary(RunsonKNL!)

12/29/17 38

HarpPluginforHadoop:ImportantpartofTwister2WorkofJudyQiu

12/29/17 39

Map Collective Run time merges MapReduce and HPC

allreducereduce

rotatepush & pull

allgather

regroup

broadcast

Qiu MIDAS run time software for Harp

• Datasets:5millionpoints,10thousandcentroids,10featuredimensions

• 10to20nodesofIntelKNL7250processors• Harp-DAALhas15xspeedupsoverSpark

MLlib

• Datasets:500Kor1milliondatapointsoffeaturedimension300

• RunningonsingleKNL7250(Harp-DAAL)vs.singleK80GPU(PyTorch)

• Harp-DAALachieves3xto6xspeedups

• Datasets:Twitterwith44millionvertices,2billionedges,subgraphtemplatesof10to12vertices

• 25nodesofIntelXeonE52670• Harp-DAALhas2xto5xspeedupsover

state-of-the-artMPI-Fasciasolution

Harpv.SparkHarpv.TorchHarpv.MPI

ImplementingTwister2indetailII

State

12/29/17 42

SystemsStateSparkKmeans Dataflow

• P=loadPoints()• C=loadInitCenters()

• for(int i =0;i <10;i++){• T=P.map().withBroadcast(C)• C=T.reduce()}

SaveStateatCoordinationPointStoreCinRDD

12/29/17 43

• Stateishandleddifferentlyinsystems• CORBA,AMT,MPIand

Storm/Heronhavelongrunningtasksthatpreservestate

• SparkandFlinkpreservedatasetsacrossdataflownodeusingin-memorydatabases

• Allsystemsagreeoncoarsegraindataflow;onlykeepstatebyexchangingdata

Iterate

FaultToleranceandState• Similarformof check-pointingmechanismisusedalreadyinHPCandBigData

• althoughHPCinformalasdoesn’ttypicallyspecifyasadataflowgraph• FlinkandSparkdobetterthanMPIduetouseof databasetechnologies;MPIisabitharderduetoricherstatebutthereisanobviousintegratedmodelusingRDDtypesnapshotsofMPIstylejobs

• Checkpointaftereachstageofthedataflowgraph(atlocationofintelligentdataflownodes)

• Naturalsynchronizationpoint• Let’sallowsusertochoosewhentocheckpoint(noteverystage)• Savestateasuserspecifies;SparkjustsavesModelstatewhichisinsufficientforcomplexalgorithms

12/29/17 44

InitialTwister2Performance• Eventuallytestlotsofchoicesoftaskmanagersandcommunicationmodels;threadsversusprocesses;languagesetc.

• Here16Haswellnodeseachwith1processrunning20tasksasthreads;Java

12/29/17 45

• ReducemicrobenchmarkforApacheFlinkandTwister2;Flinkpoorperformanceduetononoptimizedreduceoperation

• Twister2hasanewdataflowcommunicationlibrarybasedonMPI– inthiscasea1000timesfasterthanFlnk

Summary of Twister2: Next Generation HPC Cloud + Edge + Grid• WesuggestaneventdrivencomputingmodelbuiltaroundCloudandHPCandspanningbatch,streaming,andedgeapplications

• Highlyparalleloncloud;possiblysequentialattheedge• IntegratecurrenttechnologyofFaaS (FunctionasaService)andserver-hidden(serverless)computingwithHPCandApachebatch/streamingsystems

• WehavebuiltahighperformancedataanalysislibrarySPIDAL• WehaveintegratedHPCintomanyApachesystemswithHPC-ABDS• WehavedoneaverypreliminaryanalysisofthedifferentruntimesofHadoop,Spark,Flink,Storm,Heron,Naiad,DARMA(HPCAsynchronousManyTask)

• Therearedifferenttechnologiesfordifferentcircumstancesbutcanbeunifiedbyhighlevelabstractionssuchascommunicationcollectives

• ObviouslyMPIbestforparallelcomputing(bydefinition)• Apachesystemsusedataflowcommunicationwhichisnaturalfordistributedsystemsbutinevitablyslowforclassicparallelcomputing

• Nostandarddataflowlibrary(why?).AddDataflowprimitivesinMPI-4?• MPIcouldadoptsomeoftoolsofBigDataasinCoordinationPoints(dataflownodes),StatemanagementwithRDD(datasets)

12/29/17 46

next generation grid: integrating parallel and … with judy ` qiu, shantenu jha,...

Documents