next generation grid: integrating parallel and … with judy ` qiu, shantenu jha,...
TRANSCRIPT
`WorkwithJudyQiu,ShantenuJha,Supun Kamburugamuve,KannanGovindarajan,Pulasthi Wickramasinghe
12/29/17 1
The15thIEEEInternationalSymposiumonParallelandDistributedProcessingwithApplications(IEEEISPA2017)Guangzhou,China,December12-15,2017
http://trust.gzhu.edu.cn/conference/ISPA2017/GeoffreyFox,December13,2017
Department of Intelligent Systems [email protected], http://www.dsc.soic.indiana.edu/, http://spidal.org/
NextGenerationGrid:IntegratingParallelandDistributedComputingRuntimes
fromCloudtoEdgeApplications
Abstract• WelookagainatBigDataProgrammingenvironmentssuchasHadoop,Spark,Flink,Heron,Pregel;HPCconceptssuchasMPIandAsynchronousMany-TaskruntimesandCloud/Grid/Edgeideassuchasevent-drivencomputing,serverlesscomputing,workflowandServices.
• Thesecrossmanyresearchcommunitiesincludingdistributedsystems,databases,cyberphysicalsystemsandparallelcomputingwhichsometimeshaveinconsistentworldviews.
• Therearemanycommoncapabilitiesacrossthesesystemswhichareoftenimplementeddifferentlyineachpackagedenvironment.Forexample,communicationcanbebulksynchronousprocessingordataflow;schedulingcanbedynamicorstatic;stateandfault-tolerancecanhavedifferentmodels;executionanddatacanbestreamingorbatch,distributedorlocal.
• Wesuggestthatonecanusefullybuildatoolkit(calledTwister2byus)thatsupportsthesedifferentchoicesandallowsfruitfulcustomizationforeachapplicationarea.WeillustratethedesignofTwister2byseveralpointstudies.
12/29/17 2
• Supercomputerswillbeessentialforlargesimulationsandwillrunotherapplications
• HPCCloudsorNext-GenerationCommoditySystemswillbeadominantforce• MergeCloudHPCand(supportof) Edge computing• FederatedCloudsrunninginmultiplegiantdatacentersofferingalltypesofcomputing
• DistributeddatasourcesassociatedwithdeviceandFogprocessingresources• Server-hiddencomputingandFunctionasaServiceFaaS foruserpleasure“Noserveriseasiertomanagethannoserver”
• Supportadistributedevent-drivenserverless dataflowcomputingmodelcoveringbatch andstreaming dataasHPC-FaaS
• Needingparallelanddistributed(Grid)computingideas• SpanPleasinglyParalleltoDatamanagementtoGlobalMachineLearning
Predictions/Assumptions
12/29/17 3
• Useofpubliccloudsincreasingrapidly• CloudsbecomingdiversewithsubsystemscontainingGPU’s,FPGA’s,highperformancenetworks,storage,memory…
• Richsoftwarestacks:• HPC(HighPerformanceComputing)forParallelComputinglessusedthan(?)• ApacheforBigDataSoftwareStackABDSincludingcenterandedgecomputing(streaming)
• SurelyBigDatarequiresHighPerformanceComputing?• Service-orientedSystems,InternetofThingsandEdgeComputinggrowinginimportance
• Alotof confusioncomingfromdifferentcommunities(database,distributed,parallelcomputing,machinelearning,computational/datascience)investigatingsimilarideaswithlittleknowledgeexchangeandmixedup(unclear)requirements
BackgroundRemarks
12/29/17 4
• Ongeneralprinciplesparallelanddistributedcomputinghavedifferentrequirementsevenifsometimessimilarfunctionalities
• ApachestackABDStypicallyusesdistributedcomputingconcepts• Forexample,ReduceoperationisdifferentinMPI(Harp)andSpark
• Largescalesimulationrequirementsarewellunderstood• BigDatarequirementsarenotagreedbutthereareafewkeyusetypes
1) Pleasinglyparallelprocessing(includinglocalmachinelearningLML)asofdifferenttweetsfromdifferentuserswithperhapsMapReducestyleofstatisticsandvisualizations;possiblyStreaming
2) DatabasemodelwithqueriesagainsupportedbyMapReduceforhorizontalscaling3) GlobalMachineLearningGMLwithsinglejobusingmultiplenodesasclassicparallel
computing4) DeepLearningcertainlyneedsHPC– possiblyonlymultiplesmallsystems
• Currentworkloadsstress1)and2)andaresuitedtocurrentcloudsandtoABDS(withnoHPC)• ThisexplainswhySparkwithpoorGMLperformanceissosuccessfulandwhyitcanignoreMPIeventhoughMPIusesbesttechnologyforparallelcomputing
Requirements
12/29/17 5
HPCRuntimeversusABDSdistributedComputingModelonDataAnalytics
Hadoop writestodiskandis slowest;SparkandFlink spawnmanyprocessesanddonotsupportAllReduce directly;MPI doesin-placecombinedreduce/broadcastandisfastest
NeedPolymorphicReductioncapabilitychoosingbestimplementation
UseHPCarchitecturewithMutablemodelImmutabledata
12/29/17 6
UseCaseAnalysis
• Veryshortasdescribedinprevioustalksandpapers• StartedwithNISTcollectionof51usecases• “Version2”https://bigdatawg.nist.gov/V2_output_docs.php justreleasedAugust2017• 64FeaturesofDataandModelforlargescalebigdataorsimulationusecases
12/29/17 7
12/29/17 8
NISTBigDataPublicWorkingGroupStandardsBestPractice
IndianaCloudmeshlaunchingTwister2
Indiana
https://bigdatawg.nist.gov/V2_output_docs.php
Local(An
alytics/Inform
atics/Simulations)
2M
DataSourceandStyleView
PleasinglyParallelClassicMapReduce
Map-CollectiveMapPoint-to-Point
SharedMemorySingleProgramMultipleData
BulkSynchronousParallel
FusionDataflowAgents
Workflow
GeospatialInformationSystemHPCSimulationsInternetofThingsMetadata/ProvenanceShared/Dedicated/Transient/Permanent
Archived/Batched/Streaming – S1,S2,S3,S4,S5
HDFS/Lustre/GPFS
Files/ObjectsEnterpriseDataModelSQL/NoSQL/NewSQL
1M
Micro-benchmarks
ExecutionView
ProcessingView1234
6
7891011M
12
10D98D7D6D5D
4D
3D2D1D
MapStreaming 5
ConvergenceDiamondsViewsandFacets
ProblemArchitectureView
15MCo
reLibrarie
sVisualiza
tion
14M
GraphAlgorithm
s
13M
LinearAlgebraKernels/M
anysubclasses
12M
Global(A
nalytics/Inform
atics/Simulations)
3M
RecommenderEngine
5M
4M
BaseDataStatistics
10M
Stream
ingDa
taAlgorith
ms
Optimiza
tionMethodology
9M
Learning
8M
DataClassificatio
n
7M
DataSe
arch/Q
uery/In
dex
6M
11M
DataAlignm
ent
BigDataProcessingDiamonds
MultiscaleMethod
17M
16M
IterativePD
ESolvers
22M
Natureofm
eshifused
EvolutionofDiscreteSystem
s
21M
ParticlesandFields
20M
N-bodyM
ethods
19M
Spectra
lMethods
18M
Simulation(Exascale)ProcessingDiamonds
DataAbstraction
D12
ModelAbstraction
M12
DataMetric
=M
/Non-Metric
=N
D13
DataMetric
=M
/Non-Metric
=N
M13
𝑂𝑁#
=NN
/𝑂(𝑁)=N
M14
Regular=R/Irregular=
IModel
M10
Veracity7
Iterative/Sim
ple
M11
Communication
StructureM8
Dynamic=D/Static=
SD9
Dynamic=D/Static=
SM9
Regular=R/Irregular=
IData
D10
ModelVariety
M6
DataVelocity
D5
Performance
Metrics
1
DataVariety
D6
FlopsperByte/M
emory
IO/Flopsperw
att
2
ExecutionEnvironm
ent;Corelibraries
3
DataVolum
e
D4
ModelSize
M4
Simulations Analytics(ModelforBigData)
Both
(AllModel)
(NearlyallData+Model)
(NearlyallData)
(MixofDataandModel)
41/51Streaming26/51PleasinglyParallel25/51Mapreduce
64Featuresin4viewsforUnifiedClassificationof BigDataandSimulationApplications
12/29/17 9
1. PleasinglyParallel– asinBLAST,Proteindocking,some(bio-)imageryincludingLocalAnalyticsorMachineLearning– MLorfilteringpleasinglyparallel,asinbio-imagery,radarimages(pleasinglyparallelbutsophisticatedlocalanalytics)
2. ClassicMapReduce:Search,IndexandQueryandClassificationalgorithmslikecollaborativefiltering(G1forMRStat inFeatures,G7)
3. Map-Collective:Iterativemaps+communicationdominatedby“collective”operationsasinreduction,broadcast,gather,scatter.Commondatamining pattern
4. Map-PointtoPoint:Iterativemaps+communicationdominatedbymanysmallpointtopointmessagesasingraphalgorithms
5. Map-Streaming:Describesstreaming,steeringandassimilationproblems
6. SharedMemory:Someproblemsareasynchronousandareeasiertoparallelizeonsharedratherthandistributedmemory– seesomegraphalgorithms
7. SPMD:SingleProgramMultipleData,commonparallelprogrammingfeature
8. BSPorBulkSynchronousProcessing:well-definedcompute-communicationphases
9. Fusion:Knowledgediscoveryofteninvolvesfusionofmultiplemethods.
10. Dataflow:ImportantapplicationfeaturesoftenoccurringincompositeOgres
11. UseAgents:asinepidemiology(swarmapproaches)ThisisModelonly
12. Workflow:Allapplicationsofteninvolveorchestration(workflow)ofmultiplecomponents
ProblemArchitectureView(MetaorMacroPatterns)
Most(11oftotal12)arepropertiesofData+Model
12/29/17 10
These3arefocusofTwister2butweneedtopreservecapabilityonfirst2paradigms
ClassicCloudWorkload
GlobalMachineLearning
NoteProblemandSystemArchitectureasefficientexecutionsaystheymustmatch
12/29/17 11
• NeedtodiscussDataandModel asproblemshavebothintermingled,butwecangetinsightbyseparatingwhichallowsbetterunderstandingofBigData- BigSimulation“convergence”(ordifferences!)
• TheModel isauserconstructionandithasa“concept”, parametersandgives resultsdeterminedbythecomputation.Weuseterm“model”inageneralfashiontocoverallofthese.
• BigDataproblems canbebrokenupintoDataandModel• Forclustering,themodelparametersareclustercenterswhilethedataissetofpointstobeclustered
• Forqueries,themodelisstructureofdatabaseandresultsofthisquerywhilethedataiswholedatabasequeriedandSQLquery
• FordeeplearningwithImageNet,themodelischosennetworkwithmodelparametersasthenetworklinkweights.Thedataissetofimagesusedfortrainingorclassification
DataandModelinBigDataandSimulationsI
12/29/17 12
DataandModelinBigDataandSimulationsII• Simulations canalsobeconsideredasData plusModel
• Model canbeformulationwithparticledynamicsorpartialdifferentialequationsdefinedbyparameterssuchasparticlepositionsanddiscretizedvelocity,pressure,densityvalues
• Data couldbesmallwhenjustboundaryconditions• Data largewithdataassimilation(weatherforecasting)orwhendatavisualizationsareproducedbysimulation
• BigDataimpliesDataislargebutModelvariesinsize• e.g.LDA (LatentDirichletAllocation)withmanytopicsordeeplearninghasalargemodel• Clustering orDimensionreductioncanbequitesmallinmodelsize
• Data oftenstaticbetweeniterations(unlessstreaming);Modelparameters varybetweeniterations
• Data andModelParametersareoftenconfusedinpapersastermdatausedtodescribetheparametersofmodels.
• ModelsinBigDataandSimulationshavemanysimilaritiesandallowconvergence
12/29/17 13
• Applications– Divideusecasesinto Data andModelandcomparecharacteristicsseparatelyinthesetwocomponentswith64ConvergenceDiamonds(features).
• Identifyimportanceofstreamingdata,pleasinglyparallel,global/localmachine-learning• Software– Singlemodelof HighPerformanceComputing(HPC)EnhancedBigDataStackHPC-ABDS.21LayersaddinghighperformanceruntimetoApachesystemsHPC-FaaS ProgrammingModel
• Serverless InfrastructureasaServiceIaaS• Hardwaresystemdesignedforfunctionalityandperformanceofapplicationtypee.g.disks,interconnect,memory,CPUaccelerationdifferentformachinelearning,pleasinglyparallel,datamanagement,streaming,simulations
• UseDevOpstoautomatedeploymentofevent-drivensoftwaredefinedsystemsonhardware:HPCCloud 2.0
• TotalSystemSolutions(wisdom)asaService:HPCCloud 3.0
Convergence/DivergencePointsforHPC-Cloud-Edge- BigData-Simulation
UsesDevOpsnotdiscussedinthistalk
12/29/17 14
ParallelComputing:BigDataandSimulations• Allthedifferentprogrammingmodels(Spark,Flink,Storm,Naiad,MPI/OpenMP)havethesamehighlevelapproachbutapplicationrequirementsandsystemarchitecturecangivedifferentappearance
• First:BreakProblemData and/orModel-parameters intopartsassignedtoseparatenodes,processes,threads
• Then:Inparallel,docomputationstypicallyleavingdata untouchedbutchangingmodel-parameters.CalledMaps inMapReduceparlance;typicallyownercomputesrule.
• IfPleasinglyparallel,that’sallitisexceptformanagement• IfGloballyparallel,needtocommunicateresultsofcomputationsbetweennodesduringjob• Communicationmechanism(TCP,RDMA,NativeInfiniband)canvary• CommunicationStyle(PointtoPoint,Collective,Pub-Sub)canvary• Possibleneedforsophisticateddynamicchangesinpartioning (loadbalancing)
• Computationeitheronfixedtasksorflowbetweentasks• Choices:“AutomaticParallelismorNot”• Choices:“ComplicatedParallelAlgorithmorNot”• Fault-Tolerancemodelcanvary• Outputmodelcanvary:RDDorFilesorPipes
12/29/17 15
DifficultyinParallelismSizeofSynchronizationconstraints
SpectrumofApplicationsandAlgorithms
12/29/17 16
PleasinglyParallelOftenindependentevents
MapReduceasinscalabledatabases
StructuredAdaptiveSparsityHugeJobs
LooselyCoupled
LargescalesimulationsCurrentmajorBig
Datacategory
CommodityClouds
HPCCloudsHighPerformanceInterconnect
ExascaleSupercomputers
GlobalMachineLearninge.g.parallelclustering
DeepLearning
HPCClouds/SupercomputersMemoryaccessalsocritical
UnstructuredAdaptiveSparsityMediumsizeJobs
GraphAnalyticse.g.subgraphmining
LDA
LinearAlgebraatcore(typicallynotsparse)
DiskI/O
Software:MIDASHPC-ABDS
NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science
Ogres Application Analysis
HPC-ABDS and HPC-FaaS SoftwareHarp and Twister2 Building Blocks
SPIDAL Data Analytics Library
12/29/17 18
HPC-ABDS
IntegratedwiderangeofHPCandBigDatatechnologies.
IgaveupupdatinglistinJanuary2016!
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies Cross-Cutting
Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination: Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca
17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Networking: Google Cloud DNS, Amazon Route 53
21layers Over350SoftwarePackagesJanuary292016
12/29/17 19
• Googlelikestoshowatimeline;wecanbuildon(Apacheversionof)this• 2002GoogleFileSystemGFS~HDFS(Level8)• 2004MapReduce ApacheHadoop(Level14A)• 2006BigTableApacheHbase(Level11B)• 2008DremelApacheDrill(Level15A)• 2009Pregel ApacheGiraph (Level14A)• 2010FlumeJava ApacheCrunch(Level17)• 2010ColossusbetterGFS(Level18)• 2012Spanner horizontallyscalableNewSQLdatabase~CockroachDB (Level11C)• 2013 F1 horizontallyscalableSQLdatabase(Level11C)• 2013MillWheel ~ApacheStorm,TwitterHeron(Googlenotfirst!)(Level14B)• 2015CloudDataflowApacheBeamwithSparkorFlink(dataflow)engine(Level17)• Functionalitiesnotidentified:Security(3),DataTransfer(10),Scheduling(9),DevOps(6),serverless computing(whereApachehasOpenWhisk)(5)
ComponentsofBigDataStack
HPC-ABDSLevelsin()
12/29/17 20
DifferentchoicesinsoftwaresystemsinCloudsandHPC.HPC-ABDStakescloudsoftwareaugmentedbyHPCwhenneededtoimproveperformance
16of21layerspluslanguages
12/29/17 21
ImplementingTwister2tosupportaGridlinkedtoanHPCCloud
Cloud
HPC
Cloud
HPC
CentralizedHPCCloud+IoTDevices CentralizedHPCCloud+Edge=Fog+IoTDevices
Cloud
HPC
Fog
HPCCloudcanbefederated
12/29/17 22
Twister2:“NextGenerationGrid- Edge– HPCCloud”• Original2010Twister paperhas914citations;itwasaparticularapproachtoMapCollective iterativeprocessingformachinelearning
• Re-engineercurrentApacheBigDataandHPCsoftwaresystemsasatoolkit• Supportaserverless (cloud-native)dataflow event-drivenHPC-FaaS (microservice)frameworkrunningacrossapplicationandgeographicdomains.
• SupportalltypesofDataanalysisfromGMLtoEdgecomputing• BuildonCloudbestpracticebutuseHPCwhereverpossibletogethighperformance• SmoothlysupportcurrentparadigmsHadoop,Spark,Flink,Heron,MPI,DARMA…• Use interoperablecommonabstractionsbutmultiple polymorphicimplementations.
• i.e.donotrequireasingleruntime• FocusonRuntimebutthisimpliesHPC-FaaS programmingandexecutionmodel• ThisdefinesanextgenerationGridbasedondataandedgedevices– notcomputingasinoldGridSeepaperhttp://dsc.soic.indiana.edu/publications/twister2_design_big_data_toolkit.pdf
12/29/17 23
• UnitofProcessingisanEventdrivenFunction(amicroservice)replaceslibraries• Canhavestatethatmayneedtobepreservedinplace(IterativeMapReduce)• Functionscanbesingleor1of100,000mapsinlargeparallelcode
• ProcessingunitsruninHPCclouds,fogsordevices buttheseallhavesimilarsoftwarearchitecture(seeAWSGreengrass andLambda)
• UniversalProgrammingmodelsoFog(e.g.car)lookslikeacloudtoadevice(radarsensor)whilepubliccloudlookslikeacloudtothefog(car)
• Analyzetheruntimeofexistingsystems(Morestudyneeded)• Hadoop,Spark,Flink,PregelBigDataProcessing• Storm,HeronStreamingDataflow• Kepler,Pegasus,NiFi workflowsystems• HarpMap-Collective,MPIandHPCAMTruntimelikeDARMA• AndapproachessuchasGridFTP andCORBA/HLA(!)forwideareadatalinks
ProposedTwister2Approach
12/29/17 24
ComparingSparkFlinkHeronandMPI
• OnGlobalMachineLearningGML.• NoteIsaidSparkandFlinkaresuccessfulonLMLnotGMLand
currentlyLMLismorecommonthanGML
12/29/17 25
MachineLearningwithMPI,SparkandFlink
• Threealgorithmsimplementedinthreeruntimes• MultidimensionalScaling(MDS)• Terasort• K-Means(dropasnotime)
• ImplementationinJava• MDSisthemostcomplexalgorithm- threenestedparallelloops• K-Means- oneparallelloop• Terasort - noiterations
• Withcare,Javaperformance~Cperformance• Withoutcare,Javaperformance<<Cperformance(detailsomitted)
12/29/17 26
MultidimensionalScaling:3NestedParallelSections
MDSexecutiontimeon16nodeswith20processesineachnodewith
varyingnumberofpoints
MDSexecutiontimewith32000pointsonvaryingnumberofnodes.Eachnoderuns20paralleltasks
12/29/17 27
Flink
Spark
MPI
MPIFactorof20-200FasterthanSpark/Flink
TerasortSorting1TBofdatarecords
Terasort executiontimein64and32nodes.OnlyMPIshowsthesortingtimeandcommunicationtimeas
othertwoframeworksdoesn'tprovideaviablemethodtoaccuratelymeasurethem.Sortingtimeincludesdata
savetime.MPI-IB- MPIwithInfiniband
Partitionthedatausingasampleandregroup
TransferdatausingMPI
12/29/17 28
DataflowatDifferentGrainsizes
12/29/17 29
Reduce
Maps
Iterate
InternalExecutionDataflowNodes
HPCCommunication
CoarseGrainDataflowslinksjobsinsuchapipelineDatapreparation Clustering
DimensionReduction
Visualization
Butinternallytoeachjobyoucanalsoelegantlyexpressalgorithmasdataflowbutwithmorestringentperformanceconstraints
• P=loadPoints()• C=loadInitCenters()• for(int i =0;i <10;i++){• T=P.map().withBroadcast(C)• C=T.reduce()}
Iterate
CorrespondingtoclassicSparkK-meansDataflow
ImplementingTwister2indetailI
Thisbreaksrulefrom2012-2017ofnot“competing”withbutrather“enhancing”ApacheLookatCommunicationindetail
12/29/17 32
http://www.iterativemapreduce.org/
Twister2ComponentsI
9/25/2017 33
Area Component Implementation Comments: User API
Architecture Specification
Coordination PointsState and Configuration Management; Program, Data and Message Level
Change execution mode; save and reset state
Execution Semantics
Mapping of Resources to Bolts/Maps in Containers, Processes, Threads
Different systems make different choices - why?
Parallel Computing Spark Flink Hadoop Pregel MPI modes Owner Computes Rule
Job Submission (Dynamic/Static) Resource Allocation
Plugins for Slurm, Yarn, Mesos, Marathon, Aurora
Client API (e.g. Python) for Job Management
Task System
Task migration Monitoring of tasks and migrating tasks for better resource utilization
Task-based programming with Dynamic or Static Graph API;
FaaS API;
Support accelerators (CUDA,KNL)
Elasticity OpenWhisk
Streaming and FaaS Events
Heron, OpenWhisk, Kafka/RabbitMQ
Task Execution Process, Threads, Queues
Task Scheduling Dynamic Scheduling, Static Scheduling,Pluggable Scheduling Algorithms
Task Graph Static Graph, Dynamic Graph Generation
Twister2ComponentsII
9/25/2017 34
Area Component Implementation Comments
Communication API
Messages Heron This is user level and could map to multiple communication systems
Dataflow Communication
Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA
Coarse grain Dataflow from NiFi, Kepler?
Streaming, ETL data pipelines;
Define new Dataflow communicationAPI and library
BSP CommunicationMap-Collective
Conventional MPI, Harp MPI Point to Point and Collective API
Data AccessStatic (Batch) Data File Systems, NoSQL, SQL
Data APIStreaming Data Message Brokers, Spouts
Data Management Distributed Data Set
Relaxed Distributed Shared Memory(immutable data), Mutable Distributed Data
Data Transformation API;
Spark RDD, Heron Streamlet
Fault Tolerance Check PointingUpstream (streaming) backup; Lightweight; Coordination Points; Spark/Flink, MPI and Heron models
Streaming and batch casesdistinct; Crosses all components
Security Storage, Messaging, execution
Research needed Crosses all Components
SchedulingChoices• Scheduling isonekeyareawheredataflowsystemsdiffer
• DynamicScheduling(Spark)• Finegraincontrolofdataflowgraph• Graphcannotbeoptimized
• StaticScheduling(Flink)• Lesscontrolofthedataflowgraph• Graphcanbeoptimized
• Twister2willalloweither
12/29/17 35
CommunicationModels• MPICharacteristics:Tightlysynchronizedapplications
• Efficientcommunications(µslatency)withuseofadvancedhardware• Inplacecommunicationsandcomputations(Processscopeforstate)
• Basicdataflow:Modelacomputationasagraph• NodesdocomputationswithTaskascomputationsandedgesareasynchronouscommunications
• Acomputationisactivatedwhenitsinputdatadependenciesaresatisfied
• Streamingdataflow:Pub-Subwithdatapartitionedintostreams• Streamsareunbounded,ordereddatatuples• Orderofeventsimportantandgroupdataintotimewindows
• MachineLearningdataflow:Iterativecomputationsandkeeptrackofstate• ThereisbothModelandData,butonlycommunicatethemodel• CollectivecommunicationoperationssuchasAllReduce AllGather (nodifferentialoperatorsinBigDataproblems)
• Canusein-placeMPIstylecommunication
S
W G
S
W
WDataflow
12/29/17 36
MahoutandSPIDAL• MahoutwasHadoopmachinelearninglibrarybutlargelyabandonedasSparkoutperformedHadoop
• SPIDALoutperformsSparkMllib andFlinkduetobettercommunicationandin-placedataflow.
• SPIDALalsohascommunityalgorithms
• BiomolecularSimulation• GraphsforNetworkScience• Imageprocessingforpathologyandpolarscience
12/29/17 37
Qiu/FoxCoreSPIDALParallelHPCLibrarywithCollectiveUsed• DA-MDS Rotate,AllReduce,Broadcast• DirectedForceDimensionReductionAllGather,Allreduce
• IrregularDAVSClusteringPartialRotate,AllReduce,Broadcast
• DASemimetric Clustering(DeterministicAnnealing)Rotate,AllReduce,Broadcast
• K-means AllReduce,Broadcast,AllGather DAAL• SVM AllReduce,AllGather• SubGraph Mining AllGather,AllReduce• LatentDirichletAllocationRotate,AllReduce• MatrixFactorization(SGD)RotateDAAL• RecommenderSystem(ALS) RotateDAAL• SingularValueDecomposition(SVD)AllGatherDAAL
• QRDecomposition(QR)Reduce,BroadcastDAAL• NeuralNetworkAllReduce DAAL• Covariance AllReduce DAAL• LowOrderMomentsReduceDAAL• NaiveBayesReduceDAAL• LinearRegressionReduceDAAL• RidgeRegressionReduceDAAL• Multi-classLogisticRegression Regroup,Rotate,AllGather
• RandomForestAllReduce• PrincipalComponentAnalysis(PCA)AllReduceDAAL
DAAL impliesintegratedonnodewithIntelDAALOptimizedDataAnalyticsLibrary(RunsonKNL!)
12/29/17 38
Map Collective Run time merges MapReduce and HPC
allreducereduce
rotatepush & pull
allgather
regroup
broadcast
Qiu MIDAS run time software for Harp
• Datasets:5millionpoints,10thousandcentroids,10featuredimensions
• 10to20nodesofIntelKNL7250processors• Harp-DAALhas15xspeedupsoverSpark
MLlib
• Datasets:500Kor1milliondatapointsoffeaturedimension300
• RunningonsingleKNL7250(Harp-DAAL)vs.singleK80GPU(PyTorch)
• Harp-DAALachieves3xto6xspeedups
• Datasets:Twitterwith44millionvertices,2billionedges,subgraphtemplatesof10to12vertices
• 25nodesofIntelXeonE52670• Harp-DAALhas2xto5xspeedupsover
state-of-the-artMPI-Fasciasolution
Harpv.SparkHarpv.TorchHarpv.MPI
SystemsStateSparkKmeans Dataflow
• P=loadPoints()• C=loadInitCenters()
• for(int i =0;i <10;i++){• T=P.map().withBroadcast(C)• C=T.reduce()}
SaveStateatCoordinationPointStoreCinRDD
12/29/17 43
• Stateishandleddifferentlyinsystems• CORBA,AMT,MPIand
Storm/Heronhavelongrunningtasksthatpreservestate
• SparkandFlinkpreservedatasetsacrossdataflownodeusingin-memorydatabases
• Allsystemsagreeoncoarsegraindataflow;onlykeepstatebyexchangingdata
Iterate
FaultToleranceandState• Similarformof check-pointingmechanismisusedalreadyinHPCandBigData
• althoughHPCinformalasdoesn’ttypicallyspecifyasadataflowgraph• FlinkandSparkdobetterthanMPIduetouseof databasetechnologies;MPIisabitharderduetoricherstatebutthereisanobviousintegratedmodelusingRDDtypesnapshotsofMPIstylejobs
• Checkpointaftereachstageofthedataflowgraph(atlocationofintelligentdataflownodes)
• Naturalsynchronizationpoint• Let’sallowsusertochoosewhentocheckpoint(noteverystage)• Savestateasuserspecifies;SparkjustsavesModelstatewhichisinsufficientforcomplexalgorithms
12/29/17 44
InitialTwister2Performance• Eventuallytestlotsofchoicesoftaskmanagersandcommunicationmodels;threadsversusprocesses;languagesetc.
• Here16Haswellnodeseachwith1processrunning20tasksasthreads;Java
12/29/17 45
• ReducemicrobenchmarkforApacheFlinkandTwister2;Flinkpoorperformanceduetononoptimizedreduceoperation
• Twister2hasanewdataflowcommunicationlibrarybasedonMPI– inthiscasea1000timesfasterthanFlnk
Summary of Twister2: Next Generation HPC Cloud + Edge + Grid• WesuggestaneventdrivencomputingmodelbuiltaroundCloudandHPCandspanningbatch,streaming,andedgeapplications
• Highlyparalleloncloud;possiblysequentialattheedge• IntegratecurrenttechnologyofFaaS (FunctionasaService)andserver-hidden(serverless)computingwithHPCandApachebatch/streamingsystems
• WehavebuiltahighperformancedataanalysislibrarySPIDAL• WehaveintegratedHPCintomanyApachesystemswithHPC-ABDS• WehavedoneaverypreliminaryanalysisofthedifferentruntimesofHadoop,Spark,Flink,Storm,Heron,Naiad,DARMA(HPCAsynchronousManyTask)
• Therearedifferenttechnologiesfordifferentcircumstancesbutcanbeunifiedbyhighlevelabstractionssuchascommunicationcollectives
• ObviouslyMPIbestforparallelcomputing(bydefinition)• Apachesystemsusedataflowcommunicationwhichisnaturalfordistributedsystemsbutinevitablyslowforclassicparallelcomputing
• Nostandarddataflowlibrary(why?).AddDataflowprimitivesinMPI-4?• MPIcouldadoptsomeoftoolsofBigDataasinCoordinationPoints(dataflownodes),StatemanagementwithRDD(datasets)
12/29/17 46