big data analytics: the apache spark approach · 2017-08-12 · apache spark meetups(august 2017) 8...

33
Big Data Analytics: The Apache Spark Approach Michael Franklin ATPESC August 2017

Upload: others

Post on 28-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

BigDataAnalytics:TheApacheSparkApproach

Michael FranklinATPESC

August 2017

Page 2: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

Nearlyeveryfieldofendeavoristransitioningfrom“datapoor”to“datarich”

Astronomy:LSST

2

Physics:LHCOceanography

Sociology:TheWeb

Biology:SequencingEconomics:mobile,

POSterminals

Neuroscience:EEG,fMRI

Data-DrivenMedicine Sports

Page 3: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

The Fourth Paradigm of Science1. Empirical+ experimental2. Theoretical3. Computational4. Data-Intensive

3

Page 4: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

OpenSourceEcosystem&Context

4

���

2006-2010 Autonomic Computing & Cloud

UC BERKELEY

2011-2016 Big Data Analytics

Usenix HotCloud Workshop 2010

Page 5: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

AMPLabProjectVision“MakingSenseofDataatScale”

Algorithms

• MachineLearning,StatisticalMethods• Prediction,BusinessIntelligence

Machines

• ClustersandClouds• WarehouseScaleComputing

People

• Crowdsourcing,HumanComputation• DataScientists,Analysts

Page 6: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

BerkeleyDataAnalyticsStack

In House Applications – Genomics, IoT, Energy, Cosmology

Access and Interfaces

Processing Engines

Resource Virtualization

Storage

Page 7: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

SomeAMPLabnumbers• Funding– roughly50/50Govt/IndustrySplit

– NSFCISEExpeditions,DARPA,DOE,DHS– Google,SAP,Amazon,IBM(FoundingSponsors)+dozensmore

• Nearly2Mvisitstoamplab.cs.berkeley.edu• 200+PapersinSys,ML,DB,…3ACMDissertationAwards

(1+2HM);NumerousBestPaperandBestDemoAwards• 40+Ph.D.s granted(sofar);AlumnionfacultyatBerkeley,

HarveyMudd,Michigan,MIT,Stanford,Texas,Wisconsin,…• 3SpinoutcompaniesdirectlyfromAMPLab:

– Databricks,Mesosphere,Alluxio– Nearly$250Mraisedtodate

• Manyindustrialproducts&servicesbasedonorusingSpark• 3Marriages(andnumerouslong-termrelationships)

7

Page 8: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

ApacheSparkMeetups (August2017)

8

618 groups with 391,371 membersspark.meetup.com

Page 9: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

WeHitADataManagementInflectionPoint

• Massivelyscalable processingandstorage• Pay-as-you-go processingandstorage

(a.k.a.thecloud)• Flexible schemaonreadvs.schemaonwrite• Integration ofsearch,queryandanalysis• Sophisticated machinelearning/prediction• Human-in-the-loop analytics• Opensourceecosystem drivinginnovation

Page 10: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

BDASUnificationStrategy• SpecializingMapReduce leadstostovepipedsystems

• Instead,generalizeMapReduce:

1.RicherProgrammingModelèFewerSystemstoMaster

2.DataSharingèLessDataMovementleads

toBetterPerformanceSparkshowed10xperformanceimprovementonexistingHDFSdatawithnomigration.

Spark

Stre

aming

Gra

phX

…Spar

kSQ

L

MLb

ase

10

Page 11: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

Abstraction:DataflowOperators

• map

• filter

• groupBy

• sort

• union

• join

• leftOuterJoin

• rightOuterJoin

• reduce

• count

• fold

• reduceByKey

• groupByKey

• cogroup

• cross

• zip

sample

take

first

partitionBy

mapWith

pipe

save

...

11

Page 12: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

IterationinMap-Reduce

TrainingData

Map Reduce LearnedModel

w(1)

w(2)

w(3)

w(0)

InitialModel

12

Page 13: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

CostofIterationinMap-ReduceMap Reduce Learned

Model

w(1)

w(2)

w(3)

w(0)

InitialModel

TrainingData

Read 2Repeatedlyload same data

13

Page 14: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

CostofIterationinMap-ReduceMap Reduce Learned

Model

w(1)

w(2)

w(3)

w(0)

InitialModel

TrainingDataRedundantly saveoutput between

stages

14

Page 15: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

DataflowView

Training Data

(HDFS)

Map

Reduce

Map

Reduce

Map

Reduce

15

Page 16: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

MemoryOpt.Dataflow

Training Data

(HDFS)

Map

Reduce

Map

Reduce

Map

Reduce

CachedLoad

16

Page 17: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

MemoryOpt.DataflowView

Training Data

(HDFS)

Map

Reduce

Map

Reduce

Map

Reduce

Efficientlymove data betweenstages

Spark:10-100× faster than Hadoop MapReduce17

Page 18: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

SparkFaultTolerance• RDDs:Immutable collectionsofobjectsthatcanbestoredinmemoryordiskacrossacluster– Builtviaparalleltransformations(map,filter,…)– Automaticallyrebuilton(partial)failure

M.Zaharia,etal,ResilientDistributedDatasets:Afault-tolerantabstractionforin-memoryclustercomputing,NSDI2012. 18

messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2))

HadoopRDDpath=hdfs://…

FilteredRDDfunc =_.contains(...)

MappedRDDfunc =_.split(…)

Page 19: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

DataFrames(mainabstractioninSpark2.0)

employees

.join(dept,employees("deptId")=== dept("id"))

.where(employees("gender")==="female")

.groupBy(dept("id"),dept("name"))

.agg(count("name"))

Notes:1) Some people think this is an improvement over SQL J2) Dataframes can be typed

19

Page 20: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

CatalystOptimizer• TypicalDBoptimizationsacrossSQLandDF– ExtensibilityviaOptimizationRuleswritteninScala– OpenSourceoptimizerevolution!

• Codegenerationforinner-loops,iteratorremoval• ExtensibleDataSources:CSV,Avro,Parquet,JDBC,…viaTableScan (allcols),PrunedScan (project),FilteredPrunedScan(pushadvisoryselectsandprojects)CatalystScan (pushadvisoryfullCatalystexpressiontrees)• Extensible(UserDefined)Types

20

M.Armbrust,etal,SparkSQL:RelationalDataProcessinginSpark,SIGMOD2015.

Page 21: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

AninterestingthingaboutSparkSQLPerformance

21

Page 22: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

LambdaArchitecture:onewaytocombineReal-Time+Batch

• lambda-architecture.net22

Page 23: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

SparkStreaming• Microbatch approachprovideslowlatency

Additional operators provide windowed operations

M.Zaharia,etal,DiscretizedStreams:Fault-Tollerant StreamingComputationatScale,SOSP2013S.Venketaraman etal,Azkar:FastandAdaptableStreamProcessingatScale,SOSP2017 23

Page 24: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

SparkStructuredStreams(unified)

24

Batch Analytics

Streaming Analytics

Page 25: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

25

SQL

MachineLearning

Streaming

PuttingitallTogether:Multi-modalAnalytics

Page 26: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code
Page 27: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

27

From:SparkUserSurvey2016,1615respondentsfrom900organizationshttp://go.databricks.com/2016-spark-survey

Page 28: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

28

Page 29: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

29

Page 30: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

30

Page 31: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

SparkEcosystemAttributes

• Sparkfocuswasinitiallyon– Performance +Scalability withFaultTolerance

• Eventually,easeofdevelopment wasakeyfeature– especiallyacrossmultiplemodalities:DB,Graph,Stream,etc.

• ThiswastrueofmostBigDatasoftwareofthatgeneration

• LowLatency(streaming)andDeepLearning arealsogarneringsignificantattentionlately

Page 32: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

What’sNext?Innovationin(opensource)BigDataSoftwarecontinues.Performance,Scalability,andFaultToleranceremainimportant,butwefacenewchallenges,including:DataScienceLifecycle

• DataAcquisition,Integration,Cleaning(i.e.,wrangling)• DataIntegrationremainsa“wickedproblem”• ModelBuilding• Communicatingresults,Curation,“TranslationalDataScience”

EaseofDevelopmentandDeployment• Canleveragedatabaseideas(e.g.,declarativequeryoptimization)• Newcomponentsfor“modelserving”and“modelmanagement”

“Safe”DataScience• end-to-endBiasMitigation• Security,EthicsandDataPrivacy• Explainingandinfluencingdecisions• Human-in-the-loop

Page 33: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code

33

Thanks and for More Info

[email protected]