high-speed in-memory analytics over hadoop and hive...

SlidesadoptedfromMateiZaharia(MIT)andOliverVagner(NCR)

Spark&SparkSQLHigh-SpeedIn-MemoryAnalytics overHadoopandHiveData

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech

Instructor: Duen Horng (Polo) Chau1

WhatisSpark?NotamodifiedversionofHadoop

Separate,fast,MapReduce-likeengine» In-memorydatastorageforveryfastiterativequeries»Generalexecutiongraphsandpowerfuloptimizations»Upto40xfasterthanHadoop

CompatiblewithHadoop’sstorageAPIs»Canread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc.

http://spark.apache.org

2

http://spark.apache.org

WhatisSparkSQL? (FormallycalledShark)

PortofApacheHivetorunonSpark

CompatiblewithexistingHivedata,metastores,andqueries(HiveQL,UDFs,etc)

Similarspeedupsofupto40x

3

ProjectHistory[latest:v2.2]Sparkprojectstartedin2009atUCBerkeleyAMPlab,opensourced2010

BecameApacheTop-LevelProjectinFeb2014

Shark/SparkSQLstartedsummer2011

Builtby250+developersandpeoplefrom50companies

Scaleto1000+nodesinproduction

InuseatBerkeley,Princeton,Klout,Foursquare,Conviva,Quantifind,Yahoo!Research,…

UCBERKELEY

http://en.wikipedia.org/wiki/Apache_Spark 4

WhyaNewProgrammingModel?

MapReducegreatlysimplifiedbigdataanalysis

Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries

5

WhyaNewProgrammingModel?

MapReducegreatlysimplifiedbigdataanalysis

Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries

Requirefasterdatasharingacrossparalleljobs

5

IsMapReducedead?Notreally.

http://www.reddit.com/r/compsci/comments/296aqr/on_the_death_of_mapreduce_at_google/

http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/

6







DataSharinginMapReduce

iter.1 iter.2 ...

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query1

query2

query3

result1

result2

result3

...

HDFS read

7

DataSharinginMapReduce

iter.1 iter.2 ...

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query1

query2

query3

result1

result2

result3

...

HDFS read

Slowduetoreplication,serialization,anddiskIO 7

iter.1 iter.2 ...

Input

DataSharinginSpark

Distributedmemory

Input

query1

query2

query3

...

one-time processing

8

iter.1 iter.2 ...

Input

DataSharinginSpark

Distributedmemory

Input

query1

query2

query3

...

one-time processing

10-100×fasterthannetworkanddisk 8

SparkProgrammingModel

Keyidea:resilientdistributeddatasets(RDDs)»Distributedcollectionsofobjectsthatcanbecachedinmemoryacrossclusternodes»Manipulatedthroughvariousparalleloperators»Automaticallyrebuiltonfailure

Interface»Cleanlanguage-integratedAPIinScala»CanbeusedinteractivelyfromScala,Pythonconsole»Supportedlanguages:Java,Scala,Python,R

9

http://www.scala-lang.org/old/faq/4

Functional programming in D3: http://sleptons.blogspot.com/2015/01/functional-programming-d3js-good-example.html

Scala vs Java 8: http://kukuruku.co/hub/scala/java-8-vs-scala-the-difference-in-approaches-and-mutual-innovations

10



http://sleptons.blogspot.com/2015/01/functional-programming-d3js-good-example.html

http://kukuruku.co/hub/scala/java-8-vs-scala-the-difference-in-approaches-and-mutual-innovations

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

http://www.slideshare.net/normation/scala-dreaded


Worker

Worker

Worker

Driver




lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver





Worker

Worker

Worker

Driver

BaseRDD





Worker

Worker

Worker

Driver





Worker

Worker

Worker

Driver

TransformedRDD





Worker

Worker

Worker

Driver





Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count





Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).countAction





Worker

Worker

Worker

Driver






Block1

Block2

Block3

Worker

Worker

Worker

Driver






Block1

Block2

Block3

Worker

Worker

Worker

Driver


tasks





Block1

Block2

Block3

Worker

Worker

Worker

Driver


tasks

results





Block1

Block2

Block3

Worker

Worker

Worker

Driver


tasks

results

Cache1

Cache2

Cache3





Block1

Block2

Block3

Worker

Worker

Worker

Driver


Cache1

Cache2

Cache3





Block1

Block2

Block3

Worker

Worker

Worker

Driver


cachedMsgs.filter(_.contains(“bar”)).count

Cache1

Cache2

Cache3





Block1

Block2

Block3

Worker

Worker

Worker

Driver



. . .

Cache1

Cache2

Cache3





Block1

Block2

Block3

Worker

Worker

Worker

Driver



. . .

Cache1

Cache2

Cache3

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)





Block1

Block2

Block3

Worker

Worker

Worker

Driver



. . .

Cache1

Cache2

Cache3

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)

Result:scaledto1TBdatain5-7sec (vs170secforon-diskdata)



FaultToleranceRDDstracktheseriesoftransformationsusedtobuildthem(theirlineage)torecomputelostdata

E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath=hdfs://…

FilteredRDDfunc=_.contains(...)

MappedRDDfunc=_.split(…)

12

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)

13





Loaddatainmemoryonce

13





Initialparametervector

13




println("Final w: " + w)RepeatedMapReducesteps

todogradientdescent

13

LogisticRegressionPerformance

Run

ning

Tim

e(s)

0

1000

2000

3000

4000

NumberofIterations

1 5 10 20 30

HadoopSpark

127s/iteration

firstiteration174s furtheriterations6s

14

SupportedOperatorsmap

filter

groupBy

sort

join

leftOuterJoin

rightOuterJoin

reducecountreduceByKey

groupByKey

firstunioncross

samplecogroup

take

partitionBy

pipesave...

15

SparkUsers

16

SparkSQL:HiveonSpark

17

MotivationHiveisgreat,butHadoop’sexecutionenginemakeseventhesmallestqueriestakeminutes

Scalaisgoodforprogrammers,butmanydatausersonlyknowSQL

CanweextendHivetorunonSpark?

18

HiveArchitecture

Metastore

HDFS

Client

Driver

SQLParser

QueryOptimizer

PhysicalPlan

Execution

CLI JDBC

MapReduce

19

SparkSQLArchitecture

Metastore

HDFS

Client

Driver

SQLParser

PhysicalPlan

Execution

CLI JDBC

Spark

CacheMgr.

QueryOptimizer

[Engleetal,SIGMOD2012]20

UsingSparkSQLCREATE TABLE mydata_cached AS SELECT …

RunstandardHiveQLonit,includingUDFs»Afewesotericfeaturesarenotyetsupported

CanalsocallfromScalatomixwithSpark

21

BenchmarkQuery1SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;

22

BenchmarkQuery2SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’ GROUP BY V.sourceIP ORDER BY earnings DESCLIMIT 1;

23

BehaviorwithNotEnoughRAM

Iterationtime(s)

0

25

50

75

100

%ofworkingsetinmemory

Cachedisabled 25% 50% 75% Fullycached

11.5

29.740.7

58.168.8

24

What’sNext?RecallthatSpark’smodelwasmotivatedbytwoemerginguses(interactiveandmulti-stageapps)

Anotheremergingusecasethatneedsfastdatasharingisstreamprocessing»Trackandupdatestateinmemoryaseventsarrive»Large-scalereporting,clickanalysis,spamfiltering,etc

25

StreamingSparkExtendsSparktoperformstreamingcomputations

Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs

Intermixseamlesslywithbatchandad-hocqueries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

T=1

T=2

…

map reduceByWindow

[Zahariaetal,HotCloud2012] 26

map()vsflatMap()Thebestexplanation:

https://www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey

flatMap=map+flatten

27



StreamingSparkExtendsSparktoperformstreamingcomputations

Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs

Intermixseamlesslywithbatchandad-hocqueries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)

T=1

T=2

…

map reduceByWindow

[Zahariaetal,HotCloud2012]

Result:canprocess42millionrecords/second(4GB/s)on100nodesatsub-secondlatency

28

SparkStreamingCreateandoperateonRDDsfromlivedatastreamsatsetintervals

Dataisdividedintobatchesforprocessing

Streamsmaybecombinedasapartofprocessingoranalyzedwithhigherleveltransforms

29

SPARKPLATFORM

30

Standard FS/HDFS/CFS/S3

GraphXSpark SQLShark

Spark Streaming

YARN/Spark/Mesos

Scala/Python/Java

RDD

MLlib Execution

Resource Management

Data Storage

GraphXParallelgraphprocessing

ExtendsRDD->ResilientDistributedPropertyGraph» Directedmultigraphwithpropertiesattachedtoeachvertexandedge

Limitedalgorithms» PageRank» ConnectedComponents» TriangleCounts

Alphacomponent31

MLlibScalablemachinelearninglibrary

InteroperateswithNumPy

Availablealgorithmsin1.0» LinearSupportVectorMachine(SVM)» LogisticRegression» LinearLeastSquares» DecisionTrees» NaïveBayes» CollaborativeFilteringwithALS» K-means» SingularValueDecomposition» PrincipalComponentAnalysis» GradientDescent

32

MLlib2.0(partofSpark2.0)

33https://spark.apache.org/docs/latest/mllib-guide.html

Spark2 (verynewstill)

34

Newfeaturehighlights https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

Spark2.0.0hasAPIbreakingchanges

PartlywhyHW3usesSpark1.6(also,Clouderadistribution’sSpark2supportisinbeta)

Moredetails:https://spark.apache.org/releases/spark-release-2-0-0.html

https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

https://spark.apache.org/releases/spark-release-2-0-0.html

high-speed in-memory analytics over hadoop and hive...

Documents