high-speed in-memory analytics over hadoop and hive...

58
Slides adopted from Matei Zaharia (MIT) and Oliver Vagner (NCR) Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop and Hive Data CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech Instructor: Duen Horng (Polo) Chau 1

Upload: others

Post on 25-Mar-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

SlidesadoptedfromMateiZaharia(MIT)andOliverVagner(NCR)

Spark&SparkSQLHigh-SpeedIn-MemoryAnalytics overHadoopandHiveData

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech

Instructor: Duen Horng (Polo) Chau1

Page 2: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

WhatisSpark?NotamodifiedversionofHadoop

Separate,fast,MapReduce-likeengine» In-memorydatastorageforveryfastiterativequeries»Generalexecutiongraphsandpowerfuloptimizations»Upto40xfasterthanHadoop

CompatiblewithHadoop’sstorageAPIs»Canread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc.

http://spark.apache.org

2

Page 3: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

WhatisSparkSQL? (FormallycalledShark)

PortofApacheHivetorunonSpark

CompatiblewithexistingHivedata,metastores,andqueries(HiveQL,UDFs,etc)

Similarspeedupsofupto40x

3

Page 4: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

ProjectHistory[latest:v2.2]Sparkprojectstartedin2009atUCBerkeleyAMPlab,opensourced2010

BecameApacheTop-LevelProjectinFeb2014

Shark/SparkSQLstartedsummer2011

Builtby250+developersandpeoplefrom50companies

Scaleto1000+nodesinproduction

InuseatBerkeley,Princeton,Klout,Foursquare,Conviva,Quantifind,Yahoo!Research,…

UCBERKELEY

http://en.wikipedia.org/wiki/Apache_Spark 4

Page 5: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

WhyaNewProgrammingModel?

MapReducegreatlysimplifiedbigdataanalysis

Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries

5

Page 6: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

WhyaNewProgrammingModel?

MapReducegreatlysimplifiedbigdataanalysis

Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries

Requirefasterdatasharingacrossparalleljobs

5

Page 8: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

DataSharinginMapReduce

iter.1 iter.2 ...

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query1

query2

query3

result1

result2

result3

...

HDFS read

7

Page 9: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

DataSharinginMapReduce

iter.1 iter.2 ...

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query1

query2

query3

result1

result2

result3

...

HDFS read

Slowduetoreplication,serialization,anddiskIO 7

Page 10: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

iter.1 iter.2 ...

Input

DataSharinginSpark

Distributedmemory

Input

query1

query2

query3

...

one-time processing

8

Page 11: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

iter.1 iter.2 ...

Input

DataSharinginSpark

Distributedmemory

Input

query1

query2

query3

...

one-time processing

10-100×fasterthannetworkanddisk 8

Page 12: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

SparkProgrammingModel

Keyidea:resilientdistributeddatasets(RDDs)»Distributedcollectionsofobjectsthatcanbecachedinmemoryacrossclusternodes»Manipulatedthroughvariousparalleloperators»Automaticallyrebuiltonfailure

Interface»Cleanlanguage-integratedAPIinScala»CanbeusedinteractivelyfromScala,Pythonconsole»Supportedlanguages:Java,Scala,Python,R

9

Page 13: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

http://www.scala-lang.org/old/faq/4

Functional programming in D3: http://sleptons.blogspot.com/2015/01/functional-programming-d3js-good-example.html

Scala vs Java 8: http://kukuruku.co/hub/scala/java-8-vs-scala-the-difference-in-approaches-and-mutual-innovations

10

Page 14: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 15: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

Worker

Worker

Worker

Driver

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 16: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 17: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

BaseRDD

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 18: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 19: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

TransformedRDD

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 20: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 21: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 22: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).countAction

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 23: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 24: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 25: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

tasks

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 26: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

tasks

results

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 27: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

tasks

results

Cache1

Cache2

Cache3

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 28: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

Cache1

Cache2

Cache3

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 29: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

Cache1

Cache2

Cache3

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 30: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

Cache1

Cache2

Cache3

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 31: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

Cache1

Cache2

Cache3

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 32: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

Cache1

Cache2

Cache3

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)

Result:scaledto1TBdatain5-7sec (vs170secforon-diskdata)

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 33: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

FaultToleranceRDDstracktheseriesoftransformationsusedtobuildthem(theirlineage)torecomputelostdata

E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath=hdfs://…

FilteredRDDfunc=_.contains(...)

MappedRDDfunc=_.split(…)

12

Page 34: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)

13

Page 35: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)

Loaddatainmemoryonce

13

Page 36: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)

Initialparametervector

13

Page 37: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)RepeatedMapReducesteps

todogradientdescent

13

Page 38: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

LogisticRegressionPerformance

Run

ning

Tim

e(s)

0

1000

2000

3000

4000

NumberofIterations

1 5 10 20 30

HadoopSpark

127s/iteration

firstiteration174s furtheriterations6s

14

Page 39: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

SupportedOperatorsmap

filter

groupBy

sort

join

leftOuterJoin

rightOuterJoin

reducecountreduceByKey

groupByKey

firstunioncross

samplecogroup

take

partitionBy

pipesave...

15

Page 40: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

SparkUsers

16

Page 41: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

SparkSQL:HiveonSpark

17

Page 42: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

MotivationHiveisgreat,butHadoop’sexecutionenginemakeseventhesmallestqueriestakeminutes

Scalaisgoodforprogrammers,butmanydatausersonlyknowSQL

CanweextendHivetorunonSpark?

18

Page 43: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

HiveArchitecture

Metastore

HDFS

Client

Driver

SQLParser

QueryOptimizer

PhysicalPlan

Execution

CLI JDBC

MapReduce

19

Page 44: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

SparkSQLArchitecture

Metastore

HDFS

Client

Driver

SQLParser

PhysicalPlan

Execution

CLI JDBC

Spark

CacheMgr.

QueryOptimizer

[Engleetal,SIGMOD2012]20

Page 45: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

UsingSparkSQLCREATE TABLE mydata_cached AS SELECT …

RunstandardHiveQLonit,includingUDFs»Afewesotericfeaturesarenotyetsupported

CanalsocallfromScalatomixwithSpark

21

Page 46: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

BenchmarkQuery1SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;

22

Page 47: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

BenchmarkQuery2SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’ GROUP BY V.sourceIP ORDER BY earnings DESCLIMIT 1;

23

Page 48: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

BehaviorwithNotEnoughRAM

Iterationtime(s)

0

25

50

75

100

%ofworkingsetinmemory

Cachedisabled 25% 50% 75% Fullycached

11.5

29.740.7

58.168.8

24

Page 49: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

What’sNext?RecallthatSpark’smodelwasmotivatedbytwoemerginguses(interactiveandmulti-stageapps)

Anotheremergingusecasethatneedsfastdatasharingisstreamprocessing»Trackandupdatestateinmemoryaseventsarrive»Large-scalereporting,clickanalysis,spamfiltering,etc

25

Page 50: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

StreamingSparkExtendsSparktoperformstreamingcomputations

Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs

Intermixseamlesslywithbatchandad-hocqueries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

T=1

T=2

map reduceByWindow

[Zahariaetal,HotCloud2012] 26

Page 51: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

map()vsflatMap()Thebestexplanation:

https://www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey

flatMap=map+flatten

27

Page 52: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

StreamingSparkExtendsSparktoperformstreamingcomputations

Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs

Intermixseamlesslywithbatchandad-hocqueries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)

T=1

T=2

map reduceByWindow

[Zahariaetal,HotCloud2012]

Result:canprocess42millionrecords/second(4GB/s)on100nodesatsub-secondlatency

28

Page 53: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

SparkStreamingCreateandoperateonRDDsfromlivedatastreamsatsetintervals

Dataisdividedintobatchesforprocessing

Streamsmaybecombinedasapartofprocessingoranalyzedwithhigherleveltransforms

29

Page 54: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

SPARKPLATFORM

30

Standard FS/HDFS/CFS/S3

GraphXSpark SQLShark

Spark Streaming

YARN/Spark/Mesos

Scala/Python/Java

RDD

MLlib Execution

Resource Management

Data Storage

Page 55: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

GraphXParallelgraphprocessing

ExtendsRDD->ResilientDistributedPropertyGraph» Directedmultigraphwithpropertiesattachedtoeachvertexandedge

Limitedalgorithms» PageRank» ConnectedComponents» TriangleCounts

Alphacomponent31

Page 56: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

MLlibScalablemachinelearninglibrary

InteroperateswithNumPy

Availablealgorithmsin1.0» LinearSupportVectorMachine(SVM)» LogisticRegression» LinearLeastSquares» DecisionTrees» NaïveBayes» CollaborativeFilteringwithALS» K-means» SingularValueDecomposition» PrincipalComponentAnalysis» GradientDescent

32

Page 57: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

MLlib2.0(partofSpark2.0)

33https://spark.apache.org/docs/latest/mllib-guide.html

Page 58: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver

Spark2 (verynewstill)

34

Newfeaturehighlights https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

Spark2.0.0hasAPIbreakingchanges

PartlywhyHW3usesSpark1.6(also,Clouderadistribution’sSpark2supportisinbeta)

Moredetails:https://spark.apache.org/releases/spark-release-2-0-0.html