spark sql deep dive @ melbourne spark meetup

57

Click here to load reader

Upload: databricks

Post on 28-Jul-2015

1.836 views

Category:

Software


8 download

TRANSCRIPT

Page 1: Spark SQL Deep Dive @ Melbourne Spark Meetup

Spark SQL Deep Dive

Michael Armbrust Melbourne Spark Meetup – June 1st 2015

Page 2: Spark SQL Deep Dive @ Melbourne Spark Meetup

What is Apache Spark?

Fast and general cluster computing system, interoperable with Hadoop, included in all major distros Improves efficiency through:

>  In-memory computing primitives >  General computation graphs

Improves usability through: >  Rich APIs in Scala, Java, Python >  Interactive shell

Up to 100× faster (2-10× on disk)

2-5× less code

Page 3: Spark SQL Deep Dive @ Melbourne Spark Meetup

Spark Model

Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs)

> Collections of objects that can be stored in memory or disk across a cluster

>  Parallel functional transformations (map, filter, …) > Automatically rebuilt on failure

Page 4: Spark SQL Deep Dive @ Melbourne Spark Meetup

More than Map & Reduce

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

Page 5: Spark SQL Deep Dive @ Melbourne Spark Meetup

5  

On-Disk Sort Record: Time to sort 100TB

2100 machines 2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Also sorted 1PB in 4 hours

Page 6: Spark SQL Deep Dive @ Melbourne Spark Meetup

6  

Spark “Hall of Fame”

LARGEST SINGLE-DAY INTAKE

LONGEST-RUNNING JOB

LARGEST SHUFFLE

MOST INTERESTING APP

Tencent (1PB+ /day)

Alibaba (1 week on 1PB+ data)

Databricks PB Sort (1PB)

Jeremy Freeman Mapping the Brain at Scale

(with lasers!)

LARGEST CLUSTER

Tencent (8000+ nodes)

Based on Reynold Xin’s personal knowledge

Page 7: Spark SQL Deep Dive @ Melbourne Spark Meetup

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

val  lines  =  spark.textFile(“hdfs://...”)  val  errors  =  lines.filter(_  startswith  “ERROR”)  

val  messages  =  errors.map(_.split(“\t”)(2))  messages.cache()   lines

Block 1

lines Block 2

lines Block 3

Worker

Worker

Worker

Driver

messages.filter(_  contains  “foo”).count()  

messages.filter(_  contains  “bar”).count()  

. . .

tasks

results

messages Cache 1

messages Cache 2

messages Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec#(vs 170 sec for on-disk data)

Page 8: Spark SQL Deep Dive @ Melbourne Spark Meetup

A General Stack

Spark

Spark Streaming#

real-time

Spark SQL

GraphX graph

MLlib machine learning …

Spark SQL

Page 9: Spark SQL Deep Dive @ Melbourne Spark Meetup

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Powerful Stack – Agile Development

Page 10: Spark SQL Deep Dive @ Melbourne Spark Meetup

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Streaming

Powerful Stack – Agile Development

Page 11: Spark SQL Deep Dive @ Melbourne Spark Meetup

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

SparkSQL Streaming

Powerful Stack – Agile Development

Page 12: Spark SQL Deep Dive @ Melbourne Spark Meetup

Powerful Stack – Agile Development

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

GraphX

Streaming SparkSQL

Page 13: Spark SQL Deep Dive @ Melbourne Spark Meetup

Powerful Stack – Agile Development

0

20000

40000

60000

80000

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

GraphX

Streaming SparkSQL

Your App?

Page 14: Spark SQL Deep Dive @ Melbourne Spark Meetup

About SQL

Spark SQL >  Part of the core distribution since Spark 1.0

(April 2014)

0 50

100 150 200 250

# Of Commits Per Month

0

50

100

150

200

2014-0

3

2014-0

4

2014-0

5

2014-0

6

2014-0

7

2014-0

8

2014-0

9

2014-1

0

2014-1

1

2014-1

2

2015-0

1

2015-0

2

2015-0

3

2015-0

4

2015-0

5

2015-0

6

# of Contributors

Page 15: Spark SQL Deep Dive @ Melbourne Spark Meetup

SELECT  COUNT(*)  FROM  hiveTable  WHERE  hive_udf(data)    

Spark SQL >  Part of the core distribution since Spark 1.0

(April 2014) > Runs SQL / HiveQL queries including UDFs

UDAFs and SerDes

About SQL

Page 16: Spark SQL Deep Dive @ Melbourne Spark Meetup

Spark SQL >  Part of the core distribution since Spark 1.0

(April 2014) > Runs SQL / HiveQL queries including UDFs

UDAFs and SerDes > Connect existing BI tools to Spark through

JDBC

About SQL

Page 17: Spark SQL Deep Dive @ Melbourne Spark Meetup

Spark SQL >  Part of the core distribution since Spark 1.0

(April 2014) > Runs SQL / HiveQL queries including UDFs

UDAFs and SerDes > Connect existing BI tools to Spark through

JDBC > Bindings in Python, Scala, and Java

About SQL

Page 18: Spark SQL Deep Dive @ Melbourne Spark Meetup

The not-so-secret truth…

is not about SQL.

SQL

Page 19: Spark SQL Deep Dive @ Melbourne Spark Meetup

SQL: The whole story

Create and Run Spark Programs Faster: > Write less code > Read less data >  Let the optimizer do the hard work

Page 20: Spark SQL Deep Dive @ Melbourne Spark Meetup

DataFrame noun – [dey-tuh-freym] 1.  A distributed collection of rows

organized into named columns. 2.  An abstraction for selecting, filtering,

aggregating and plotting structured data (cf. R, Pandas).

3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).

Page 21: Spark SQL Deep Dive @ Melbourne Spark Meetup

Write Less Code: Input & Output

Spark SQL’s Data Source API can read and write DataFrames using a variety of formats.

21  

{ JSON }

Built-In External

JDBC

and more…

Page 22: Spark SQL Deep Dive @ Melbourne Spark Meetup

Write Less Code: Input & Output

Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read  \      .format("json")  \      .option("samplingRatio",  "0.1")  \      .load("/home/michael/data.json")    df.write  \      .format("parquet")  \      .mode("append")  \      .partitionBy("year")  \      .saveAsTable("fasterData")    

Page 23: Spark SQL Deep Dive @ Melbourne Spark Meetup

Write Less Code: Input & Output

Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read  \      .format("json")  \      .option("samplingRatio",  "0.1")  \      .load("/home/michael/data.json")    df.write  \      .format("parquet")  \      .mode("append")  \      .partitionBy("year")  \      .saveAsTable("fasterData")    

read and write  functions create new builders for

doing I/O

Page 24: Spark SQL Deep Dive @ Melbourne Spark Meetup

Write Less Code: Input & Output

Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read  \      .format("json")  \      .option("samplingRatio",  "0.1")  \      .load("/home/michael/data.json")    df.write  \      .format("parquet")  \      .mode("append")  \      .partitionBy("year")  \      .saveAsTable("fasterData")    

Builder methods are used to specify: •  Format •  Partitioning •  Handling of

existing data •  and more

Page 25: Spark SQL Deep Dive @ Melbourne Spark Meetup

Write Less Code: Input & Output

Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read  \      .format("json")  \      .option("samplingRatio",  "0.1")  \      .load("/home/michael/data.json")    df.write  \      .format("parquet")  \      .mode("append")  \      .partitionBy("year")  \      .saveAsTable("fasterData")    

load(…), save(…) or saveAsTable(…)  functions create new builders for

doing I/O

Page 26: Spark SQL Deep Dive @ Melbourne Spark Meetup

ETL Using Custom Data Sources

sqlContext.read      .format("com.databricks.spark.git")      .option("url",  "https://github.com/apache/spark.git")      .option("numPartitions",  "100")      .option("branches",  "master,branch-­‐1.3,branch-­‐1.2")      .load()      .repartition(1)      .write      .format("json")      .save("/home/michael/spark.json")    

Page 27: Spark SQL Deep Dive @ Melbourne Spark Meetup

Write Less Code: Powerful Operations

Common operations can be expressed concisely as calls to the DataFrame API: •  Selecting required columns •  Joining different data sources •  Aggregation (count, sum, average, etc) •  Filtering

27  

Page 28: Spark SQL Deep Dive @ Melbourne Spark Meetup

Write Less Code: Compute an Average

private  IntWritable  one  =        new  IntWritable(1)  private  IntWritable  output  =      new  IntWritable()  proctected  void  map(          LongWritable  key,          Text  value,          Context  context)  {      String[]  fields  =  value.split("\t")      output.set(Integer.parseInt(fields[1]))      context.write(one,  output)  }    IntWritable  one  =  new  IntWritable(1)  DoubleWritable  average  =  new  DoubleWritable()    protected  void  reduce(          IntWritable  key,          Iterable<IntWritable>  values,          Context  context)  {      int  sum  =  0      int  count  =  0      for(IntWritable  value  :  values)  {            sum  +=  value.get()            count++          }      average.set(sum  /  (double)  count)      context.Write(key,  average)  }  

data  =  sc.textFile(...).split("\t")  data.map(lambda  x:  (x[0],  [x.[1],  1]))  \        .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])  \        .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])  \        .collect()  

Page 29: Spark SQL Deep Dive @ Melbourne Spark Meetup

Write Less Code: Compute an Average

Using RDDs  

data  =  sc.textFile(...).split("\t")  data.map(lambda  x:  (x[0],  [int(x[1]),  1]))  \        .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])  \        .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])  \        .collect()        

Using DataFrames  

sqlCtx.table("people")  \        .groupBy("name")  \        .agg("name",  avg("age"))  \        .collect()    

Using SQL  

SELECT  name,  avg(age)  FROM  people  GROUP  BY  name  

Page 30: Spark SQL Deep Dive @ Melbourne Spark Meetup

Not Just Less Code: Faster Implementations

0 2 4 6 8 10

RDD Scala

RDD Python

DataFrame Scala

DataFrame Python

DataFrame SQL

Time to Aggregate 10 million int pairs (secs)

Page 31: Spark SQL Deep Dive @ Melbourne Spark Meetup

31  

Demo: Data Frames Using Spark SQL to read, write, slice and dice your data using a simple functions

Page 32: Spark SQL Deep Dive @ Melbourne Spark Meetup

Read Less Data

Spark SQL can help you read less data automatically:

•  Converting to more efficient formats •  Using columnar formats (i.e. parquet) •  Using partitioning (i.e., /year=2014/month=02/…)1 •  Skipping data using statistics (i.e., min, max)2

•  Pushing predicates into storage systems (i.e., JDBC)  

Page 33: Spark SQL Deep Dive @ Melbourne Spark Meetup

Optimization happens as late as possible, therefore Spark SQL can

optimize across functions.

33

Page 34: Spark SQL Deep Dive @ Melbourne Spark Meetup

34

def  add_demographics(events):        u  =  sqlCtx.table("users")                                      #  Load  Hive  table        events  \            .join(u,  events.user_id  ==  u.user_id)  \      #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))        #  udf  adds  city  column    events  =  add_demographics(sqlCtx.load("/data/events",  "json"))    training_data  =  events.where(events.city  ==  "Palo  Alto")                                              .select(events.timestamp).collect()    

Logical Plan

filter

join

events file users table

expensive

only join relevant users

Physical Plan

join

scan (events) filter

scan (users)

Page 35: Spark SQL Deep Dive @ Melbourne Spark Meetup

35

def  add_demographics(events):        u  =  sqlCtx.table("users")                                      #  Load  partitioned  Hive  table        events  \            .join(u,  events.user_id  ==  u.user_id)  \      #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))        #  Run  udf  to  add  city  column    

Physical Plan with Predicate Pushdown

and Column Pruning

join

optimized scan

(events) optimized

scan (users)

events  =  add_demographics(sqlCtx.load("/data/events",  "parquet"))    training_data  =  events.where(events.city  ==  "Palo  Alto")                                              .select(events.timestamp).collect()    

Logical Plan

filter

join

events file users table

Physical Plan

join

scan (events) filter

scan (users)

Page 36: Spark SQL Deep Dive @ Melbourne Spark Meetup

Machine Learning Pipelines

tokenizer  =  Tokenizer(inputCol="text", outputCol="words”)  hashingTF  =  HashingTF(inputCol="words", outputCol="features”)  lr  =  LogisticRegression(maxIter=10,  regParam=0.01)  pipeline  =  Pipeline(stages=[tokenizer,  hashingTF,  lr])    df  =  sqlCtx.load("/path/to/data") model  =  pipeline.fit(df)

df0 df1 df2 df3 tokenizer hashingTF lr.model

lr

Pipeline Model

Page 37: Spark SQL Deep Dive @ Melbourne Spark Meetup

Set Footer from Insert Dropdown Menu 37

So how does it all work?

Page 38: Spark SQL Deep Dive @ Melbourne Spark Meetup

Plan Optimization & Execution

Set Footer from Insert Dropdown Menu 38

SQL AST

DataFrame

Unresolved Logical

Plan

Logical Plan

Optimized Logical

Plan RDDs

Selected Physical

Plan

Analysis Logical Optimization

Physical Planning

Cost

Mod

el

Physical Plans

Catalog

DataFrames and SQL share the same optimization/execution pipeline

Code Generation

Page 39: Spark SQL Deep Dive @ Melbourne Spark Meetup

An example query SELECT  name  

FROM  (  

     SELECT  id,  name  

     FROM  People)  p  

WHERE  p.id  =  1  

Projectname

Projectid,name

Filterid = 1

People

LogicalPlan

39

Page 40: Spark SQL Deep Dive @ Melbourne Spark Meetup

Naïve Query Planning SELECT  name  

FROM  (  

     SELECT  id,  name  

     FROM  People)  p  

WHERE  p.id  =  1  

Projectname

Projectid,name

Filterid = 1

People

LogicalPlan

Projectname

Projectid,name

Filterid = 1

TableScanPeople

PhysicalPlan

40

Page 41: Spark SQL Deep Dive @ Melbourne Spark Meetup

Optimized Execution Writing imperative code to optimize all possible patterns is hard.

Projectname

Projectid,name

Filterid = 1

People

LogicalPlan

Projectname

Projectid,name

Filterid = 1

People

IndexLookupid = 1

return: name

LogicalPlan

PhysicalPlan

Instead write simple rules: •  Each rule makes one change •  Run many rules together to

fixed point.

41

Page 42: Spark SQL Deep Dive @ Melbourne Spark Meetup

Prior Work: #Optimizer Generators Volcano / Cascades: •  Create a custom language for expressing

rules that rewrite trees of relational operators.

•  Build a compiler that generates executable code for these rules.

Cons:  Developers  need  to  learn  this  custom  language.  Language  might  not  be  powerful  enough.   42

Page 43: Spark SQL Deep Dive @ Melbourne Spark Meetup

TreeNode Library Easily transformable trees of operators •  Standard collection functionality - foreach,  

map,collect,etc. •  transform function – recursive modification

of tree fragments that match a pattern. •  Debugging support – pretty printing,

splicing, etc.

43

Page 44: Spark SQL Deep Dive @ Melbourne Spark Meetup

Tree Transformations Developers express tree transformations as PartialFunction[TreeType,TreeType] 1.  If the function does apply to an operator, that

operator is replaced with the result. 2.  When the function does not apply to an

operator, that operator is left unchanged. 3.  The transformation is applied recursively to all

children. 44

Page 45: Spark SQL Deep Dive @ Melbourne Spark Meetup

Writing Rules as Tree Transformations 1.  Find filters on top of

projections. 2.  Check that the filter

can be evaluated without the result of the project.

3.  If so, switch the operators.

Projectname

Projectid,name

Filterid = 1

People

OriginalPlan

Projectname

Projectid,name

Filterid = 1

People

FilterPush-Down

45

Page 46: Spark SQL Deep Dive @ Melbourne Spark Meetup

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

46

Page 47: Spark SQL Deep Dive @ Melbourne Spark Meetup

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Partial Function Tree

47

Page 48: Spark SQL Deep Dive @ Melbourne Spark Meetup

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Find Filter on Project

48

Page 49: Spark SQL Deep Dive @ Melbourne Spark Meetup

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Check that the filter can be evaluated without the result of the project.

49

Page 50: Spark SQL Deep Dive @ Melbourne Spark Meetup

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

If so, switch the order.

50

Page 51: Spark SQL Deep Dive @ Melbourne Spark Meetup

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Scala: Pattern Matching

51

Page 52: Spark SQL Deep Dive @ Melbourne Spark Meetup

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  

Catalyst: Attribute Reference Tracking

52

Page 53: Spark SQL Deep Dive @ Melbourne Spark Meetup

Filter Push Down Transformation  

val  newPlan  =  queryPlan  transform  {  

 case  f  @  Filter(_,  p  @  Project(_,  grandChild))    

     if(f.references  subsetOf  grandChild.output)  =>  

 p.copy(child  =  f.copy(child  =  grandChild)  

}  Scala: Copy Constructors

53

Page 54: Spark SQL Deep Dive @ Melbourne Spark Meetup

Optimizing with Rules

Projectname

Projectid,name

Filterid = 1

People

OriginalPlan

Projectname

Projectid,name

Filterid = 1

People

FilterPush-Down

Projectname

Filterid = 1

People

CombineProjection

IndexLookupid = 1

return: name

PhysicalPlan

54

Page 55: Spark SQL Deep Dive @ Melbourne Spark Meetup

Future Work – Project Tungsten

Consider “abcd” – 4 bytes with UTF8 encoding java.lang.String  object  internals:  OFFSET    SIZE      TYPE  DESCRIPTION                                        VALUE            0          4                (object  header)                                ...            4          4                (object  header)                                ...            8          4                (object  header)                                ...          12          4  char[]  String.value                                      []          16          4        int  String.hash                                        0          20          4        int  String.hash32                                    0  Instance  size:  24  bytes  (reported  by  Instrumentation  API)  

Page 56: Spark SQL Deep Dive @ Melbourne Spark Meetup

Project Tungsten

Overcome JVM limitations: •  Memory Management and Binary Processing:

leveraging application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection

•  Cache-aware computation: algorithms and data structures to exploit memory hierarchy

•  Code generation: using code generation to exploit modern compilers and CPUs

Page 57: Spark SQL Deep Dive @ Melbourne Spark Meetup

Questions? Learn more at: http://spark.apache.org/docs/latest/ Get Involved: https://github.com/apache/spark