spark sql versus apache drill: different tools with different rules

66

Click here to load reader

Upload: hadoop-summit

Post on 09-Jan-2017

1.083 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 1© 2014 MapR Technologies

Spark SQL versus Apache Drill: Different Tools with Different Rules

Page 2: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 2

Contact Information

Ted DunningChief Applications Architect at MapR Technologies

Committer & PMC for Apache’s Drill, Zookeeper & othersVP of Incubator at Apache Foundation

Email [email protected] [email protected]

Twitter @ted_dunning

Page 3: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 5

What is Drill?

Page 4: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 6

A Query engine that has…• Columnar/Vectorized • Optimistic/pipelined• Runtime compilation• Late binding • Extensible

Page 5: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 7

Table Can Be an Entire Directory Tree

// On a fileselect errorLevel, count(*)from dfs.logs.`/AppServerLogs/2014/Janpart0001.parquet` group by errorLevel;

// On the entire data collection: all years, all monthsselect errorLevel, count(*)from dfs.logs.`/AppServerLogs`group by errorLevel;

Page 6: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 8

Basic Process

Zookeeper

DFS/HBase DFS/HBase DFS/HBase

Drillbit

Distributed Cache

Drillbit

Distributed Cache

Drillbit

Distributed Cache

Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)2. Drillbit generates execution plan based on query optimization & locality

3. Fragments are farmed to individual nodes4. Result is returned to driving node

c c c

Page 7: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 9

Stages of Query Planning

Parser Logical Planner

Physical Planner

Query Foreman

Plan fragments sent to drill bits

SQLQuery

Heuristic and cost based

Cost based

Page 8: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 10

Query Execution

SQL Parser

Optimizer

Scheduler

Pig Parser Phys

ical

Pla

n Mongo

CassandraHiveQL Parser

RPC Endpoint

Distributed Cache

Stor

age

Inte

rfac

e

OperatorsOperators

Foreman

Logi

cal P

lan

HDFS

HBase

JDBC Endpoint

ODBC Endpoint

Page 9: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 11

Batches of Values

• Value vectors– List of values, with same schema– With the 4-value semantics for each value

• Shipped around in batches– max 256k bytes in a batch– max 64K rows in a batch

• RPC designed for multiple replies to a request

Page 10: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 12

Fixed Value Vectors

Page 11: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 13

Vectorization• Drill operates on more than one record at a time

– Word-sized manipulations– SIMD instructions

• GCC, LLVM and JVM all do various optimizations automatically– Manually code algorithms

• Logical Vectorization– Bitmaps allow lightning fast null-checks– Avoid branching to speed CPU pipeline

Page 12: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 14

Runtime Compilation is Faster• JIT is smart, but more gains with runtime compilation• Janino: Java-based Java compiler

From http://bit.ly/16Xk32x

Page 13: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 15

Drill compiler

Loaded classMerge byte-code of the two classes

Janino compiles runtime

byte-code

CodeModel generates

code

Precompiled byte-code templates

Page 14: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 16

Optimistic

cmd pipeline small db med db large db dw compilation hadoop0

20

40

60

80

100

120

140

160

Speed vs. check-pointing

No need to checkpoint

Checkpoint frequentlyApache Drill

Page 15: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 17

Optimistic Execution• Recovery code trivial

– Running instances discard the failed query’s intermediate state• Pipelining possible

– Send results as soon as batch is large enough– Requires barrier-less decomposition of query

Page 16: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 18

Pipelining• Record batches are pipelined between

nodes– ~256kB usually

• Unit of work for Drill– Operators works on a batch

• Operator reconfiguration happens at batch boundaries

DrillBit

DrillBit DrillBit

Page 17: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 19

Pipelining• Random access: sort without copy or restructuring• Avoids serialization/deserialization• Off-heap (no GC woes when lots of memory)• Read/write to disk

– when data larger than memory

Drill Bit

Memory overflow

uses disk

Disk

Page 18: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 20

Cost-based Optimization• Using Optiq, an extensible framework

– Pluggable rules, and cost model • Rules for distributed plan generation

– Insert Exchange operator into physical plan– Optiq enhanced to explore parallel query plans

• Pluggable cost model– CPU, IO, memory, network cost (data locality)– Storage engine features (HDFS vs HIVE vs HBase)

Query Optimizer

Pluggablerules

Pluggablecost model

Page 19: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 21

What is SparkSQL?

Page 20: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 22

What is Spark SQL• Essentially syntactic sugar over a limited subset of Spark• Inherits all the virtues (and vices) of Spark

– Lambdas can serve as UDFs (has subtle issues for performance)• Inputs have to be loaded

– Perhaps lazily, not obvious when load actually happens• Not designed as a streaming engine, requires more memory• Some JSON support, but not so much for large or variable

objects

• Embedded in a real language!

Page 21: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 23

In More Detail• A Spark program consists of a computation graph that consumes

and produces so-called resilient data datasets• SparkSQL allows these computations to be defined using SQL

(but needs schema definitions on the RDD’s)

• Conventional Spark programs and SparkSQL programs interoperate nearly seamlessly

Page 22: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 24

Many Similarities

SQL Parser

Optimizer

Java Phys

ical

Pla

n

Scala

Logi

cal P

lan

Python

Page 23: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 25

Important Differences• Spark execution assumes RDD’s are complete representation,

not a stream of row batches

• Input sources don’t inject optimization rules, nor expose detailed cost models

• Most RDD’s don’t have a zero-copy capability

• Spark inherits JVM memory model, very limited use of off-heap

Page 24: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 26

scala> sqlContext.sql("select * from json.`foo.json`").show+---+------+----+| a| b| c|+---+------+----+| 3|[3, 2]| xyz|| 7| null| wxy|| 7| []|null|+---+------+----+

Page 25: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 27

scala> sqlContext.sql( "select a, explode(b) b_v from json.`bug.json`").show+---+---------+| a| b_v|+---+---------+| 3| 3|| 3| 2|+---+---------+

Page 26: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 28

First Synthesis• Drill has a more nuanced optimizer, better code generation

– This often leads to ~2x speed advantage

• Drill has ValueVector and row batches– This leads to much less memory pressure

• Drill has much stricter memory life-cycle– Query and done and gone, no need for big GC’s even on big memory

• Drill is all about SQL execution

Page 27: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 29

But …• Spark can optimize across entire program

– This often leads to ~2x speed advantage

• Spark has much more flexible memory structures– This can lead to much less memory pressure

• Spark has much more flexible RDD life-cycle– RDD’s can be cached, persisted or simply recomputed as necessary

• Spark is not all about SQL execution

Page 28: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 30

The Really Big Differences• Drill focuses heavily on secure, multi-tenant access to data

– Strong impersonation semantics– Cascading rights via views– Queries co-exist in a cluster and reserve only their momentary resource

requirements

• Spark focuses heavily on fully integrated execution models– Any spark function works with (almost) any RDD’s– Memory residency of RDD’s is the highest goal

Page 29: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 31

Drill security➢ End to end security

from BI tools to Hadoop

➢ Standard based PAM Authentication

➢ 2 level user Impersonation

➢ Fine-grained row and column level access control with Drill Views – no centralized security repository required

Page 30: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 32

Granular security permissions through Drill views

Name City State

Credit Card #

Dave San Jose CA 1374-7914-3865-4817John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)OwnerAdmins

Permission Admins

Business Analyst Data Scientist

Name City State

Credit Card #

Dave San Jose CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist View (/views/maskedcards.view.drill)

Not a physical data copy

Name City State

Dave San Jose CAJohn Boulder CO

Business Analyst View

OwnerAdmins

Permission Business Analysts

OwnerAdmins

Permission Data

Scientists

Page 31: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 33

Ownership ChainingCombine Self Service Exploration with Data Governance

Name City State Credit Card #

Dave San Jose CA 1374-7914-3865-4817

John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)

Name City State Credit Card #

Dave San Jose CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist (/views/V_Scientist)

Jane (Read)John (Owner)

Name City State

Dave San Jose CA

John Boulder CO

Analyst(/views/V_Analyst)

Jack (Read)Jane(Owner)

RAW

FILEV

_Scientist

V_A

nalyst

Does Jack have access to V_Analyst? ->YES

Who is the owner of V_Analyst? ->Jane

Drill accesses V_Analyst as Jane (Impersonation hop 1)

Does Jane have access to V_Scientist ? -> YES

Who is the owner of V_Scientist? ->John

Drill accesses V_Scientist as John (Impersonation hop 2)

John(Owner)

Does John have permissions on raw file? -> YES

Who is the owner of raw file? ->John

Drill accesses source file as John (no impersonation here)

Jack queries the view V_Analyst

*Ownership chain length (# hops) is configurable

Ownership chaining

Access path

Page 32: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 34

But was that the right question?

Page 33: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 35

Unification is Feasible• It is relatively easy to build a DrillContext in Spark

– compare to SqlContext

• Define Datasets as Drill data sources and sinks– Drill runs at the same time as Spark

• Orchestrate transport of Spark data to/from Drill

• Cost of transport is remarkably small

Page 34: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 36

What does the Spark and Drill integration look like

Features at a glance:• Use Drill as an input to Spark• Query Spark RDDs via Drill and create data pipelines

Page 35: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 37

Is unification valuable?

Page 36: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 38

Example of Unification

Page 37: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 39

Simple Session Protocol• Calls started at random

intervals

• During calls, reconnection is done periodically

• Many log events are buffered and sent to current tower during active state

Page 38: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 40

The Resulting Data• Signal strength reports

– Tower, timestamp, rank, caller, caller location*, signal strength• Tower log events: HELLO, FAIL, CONNECT, END• Call end

• Note that data for one tower is often received by another due to caller buffering to diagnostic data

*Location isn’t quite location … poetic license applied for

Page 39: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 41

What can we do with it?

Page 40: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 42

Baby Steps

• What does signal propagation look like?

select x, y, signal from cdr_stream where tower = 3

• Plot results to get a map of signal strength around a tower

Page 41: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 43

Baby Steps

• What does tower coverage look like?

select x, y from cdr_stream where tower = 3 and event_type = ‘CONNECT’.

• Plot results to get a map of coverage area for a tower

Page 42: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 44

What about anomaly detection?

Page 43: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 45

Detecting Tower Loss

It’s important to know if traffic is stopped or delayed because of a problem…

But events from towers come at irregular intervals

How long after the last event should you begin to worry?

Page 44: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 46

Event Stream (timing)• Events of various types arrive at irregular intervals

– we can assume Poisson distribution

• The key question is whether frequency has changed relative to expected values– This shows up as a change in interval

• Want alert as soon as possible

Page 45: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 47

Converting Event Times to Anomaly

99.9%-ile

99.99%-ile

Page 46: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 48

But in the real world, event rates often change

Page 47: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 49

Time Intervals Are Key to Modeling Sporadic Events

Page 48: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 50

Time Intervals Are Key to Modeling Sporadic Events

Page 49: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 51

After Rate Correction

Page 50: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 52

Detecting Anomalies in Sporadic Events

Page 51: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 53

Propagation Anomalies• What happens when something shadows part of the coverage

field?– Can happen in urban areas with a construction crane

• Can solve heuristically– Subtract from reference image composed by long term averages– Doesn’t deal well with weak signal regions and low S/N

• Can solve probabilistically– Compute anomaly for each measurement, use mean of log(p)

Page 52: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 54

Page 53: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 55

Page 54: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 56

Variable Signal/Noise Makes Heuristic Tricky

Far from the transmitter, received signal is dominated by noise. This makes subtraction of average value a bad algorithm.

Page 55: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 57

Other Issues• Finding anomalies in coverage area is similar tricky

• Coverage area is roughly where tower signal strength is higher than neighbors

• Except for fuzziness due to hand-off delays• Except for bias due to large-scale caller motions

– Rush hour– Event mobs

Page 56: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 58

Simple Answer for Propagation Anomalies • Cluster signal strength reports• Cluster locations using k-means, large k• Model report rate anomaly using discrete event models• Model signal strength anomaly using percentile model

• Trade larger k against higher report rates, faster detection

• Overall anomaly is sum of individual log(p) anomalies

Page 57: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 59

Coverage Areas

Page 58: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 60

Just One Tower

Page 59: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 61

Cluster Reports for That Tower

Page 60: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 62

Cluster Reports for That Tower

Page 61: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 63

General Dataflow

Page 62: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 64

Summary• Drill and Spark provide healthy competition in Apache• Over time, they have converged in many respects

– But important distinctions remain• Projects can work together to share key technology

– Apache Arrow … started as off-shoot of Drill, now has >12 major projects as participants, including Spark

• Systems can work together even more deeply– DrillContext makes integration first class

Page 63: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 65

e-book available courtesy of MapR

http://bit.ly/1jQ9QuL

A New Look at Anomaly Detectionby Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)

Page 64: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 66

Page 65: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 67

Thank you for coming today!

Page 66: Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 68

…helping you put data technology to work

● Find answers

● Ask technical questions

● Join on-demand training course discussions

● Follow release announcements

● Share and vote on product ideas

● Find Meetup and event listings

Connect with fellow Apache Hadoop and Spark professionals

community.mapr.com