Transcript
Page 1: New Analytics Toolbox

The New Analytics Toolbox

Going beyond Hadoop

Page 2: New Analytics Toolbox

whoami

Robbie StricklandSoftware Dev Managerlinkedin.com/in/robbiestrickland@[email protected]

Page 3: New Analytics Toolbox

Hadoop works fine, right?

Page 4: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)

Page 5: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient

Page 6: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk

Page 7: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks

Page 8: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks

Page 9: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate

Page 10: New Analytics Toolbox

Hadoop works fine, right?

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate● Configuration isn’t for the faint of heart

Page 11: New Analytics Toolbox

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Page 12: New Analytics Toolbox

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

MPP queries for Hadoop data

Page 13: New Analytics Toolbox

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Varies by vendor

Page 14: New Analytics Toolbox

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

Generic in-memory analysis

Page 15: New Analytics Toolbox

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark Hive queries on Spark, replaced by Spark SQL

Page 16: New Analytics Toolbox

Spark wins?

Page 17: New Analytics Toolbox

Spark wins?

● Can be a Hadoop replacement

Page 18: New Analytics Toolbox

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat

Page 19: New Analytics Toolbox

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop

Page 20: New Analytics Toolbox

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis

Page 21: New Analytics Toolbox

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model

Page 22: New Analytics Toolbox

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model● Direct Cassandra integration

Page 23: New Analytics Toolbox

Spark vs. Hadoop

Page 24: New Analytics Toolbox

MapReduce:

Page 25: New Analytics Toolbox

MapReduce:

Page 26: New Analytics Toolbox

MapReduce:

Page 27: New Analytics Toolbox

MapReduce:

Page 28: New Analytics Toolbox

MapReduce:

Page 29: New Analytics Toolbox

MapReduce:

Page 30: New Analytics Toolbox

Spark (angels sing!):

Page 31: New Analytics Toolbox

What is Spark?

Page 32: New Analytics Toolbox

What is Spark?

● In-memory cluster computing

Page 33: New Analytics Toolbox

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce

Page 34: New Analytics Toolbox

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets

Page 35: New Analytics Toolbox

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java

Page 36: New Analytics Toolbox

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing

Page 37: New Analytics Toolbox

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing● Supports any existing Hadoop input / output

format

Page 38: New Analytics Toolbox

What is Spark?

● Native graph processing via GraphX

Page 39: New Analytics Toolbox

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib

Page 40: New Analytics Toolbox

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL

Page 41: New Analytics Toolbox

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR

Page 42: New Analytics Toolbox

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR● Easily join datasets from disparate sources

Page 43: New Analytics Toolbox

Spark on Cassandra

Page 44: New Analytics Toolbox

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

Page 45: New Analytics Toolbox

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft

Page 46: New Analytics Toolbox

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)

Page 47: New Analytics Toolbox

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)● Data locality aware

Page 48: New Analytics Toolbox

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

● No job config cruft● Supports server-side filters (where clauses)● Data locality aware● Uses HDFS, CassandraFS, or other

distributed FS for checkpointing

Page 49: New Analytics Toolbox

Cassandra with Hadoop

CassandraDataNode

TaskTracker

CassandraDataNode

TaskTracker

CassandraDataNode

TaskTracker

NameNode2ndaryNN

JobTracker

Page 50: New Analytics Toolbox

Cassandra with Spark (using HDFS)

CassandraSpark Worker

DataNode

CassandraSpark Worker

DataNode

CassandraSpark Worker

DataNode

Spark MasterNameNode2ndaryNN

Page 51: New Analytics Toolbox

A Typical Spark Application

Page 52: New Analytics Toolbox

A Typical Spark Application● SparkContext + SparkConf

Page 53: New Analytics Toolbox

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

Page 54: New Analytics Toolbox

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

● Transformations/Actions

Page 55: New Analytics Toolbox

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

● Transformations/Actions

● Saving/Displaying

Page 56: New Analytics Toolbox

Resilient Distributed Dataset (RDD)

Page 57: New Analytics Toolbox

Resilient Distributed Dataset (RDD)● A distributed collection of items

Page 58: New Analytics Toolbox

Resilient Distributed Dataset (RDD)● A distributed collection of items

● Transformations

○ Similar to those found in scala collections

○ Lazily processed

Page 59: New Analytics Toolbox

Resilient Distributed Dataset (RDD)● A distributed collection of items

● Transformations

○ Similar to those found in scala collections

○ Lazily processed

● Can recalculate from any point of failure

Page 60: New Analytics Toolbox

RDD Transformations vs Actions

Transformations: Produce new RDDs

Actions: Require the materialization of the records to produce a value

Page 61: New Analytics Toolbox

RDD Transformations/Actions

Transformations: filter, map, flatMap, collect(λ):RDD[T], distinct, groupBy, subtract, union, zip, reduceByKey ...

Actions: collect:Array[T], count, fold, reduce ...

Page 62: New Analytics Toolbox

Resilient Distributed Dataset (RDD)

val numOfAdults = persons.filter(_.age > 17).count()

Transformation Action

Page 63: New Analytics Toolbox

Example

Page 64: New Analytics Toolbox

Demo

Page 65: New Analytics Toolbox

Spark SQL Example

Page 66: New Analytics Toolbox

Spark Streaming

Page 67: New Analytics Toolbox

Spark Streaming

● Creates RDDs from stream source on a defined interval

Page 68: New Analytics Toolbox

Spark Streaming

● Creates RDDs from stream source on a defined interval

● Same ops as “normal” RDDs

Page 69: New Analytics Toolbox

Spark Streaming

● Creates RDDs from stream source on a defined interval

● Same ops as “normal” RDDs● Supports a variety of sources

○ Queues○ Twitter○ Flume○ etc.

Page 70: New Analytics Toolbox

Spark Streaming

Page 71: New Analytics Toolbox

The Output

Page 72: New Analytics Toolbox

Real World Task Distribution

Page 73: New Analytics Toolbox

Real World Task Distribution

Transformations and Actions:

Similar to Scala but …

Your choice of transformations need be made with task distribution and memory in mind

Page 74: New Analytics Toolbox

Real World Task Distribution

How do we do that?

● Partition the data appropriately for number of cores (at least 2 x number of cores)

● Filter early and often (Queries, Filters, Distinct...)● Use pipelines

Page 75: New Analytics Toolbox

Real World Task Distribution

How do we do that?

● Partial Aggregation● Create an algorithm to be as simple/efficient as it can be

to appropriately answer the question

Page 76: New Analytics Toolbox

Real World Task Distribution

Page 77: New Analytics Toolbox

Real World Task Distribution

Some Common Costly Transformations:

● sorting● groupByKey● reduceByKey● ...

Page 78: New Analytics Toolbox

Thank you!

Robbie Strickland@[email protected]://github.com/rstrickland/sparkdemo


Top Related