introduction to apache spark

Introduction to Spark

Eric Eijkelenboom - UserReport - userreport.com

http://userreport.com

• What is Spark and why should I care?

• Architecture and programming model

• Examples

• Mini demo

• Related projects

RTFM• A general-purpose computation framework that leverages distributed

memory

• More flexible than MapReduce (it supports general execution graphs)

• Linear scalability and fault tolerance

• It supports a rich set of higher-level tools including

• Shark (Hive on Spark) and Spark SQL

• MLlib for machine learning

• GraphX for graph processing

• Spark Streaming

Who cares?

!

!

!

!

• Slow due to serialisation & replication

• Inefficient for iterative computing & interactive querying

Limitations of MapReduceInput iter. 1 iter. 2 . . .

HDFS read

HDFS write

HDFS read

HDFS write

Map

Map

Map

Reduce

Reduce

Input Output

Leveraging memory

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Leveraging memory

iter. 1 iter. 2 . . .

Input

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Leveraging memory

iter. 1 iter. 2 . . .

Input

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Not tied to 2-stage MapReduce paradigm

1. Extract a working set 2. Cache it 3. Query it repeatedly

So, Spark is…• In-memory analytics, many times faster than

Hadoop/Hive

• Designed for running iterative algorithms & interactive querying

• Highly compatible with Hadoop’s Storage APIs

• Can run on your existing Hadoop Cluster Setup

• Programming in Scala, Python or Java

Spark stack

Architecture

HDFS

Datanode Datanode Datanode....Spark Worker Spark Worker Spark Worker....

Cache Cache Cache

Block Block Block

Cluster Manager

Spark Driver (Master)

Architecture

HDFS

Datanode Datanode Datanode....Spark Worker Spark Worker Spark Worker....

Cache Cache Cache

Block Block Block

Cluster Manager

Spark Driver (Master)

• YARN • Mesos • Standalone

Programming model• Resilient Distributed Datasets (RDDs) are basic building blocks

• Distributed collection of objects, cached in-memory across cluster nodes

• Automatically rebuilt on failure

• RDD operations

• Transformations: create new RDDs from existing ones

• Actions: return a value to the master node after running a computation on the dataset

As you know…• … Hadoop is a distributed system for counting

words

• Here is how it’s done is Spark

As you know…• … Hadoop is a distributed system for counting

words

• Here is how it’s done is Spark

Blue code: Spark operationsRed code: functions (closures) that get passed to the cluster automatically

Text search

Text search

In memory text search: !!caches the RDD in memory for faster reuse

Logistic regression

!

• 100 GB of data on a 100 node cluster

Easy unit testing

Spark shell

Mini demo

Hive on Spark = Shark• A large scale data warehouse system just like Hive

• Highly compatible with Hive (HQL, metastore, serialization formats, and UDFs)

• Built on top of Spark (thus a faster execution engine)

• Provision of creating in-memory materialized tables (Cached Tables)

• And cached tables utilise columnar storage instead of raw storage

Shark

Shark uses the existing Hive client and metastore

MLlib• Machine learning library based on Spark

!

!

• Supports a range of machine learning algorithms, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and more

Spark Streaming• Write streaming applications in the same way as

batch applications

• Reuse code between batch processing and streaming

• Write more than analytics applications:

• Join streams against historical data

• Run ad-hoc queries on stream state

Spark Streaming• Count tweets on a sliding window

!

!

• Find words with higher frequency than historic data

GraphX: graph computing

introduction to apache spark

Technology

spark stack

spark shell

input hdfs

memory iter

spark sql mllib

spark streaming count

spark operations red

distributed memory