introduction to apache spark

30
Introduction to Spark Eric Eijkelenboom - UserReport - userreport.com

Upload: userreport

Post on 26-Jan-2015

141 views

Category:

Technology


7 download

DESCRIPTION

Intro

TRANSCRIPT

Page 1: Introduction to apache spark

Introduction to Spark

Eric Eijkelenboom - UserReport - userreport.com

Page 2: Introduction to apache spark

• What is Spark and why should I care?

• Architecture and programming model

• Examples

• Mini demo

• Related projects

Page 3: Introduction to apache spark
Page 4: Introduction to apache spark

RTFM• A general-purpose computation framework that leverages distributed

memory

• More flexible than MapReduce (it supports general execution graphs)

• Linear scalability and fault tolerance

• It supports a rich set of higher-level tools including

• Shark (Hive on Spark) and Spark SQL

• MLlib for machine learning

• GraphX for graph processing

• Spark Streaming

Page 5: Introduction to apache spark

Who cares?

Page 6: Introduction to apache spark

!

!

!

!

• Slow due to serialisation & replication

• Inefficient for iterative computing & interactive querying

Limitations of MapReduceInput iter. 1 iter. 2 . . .

HDFS read

HDFS write

HDFS read

HDFS write

Map

Map

Map

Reduce

Reduce

Input Output

Page 7: Introduction to apache spark

Leveraging memory

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Page 8: Introduction to apache spark

Leveraging memory

iter. 1 iter. 2 . . .

Input

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Page 9: Introduction to apache spark

Leveraging memory

iter. 1 iter. 2 . . .

Input

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Not tied to 2-stage MapReduce paradigm

1. Extract a working set 2. Cache it 3. Query it repeatedly

Page 10: Introduction to apache spark

So, Spark is…• In-memory analytics, many times faster than

Hadoop/Hive

• Designed for running iterative algorithms & interactive querying

• Highly compatible with Hadoop’s Storage APIs

• Can run on your existing Hadoop Cluster Setup

• Programming in Scala, Python or Java

Page 11: Introduction to apache spark

Spark stack

Page 12: Introduction to apache spark

Architecture

HDFS

Datanode Datanode Datanode....Spark Worker Spark Worker Spark Worker....

Cache Cache Cache

Block Block Block

Cluster Manager

Spark Driver (Master)

Page 13: Introduction to apache spark

Architecture

HDFS

Datanode Datanode Datanode....Spark Worker Spark Worker Spark Worker....

Cache Cache Cache

Block Block Block

Cluster Manager

Spark Driver (Master)

• YARN • Mesos • Standalone

Page 14: Introduction to apache spark

Programming model• Resilient Distributed Datasets (RDDs) are basic building blocks

• Distributed collection of objects, cached in-memory across cluster nodes

• Automatically rebuilt on failure

• RDD operations

• Transformations: create new RDDs from existing ones

• Actions: return a value to the master node after running a computation on the dataset

Page 15: Introduction to apache spark

As you know…• … Hadoop is a distributed system for counting

words

• Here is how it’s done is Spark

Page 16: Introduction to apache spark

As you know…• … Hadoop is a distributed system for counting

words

• Here is how it’s done is Spark

Blue code: Spark operationsRed code: functions (closures) that get passed to the cluster automatically

Page 17: Introduction to apache spark

Text search

Page 18: Introduction to apache spark

Text search

In memory text search: !!caches the RDD in memory for faster reuse

Page 19: Introduction to apache spark

Logistic regression

!

• 100 GB of data on a 100 node cluster

Page 20: Introduction to apache spark

Easy unit testing

Page 21: Introduction to apache spark

Spark shell

Page 22: Introduction to apache spark

Mini demo

Page 23: Introduction to apache spark
Page 24: Introduction to apache spark

Hive on Spark = Shark• A large scale data warehouse system just like Hive

• Highly compatible with Hive (HQL, metastore, serialization formats, and UDFs)

• Built on top of Spark (thus a faster execution engine)

• Provision of creating in-memory materialized tables (Cached Tables)

• And cached tables utilise columnar storage instead of raw storage

Page 25: Introduction to apache spark

Shark

Shark uses the existing Hive client and metastore

Page 26: Introduction to apache spark

MLlib• Machine learning library based on Spark

!

!

• Supports a range of machine learning algorithms, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and more

Page 27: Introduction to apache spark

Spark Streaming• Write streaming applications in the same way as

batch applications

• Reuse code between batch processing and streaming

• Write more than analytics applications:

• Join streams against historical data

• Run ad-hoc queries on stream state

Page 28: Introduction to apache spark

Spark Streaming• Count tweets on a sliding window

!

!

• Find words with higher frequency than historic data

Page 29: Introduction to apache spark

GraphX: graph computing

Page 30: Introduction to apache spark