berkley data analysis stack (bdas)

Berkley Data Analysis Stack (BDAS)

Mesos, Spark, Shark, Spark Streaming

Current Data Analysis Open Stack

Application

Storage

Data Processing

InfrastructureCharacteristics: • Batch Processing on on-disk data..• Not very efficient with “Interactive” and “Streaming” computations.

Berkeley Data Analytics Stack (BDAS)

Infrastructure

Storage

Data Processing

Application

Resource Management

Data Management

Share infrastructure across frameworks(multi-programming for datacenters)

Efficient data sharing across frameworks

Data Processing• in-memory processing • trade between time, quality, and cost

ApplicationNew apps: AMP-Genomics, Carat, …

BDAS Components

Mesos• A platform for sharing commodity clusters between diverse

computing frameworks.

B. Hindman et. al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, tech report, UCB, 2010

Mesos• “Resource Offers” to publish available resources


• Has to deal with “framework” specific constraints (without knowing the specific constraints).

• e.g. data locality.• Allows the framework scheduler to “reject

offer” if constraints are not met.

Mesos• Other Issues:

• Resource Allocation Strategies: Pluggable • fair sharing plugin implemented

• Revocation• Isolation:

• “Existing OS isolation techniques: Linux Containers”. • Fault Tolerance:

• Master: stand by master nodes and zoo keeper • Slaves: Reports task/slave failures to the framework, the latter handles.• Framework scheduler failure: Replicate


SparkCurrent popular programming models for clusters transform data flowing from stable storage to stable storage

• Spark: In-Memory Cluster Computing for Iterative and Interactive Applications. – Assumption

• Same working data set is used across iterations or for a number of interactive queries.• Commodity Cluster• Local Data partitions fit in Memory

Some Slides taken from presentation:

Spark

• Acyclic data flows – powerful abstractions– But not efficient for Iterative/interactive applications that

repeatedly use the same “working data set”.

Solution: augment data flow model with in-memory “resilient distributed datasets”

(RDDs)

RDDs

• An RDD is an immutable, partitioned, logical collection of records– Need not be materialized, but rather contains information to rebuild a

dataset from stable storage (lazy-loading and lineage)– can be rebuilt if a partition is lost (Transform once read many)

• Partitioning can be based on a key in each record (using hash or range partitioning)

• Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)

• Can be cached for future reuse

Generality of RDDs• Claim: Spark’s combination of data flow with RDDs

unifies many proposed cluster programming models– General data flow models: MapReduce, Dryad, SQL– Specialized models for stateful apps: Pregel (BSP), HaLoop

(iterative MR), Continuous Bulk Processing

• Instead of specialized APIs for one type of app, give user first-class control of distributed datasets

Programming ModelTransformations(define a new RDD)

mapfiltersampleuniongroupByKeyreduceByKeyjoincache…

Parallel operations(return a result to

driver)reducecollectcountsavelookupKey…

Example: Log Mining• Load error messages from a log into memory,

then interactively search for various patternslines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count. . .

tasksresults

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Cached RDDParallel operation

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

RDD Fault Tolerance

• RDDs maintain lineage information that can be used to reconstruct lost partitions

• Ex: cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache()

HdfsRDDpath: hdfs://…

FilteredRDDfunc: contains(...)

MappedRDDfunc: split(…) CachedRDD

Benefits of RDD Model• Consistency is easy due to immutability• Inexpensive fault tolerance (log lineage rather than

replicating/checkpointing data)• Locality-aware scheduling of tasks on partitions• Despite being restricted, model seems applicable to a broad

variety of applications

Example: Logistic Regression

• Goal: find best line separating two sets of points

+

–

+ ++

+

+

++ +

– ––

–

–

–– –

+

target

–

random initial line

Logistic Regression Code

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w)

Logistic Regression Performance

127 s / iteration

first iteration 174 s

further iterations 6 s

Page Rank: Scala Implementation

val links = // RDD of (url, neighbors) pairsvar ranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _)}

ranks.saveAsTextFile(...)

• Fast, expressive cluster computing system compatible with Apache Hadoop– Works with any Hadoop-supported storage system (HDFS, S3, Avro, …)

• Improves efficiency through:– In-memory computing primitives– General computation graphs

• Improves usability through:– Rich APIs in Java, Scala, Python– Interactive shell

Up to 100× faster

Often 2-10× less code

Spark Summary

Spark Streaming

• Framework for large scale stream processing – Scales to 100s of nodes– Can achieve second scale latencies– Integrates with Spark’s batch and interactive processing– Provides a simple batch-like API for implementing complex algorithm– Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

Requirements

Scalable to large clusters

Second-scale latencies

Simple programming model

Integrated with batch & interactive processing

Stateful Stream Processing Traditional streaming systems have a event-

driven record-at-a-time processing model- Each node has mutable state- For each record, update state & send new

records

State is lost if node dies!

Making stateful stream processing be fault-tolerant is challenging

mutable state

node 1

node 3

input records

node 2

input records

24

Discretized Stream Processing

Run a streaming computation as a series of very small, deterministic batch jobs

25

Spark

SparkStreaming

batches of X seconds

live data stream

processed results

Chop up the live stream into batches of X seconds

Spark treats each batch of data as RDDs and processes them using RDD operations

Finally, the processed results of the RDD operations are returned in batches

Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

DStream: a sequence of RDD representing a stream of data

batch @ t+1batch @ t batch @ t+2

tweets DStream

stored in memory as an RDD (immutable, distributed)

Twitter Streaming API

Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)val hashTags = tweets.flatMap (status => getTags(status))

flatMap flatMap flatMap

…

transformation: modify data in one Dstream to create another DStream new DStream

new RDDs created for every batch

batch @ t+1batch @ t batch @ t+2

tweets DStream

hashTags Dstream[#cat, #dog, … ]

Example 1 – Get hashtags from Twitter

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)val hashTags = tweets.flatMap (status => getTags(status))hashTags.saveAsHadoopFiles("hdfs://...")output operation: to push data to external storage

flatMap

flatMap

flatMap

save save save

batch @ t+1batch @ t batch @ t+2tweets DStream

hashTags DStream

every batch saved to HDFS

Fault-tolerance RDDs remember the sequence of

operations that created it from the original fault-tolerant input data

Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant

Data lost due to worker failure, can be recomputed from input data

input data replicatedin memory

flatMap

lost partitions recomputed on other workers

tweetsRDD

hashTagsRDD

Key concepts• DStream – sequence of RDDs representing a stream of data

– Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets

• Transformations – modify data from on DStream to another– Standard RDD operations – map, countByValue, reduce, join, …– Stateful operations – window, countByValueAndWindow, …

• Output Operations – send data to external entity– saveAsHadoopFiles – saves to HDFS– foreach – do anything with each batch of results

Comparison with Storm and S4

Higher throughput than Storm Spark Streaming: 670k records/second/node Storm: 115k records/second/node Apache S4: 7.5k records/second/node

100 100005

1015202530 WordCount

Spark

Storm

Record Size (bytes)

Thro

ughp

ut p

er n

ode

(MB/

s)

100 10000

40

80

120 Grep

Spark

Storm

Record Size (bytes)

Thro

ughp

ut p

er n

ode

(MB/

s)31

berkley data analysis stack (bdas)

Documents

data center

data locality

disk data

working data set

augment data flow model

data flow operators

finegrained resource

stable storage spark