resilient distributed datasets (nsdi 2012)

51
Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed Programs with Partitioned Tables Discretized Streams (HotCloud 2012) An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters MapReduce Online(NSDI 2010)

Upload: bess

Post on 24-Feb-2016

90 views

Category:

Documents


0 download

DESCRIPTION

Piccolo (OSDI 2010). Resilient Distributed Datasets (NSDI 2012). Building Fast, Distributed Programs with Partitioned Tables. A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Discretized Streams ( HotCloud 2012). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Resilient Distributed Datasets (NSDI 2012)

Resilient Distributed Datasets (NSDI 2012)A Fault-Tolerant Abstraction forIn-Memory Cluster Computing

Piccolo (OSDI 2010)Building Fast, Distributed Programs with Partitioned Tables

Discretized Streams (HotCloud 2012)An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters MapReduce Online(NSDI 2010)

Page 2: Resilient Distributed Datasets (NSDI 2012)

MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clustersBut as soon as it got popular, users wanted more:

»More complex, multi-stage applications(e.g. iterative machine learning & graph processing)

»More interactive ad-hoc queries

Response: specialized frameworks for some of these apps (e.g. Pregel for graph

processing)

Page 3: Resilient Distributed Datasets (NSDI 2012)

MotivationComplex apps and interactive queries both two things that MapReduce lacks:• Efficient primitives for data sharing• A means for pipelining or continuous

processing

Page 4: Resilient Distributed Datasets (NSDI 2012)

MapReduce System Model•Designed for batch-oriented computations over large data sets

–Each operator runs to completion before producing any output– Barriers between stages

–Only way to share data is by using stable storage• Map output to local disk, reduce output to HDFS

Page 5: Resilient Distributed Datasets (NSDI 2012)

Examplesiter. 1 iter. 2 . .

.Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication and disk I/O,but necessary for fault tolerance

Page 6: Resilient Distributed Datasets (NSDI 2012)

iter. 1 iter. 2 . . .

Input

Goal: In-Memory Data Sharing

Input

query 1

query 2

query 3. . .

one-timeprocessing

10-100× faster than network/disk, but how to get FT?

Page 7: Resilient Distributed Datasets (NSDI 2012)

Challenge

How to design a distributed memory abstraction that is both fault-tolerant

and efficient?

Page 8: Resilient Distributed Datasets (NSDI 2012)

Approach 1: Fine-grainedE.g., Piccolo (Others: RAMCloud; DSM)Distributed Shared TableImplemented as an in-memory (dist) key-value store

Page 9: Resilient Distributed Datasets (NSDI 2012)

Kernel FunctionsOperate on in-memory state concurrently on many machinesSequential code that reads from and writes to distributed table

Page 10: Resilient Distributed Datasets (NSDI 2012)

Using the storeget(key)put(key,value)update(key,value)flush()get_iterator(partition)

Page 11: Resilient Distributed Datasets (NSDI 2012)

User specified policies… For partitioningHelps programmers express data locality preferencesPiccolo ensures all entries in a partition reside on the same machineE.g., user can locate kernel with partition, and/or co-locate partitions of different related tables

Page 12: Resilient Distributed Datasets (NSDI 2012)

User-specified policies … for resolving conflicts (multiple kernels writing)User defines an accumulation function (works if results independent of update order)… for checkpointing and restorePiccolo stores global state snapshot; relies on user to check-point kernel execution state

Page 13: Resilient Distributed Datasets (NSDI 2012)

Fine-Grained: ChallengeExisting storage abstractions have interfaces based on fine-grained updates to mutable stateRequires replicating data or logs across nodes for fault tolerance

»Costly for data-intensive apps»10-100x slower than memory write

Page 14: Resilient Distributed Datasets (NSDI 2012)

Coarse Grained: Resilient Distributed Datasets (RDDs)Restricted form of distributed shared memory

»Immutable, partitioned collections of records

»Can only be built through coarse-grained deterministic transformations (map, filter, join, …)

Efficient fault recovery using lineage»Log one operation to apply to many

elements»Recompute lost partitions on failure»No cost if nothing fails

Page 15: Resilient Distributed Datasets (NSDI 2012)

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Page 16: Resilient Distributed Datasets (NSDI 2012)

Generality of RDDsDespite their restrictions, RDDs can express surprisingly many parallel algorithms

»These naturally apply the same operation to many items

Unify many current programming models»Data flow models: MapReduce, Dryad, SQL, …»Specialized models for iterative apps: BSP (Pregel),

iterative MapReduce (Haloop), bulk incremental, …

Support new apps that these models don’t

Page 17: Resilient Distributed Datasets (NSDI 2012)

Memorybandwidth

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

CoarseLow High

K-V stores,databases,RAMCloud

Best for batchworkloads

Best fortransactional

workloads

HDFS RDDs

Page 18: Resilient Distributed Datasets (NSDI 2012)

Spark Programming InterfaceDryadLINQ-like API in the Scala languageUsable interactively from Scala interpreterProvides:

»Resilient distributed datasets (RDDs)»Operations on RDDs: transformations (build

new RDDs), actions (compute and output results)

»Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)

Page 19: Resilient Distributed Datasets (NSDI 2012)

Spark Operations

Transformations

(define a new RDD)

mapfilter

samplegroupByKeyreduceByKey

sortByKey

flatMapunionjoin

cogroupcross

mapValues

Actions(return a result

to driver program)

collectreducecountsave

lookupKey

Page 20: Resilient Distributed Datasets (NSDI 2012)

Task SchedulerDryad-like DAGsPipelines functionswithin a stageLocality & data reuse awarePartitioning-awareto avoid shuffles

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached data partition

Page 21: Resilient Distributed Datasets (NSDI 2012)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).countmessages.filter(_.contains(“bar”)).count

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Base RDDTransformed

RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20

sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Page 22: Resilient Distributed Datasets (NSDI 2012)

RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.:

messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Fault Recovery

HadoopRDD FilteredRDD MappedRDD

Page 23: Resilient Distributed Datasets (NSDI 2012)

Fault Recovery Results

1 2 3 4 5 6 7 8 9 10020406080

100120140 119

57 56 58 5881

57 59 57 59

Iteration

Iter

atri

on t

ime

(s) Failure happens

Page 24: Resilient Distributed Datasets (NSDI 2012)

Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs

Page 25: Resilient Distributed Datasets (NSDI 2012)

Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairsfor (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _)}

Page 26: Resilient Distributed Datasets (NSDI 2012)

Optimizing Placement

links & ranks repeatedly joinedCan co-partition them (e.g. hash both on URL) to avoid shufflesCan also use app knowledge, e.g., hash on DNS name

links = links.partitionBy( new URLPartitioner())

reduceContribs0

join

joinContribs2

Ranks0(url, rank)

Links(url,

neighbors)

. . .

Ranks2

reduce

Ranks1

Page 27: Resilient Distributed Datasets (NSDI 2012)

PageRank Performance

01020304050607080 72

23

HadoopBasic SparkSpark + Con-trolled Partition-ingTi

me

per

iter

-at

ion

(s)

Page 28: Resilient Distributed Datasets (NSDI 2012)

Programming Models Implemented on SparkRDDs can express many existing parallel models

»MapReduce, DryadLINQ»Pregel graph processing [200 LOC]»Iterative MapReduce [200 LOC]»SQL: Hive on Spark (Shark) [in progress]

Enables apps to efficiently intermix these models

All are based oncoarse-grained operations

Page 29: Resilient Distributed Datasets (NSDI 2012)

Spark: SummaryRDDs offer a simple and efficient programming model for a broad range of applicationsLeverage the coarse-grained nature of many parallel algorithms for low-overhead recovery

Issues?

Page 30: Resilient Distributed Datasets (NSDI 2012)

Discretized Streams

Putting in-memory frameworks to work…

Page 31: Resilient Distributed Datasets (NSDI 2012)

Motivation• Many important applications need to process

large data streams arriving in real time– User activity statistics (e.g. Facebook’s Puma)– Spam detection– Traffic estimation– Network intrusion detection

• Target: large-scale apps that must run on tens-hundreds of nodes with O(1 sec) latency

Page 32: Resilient Distributed Datasets (NSDI 2012)

Challenge• To run at large scale, system has to

be both:– Fault-tolerant: recover quickly from

failures and stragglers– Cost-efficient: do not require

significant hardware beyond that needed for basic processing

• Existing streaming systems don’t have both properties

Page 33: Resilient Distributed Datasets (NSDI 2012)

Traditional Streaming Systems

• “Record-at-a-time” processing model– Each node has mutable state– For each record, update state & send new records

mutable state

node 1 node

3

input records push

node 2

input records

Page 34: Resilient Distributed Datasets (NSDI 2012)

Traditional Streaming SystemsFault tolerance via replication or upstream backup:

node 1 node

3node

2

node 1’ node

3’node

2’

synchronization

node 1 node

3

node 2

standby

input

input

input

input

Page 35: Resilient Distributed Datasets (NSDI 2012)

Traditional Streaming SystemsFault tolerance via replication or upstream backup:

node 1 node

3node

2

node 1’ node

3’node

2’

synchronization

node 1 node

3

node 2

standby

input

input

input

input

Fast recovery, but 2x hardware

cost

Only need 1 standby, but

slow to recover

Page 36: Resilient Distributed Datasets (NSDI 2012)

Traditional Streaming SystemsFault tolerance via replication or upstream backup:

node 1 node

3node

2

node 1’ node

3’node

2’

synchronization

node 1 node

3

node 2

standby

input

input

input

input

Neither approach tolerates stragglers

Page 37: Resilient Distributed Datasets (NSDI 2012)

Observation• Batch processing models for clusters (e.g.

MapReduce) provide fault tolerance efficiently– Divide job into deterministic tasks– Rerun failed/slow tasks in parallel on other nodes

• Idea: run a streaming computation as a series of very small, deterministic batches– Same recovery schemes at much smaller timescale– Work to make batch size as small as possible

Page 38: Resilient Distributed Datasets (NSDI 2012)

Discretized Stream Processing

t = 1:

t = 2:

stream 1 stream 2

batch operation

pullinput

… …

input

immutable dataset

(stored reliably)

immutable dataset

(output or state);

stored in memorywithout

replication

Page 39: Resilient Distributed Datasets (NSDI 2012)

Parallel Recovery• Checkpoint state datasets periodically• If a node fails/straggles, recompute its

dataset partitions in parallel on other nodesmap

input dataset

Faster recovery than upstream backup,

without the cost of replication

output dataset

Page 40: Resilient Distributed Datasets (NSDI 2012)

Programming Model• A discretized stream (D-stream) is a

sequence of immutable, partitioned datasets– Specifically, resilient distributed

datasets (RDDs), the storage abstraction in Spark

• Deterministic transformations operators produce new streams

Page 41: Resilient Distributed Datasets (NSDI 2012)

D-Streams Summary• D-Streams forgo traditional

streaming wisdom by batching data in small timesteps

• Enable efficient, new parallel recovery scheme

Page 42: Resilient Distributed Datasets (NSDI 2012)

MapReduce Online

…..pipelining in map-reduce

Page 43: Resilient Distributed Datasets (NSDI 2012)

Stream Processing with HOP

• Run MR jobs continuously, and analyze data as it arrives

• Map and reduce tasks run continuously• Reduce function divides stream into windows

– “Every 30 seconds, compute the 1, 5, and 15 minute average network utilization; trigger an alert if …”

– Window management done by user (reduce)

Page 44: Resilient Distributed Datasets (NSDI 2012)

Dataflow in Hadoop

map

map

reduce

reduce

Local FS

Local FS

HTTP GET

Page 45: Resilient Distributed Datasets (NSDI 2012)

Hadoop Online Prototype• HOP supports pipelining within and between

MapReduce jobs: push rather than pull– Preserve simple fault tolerance scheme– Improved job completion time (better cluster utilization)– Improved detection and handling of stragglers

• MapReduce programming model unchanged– Clients supply same job parameters

• Hadoop client interface backward compatible– No changes required to existing clients

• E.g., Pig, Hive, Sawzall, Jaql– Extended to take a series of job

Page 46: Resilient Distributed Datasets (NSDI 2012)

Pipelining Batch Size

• Initial design: pipeline eagerly (for each row)– Prevents use of combiner– Moves more sorting work to mapper– Map function can block on network I/O

• Revised design: map writes into buffer– Spill thread: sort & combine buffer, spill to disk– Send thread: pipeline spill files => reducers

• Simple adaptive algorithm

Page 47: Resilient Distributed Datasets (NSDI 2012)

Pipeline request

Dataflow in HOP

Schedule

Schedule + Location

map

map

reduce

reduce

Page 48: Resilient Distributed Datasets (NSDI 2012)

Online Aggregation• Traditional MR: poor UI for data analysis• Pipelining means that data is available at

consumers “early”– Can be used to compute and refine an approximate

answer– Often sufficient for interactive data analysis,

developing new MapReduce jobs, ...• Within a single job: periodically invoke reduce

function at each reduce task on available data• Between jobs: periodically send a “snapshot”

to consumer jobs

Page 49: Resilient Distributed Datasets (NSDI 2012)

Intra-Job Online Aggregation• Approximate answers published to HDFS by

each reduce task• Based on job progress: e.g. 10%, 20%, …• Challenge: providing statistically meaningful

approximations– How close is an approximation to the final answer?– How do you avoid biased samples?

• Challenge: reduce functions are opaque– Ideally, computing 20% approximation should reuse

results of 10% approximation– Either use combiners, or HOP does redundant work

Page 50: Resilient Distributed Datasets (NSDI 2012)

Inter-Job Online Aggregation

Write Answer

HDFS

map

mapJob 2 Mappers

reduce

reduce

Job 1 Reducers

Page 51: Resilient Distributed Datasets (NSDI 2012)

Inter-Job Online Aggregation

• Like intra-job OA, but approximate answers are pipelined to map tasks of next job– Requires co-scheduling a sequence of jobs

• Consumer job computes an approximation– Can be used to feed an arbitrary chain of

consumer jobs with approximate answers• Challenge: how to avoid redundant work

– Output of reduce for 10% progress vs. for 20%