dryad and dataflow systems

Dryad anddataflow systems

Michael Isard [email protected] Research

4th June, 2014

mailto:[email protected]

Talk outline• Why is dataflow so useful?• What is Dryad?• An engineering sweet spot• Beyond Dryad• Conclusions

Computation on large datasets• Performance mostly efficient resource use• Locality• Data placed correctly in memory hierarchy

• Scheduling• Get enough work done before being interrupted

• Decompose into independent batches• Parallel computation• Control communication and synchronization

• Distributed computation• Writes must be explicitly shared

Computational model• Vertices are independent• State and scheduling

• Dataflow very powerful• Explicit batching and communication

Processingvertices

Channels

Inputs

Outputs

Why dataflow now?• Collection-oriented programming model• Operations on collections of objects• Turn spurious (unordered) for into foreach• Not every for is foreach

• Aggregation (sum, count, max, etc.)• Grouping• Join, Zip

• Iteration

• LINQ since ca 2008, now Spark via Scala, Java

int SortKey(KeyValuePair<string,int> x){ return x.count;}

int SortKey(void* x){ return (KeyValuePair<string,int>*)x->count;}

Given some lines of text, find the most commonly occurring words.

1. Read the lines from a file2. Split each line into its constituent words3. Count how many times each word appears4. Find the words with the highest counts

1. var lines = FS.ReadAsLines(inputFileName);2. var words = lines.SelectMany(x => x.Split(‘ ‘));3. var counts = words.CountInGroups();4. var highest =

counts.OrderByDescending(x => x.count).Take(10);

Type inference

Collection<KeyValuePair<string,int>>

Lambda expressions

Generics and extension methods

FooCollection FooTake(FooCollection c, int count) { … }

Well-chosen syntactic sugar

red,2blue,4

yellow,3

red

red

blue

blueblue blueyellow

yellowyellow

Collection<T> Take(this Collection<T> c, int count) { … }

Collections compile to dataflow• Each operator specifies a single data-parallel step• Communication between steps explicit• Collections reference collections, not individual objects!• Communication under control of the system

• Partition, pipeline, exchange automatically

• LINQ innovation: embedded user-defined functions var words = lines.SelectMany(x => x.Split(‘ ‘));• Very expressive• Programmer ‘naturally’ writes pure functions

Distributed sortingvar sorted = set.OrderBy(x => x.key)

range partition by key

sort locally

sorted

set

sample

compute histogram

Quiet revolution in parallelism• Programming model is more attractive• Simpler, more concise, readable, maintainable

• Program is easier to optimize• Programmer separates computation and communication• System can re-order, distribute, batch, etc. etc.

What is Dryad?• General-purpose DAG execution engine ca 2005• Cited as inspiration for e.g. Hyracks, Tez

• Engine behind Microsoft Cosmos/SCOPE• Initially MSN Search/Bing, now used throughout MSFT

• Core of research batch cluster environment ca 2009• DryadLINQ• Quincy scheduler• TidyFS

What Dryad does• Abstracts cluster resources• Set of computers, network topology, etc.

• Recovers from transient failures• Rerun computations on machine or network fault• Speculate duplicates for slow computations

• Schedules a local DAG of work at each vertex

Scheduling and fault tolerance• DAG makes things easy• Schedule from source to sink in any order• Re-execute subgraph on failure• Execute “duplicates” for slow vertices

Resources are virtualized• Each graph vertex is a process• Writes outputs to disk (usually)• Reads inputs from upstream nodes’ output files

• Graph generally larger than cluster RAM• 1TB partitioned input, 250MB part size, 4000 parts

• Cluster is shared• Don’t size program for exact cluster• Use whatever share of resources are available

Integrated system• Collection-oriented programming model (LINQ)• Partitioned file system (TidyFS)• Manages replication and distribution of large data

• Cluster scheduler (Quincy)• Jointly schedule multiple jobs at a time• Fine-grain multiplexing between jobs• Balance locality and fairness

• Monitoring and debugging (Artemis)• Within job and across jobs

Dryad Cluster Scheduling

R

Scheduler

Dryad Cluster Scheduling

R

R

Scheduler

Quincy without preemption

Quincy with preemption

Dryad features• Well-tested at scales up to 15k cluster computers• In heavy production use for 8 years

• Dataflow graph is mutable at runtime• Repartition to avoid skew• Specialize matrices dense/sparse• Harden fault-tolerance

Stateless DAG dataflow• MapReduce, Dryad, Spark, …• Stateless vertex constraint hampers performance• Iteration and streaming overheads

• Why does this design keep repeating?

Software engineering• Fault tolerance well understood• E.g., Chandy-Lamport, rollback recovery, etc.

• Basic mechanism: checkpoint plus log• Stateless DAG: no checkpoint!• Programming model “tricked” user• All communication on typed channels• Only channel data needs to be persisted• Fault tolerance comes without programmer effort• Even with UDFs

What about stateful dataflow?• Naiad• Add state to vertices• Support streaming and iteration

• Opportunities• Much lower latency• Can model mutable state with dataflow

• Challenges• Scheduling• Coordination• Fault tolerance

Batch processing

Stream processing

Graph processing

Timely dataflow

Batching Streamingvs.

Requires coordination Supports aggregation

No coordination needed Aggregation is difficult

(synchronous) (asynchronous)

Batch DAG execution

Centralcoordinator

Streaming DAG execution

Streaming DAG execution

Inlinecoordination

Batch iteration

Centralcoordinator

Streaming iteration

Messages

B C D

B.SENDBY(edge, message, time)

C.ONRECV(edge, message, time)

Messages are delivered asynchronously

Notifications

B C D

D.NOTIFYAT(time)

D.ONNOTIFY(time)

Notifications support batching

C.SENDBY(_, _, time)

No more messages at time or earlierD.ONRECV(_, _, time)

Coordination in timely dataflow• Local scheduling with global progress tracking• Coordination with a shared counter, not a scheduler• Efficient, scalable implementation

32K tweets/s

10 queries/s

Interactive graph analysis

In ⋈

#x

@y

z?

⋈max

⋈

Query latency

30000 35000 40000 45000 500001

10

100

1000

Experiment time (s)

Quer

y la

tenc

y (m

s)

32 8-core 2.1 GHz AMD Opteron16 GB RAM per serverGigabit Ethernet

Max: 140 ms99th percentile: 70 msMedian: 5.2 ms

Mutable state• In batch DAG systems collections are immutable• Functional definition in terms of preceding subgraph

• Adding streaming or iteration introduces mutability• Collection varies as function of epoch, loop iteration

Key-value store as dataflowvar lookup = data.join(query, d => d.key, q => q.key)

• Modeled random access with dataflow… • Add/remove key is streaming update to data• Look up key is streaming update to query

• High throughput requires batching• But that was true anyway, in general

What can’t dataflow do?• Programming model for mutable state?• Not as intuitive as functional collection manipulation

• Policies for placement still primitive• Hash everything and hope

• Great research opportunities• Intersection of OS, network, runtime, language

Conclusions• Dataflow is a great structuring principle• We know good programming models• We know how to write high-performance systems

• Dataflow is the status quo for batch processing• Mutable state is the current research frontier

Apache 2.0 licensed source on GitHubhttp://research.microsoft.com/en-us/um/siliconvalley/projects/BigDataDev/

http://research.microsoft.com/en-us/um/siliconvalley/projects/BigDataDev/

http://research.microsoft.com/en-us/um/siliconvalley/projects/BigDataDev/

dryad and dataflow systems

Documents

int count

return x

java5int sortkeykeyvaluepair

lines of text

highest countsvar lines

int sortkeyvoid

collections compile

split var counts