cluster computing with dryadlinq

42
Cluster Computing with DryadLINQ Mihai Budiu Microsoft Research Silicon Valley IEEE Cloud Computing Conference July 19, 2008

Upload: jasia

Post on 21-Mar-2016

36 views

Category:

Documents


2 download

DESCRIPTION

Cluster Computing with DryadLINQ. Mihai Budiu Microsof t Research Silicon Valley IEEE Cloud Computing Conference July 19, 2008. Goal. Design Space. Grid. Internet. Data- parallel. Dryad. Search. Shared memory. Private d ata center. Transaction. HPC. Latency. Throughput. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Cluster Computing with DryadLINQ

Cluster Computing with DryadLINQ

Mihai BudiuMicrosoft Research Silicon Valley

IEEE Cloud Computing ConferenceJuly 19, 2008

Page 2: Cluster Computing with DryadLINQ

2

Goal

Page 3: Cluster Computing with DryadLINQ

3

Design Space

ThroughputLatency

Internet

Privatedata

center

Data-parallel

Sharedmemory

DryadSearch

HPC

Grid

Transaction

Page 4: Cluster Computing with DryadLINQ

4

Data Partitioning

RAM

DATA

DATA

Page 5: Cluster Computing with DryadLINQ

5

Data-Parallel Computation

Storage

Execution

Application

Parallel Databases

Map-Reduce

GFSBigTable

CosmosNTFS

Dryad

DryadLINQScope,PSQL

Sawzall, Pig

Page 6: Cluster Computing with DryadLINQ

6

• Introduction• Dryad • DryadLINQ• Applications

Outline

Page 7: Cluster Computing with DryadLINQ

7

2-D Piping• Unix Pipes: 1-D

grep | sed | sort | awk | perl

• Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50

Page 8: Cluster Computing with DryadLINQ

8

Virtualized 2-D Pipelines

Page 9: Cluster Computing with DryadLINQ

9

Virtualized 2-D Pipelines

Page 10: Cluster Computing with DryadLINQ

10

Virtualized 2-D Pipelines

Page 11: Cluster Computing with DryadLINQ

11

Virtualized 2-D Pipelines

Page 12: Cluster Computing with DryadLINQ

12

Virtualized 2-D Pipelines• 2D DAG• multi-machine• virtualized

Page 13: Cluster Computing with DryadLINQ

13

Architecture

Files, TCP, FIFO, Networkjob schedule

data plane

control plane

NS PD PDPD

V V V

Job manager cluster

Page 14: Cluster Computing with DryadLINQ

Fault Tolerance

Page 15: Cluster Computing with DryadLINQ

15

• Introduction• Dryad • DryadLINQ• Applications

Outline

Page 16: Cluster Computing with DryadLINQ

16

DryadLINQ

Dryad

Page 17: Cluster Computing with DryadLINQ

17

LINQ = C# + Queries

Collection<T> collection;bool IsLegal(Key);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

Page 18: Cluster Computing with DryadLINQ

18

Collection<T> collection;bool IsLegal(Key k);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

DryadLINQ = LINQ + Dryad

C#

collection

results

C# C# C#

Vertexcode

Queryplan(Dryad job)Data

Page 19: Cluster Computing with DryadLINQ

19

Data Model

Partition

Collection

C# objects

Page 20: Cluster Computing with DryadLINQ

20

Language Summary

WhereSelectGroupByOrderByAggregateJoinApplyMaterialize

Page 21: Cluster Computing with DryadLINQ

21

Demo

Done

Page 22: Cluster Computing with DryadLINQ

22

Example: Histogrampublic static IQueryable<Pair> Histogram( IQueryable<LineRecord> input, int k){ var words = input.SelectMany(x => x.line.Split(' ')); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x.count); var top = ordered.Take(k); return top;}

“A line of words of wisdom”

[“A”, “line”, “of”, “words”, “of”, “wisdom”]

[[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]]

[ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}]

[{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}]

[{“of”, 2}, {“A”, 1}, {“line”, 1}]

Page 23: Cluster Computing with DryadLINQ

23

Histogram Plan

SelectManyHashDistribute

MergeGroupBy

Select

OrderByDescendingTake

MergeSortTake

Page 24: Cluster Computing with DryadLINQ

24

Map-Reduce in DryadLINQ

public static IQueryable<S> MapReduce<T,M,K,S>( this IQueryable<T> input, Expression<Func<T, IEnumerable<M>>> mapper, Expression<Func<M,K>> keySelector, Expression<Func<IGrouping<K,M>,S>> reducer) { var map = input.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.Select(reducer); return result;}

Page 25: Cluster Computing with DryadLINQ

25

Map-Reduce Plan

M

R

G

M

Q

G1

R

D

MS

G2

R

static dynamic

X

X

M

Q

G1

R

D

MS

G2

R

X

M

Q

G1

R

D

MS

G2

R

X

M

Q

G1

R

D

M

Q

G1

R

D

MS

G2

R

X

M

Q

G1

R

D

MS

G2

R

X

M

Q

G1

R

D

MS

G2

R

MS

G2

R

map

sort

groupby

reduce

distribute

mergesort

groupby

reduce

mergesort

groupby

reduce

consumer

map

parti

al a

ggre

gatio

nre

ducedynamic

Page 26: Cluster Computing with DryadLINQ

26

• Introduction• Dryad • DryadLINQ• Applications

Outline

Page 27: Cluster Computing with DryadLINQ

27

Applications

Dryad

DryadLINQ

Distributed Data Structures

Machine learningGraphsLog

analysisImage

processingCombinatorialoptimization

Raytracing

Dataanalysis

Page 28: Cluster Computing with DryadLINQ

28

E.g: Linear Algebra

T U Vnmm ,,=, ,

T

Page 29: Cluster Computing with DryadLINQ

Expectation Maximization (Gaussians)

29

• 160 lines • 3 iterations shown

Page 30: Cluster Computing with DryadLINQ

Conclusions

• Dryad = distributed execution environment• Supports rich software ecosystem

• DryadLINQ = Compiles LINQ to Dryad• C# objects and declarative programming• .Net and Visual Studio

for distributed programming

30

Page 31: Cluster Computing with DryadLINQ

31

Dryad Job Structure

grep

sed

sortawk

perlgrep

grepsed

sort

sort

awk

Inputfiles

Vertices (processes)

Outputfiles

ChannelsStage

Page 32: Cluster Computing with DryadLINQ

32

Linear Regression

• Data

• Find

• S.t.

mt

nt yx ,

mnA

tt yAx

},...,1{ nt

Page 33: Cluster Computing with DryadLINQ

33

Analytic Solution

X×XT X×XT X×XT Y×XT Y×XT Y×XT

Σ

X[0] X[1] X[2] Y[0] Y[1] Y[2]

Σ

[ ]-1

*

A

1))(( Ttt t

Ttt t xxxyA

Map

Reduce

Page 34: Cluster Computing with DryadLINQ

34

Linear Regression Code

Vectors x = input(0), y = input(1);Matrices xx = x.Map(x, (a,b) => a.OuterProd(b));OneMatrix xxs = xx.Sum();Matrices yx = y.Map(x, (a,b) => a.OuterProd(b));OneMatrix yxs = yx.Sum();OneMatrix xxinv = xxs.Map(a => a.Inverse());OneMatrix A = yxs.Map(xxinv, (a, b) => a.Mult(b));

1))(( Ttt t

Ttt t xxxyA

Page 35: Cluster Computing with DryadLINQ

35

Dryad = Execution Layer

Job (application)

Dryad

Cluster

Pipeline

Shell

Machine≈

Page 36: Cluster Computing with DryadLINQ

36

• Many similarities• Exe + app. model• Map+sort+reduce• Few policies• Program=map+reduce• Simple• Mature (> 4 years)• Widely deployed• Hadoop

Dryad Map-Reduce

• Execution layer• Job = arbitrary DAG• Plug-in policies• Program=graph gen.• Complex ( features)• New (< 2 years)• Still growing• Internal

Page 37: Cluster Computing with DryadLINQ

37

PLINQ

public static IEnumerable<TSource> DryadSort<TSource, TKey>(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IComparer<TKey> comparer, bool isDescending){

return source.AsParallel().OrderBy(keySelector, comparer);}

Page 38: Cluster Computing with DryadLINQ

38

Operations on Large Vectors: Map 1

U

T

T Uf

f

f preserves partitioning

Page 39: Cluster Computing with DryadLINQ

39

V

Map 2 (Pairwise)

T Uf

V

U

T

f

Page 40: Cluster Computing with DryadLINQ

40

Map 3 (Vector-Scalar)T U

fV

V

40

U

T

f

Page 41: Cluster Computing with DryadLINQ

Reduce (Fold)

41

U UU

U

f

f f f

fU U U

U

Page 42: Cluster Computing with DryadLINQ

42

Software Stack

Windows Server

Cluster Services

Distributed Filesystem (Cosmos)

Dryad

Distributed Shell (Nebula)

PSQL

DryadLINQ

PerlSQL

server

C++

Windows Server

Windows Server

Windows Server

C++

CIFS/NTFS

legacycode

sed, awk, grep, etc.

SSISScope

C# Data structures

Applications

C#

Job

queu

eing

, mon

itorin

g