cluster computing with dryadlinq
DESCRIPTION
Cluster Computing with DryadLINQ. Mihai Budiu Microsof t Research Silicon Valley IEEE Cloud Computing Conference July 19, 2008. Goal. Design Space. Grid. Internet. Data- parallel. Dryad. Search. Shared memory. Private d ata center. Transaction. HPC. Latency. Throughput. - PowerPoint PPT PresentationTRANSCRIPT
Cluster Computing with DryadLINQ
Mihai BudiuMicrosoft Research Silicon Valley
IEEE Cloud Computing ConferenceJuly 19, 2008
2
Goal
3
Design Space
ThroughputLatency
Internet
Privatedata
center
Data-parallel
Sharedmemory
DryadSearch
HPC
Grid
Transaction
4
Data Partitioning
RAM
DATA
DATA
5
Data-Parallel Computation
Storage
Execution
Application
Parallel Databases
Map-Reduce
GFSBigTable
CosmosNTFS
Dryad
DryadLINQScope,PSQL
Sawzall, Pig
6
• Introduction• Dryad • DryadLINQ• Applications
Outline
7
2-D Piping• Unix Pipes: 1-D
grep | sed | sort | awk | perl
• Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50
8
Virtualized 2-D Pipelines
9
Virtualized 2-D Pipelines
10
Virtualized 2-D Pipelines
11
Virtualized 2-D Pipelines
12
Virtualized 2-D Pipelines• 2D DAG• multi-machine• virtualized
13
Architecture
Files, TCP, FIFO, Networkjob schedule
data plane
control plane
NS PD PDPD
V V V
Job manager cluster
Fault Tolerance
15
• Introduction• Dryad • DryadLINQ• Applications
Outline
16
DryadLINQ
Dryad
17
LINQ = C# + Queries
Collection<T> collection;bool IsLegal(Key);string Hash(Key);
var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
18
Collection<T> collection;bool IsLegal(Key k);string Hash(Key);
var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
DryadLINQ = LINQ + Dryad
C#
collection
results
C# C# C#
Vertexcode
Queryplan(Dryad job)Data
19
Data Model
Partition
Collection
C# objects
20
Language Summary
WhereSelectGroupByOrderByAggregateJoinApplyMaterialize
21
Demo
Done
22
Example: Histogrampublic static IQueryable<Pair> Histogram( IQueryable<LineRecord> input, int k){ var words = input.SelectMany(x => x.line.Split(' ')); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x.count); var top = ordered.Take(k); return top;}
“A line of words of wisdom”
[“A”, “line”, “of”, “words”, “of”, “wisdom”]
[[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]]
[ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}]
[{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}]
[{“of”, 2}, {“A”, 1}, {“line”, 1}]
23
Histogram Plan
SelectManyHashDistribute
MergeGroupBy
Select
OrderByDescendingTake
MergeSortTake
24
Map-Reduce in DryadLINQ
public static IQueryable<S> MapReduce<T,M,K,S>( this IQueryable<T> input, Expression<Func<T, IEnumerable<M>>> mapper, Expression<Func<M,K>> keySelector, Expression<Func<IGrouping<K,M>,S>> reducer) { var map = input.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.Select(reducer); return result;}
25
Map-Reduce Plan
M
R
G
M
Q
G1
R
D
MS
G2
R
static dynamic
X
X
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
MS
G2
R
X
M
Q
G1
R
D
MS
G2
R
MS
G2
R
map
sort
groupby
reduce
distribute
mergesort
groupby
reduce
mergesort
groupby
reduce
consumer
map
parti
al a
ggre
gatio
nre
ducedynamic
26
• Introduction• Dryad • DryadLINQ• Applications
Outline
27
Applications
Dryad
DryadLINQ
Distributed Data Structures
Machine learningGraphsLog
analysisImage
processingCombinatorialoptimization
Raytracing
Dataanalysis
28
E.g: Linear Algebra
T U Vnmm ,,=, ,
T
Expectation Maximization (Gaussians)
29
• 160 lines • 3 iterations shown
Conclusions
• Dryad = distributed execution environment• Supports rich software ecosystem
• DryadLINQ = Compiles LINQ to Dryad• C# objects and declarative programming• .Net and Visual Studio
for distributed programming
30
31
Dryad Job Structure
grep
sed
sortawk
perlgrep
grepsed
sort
sort
awk
Inputfiles
Vertices (processes)
Outputfiles
ChannelsStage
32
Linear Regression
• Data
• Find
• S.t.
mt
nt yx ,
mnA
tt yAx
},...,1{ nt
33
Analytic Solution
X×XT X×XT X×XT Y×XT Y×XT Y×XT
Σ
X[0] X[1] X[2] Y[0] Y[1] Y[2]
Σ
[ ]-1
*
A
1))(( Ttt t
Ttt t xxxyA
Map
Reduce
34
Linear Regression Code
Vectors x = input(0), y = input(1);Matrices xx = x.Map(x, (a,b) => a.OuterProd(b));OneMatrix xxs = xx.Sum();Matrices yx = y.Map(x, (a,b) => a.OuterProd(b));OneMatrix yxs = yx.Sum();OneMatrix xxinv = xxs.Map(a => a.Inverse());OneMatrix A = yxs.Map(xxinv, (a, b) => a.Mult(b));
1))(( Ttt t
Ttt t xxxyA
35
Dryad = Execution Layer
Job (application)
Dryad
Cluster
Pipeline
Shell
Machine≈
36
• Many similarities• Exe + app. model• Map+sort+reduce• Few policies• Program=map+reduce• Simple• Mature (> 4 years)• Widely deployed• Hadoop
Dryad Map-Reduce
• Execution layer• Job = arbitrary DAG• Plug-in policies• Program=graph gen.• Complex ( features)• New (< 2 years)• Still growing• Internal
37
PLINQ
public static IEnumerable<TSource> DryadSort<TSource, TKey>(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IComparer<TKey> comparer, bool isDescending){
return source.AsParallel().OrderBy(keySelector, comparer);}
38
Operations on Large Vectors: Map 1
U
T
T Uf
f
f preserves partitioning
39
V
Map 2 (Pairwise)
T Uf
V
U
T
f
40
Map 3 (Vector-Scalar)T U
fV
V
40
U
T
f
Reduce (Fold)
41
U UU
U
f
f f f
fU U U
U
42
Software Stack
Windows Server
Cluster Services
Distributed Filesystem (Cosmos)
Dryad
Distributed Shell (Nebula)
PSQL
DryadLINQ
PerlSQL
server
C++
Windows Server
Windows Server
Windows Server
C++
CIFS/NTFS
legacycode
sed, awk, grep, etc.
SSISScope
C# Data structures
Applications
C#
Job
queu
eing
, mon
itorin
g