![Page 1: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/1.jpg)
DANDELION: A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS
Jon CurreyMicrosoft Research
Joint work withChris Rossbach, Yuan Yu, JP Martin, Dennis Fetterly
![Page 2: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/2.jpg)
Motivation: Programmability forHeterogeneous Distributed Systems
Data volumes increasing
Cluster costs decreasing
Architectural diversity prevalent CPU-GPU server: 5X Gflops/$, 4X Gflops/kwatt v. CPUs
Programming challenges Heterogeneity programming models, arch. expertise
Distributed resources data movement, scheduling
Concurrency synchronization, consistency
Dandelion GTC 2014 S4221 2
![Page 3: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/3.jpg)
Dandelion *Goal*
Single programming interface for clusters CPUs
GPUs
FPGAs
You name it…
Programmer write sequential code
Runtime Parallelize computation
Partition data
Runs on all available resources
Maps computation to best architecture
Dandelion GTC 2014 S4221
(holy grail)
3
![Page 4: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/4.jpg)
Dandelion goal
Offload data-parallel code fragments
Small cluster of multi-core + GPU
Starting point: LINQ queries
Dandelion GTC 2014 S4221
(a less holy and attractive vessel:often just as effective, mileage may vary)
Our 10-node GPU Cluster:-- 24,960 GPU cores -- 240 CPU HW threads (12 cores x2 ctxts/node)-- 2560 GB RAM (256 GB/node)
4
![Page 5: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/5.jpg)
(Very) High Level View
Dandelion GTC 2014 S4221 5
User ProgramPartitioned data files
(input)
Compile to a mix of CPU and GPU code
Run on the cluster Partitioned data files(output)
&
Dandelion
![Page 6: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/6.jpg)
Dandelion Architecture
Dandelion GTC 2014 S4221
Client User Program
Data-flow graphs(cluster, machine, GPU)
Worker Vertex Code(CPU and GPU)
Dandelion Compiler
Machine Runtime
Dandelion Vertex
Cluster Runtime GPU Runtime
Worker Vertex CodeData-flow graphs
Cluster
6
User Program
Dandelion Compiler
Data-flow graphs(cluster, machine, GPU)
Worker Vertex Code(CPU and GPU)
Machine RuntimeCluster Runtime GPU Runtime
Worker Vertex CodeData-flow graphs
![Page 7: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/7.jpg)
Wait… why so many different “dataflow” components?
Dandelion GTC 2014 S4221 7
![Page 8: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/8.jpg)
The composition problem
What happens if I want the following? Matrix D = A x B x C
Matrixgemm(Matrix A, Matrix B) {
copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;
}
Dandelion GTC 2014 S4221 8
![Page 9: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/9.jpg)
Composed matrix multiplication
Matrix gemm(Matrix A, Matrix B) {
copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;
}
Matrix AxBxC(Matrix A, B, C) {
Matrix AxB = gemm(A,B);Matrix AxBxC = gemm(AxB,C); return AxBxC;
}
Dandelion GTC 2014 S4221 9
![Page 10: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/10.jpg)
Composed matrix multiplication
Matrixgemm(Matrix A, Matrix B) {
copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;
}
MatrixAxBxC(Matrix A, B, C) {
Matrix AxB = gemm(A,B);
Matrix AxBxC = gemm(AxB,C); return AxBxC;
}
AxB copied from GPU memory…
Dandelion GTC 2014 S4221 10
![Page 11: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/11.jpg)
Composed matrix multiplication
Matrixgemm(Matrix A, Matrix B) {
copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;
}
MatrixAxBxC(Matrix A, B, C) {
Matrix AxB = gemm(A,B);
Matrix AxBxC = gemm(AxB,C); return AxBxC;
} …only to be copied right back!
Dandelion GTC 2014 S4221 11
![Page 12: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/12.jpg)
What if I have >1 GPU?
What happens if I want the following? Matrix D = A x B x C
Matrixgemm(GPU dev,Matrix A, Matrix B) {
copyToGPU(dev, A);copyToGPU(dev, B);invokeGPU(dev);Matrix C = new Matrix();copyFromGPU(dev, C);return C;
}
Dandelion GTC 2014 S4221 12
![Page 13: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/13.jpg)
Composition with >1 GPUMatrix gemm(GPU dev, Matrix A, Matrix B) {
copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;
}
Matrix AxBxC(Matrix A,B,C) {
Matrix AxB = gemm(???, A,B);Matrix AxBxC = gemm(???, AxB,C); return AxBxC;
}
Dandelion GTC 2014 S4221 13
![Page 14: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/14.jpg)
Composition with >1 GPUMatrix gemm(GPU dev, Matrix A, Matrix B) {
copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;
}
Matrix AxBxC(GPU dev, Matrix A,B,C) {
Matrix AxB = gemm(dev, A,B);Matrix AxBxC = gemm(dev, AxB,C); return AxBxC;
}
Rats…now I can only use 1 GPU.How to partition
computation?
Dandelion GTC 2014 S4221 14
![Page 15: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/15.jpg)
Composition with >1 GPUMatrix gemm(GPU dev, Matrix A, Matrix B) {
copyToGPU(A);copyToGPU(B);invokeGPU();Matrix C = new Matrix();copyFromGPU(C);return C;
}
Matrix AxBxC(GPU devA, GPU devB, Matrix A,B,C) {
Matrix AxB = gemm(devA, A,B);Matrix AxBxC = gemm(devB, AxB,C); return AxBxC;
}
Rats…this will never scale to many GPUs.Plus, how do I choose which GPUs to use?
Why don’t we have this problem with CPUs?
Device-centric APIs are the wrong abstraction for GPU compute.
Dandelion GTC 2014 S4221 15
![Page 16: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/16.jpg)
nodes computation
edges communication
Expresses parallelism explicitly
Minimal specification of data movement: runtime does it.
asynchrony is a runtime concern (not programmer concern)
No specification of computedevice mapping: like threads!
Dataflow: program == graph
gemm
gemm
Matrix: C
Matrix: A Matrix: B
Programmer provides algorithms, graph structure, runtime does the rest:
Data movement (with asynchrony), multi-GPU scheduling
Works for distribute compute too!
Dandelion GTC 2014 S4221 16
![Page 17: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/17.jpg)
S
V
machine
Dandelion Architecture (2)
Dandelion GTC 2014 S4221
User program
cluster
SS
cluster graph
TCP, caches, files
V V V
Dandelion Compiler
Cluster Runtime
machine graph
A B
C
DCPU Task
GPU Task
M = MasterS = Slave
= CPU= GPU
LINQ Query
17
GPU graph
Machine Runtime GPU Runtime
primitive library:relational algebra
This talk (mostly):
![Page 18: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/18.jpg)
Language Integrated Query Relational operators on collections
var res = collection
.Where(x => x.isRed())
.GroupBy(x => x)
.Select(x => f(x));
Why focus on LINQ? Expresses many important workloads easily
K-Means, PageRank (MR), Sparse Matrix SVD, …
Powerful: lambdas embed C#/.NET Declarative/data-parallel
Natural fit for dataflow
Lambdas in C++11 and Java 8
What’s a LINQ query?
Dandelion GTC 2014 S4221 18
![Page 19: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/19.jpg)
Running Example: K-Means
Partition n points into k clusters Pick k initial centers
while(not done) {1. Each point nearest center2. Each new center = mean(points old center)
}Dandelion GTC 2014 S4221 19
![Page 20: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/20.jpg)
Dandelion GTC 2014 S4221
centers = points
.GroupBy(point => NearestCenter(point, centers))
.Select(g => g.Aggregate((x, y) => x+y)/g.Count());
Step 2: Each new cluster center = average of points in a group
Step 1: Group points by nearest cluster center
simple mapping to GPU
GPU implementationnon-obvious
Running Example: K-Means
20
Partition n points into k clusters Pick k initial centers
while(not done) {1. Each point nearest center2. Each new center = mean(points old center)
}
![Page 21: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/21.jpg)
GroupBy
Group a collection by key
Lambda function maps elements key
Dandelion GTC 2014 S4221
var res = ints.GroupBy(x => x);
10 30 20 10 20 30 10
101010 202030 30
21
foreach(T elem in ints)
{
key = KeyLambda(elem);
group = GetGroup(key);
group.Add(elem);
}
foreach(T elem in PF(ints))
{
key = KeyLambda(elem);
group = GetGroup(key);
group.Add(elem);
}
![Page 22: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/22.jpg)
Dandelion GTC 2014 S4221 22
Background: GPU Architecture
Kernel
Device with 4 SMs
SM 0 SM 1 SM 2 SM 3
Thread Block 1
Thread Block 2
Thread Block 3
Thread Block 0
Thread Block 4
Thread Block 5
Thread Block 6
Thread Block 7
Thread Block 1
Thread Block 2
Thread Block 3
Thread Block 4
Thread Block 5
Thread Block 6
Thread Block 7
Thread Block 8
Thread(0, 0) … Thread
(31, 0)
Thread(0, 1) … Thread
(31, 1)
…
Wide SIMD (vector) machine: SMs• code == kernels, 1000s of threads
• explicit subdivision: blocks
• model: all threads run in parallel• HW maps subsets (warps) to SMs
• warps: concurrent, divergent CF serialized• schedule non-deterministic• locks problematic, despite atomic ops
• exposed u-arch features/warts• e.g. software-managed caches• 1st order performance impact
SM = Streaming Multiprocessor
foreach(T elem in PF(ints))
{
key = KeyLambda(elem);
group = GetGroup(key);
group.Add(elem);
}
![Page 23: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/23.jpg)
GPU GroupByProcess each input element in parallel
grouping ~ shuffling input item output offset s.t. groups are contiguous output offset = group offset + item number … but how to get the group offset, item number?
Dandelion GTC 2014 S4221
10 30 20 10 20 30 10
101010 202030 30
ints
res
Number of groups and input group
mapping
Number of elements in each
group
Start index of each group in the
output sequence
23
![Page 24: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/24.jpg)
GPU GroupBy: Multiple Stages
GPU lock-free hash table
Dandelion GTC 2014 S4221
10 30 20 10 20 30 10
Assign group IDs
Compute group sizes
0 1 2
10 20 30
Group ID :
0 1 2
10 20 30
3 2 2
Group ID :
Group Size :
Compute start indices
0 1 2
10 20 30
0 3 5
Group ID :
Group Start Index :
Write Outputs
10 302010 20 3010
Hash table lookup: group ID
-- Uses atomic increment
prefix sum of group sizes
Write to output location– Uses atomic increment
24
![Page 25: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/25.jpg)
GPU GroupBy: Multiple Stages
GPU lock-free hash table
10 30 20 10 20 30 10
Assign group IDs
Compute group sizes
0 1 2
10 20 30
Group ID :
0 1 2
10 20 30
3 2 2
Group ID :
Group Size :
Compute start indices
0 1 2
10 20 30
0 3 5
Group ID :
Group Start Index :
Write Outputs
10 302010 20 3010
Hash table lookup: group ID
-- Uses atomic increment
prefix sum of group sizes
Write to output location– Uses atomic increment
Assign group IDs
Compute group sizes
Compute start indices
Write Outputs
Dandelion GTC 2014 S4221
• User types/functions not needed at every step• The dataflow is abstract generic primitives
Assign group IDs
Compute group sizes
Compute start indices
Write Outputs
25
![Page 26: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/26.jpg)
Composed Generic Primitives
GPU lock-free hash table
Hash table lookup: group ID
-- Uses atomic increment
prefix sum of group sizes
Write to output location– Uses atomic increment
10 30 20 10 20 30 10
Assign group IDs
Compute group sizes
0 1 2
10 20 30
Group ID :
0 1 2
10 20 30
3 2 2
Group ID :
Group Size :
Compute start indices
0 1 2
10 20 30
0 3 5
Group ID :
Group Start Index :
Write Outputs
10 302010 20 3010
Assign group IDs
Compute group sizes
Compute start indices
Write Outputs
Dandelion GTC 2014 S4221
buildHT<K,T,keyfn, eqfn>
prefixsum
shuffle<K,T,keyfn>
groupsizes
GPU GroupBy: Multiple Stages
26
Compile Time: Howcross-compile/marshal
these?
How to build a LINQGPU compiler:Repeat this process for all LINQ operators
GroupBy<K,T,keyfn, eqfn>
![Page 27: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/27.jpg)
Compiling C# GPU code
Dandelion GTC 2014 S4221
int NearestCenter(Vector point, IEnumerable<Vector> centers) {
int minIndex = 0, curIndex = 0;
double minValue = Double.MaxValue;
foreach (Vector center in centers) {
double curValue = (center - point).Norm2();
minIndex = (minValue > curValue) ? curIndex : minIndex;
minValue = (minValue > curValue) ? curValue : minValue;
curIndex++;
}
return minIndex;
}
centers = points
.GroupBy(pnt => NearestCenter(pnt, centers))
.Select(g=>g.Aggregate((x,y)=>x+y)/g.Count());
Marshalling for user types:1. Decide GPU-side layout
2. Generate serialization code
also cross-compile all referenced
functions
27
![Page 28: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/28.jpg)
Compiling C# GPU code
Dandelion GTC 2014 S4221
Translation performed at .NET byte-code (‘CIL’) level Map C# types to CUDA structs Translate C# methods into CUDA kernel functions Generate C# code for CPU-GPU serialization/transfer
Main constraint: dynamic memory allocation Convert to stack allocation if object size can be
inferred Fail parallelization, fallback to host otherwise
28
![Page 29: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/29.jpg)
Generated CUDA Kernel Code__device__ __host__ int NearestCenter_Kernel(KernelStruct_0 point, KernelStruct_0 *centers, int centers_n) {
KernelStruct_0 local_6;
int local_0 = 0;
double local_1 = 1.79769313486232E+308;
int local_2 = 0;
int centers_n_idx = -1;
goto IL_0041;
{
IL_0018:
KernelStruct_0 local_3 = centers[centers_n_idx];
local_6 = op_Subtraction_Kernel(local_3, point);
double local_4 = ((double)(Norm2_Kernel(local_6)));
if (((local_1) > (local_4))) {
local_1 = local_4;
local_0 = local_2;
}
local_2 = ((local_2) + (1));
IL_0041:
if (((++centers_n_idx) < centers_n)) {
goto IL_0018;
}
goto IL_0058;
}
IL_0058:
return local_0;
}
Dandelion GTC 2014 S4221
int NearestCenter(Vector point, IEnumerable<Vector> centers) {
int minIndex = 0, curIndex = 0;
double minValue = Double.MaxValue;
foreach (Vector center in centers) {
double curValue = (center - point).Norm2();
minIndex = (minValue > curValue) ? curIndex : minIndex;
minValue = (minValue > curValue) ? curValue : minValue;
curIndex++;
}
return minIndex;
}
29
struct KernelStruct_0 {float arr[N];__device__ int GetLength() { return N;
}};
![Page 30: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/30.jpg)
newCenters is an expression tree:
GroupBy
Select
void KMeans(IQueryable<Vector> points,
IQueryable<Vector> centers) {
var newCenters =
points.GroupBy(point => NearestCenter(point, centers))
.Select(g => g.Aggregate((x, y) => x + y) / g.Count());
... // other stuff
foreach (Vector center in newCenters) {
do_something(center);
}
}
Dandelion GTC 2014 S4221
Leveraging lazy evaluation
Dandelion invoked:1. load binary, find IL2. generate C#, CUDA3. compile *.dll, *.ptx4. build dataflow graphs5. deploy bin, graphs
…30
![Page 31: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/31.jpg)
10 x GroupBy
centers
Tee
G1 G1 G1 G1 G1 G1 G1 G1 G1 G1
G2 G2 G2 G2 G2 G2 G2 G2 G2 G2
new_centers
merge
10 x vector-partition
K-Means Dataflow Graphs
Dandelion GTC 2014 S4221 31
Machine graph/GPU graph
GroupBy
![Page 32: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/32.jpg)
Evaluation
Programmability
Performance: single-machine & cluster
Benchmarks: kmeans, pagerank, terasort, skyserver
Black-scholes, ID3 dec. trees, BM25F (local-only)
Platform: 10-machine cluster• NVIDIA Tesla k20m, 5GB GDDR5
• 2 Xeon E5 2.3GHz 24 hw threads
• 256 GB RAM L1:32K I + 32K d, 256K L2, 15M L3
• Windows Server 2008 R2 64-bit
• Mellanox ConnectX-3 10 Gigabit Ethernet
Dandelion GTC 2014 S4221 32
![Page 33: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/33.jpg)
K-Means in C#class KMeans {
int NearestCenter(Vector point, IEnumerable<Vector> centers) {
int minIndex = 0, curIndex = 0;
double minValue = Double.MaxValue;
foreach (Vector center in centers) {
double curValue = (center - point).Norm2();
minIndex = (minValue > curValue) ? curIndex : minIndex;
minValue = (minValue > curValue) ? curValue : minValue;
curIndex++;
}
return minIndex;
}
IQueryable<Vector> Steps(int nSteps, IQueryable<Vector> points, IQueryable<Vector> centers) {
for(int i=0; i<nSteps; i++)
centers = points
.GroupBy(point => NearestCenter(point, centers))
.Select(g => g.Aggregate((x, y) => x + y) / g.Count());
return centers;
}
IQueryable<Vector> KMeans() {
IQueryable<Vector> points = new Vector[N];
IQueryable<Vector> centers = new Vector[K];
return Steps(s, points, centers);
}
}
Dandelion GTC 2014 S4221 33
![Page 34: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/34.jpg)
class KMeans {
int NearestCenter(Vector point, IEnumerable<Vector> centers) {
int minIndex = 0, curIndex = 0;
double minValue = Double.MaxValue;
foreach (Vector center in centers) {
double curValue = (center - point).Norm2();
minIndex = (minValue > curValue) ? curIndex : minIndex;
minValue = (minValue > curValue) ? curValue : minValue;
curIndex++;
}
return minIndex;
}
IQueryable<Vector> Steps(int nSteps, IQueryable<Vector> points, IQueryable<Vector> centers) {
for(int i=0; i<nSteps; i++)
centers = points
.GroupBy(point => NearestCenter(point, centers))
.Select(g => g.Aggregate((x, y) => x + y) / g.Count());
return centers;
}
IQueryable<Vector> KMeans() {
IQueryable<Vector> points = new Vector[N].AsDandelion();
IQueryable<Vector> centers = new Vector[K].AsDandelion();
return Steps(s, points, centers);
}
}
K-Means in Dandelion
Dandelion GTC 2014 S4221 34
![Page 35: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/35.jpg)
0
100
200
300
400
500
600
700
800
900
1000
0.1
1
10
100
1000
SL
OC
Sp
ee
du
p o
ver
seq
ue
nti
al C
++
speedup SLOC
K-Means Shootout
• Speedup: log-scale, higher is better• SLOC: lower is better• Other input sizes similar
• single machine• NVIDIA Tesla k20m, 5GB GDDR5• 2 Xeon E5 2.3GHz 24 hw threads• 256 GB RAM L1:32K I + 32K d, 256K L2, 15M L3• Windows Server 2008 R2 64-bit 35
• low SLOC but slow• 24 threads only 7x
• fast, complex• expertise required
• 20X SLOC reduction• ~17X speedup v. seq.• 2..7X slower v. hand opt.
Dandelion GTC 2014 S4221
![Page 36: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/36.jpg)
0
5
10
15
20
Sp
ee
du
p o
ver
seq
ue
nti
al L
INQ
/CP
ULINQ-seq Multi-thread CPU GPU
Single-machine performance
Dandelion GTC 2014 S4221
• NVIDIA Tesla k20m, 5GB GDDR5• 2 Xeon E5 2.3GHz 24 hw threads• 256 GB RAM L1:32K I + 32K d, 256K L2, 15M L3• Windows Server 2008 R2 64-bit
• Higher is better• Other input sizes: same trends
36
• 15-20X v. seq, 2x v. 24 cpus• high compute:datapyrrhic victory:
• ~2X v. seq, 1X v. 24 cpus• low arithmetic intensity
![Page 37: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/37.jpg)
Cluster performance
Dandelion GTC 2014 S4221
• Speedup is log-scale, higher is better• Larger inputs for cluster:
10 machines:• NVIDIA Tesla k20m, 5GB GDDR5• 2 Xeon E5 2.3GHz 24 hw threads• 256 GB RAM L1:32K I + 32K d, 256K L2, 15M L3• Windows Server 2008 R2 64-bit• Mellanox ConnectX-3 10 Gigabit Ethernet
1
10
100
kmeans pagerank skyserver terasort
Sp
ee
du
p o
ver
1 th
rea
d/n
od
e x
10
no
de
sMulti-thread CPU GPU
37
• 66X v. 1 cpu/node• 4X v. 24 cpus/node• data streamable
• intermediate data > GPU mem• GPU runtime thrashing
dist. overheads narrow GPU v. CPU gap
![Page 38: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/38.jpg)
LINQits: Dandelion compiler with FPGA backend [ISCA ’13]
GPU Programming models/Cross-compilation Delite [Chafi, Brown ‘11], Liszt[DeVito 11], Halide[Ragan-Kelley 13], Legion[Bauer 12], OptiML[Sujeeth `11],
Accelerator [ASPLOS ‘06], Amp/C++, CUDA, OpenCL
StreamIt CUDA [CGO ‘09, LCTES ‘09], Flextream [Hormati 09], Lime [Auerbach 10]
Copperhead[Catanzaro `11], JCUDA[Yan `09], Rootbeer[Pratt-Szeliga `12], pycuda[Kloeckner `12]
Jacket, MATLAB CUDA compiler [Prasad ‘11]
GPU Scheduling/GPU engines
TimeGraph [Kato 11], Maestro[Spafford 10], Pegasus [Gupta 11], StarPU[Augonnet], Merge[Linderman `08]
Graph-based programming models
Synthesis [Masselin 89], Monsoon/Id [Arvind], Dryad [Isard 07]
StreamIt [Thies 02], DirectShow, TCP Offload [Currid 04]
PTask [Rossbach 11], PipesFS [de Bruijn 08], FFPF[Bos 04], Ruler[Hruby 07]
Relational algebra on GPUs [He 08, He 09, Govindaraju 05] Thrust
More…please see paper
Related Work
Dandelion GTC 2014 S4221 38
![Page 39: Dandelion: A Unified Programming Model for GPU Clusterson-demand.gputechconf.com/gtc/2014/presentations/S4221-dandelion... · A UNIFIED PROGRAMMING MODEL FOR GPU CLUSTERS ... Device](https://reader031.vdocuments.net/reader031/viewer/2022022010/5b008c1d7f8b9a952f8d113b/html5/thumbnails/39.jpg)
Conclusion
Dandelion
High-level abstractions for heterogeneous systems
Improved programmability
Current results promising, incomplete
Future work:
Query planning, scheduling, applications
Support more accelerators/architectures
Move beyond LINQ
Dataflow: an important key
Enables composition of multiple runtimes
Thank you! Questions?
Dandelion GTC 2014 S4221 39