ptask: data-flow programming support for hetergeneous …€¦ · chris rossbach and jon currey...
TRANSCRIPT
PTASK + DANDELION: DATA-FLOW PROGRAMMING SUPPORT FOR HETEROGENEOUS PLATFORMS
Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012
Motivation/Overview
GPU programming is super-important (dissenters: are you at the right conference?)
still difficult despite amazing progress
Technology stacks (at least partly) to blame:
OS thinks GPU is I/O device
Data movement + algorithms coupled
High level language support too low level
PTask+Dandelion: NVIDIA GTC 5/17/2012 2
Motivation/Overview
GPU programming is super-important (dissenters: are you at the right conference?)
still difficult despite amazing progress
GPU technology stacks (at least partly) to blame:
OS thinks GPU is I/O device
Data movement + algorithms coupled
High level language support too low level
PTask+Dandelion: NVIDIA GTC 5/17/2012 3
1. These properties limit GPUs 2. OS-level abstractions are needed 3. OS support better language support
The case for OS support
The case for dataflow abstractions
PTask: better OS support for GPUs
Dandelion: language+runtime support
Related and Future Work
Conclusion
Outline
PTask+Dandelion: NVIDIA GTC 5/17/2012 4
Layered abstractions + OS support
PTask+Dandelion: NVIDIA GTC 5/17/2012 5
vendor driver vendor driver vendor driver
CPU I/O dev DISK NIC
process files pipes
LIBC/CLR
process files pipes
Applications
user
kernel
HW
OS-level abstractions
HAL
user-mode Runtimes/libs
programmer- visible interface
OS interface
Hardware interface
* 1:1 correspondence between OS-level and user-level abstractions * Diverse HW support enabled HAL
GPU abstractions
PTask+Dandelion: NVIDIA GTC 5/17/2012 6
Vendor-specific
driver
GPU
ioctl
GPU Runtime (CUDA, DirectX, OpenCL)
GPGPU APIs
shaders kernels
language integration
Applications
user
kernel
HW
Runtime support
programmer- visible interface
1 OS-level abstraction!
Hardware interface
mmap
1. No kernel-facing API 2. Too much runtime support in OS 3. Poor composability
Fat driver, proprietary interfaces
No OS support No isolation
• Image-convolution in CUDA • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidia GeForce GT230
Higher is better
0
200
400
600
800
1000
1200
no CPU load high CPU load
GPU benchmark throughput
CPU+GPU schedulers not integrated! …other pathologies abundant
PTask+Dandelion: NVIDIA GTC 5/17/2012 7
Composition: Gestural Interface
capture
filter xform
“Hand” events
Raw images
detect
High data rates
PTask+Dandelion: NVIDIA GTC 5/17/2012 8
noisy point cloud
geometric transformation
capture camera images
detect gestures
noise filtering
Data-parallel algorithms … good fit for GPU
What We’d Like To Do
#> capture | xform | filter | detect &
PTask+Dandelion: NVIDIA GTC 5/17/2012 9
Modular design
flexibility, reuse
Utilize heterogeneous hardware
Data-parallel components GPU
Sequential components CPU
Using OS provided tools
processes, pipes
CPU CPU GPU GPU
GPUs cannot run OS: different ISA
Memories disjoint, or have different coherence guarantees
Host CPU must “manage” GPU execution Program inputs explicitly transferred/bound at runtime
Device buffers pre-allocated
GPU Execution model
CPU Main memory
GPU memory
GPU
Copy inputs Copy outputs Send commands
User-mode apps must implement
PTask+Dandelion: NVIDIA GTC 5/17/2012 10
OS executive
capture
GPU
Data migration
Run!
camdrv GPU driver
PCI-xfer PCI-xfer
xform
copy to
GPU
copy from GPU
PCI-xfer PCI-xfer
filter
copy from GPU
detect
IRP
HIDdrv
read()
copy to
GPU
write() read() write() read() write() read()
capture xform filter detect
#> capture | xform | filter | detect &
PTask+Dandelion: NVIDIA GTC 5/17/2012 11
The case for OS support
The case for dataflow abstractions
PTask: Dataflow for GPUs
Dandelion: LINQ on GPUs
Related and Future Work
Conclusion
Outline
PTask+Dandelion: NVIDIA GTC 5/17/2012 12
Why dataflow?
What happens if I want the following? Matrix D = A x B x C
Matrix gemm(Matrix A, Matrix B) { copyToGPU(A); copyToGPU(B); invokeGPU(); Matrix C = new Matrix(); copyFromGPU(C); return C; }
PTask+Dandelion: NVIDIA GTC 5/17/2012 13
Composed matrix multiplication
Matrix gemm(Matrix A, Matrix B) { copyToGPU(A); copyToGPU(B); invokeGPU(); Matrix C = new Matrix(); copyFromGPU(C); return C; }
Matrix AxBxC(Matrix A, B, C) { Matrix AxB = gemm(A,B); Matrix AxBxC = gemm(AxB,C); return AxBxC; }
PTask+Dandelion: NVIDIA GTC 5/17/2012 14
Composed matrix multiplication
Matrix gemm(Matrix A, Matrix B) { copyToGPU(A); copyToGPU(B); invokeGPU(); Matrix C = new Matrix(); copyFromGPU(C); return C; }
Matrix AxBxC(Matrix A, B, C) {
Matrix AxB = gemm(A,B); Matrix AxBxC = gemm(AxB,C); return AxBxC; }
AxB copied from GPU memory…
PTask+Dandelion: NVIDIA GTC 5/17/2012 15
Composed matrix multiplication
Matrix gemm(Matrix A, Matrix B) { copyToGPU(A); copyToGPU(B); invokeGPU(); Matrix C = new Matrix(); copyFromGPU(C); return C; }
Matrix AxBxC(Matrix A, B, C) { Matrix AxB = gemm(A,B);
Matrix AxBxC = gemm(AxB,C); return AxBxC; } …only to be copied
right back!
We need different abstractions that help get around this…
PTask+Dandelion: NVIDIA GTC 5/17/2012 16
nodes computation
edges communication
leaves flexibility for the runtime
minimal specification of data movement
asynchrony is a runtime concern (not programmer concern)
Dataflow: runtime manages data movement
PTask+Dandelion: NVIDIA GTC 5/17/2012 17
gemm
gemm
Matrix: C
Matrix: A Matrix: B
The case for OS support
The case for dataflow abstractions
PTask: Dataflow for GPUs
Dandelion: LINQ on GPUs
Related and Future Work
Conclusion
Outline
PTask+Dandelion: NVIDIA GTC 5/17/2012 18
ptask (parallel task) Has priority for fairness
Analogous to a process for GPU execution
List of input/output resources (e.g. stdin, stdout…)
ports Can be mapped to ptask input/outputs
A data source or sink
channels Similar to pipes, connect arbitrary ports
Specialize to eliminate double-buffering
graph Directed, connected ptasks, ports, channels
datablocks Memory-space transparent buffers
PTask OS abstractions: dataflow!
• data: specify where, not how • OS objectsOS RM possible
PTask+Dandelion: NVIDIA GTC 5/17/2012 19
PTask Graph: Gestural Interface
PTask+Dandelion: NVIDIA GTC 5/17/2012 20
xform
ptask (GPU)
port
channel
clo
ud
raw
img
filter
f-o
ut
f-in
capture
process (CPU)
detect
ptask graph
ptask graph
#> capture | xform | filter | detect &
mapped mem GPU mem GPU mem
Optimized data movement
raw
img
Data arrival triggers computation
datablock
PTask+Dandelion: NVIDIA GTC 5/17/2012 21
Main Memory
GPU 0 Memory
GPU 1 Memory
… Datablock
space V M RW data main 1 1 1 1 gpu0 0 1 1 0 gpu1 1 1 1 0
Logical buffer backed by multiple physical buffers
buffers created/updated lazily
mem-mapping used to share across process boundaries
Track buffer validity per memory space writes invalidate other views
Flags for access control/data placement
Location Transparency: Datablocks
Enables OS-mediated IPC through GPU memory
Datablock
space V M RW data
main 0 0 0 0 gpu 0 0 0 0
Main Memory
GPU Memory
Datablock Action Zone
PTask+Dandelion: NVIDIA GTC 5/17/2012 22
xform
clo
ud
raw
img
filter
f-in
capture
#> capture | xform | filter …
ptask port channel
process
datablock 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
raw
img
clo
ud
…
The case for OS support
The case for dataflow abstractions
PTask: Dataflow for GPUs
Some numbers
Dandelion: LINQ on GPUs
Related and Future Work
Conclusion
Outline
PTask+Dandelion: NVIDIA GTC 5/17/2012 24
Windows 7
Two PTask API implementations:
1. Stacked UMDF/KMDF driver
Kernel component: mem-mapping, signaling
User component: wraps DirectX, CUDA, OpenCL
syscalls DeviceIoControl() calls
2. User-mode library
Implementation
PTask+Dandelion: NVIDIA GTC 5/17/2012 25
Implementations
pipes: capture | xform | filter | detect
modular: capture+xform+filter+detect, 1process
handcode: data movement optimized, 1process
ptask: ptask graph
Configurations
real-time: driven by cameras
unconstrained: driven by in-memory playback
Gestural Interface evaluation
PTask+Dandelion: NVIDIA GTC 5/17/2012 26
0
0.5
1
1.5
2
2.5
3
3.5
runtime user sys
rela
tive
to
ha
nd
cod
e
handcode
modular
pipes
ptask
Gestural Interface Performance
• Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • GTX580 (EVGA)
lower is better compared to pipes
• ~2.7x less CPU usage • 16x higher throughput • ~45% less memory usage
compared to hand-code • 11.6% higher throughput • lower CPU util: no driver program
PTask+Dandelion: NVIDIA GTC 5/17/2012 27
PTask schedules GPUs like CPUs
PTask provides performance isolation
Locality-aware scheduling
PTask transparently uses multiple GPUs
Proof-of-concept in Linux
Generality: micro-benchmarks
…
Things I didn’t show you
PTask+Dandelion: NVIDIA GTC 5/17/2012 28
The case for OS support
The case for dataflow abstractions
PTask: better OS support for GPUs
Dandelion: language+runtime support
Related and Future Work
Conclusion
Outline
PTask+Dandelion: NVIDIA GTC 5/17/2012 29
Dandelion *Goal*
A single programming interface for heterogeneous platforms: Clusters comprising:
CPUs
GPUs
FPGAs
You name it…
Programmer: writes sequential code once
Runtime: partitions, runs in parallel on available resources
…achievable?
PTask+Dandelion: NVIDIA GTC 5/17/2012 30
(holy grail)
Dandelion goal
Offload data-parallel code fragments
Small cluster of multi-core + multi-GPU
Starting point:
LINQ queries
PTask+Dandelion: NVIDIA GTC 5/17/2012 31
(a less holy and attractive vessel: usually just as effective, depending on use case)
Language Integrated Query
Relational operators over objects:
var res = collection
.Where(c => c.hasSomeProperty())
.Select(c => new {c, quack});
Why is LINQ important? Expresses many important workloads easily
K-Means, PageRank, Sparse Matrix SVD, …
LINQ Queries are data-parallel
Natural fit for data flow
Wait! What’s a LINQ Query?
PTask+Dandelion: NVIDIA GTC 5/17/2012 32
Dandelion Overview
PTask+Dandelion: NVIDIA GTC 5/17/2012 33
LINQ Query (or other
parallel code)
V V V
Distributed Graph Management
Runtime
A B
C
D
Machine with CPUs and GPUs
CPU Task
GPU Task
Local PTask Graph
Distribution by DryadLINQ c.f. MapReduce/Apache Hadoop
Dandelion: Local View
PTask+Dandelion: NVIDIA GTC 5/17/2012 34
LINQ Query (or other
parallel code)
A B
C
D
Dandelion Compiler
PTask Runtime
Machine with CPUs and GPUs
Construct dataflow graphs using templated ‘PTask Primitives’
Generate GPU code for user types and functions
Dataflow Graph
Example: K-Means Clustering
Partition n points into k clusters
Step 0: Pick k initial cluster centers
Step 1: Each point cluster with nearest center
Step 2: Recompute centers (calculate cluster means)
Iterate steps 1 & 2 until stable
PTask+Dandelion: NVIDIA GTC 5/17/2012 35
centers = points
.GroupBy(point => NearestCenter(point, centers))
.Select(g => g.Aggregate((x, y) => x+y)/g.Count());
GroupBy
Groups an input sequence by key
Custom function maps input elements key
PTask+Dandelion: NVIDIA GTC 5/17/2012 36
List<Group<int>> res = ints.GroupBy(x => x);
10 30 20 10 20 30 10 ints
res
GroupBy(x => x)
10 10 10 20 20 30 30
Consider CPU Implementation
PTask+Dandelion: NVIDIA GTC 5/17/2012 37
foreach(T elem in InputSequence)
{
key = KeySelector(elem);
group = GetGroup(key);
group.Add(elem);
}
• Mapping to GPUs is not obvious – How to assign work across 1000’s of threads?
– Synchronizing group creation is problematic
– “Append” is problematic
10 30 20 10 20 30 10
Parallel GroupBy
Process each input element in parallel grouping ~ shuffling
item output offset = group offset + item number
… but how to get the group offset?
PTask+Dandelion: NVIDIA GTC 5/17/2012 38
10 30 20 10 20 30 10
10 10 10 20 20 30 30
ints
res
Number of groups and integer group
IDs
Number of elements in each
group
Start index of each group in the
output sequence
GPU GroupBy: Multiple Passes
Uses a lock-free hash table
PTask+Dandelion: NVIDIA GTC 5/17/2012 39
10 30 20 10 20 30 10
Assign group IDs
Compute group sizes
0 1 2
10 20 30
GroupID :
0 1 2
10 20 30
3 2 2
GroupID :
Group Size :
Compute start indices
0 1 2
10 20 30
0 3 5
GroupID :
Group Start Index :
Write Outputs
10 30 20 10 20 30 10
Looks up hash table to get group ID Atomically increments group counters
Exclusive prefix sum on group counts
Write output to correct location – Uses atomic increment
GroupBy As Composed Primitives
PTask+Dandelion: NVIDIA GTC 5/17/2012 40
10 30 20 10 20 30 10
Assign group IDs
Compute group sizes
0 1 2
10 20 30
GroupID :
0 1 2
10 20 30
3 2 2
GroupID :
Group Size :
Compute start indices
0 1 2
10 20 30
0 3 5
GroupID :
Group Start Index :
Write Outputs
10 30 20 10 20 30 10
buildGroupKeyMap
prefixSum
shuffle
countGroupMembers
GroupBy Primitive
K-Means Dandelion graph
PTask+Dandelion: NVIDIA GTC 5/17/2012 41
Simple GroupBy
Streaming GroupBy
K-Means in C#/LINQ class KMeans {
int NearestCenter(Vector point, IEnumerable<Vector> centers) {
int minIndex = 0, curIndex = 0;
double minValue = Double.MaxValue;
foreach (Vector center in centers) {
double curValue = (center - point).Norm2();
minIndex = (minValue > curValue) ? curIndex : minIndex;
minValue = (minValue > curValue) ? curValue : minValue;
curIndex++;
}
return minIndex;
}
IQueryable<Vector> Iteration(IQueryable<Vector> points,
IQueryable<Vector> centers) {
return points
.GroupBy(point => NearestCenter(point, centers))
.Select(g => g.Aggregate((x, y) => x + y) / g.Count());
}
}
PTask+Dandelion: NVIDIA GTC 5/17/2012 42
K-Means in Dandelion class KMeans {
int NearestCenter(Vector point, IEnumerable<Vector> centers) {
int minIndex = 0, curIndex = 0;
double minValue = Double.MaxValue;
foreach (Vector center in centers) {
double curValue = (center - point).Norm2();
minIndex = (minValue > curValue) ? curIndex : minIndex;
minValue = (minValue > curValue) ? curValue : minValue;
curIndex++;
}
return minIndex;
}
IQueryable<Vector> Iteration(IQueryable<Vector> points,
IQueryable<Vector> centers) {
return points
.GroupBy(point => NearestCenter(point, centers))
.Select(g => g.Aggregate((x, y) => x + y) / g.Count());
}
}
PTask+Dandelion: NVIDIA GTC 5/17/2012 43
This is kind of a big deal! • managed sequential code • running on GPUs • zero programmer effort
K-Means Performance
PTask+Dandelion: NVIDIA GTC 5/17/2012 44
0
2
4
6
8
10
12
14
16
18
0 1 2 3 4 5 6 7
Ex
ecu
tio
n T
ime
(s)
Number of Points (millions)
K-Means (40 Centers)
Streaming
CPU
23x faster than single-threaded CPU …. and think we can do even better
4-core Intel Xeon W3550 @ 3.07 GHz, 6GB RAM
NVIDIA GTX580 GPU with 1.5GB GDDR
PTask on NVIDIA CUDA version 4.0
PageRank in Dandelion
PTask+Dandelion: NVIDIA GTC 5/17/2012 45
Streaming Join
Streaming GroupBy
public struct Rank { public UInt64 PageID; public UInt64 PageRank; … } public struct Page { public UInt64 PageID; public UInt64 LinkedPageID; int numLinkedPages; … } // Join Pages with Ranks, and disperse updates var partialRanks = from rank in ranks
join page in pages on rank.PageID equals page.PageID select new Rank( page.LinkedPageID, rank.PageRank / page.NumLinks); // Re-accumulate Ranks var newRanks = from partialRank in partialRanks
group partialRank.PageRank
by partialRank.PageID into g select new Rank(g.Key, g.Sum());
Status
Dandelion is a research prototype
Working end to end … now add more workloads
Current results promising … but incomplete
Many areas of research remain unexplored
Optimal work partitioning and scheduling
When to expose/hide architectural features
We will need programmer hints sometimes. When?
PTask+Dandelion: NVIDIA GTC 5/17/2012 46
The case for OS support
The case for dataflow abstractions
PTask: Dataflow for GPUs
Dandelion: LINQ on GPUs
Related and Future Work
Conclusion
Outline
PTask+Dandelion: NVIDIA GTC 5/17/2012 47
Prior work : regular (scientific, HPC) apps Accelerator : Array computations [ASPLOS ‘06] Amp/C++
StreamIt CUDA [CGO ‘09, LCTES ‘09]
MATLAB CUDA compiler [PLDI ‘11]
OS support for heterogeneous platforms: Helios [Nightingale 09], BarrelFish [Baumann 09] ,Offcodes [Weinsberg 08]
GPU Scheduling TimeGraph [Kato 11], Pegasus [Gupta 11]
Graph-based programming models Synthesis [Masselin 89], Monsoon/Id [Arvind], Dryad [Isard 07]
StreamIt [Thies 02], DirectShow, TCP Offload [Currid 04]
Tasking Tessellation, Apple GCD, …
Related Work
PTask+Dandelion: NVIDIA GTC 5/17/2012 48
Conclusion and Future Work
Better abstractions for GPUs are critical Enable fairness & priority
OS can use the GPU
Dataflow: a good fit abstraction system manages data movement
performance benefits significant
enables more flexible technology stack
Proof of concept: LINQ on GPUs
Future work Automatic task partitioning across CPU and GPU
Finding best parallelization for a given LINQ query
49
Thank you! Questions?
PTask+Dandelion: NVIDIA GTC 5/17/2012