ptask: data-flow programming support for hetergeneous …€¦ · chris rossbach and jon currey...

48
PTASK + DANDELION: DATA-FLOW PROGRAMMING SUPPORT FOR HETEROGENEOUS PLATFORMS Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012

Upload: others

Post on 12-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

PTASK + DANDELION: DATA-FLOW PROGRAMMING SUPPORT FOR HETEROGENEOUS PLATFORMS

Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012

Page 2: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Motivation/Overview

GPU programming is super-important (dissenters: are you at the right conference?)

still difficult despite amazing progress

Technology stacks (at least partly) to blame:

OS thinks GPU is I/O device

Data movement + algorithms coupled

High level language support too low level

PTask+Dandelion: NVIDIA GTC 5/17/2012 2

Page 3: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Motivation/Overview

GPU programming is super-important (dissenters: are you at the right conference?)

still difficult despite amazing progress

GPU technology stacks (at least partly) to blame:

OS thinks GPU is I/O device

Data movement + algorithms coupled

High level language support too low level

PTask+Dandelion: NVIDIA GTC 5/17/2012 3

1. These properties limit GPUs 2. OS-level abstractions are needed 3. OS support better language support

Page 4: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

The case for OS support

The case for dataflow abstractions

PTask: better OS support for GPUs

Dandelion: language+runtime support

Related and Future Work

Conclusion

Outline

PTask+Dandelion: NVIDIA GTC 5/17/2012 4

Page 5: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Layered abstractions + OS support

PTask+Dandelion: NVIDIA GTC 5/17/2012 5

vendor driver vendor driver vendor driver

CPU I/O dev DISK NIC

process files pipes

LIBC/CLR

process files pipes

Applications

user

kernel

HW

OS-level abstractions

HAL

user-mode Runtimes/libs

programmer- visible interface

OS interface

Hardware interface

* 1:1 correspondence between OS-level and user-level abstractions * Diverse HW support enabled HAL

Page 6: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

GPU abstractions

PTask+Dandelion: NVIDIA GTC 5/17/2012 6

Vendor-specific

driver

GPU

ioctl

GPU Runtime (CUDA, DirectX, OpenCL)

GPGPU APIs

shaders kernels

language integration

Applications

user

kernel

HW

Runtime support

programmer- visible interface

1 OS-level abstraction!

Hardware interface

mmap

1. No kernel-facing API 2. Too much runtime support in OS 3. Poor composability

Fat driver, proprietary interfaces

Page 7: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

No OS support No isolation

• Image-convolution in CUDA • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidia GeForce GT230

Higher is better

0

200

400

600

800

1000

1200

no CPU load high CPU load

GPU benchmark throughput

CPU+GPU schedulers not integrated! …other pathologies abundant

PTask+Dandelion: NVIDIA GTC 5/17/2012 7

Page 8: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Composition: Gestural Interface

capture

filter xform

“Hand” events

Raw images

detect

High data rates

PTask+Dandelion: NVIDIA GTC 5/17/2012 8

noisy point cloud

geometric transformation

capture camera images

detect gestures

noise filtering

Data-parallel algorithms … good fit for GPU

Page 9: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

What We’d Like To Do

#> capture | xform | filter | detect &

PTask+Dandelion: NVIDIA GTC 5/17/2012 9

Modular design

flexibility, reuse

Utilize heterogeneous hardware

Data-parallel components GPU

Sequential components CPU

Using OS provided tools

processes, pipes

CPU CPU GPU GPU

Page 10: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

GPUs cannot run OS: different ISA

Memories disjoint, or have different coherence guarantees

Host CPU must “manage” GPU execution Program inputs explicitly transferred/bound at runtime

Device buffers pre-allocated

GPU Execution model

CPU Main memory

GPU memory

GPU

Copy inputs Copy outputs Send commands

User-mode apps must implement

PTask+Dandelion: NVIDIA GTC 5/17/2012 10

Page 11: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

OS executive

capture

GPU

Data migration

Run!

camdrv GPU driver

PCI-xfer PCI-xfer

xform

copy to

GPU

copy from GPU

PCI-xfer PCI-xfer

filter

copy from GPU

detect

IRP

HIDdrv

read()

copy to

GPU

write() read() write() read() write() read()

capture xform filter detect

#> capture | xform | filter | detect &

PTask+Dandelion: NVIDIA GTC 5/17/2012 11

Page 12: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

The case for OS support

The case for dataflow abstractions

PTask: Dataflow for GPUs

Dandelion: LINQ on GPUs

Related and Future Work

Conclusion

Outline

PTask+Dandelion: NVIDIA GTC 5/17/2012 12

Page 13: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Why dataflow?

What happens if I want the following? Matrix D = A x B x C

Matrix gemm(Matrix A, Matrix B) { copyToGPU(A); copyToGPU(B); invokeGPU(); Matrix C = new Matrix(); copyFromGPU(C); return C; }

PTask+Dandelion: NVIDIA GTC 5/17/2012 13

Page 14: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Composed matrix multiplication

Matrix gemm(Matrix A, Matrix B) { copyToGPU(A); copyToGPU(B); invokeGPU(); Matrix C = new Matrix(); copyFromGPU(C); return C; }

Matrix AxBxC(Matrix A, B, C) { Matrix AxB = gemm(A,B); Matrix AxBxC = gemm(AxB,C); return AxBxC; }

PTask+Dandelion: NVIDIA GTC 5/17/2012 14

Page 15: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Composed matrix multiplication

Matrix gemm(Matrix A, Matrix B) { copyToGPU(A); copyToGPU(B); invokeGPU(); Matrix C = new Matrix(); copyFromGPU(C); return C; }

Matrix AxBxC(Matrix A, B, C) {

Matrix AxB = gemm(A,B); Matrix AxBxC = gemm(AxB,C); return AxBxC; }

AxB copied from GPU memory…

PTask+Dandelion: NVIDIA GTC 5/17/2012 15

Page 16: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Composed matrix multiplication

Matrix gemm(Matrix A, Matrix B) { copyToGPU(A); copyToGPU(B); invokeGPU(); Matrix C = new Matrix(); copyFromGPU(C); return C; }

Matrix AxBxC(Matrix A, B, C) { Matrix AxB = gemm(A,B);

Matrix AxBxC = gemm(AxB,C); return AxBxC; } …only to be copied

right back!

We need different abstractions that help get around this…

PTask+Dandelion: NVIDIA GTC 5/17/2012 16

Page 17: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

nodes computation

edges communication

leaves flexibility for the runtime

minimal specification of data movement

asynchrony is a runtime concern (not programmer concern)

Dataflow: runtime manages data movement

PTask+Dandelion: NVIDIA GTC 5/17/2012 17

gemm

gemm

Matrix: C

Matrix: A Matrix: B

Page 18: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

The case for OS support

The case for dataflow abstractions

PTask: Dataflow for GPUs

Dandelion: LINQ on GPUs

Related and Future Work

Conclusion

Outline

PTask+Dandelion: NVIDIA GTC 5/17/2012 18

Page 19: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

ptask (parallel task) Has priority for fairness

Analogous to a process for GPU execution

List of input/output resources (e.g. stdin, stdout…)

ports Can be mapped to ptask input/outputs

A data source or sink

channels Similar to pipes, connect arbitrary ports

Specialize to eliminate double-buffering

graph Directed, connected ptasks, ports, channels

datablocks Memory-space transparent buffers

PTask OS abstractions: dataflow!

• data: specify where, not how • OS objectsOS RM possible

PTask+Dandelion: NVIDIA GTC 5/17/2012 19

Page 20: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

PTask Graph: Gestural Interface

PTask+Dandelion: NVIDIA GTC 5/17/2012 20

xform

ptask (GPU)

port

channel

clo

ud

raw

img

filter

f-o

ut

f-in

capture

process (CPU)

detect

ptask graph

ptask graph

#> capture | xform | filter | detect &

mapped mem GPU mem GPU mem

Optimized data movement

raw

img

Data arrival triggers computation

datablock

Page 21: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

PTask+Dandelion: NVIDIA GTC 5/17/2012 21

Main Memory

GPU 0 Memory

GPU 1 Memory

… Datablock

space V M RW data main 1 1 1 1 gpu0 0 1 1 0 gpu1 1 1 1 0

Logical buffer backed by multiple physical buffers

buffers created/updated lazily

mem-mapping used to share across process boundaries

Track buffer validity per memory space writes invalidate other views

Flags for access control/data placement

Location Transparency: Datablocks

Enables OS-mediated IPC through GPU memory

Page 22: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Datablock

space V M RW data

main 0 0 0 0 gpu 0 0 0 0

Main Memory

GPU Memory

Datablock Action Zone

PTask+Dandelion: NVIDIA GTC 5/17/2012 22

xform

clo

ud

raw

img

filter

f-in

capture

#> capture | xform | filter …

ptask port channel

process

datablock 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1

raw

img

clo

ud

Page 23: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

The case for OS support

The case for dataflow abstractions

PTask: Dataflow for GPUs

Some numbers

Dandelion: LINQ on GPUs

Related and Future Work

Conclusion

Outline

PTask+Dandelion: NVIDIA GTC 5/17/2012 24

Page 24: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Windows 7

Two PTask API implementations:

1. Stacked UMDF/KMDF driver

Kernel component: mem-mapping, signaling

User component: wraps DirectX, CUDA, OpenCL

syscalls DeviceIoControl() calls

2. User-mode library

Implementation

PTask+Dandelion: NVIDIA GTC 5/17/2012 25

Page 25: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Implementations

pipes: capture | xform | filter | detect

modular: capture+xform+filter+detect, 1process

handcode: data movement optimized, 1process

ptask: ptask graph

Configurations

real-time: driven by cameras

unconstrained: driven by in-memory playback

Gestural Interface evaluation

PTask+Dandelion: NVIDIA GTC 5/17/2012 26

Page 26: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

0

0.5

1

1.5

2

2.5

3

3.5

runtime user sys

rela

tive

to

ha

nd

cod

e

handcode

modular

pipes

ptask

Gestural Interface Performance

• Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • GTX580 (EVGA)

lower is better compared to pipes

• ~2.7x less CPU usage • 16x higher throughput • ~45% less memory usage

compared to hand-code • 11.6% higher throughput • lower CPU util: no driver program

PTask+Dandelion: NVIDIA GTC 5/17/2012 27

Page 27: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

PTask schedules GPUs like CPUs

PTask provides performance isolation

Locality-aware scheduling

PTask transparently uses multiple GPUs

Proof-of-concept in Linux

Generality: micro-benchmarks

Things I didn’t show you

PTask+Dandelion: NVIDIA GTC 5/17/2012 28

Page 28: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

The case for OS support

The case for dataflow abstractions

PTask: better OS support for GPUs

Dandelion: language+runtime support

Related and Future Work

Conclusion

Outline

PTask+Dandelion: NVIDIA GTC 5/17/2012 29

Page 29: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Dandelion *Goal*

A single programming interface for heterogeneous platforms: Clusters comprising:

CPUs

GPUs

FPGAs

You name it…

Programmer: writes sequential code once

Runtime: partitions, runs in parallel on available resources

…achievable?

PTask+Dandelion: NVIDIA GTC 5/17/2012 30

(holy grail)

Page 30: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Dandelion goal

Offload data-parallel code fragments

Small cluster of multi-core + multi-GPU

Starting point:

LINQ queries

PTask+Dandelion: NVIDIA GTC 5/17/2012 31

(a less holy and attractive vessel: usually just as effective, depending on use case)

Page 31: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Language Integrated Query

Relational operators over objects:

var res = collection

.Where(c => c.hasSomeProperty())

.Select(c => new {c, quack});

Why is LINQ important? Expresses many important workloads easily

K-Means, PageRank, Sparse Matrix SVD, …

LINQ Queries are data-parallel

Natural fit for data flow

Wait! What’s a LINQ Query?

PTask+Dandelion: NVIDIA GTC 5/17/2012 32

Page 32: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Dandelion Overview

PTask+Dandelion: NVIDIA GTC 5/17/2012 33

LINQ Query (or other

parallel code)

V V V

Distributed Graph Management

Runtime

A B

C

D

Machine with CPUs and GPUs

CPU Task

GPU Task

Local PTask Graph

Distribution by DryadLINQ c.f. MapReduce/Apache Hadoop

Page 33: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Dandelion: Local View

PTask+Dandelion: NVIDIA GTC 5/17/2012 34

LINQ Query (or other

parallel code)

A B

C

D

Dandelion Compiler

PTask Runtime

Machine with CPUs and GPUs

Construct dataflow graphs using templated ‘PTask Primitives’

Generate GPU code for user types and functions

Dataflow Graph

Page 34: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Example: K-Means Clustering

Partition n points into k clusters

Step 0: Pick k initial cluster centers

Step 1: Each point cluster with nearest center

Step 2: Recompute centers (calculate cluster means)

Iterate steps 1 & 2 until stable

PTask+Dandelion: NVIDIA GTC 5/17/2012 35

centers = points

.GroupBy(point => NearestCenter(point, centers))

.Select(g => g.Aggregate((x, y) => x+y)/g.Count());

Page 35: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

GroupBy

Groups an input sequence by key

Custom function maps input elements key

PTask+Dandelion: NVIDIA GTC 5/17/2012 36

List<Group<int>> res = ints.GroupBy(x => x);

10 30 20 10 20 30 10 ints

res

GroupBy(x => x)

10 10 10 20 20 30 30

Page 36: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Consider CPU Implementation

PTask+Dandelion: NVIDIA GTC 5/17/2012 37

foreach(T elem in InputSequence)

{

key = KeySelector(elem);

group = GetGroup(key);

group.Add(elem);

}

• Mapping to GPUs is not obvious – How to assign work across 1000’s of threads?

– Synchronizing group creation is problematic

– “Append” is problematic

10 30 20 10 20 30 10

Page 37: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Parallel GroupBy

Process each input element in parallel grouping ~ shuffling

item output offset = group offset + item number

… but how to get the group offset?

PTask+Dandelion: NVIDIA GTC 5/17/2012 38

10 30 20 10 20 30 10

10 10 10 20 20 30 30

ints

res

Number of groups and integer group

IDs

Number of elements in each

group

Start index of each group in the

output sequence

Page 38: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

GPU GroupBy: Multiple Passes

Uses a lock-free hash table

PTask+Dandelion: NVIDIA GTC 5/17/2012 39

10 30 20 10 20 30 10

Assign group IDs

Compute group sizes

0 1 2

10 20 30

GroupID :

0 1 2

10 20 30

3 2 2

GroupID :

Group Size :

Compute start indices

0 1 2

10 20 30

0 3 5

GroupID :

Group Start Index :

Write Outputs

10 30 20 10 20 30 10

Looks up hash table to get group ID Atomically increments group counters

Exclusive prefix sum on group counts

Write output to correct location – Uses atomic increment

Page 39: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

GroupBy As Composed Primitives

PTask+Dandelion: NVIDIA GTC 5/17/2012 40

10 30 20 10 20 30 10

Assign group IDs

Compute group sizes

0 1 2

10 20 30

GroupID :

0 1 2

10 20 30

3 2 2

GroupID :

Group Size :

Compute start indices

0 1 2

10 20 30

0 3 5

GroupID :

Group Start Index :

Write Outputs

10 30 20 10 20 30 10

buildGroupKeyMap

prefixSum

shuffle

countGroupMembers

GroupBy Primitive

Page 40: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

K-Means Dandelion graph

PTask+Dandelion: NVIDIA GTC 5/17/2012 41

Simple GroupBy

Streaming GroupBy

Page 41: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

K-Means in C#/LINQ class KMeans {

int NearestCenter(Vector point, IEnumerable<Vector> centers) {

int minIndex = 0, curIndex = 0;

double minValue = Double.MaxValue;

foreach (Vector center in centers) {

double curValue = (center - point).Norm2();

minIndex = (minValue > curValue) ? curIndex : minIndex;

minValue = (minValue > curValue) ? curValue : minValue;

curIndex++;

}

return minIndex;

}

IQueryable<Vector> Iteration(IQueryable<Vector> points,

IQueryable<Vector> centers) {

return points

.GroupBy(point => NearestCenter(point, centers))

.Select(g => g.Aggregate((x, y) => x + y) / g.Count());

}

}

PTask+Dandelion: NVIDIA GTC 5/17/2012 42

Page 42: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

K-Means in Dandelion class KMeans {

int NearestCenter(Vector point, IEnumerable<Vector> centers) {

int minIndex = 0, curIndex = 0;

double minValue = Double.MaxValue;

foreach (Vector center in centers) {

double curValue = (center - point).Norm2();

minIndex = (minValue > curValue) ? curIndex : minIndex;

minValue = (minValue > curValue) ? curValue : minValue;

curIndex++;

}

return minIndex;

}

IQueryable<Vector> Iteration(IQueryable<Vector> points,

IQueryable<Vector> centers) {

return points

.GroupBy(point => NearestCenter(point, centers))

.Select(g => g.Aggregate((x, y) => x + y) / g.Count());

}

}

PTask+Dandelion: NVIDIA GTC 5/17/2012 43

This is kind of a big deal! • managed sequential code • running on GPUs • zero programmer effort

Page 43: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

K-Means Performance

PTask+Dandelion: NVIDIA GTC 5/17/2012 44

0

2

4

6

8

10

12

14

16

18

0 1 2 3 4 5 6 7

Ex

ecu

tio

n T

ime

(s)

Number of Points (millions)

K-Means (40 Centers)

Streaming

CPU

23x faster than single-threaded CPU …. and think we can do even better

4-core Intel Xeon W3550 @ 3.07 GHz, 6GB RAM

NVIDIA GTX580 GPU with 1.5GB GDDR

PTask on NVIDIA CUDA version 4.0

Page 44: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

PageRank in Dandelion

PTask+Dandelion: NVIDIA GTC 5/17/2012 45

Streaming Join

Streaming GroupBy

public struct Rank { public UInt64 PageID; public UInt64 PageRank; … } public struct Page { public UInt64 PageID; public UInt64 LinkedPageID; int numLinkedPages; … } // Join Pages with Ranks, and disperse updates var partialRanks = from rank in ranks

join page in pages on rank.PageID equals page.PageID select new Rank( page.LinkedPageID, rank.PageRank / page.NumLinks); // Re-accumulate Ranks var newRanks = from partialRank in partialRanks

group partialRank.PageRank

by partialRank.PageID into g select new Rank(g.Key, g.Sum());

Page 45: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Status

Dandelion is a research prototype

Working end to end … now add more workloads

Current results promising … but incomplete

Many areas of research remain unexplored

Optimal work partitioning and scheduling

When to expose/hide architectural features

We will need programmer hints sometimes. When?

PTask+Dandelion: NVIDIA GTC 5/17/2012 46

Page 46: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

The case for OS support

The case for dataflow abstractions

PTask: Dataflow for GPUs

Dandelion: LINQ on GPUs

Related and Future Work

Conclusion

Outline

PTask+Dandelion: NVIDIA GTC 5/17/2012 47

Page 47: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Prior work : regular (scientific, HPC) apps Accelerator : Array computations [ASPLOS ‘06] Amp/C++

StreamIt CUDA [CGO ‘09, LCTES ‘09]

MATLAB CUDA compiler [PLDI ‘11]

OS support for heterogeneous platforms: Helios [Nightingale 09], BarrelFish [Baumann 09] ,Offcodes [Weinsberg 08]

GPU Scheduling TimeGraph [Kato 11], Pegasus [Gupta 11]

Graph-based programming models Synthesis [Masselin 89], Monsoon/Id [Arvind], Dryad [Isard 07]

StreamIt [Thies 02], DirectShow, TCP Offload [Currid 04]

Tasking Tessellation, Apple GCD, …

Related Work

PTask+Dandelion: NVIDIA GTC 5/17/2012 48

Page 48: PTASK: Data-Flow Programming Support for Hetergeneous …€¦ · Chris Rossbach and Jon Currey Microsoft Research Silicon Valley NVIDIA GTC 5/17/2012 . Motivation/Overview ... IRP

Conclusion and Future Work

Better abstractions for GPUs are critical Enable fairness & priority

OS can use the GPU

Dataflow: a good fit abstraction system manages data movement

performance benefits significant

enables more flexible technology stack

Proof of concept: LINQ on GPUs

Future work Automatic task partitioning across CPU and GPU

Finding best parallelization for a given LINQ query

49

Thank you! Questions?

PTask+Dandelion: NVIDIA GTC 5/17/2012