divergence analysis with affine constraints

34
λ [email protected] Divergence Analysis with Affine Constraints Diogo Sampaio, Sylvain Collange and Fernando Pereira The Federal University of Minas Gerais - Brazil Programming Languages Laboratory λ

Upload: blanca

Post on 23-Feb-2016

67 views

Category:

Documents


0 download

DESCRIPTION

λ. Programming Languages Laboratory. Divergence Analysis with Affine Constraints. Diogo Sampaio , Sylvain Collange and Fernando Pereira The Federal University of Minas Gerais - Brazil. The Objective of this work is to speedup code that runs on GPUs . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Divergence Analysis with Affine Constraints

λ

[email protected]

Divergence Analysis with Affine Constraints

Diogo Sampaio, Sylvain Collange andFernando Pereira

The Federal University of Minas Gerais - Brazil

Programming Languages Laboratoryλ

Page 2: Divergence Analysis with Affine Constraints

λ

[email protected]

The Objective of this work is to speedup code that

runs on GPUs.

We will achieve this goal via two contributions.

Divergence analysis with affine constraints. Divergence aware

register allocation.Whichenables…

Page 3: Divergence Analysis with Affine Constraints

λ

[email protected]

Motivation

• General Purpose Programming in Graphics Processing Units is a reality today.– Lots of academic research.– Many industrial applications

• Yet, programming efficient GPGPU applications is hard.– Complex interplay with the

hardware.– Threads execute in lock step, but

divergences may happen.

Page 4: Divergence Analysis with Affine Constraints

λ

[email protected]

What are Divergences?

• Below we have a simple kernel, and its Control Flow Graph:

__global__ voidex (float* v) { if (v[tid] < 0.0) { v[tid] /= 2; } else { v[tid] = 0.0; }}• Why do we have divergences

in this kernel?

Page 5: Divergence Analysis with Affine Constraints

λ

[email protected]

Why are Divergences a Problem?

Page 6: Divergence Analysis with Affine Constraints

λ

[email protected]

Uniform and Divergent Variables

• If a variable has always the same value for all the threads in execution, then we call it uniform.

• If different threads in execution may see the same variable name with different values, this variable is called divergent.

• Which variables are divergent?– The thread identifier is always divergent.– Variables that depend on divergent variables are also

divergent.• Data dependences.• Control dependences.

Page 7: Divergence Analysis with Affine Constraints

λ

[email protected]

Data Dependences

• If a variable v is defined by an in instruction that uses a variable u, then v is data-dependent on u.

• In the figure, %r1 depends on v and on %tid.

The value of %r1 may be differentfor different threads.

Page 8: Divergence Analysis with Affine Constraints

λ

[email protected]

Control Dependences

• If the value assigned to a variable v is controlled by a variable u, then v is control-dependent on u.

• In the figure, %f2 is control dependent on %p1.

Depending on how each threadbranches at the end of B0, %f2may be %f1/2 or 0.0 at BST.

Page 9: Divergence Analysis with Affine Constraints

λ

[email protected]

Affine Variables• Some divergent variables are special: they are affine

expressions of the thread identifier, e.g,. v = C×Tid + N.• Example: the kernel below computes the average of

each column of a matrix:

The loop always executes the same numberof iterations for all the threads

Page 10: Divergence Analysis with Affine Constraints

λ

[email protected]

Affine Variables• Variable i is divergent, yet, it is very regular: each

thread sees it as "Tid + N × c", where N is the current loop iteration.

• We say that i is an affine variable.

In this case, i = Tid + 10 * c

Page 11: Divergence Analysis with Affine Constraints

λ

[email protected]

• This analysis classifies variables as uniform, affine or divergent.

• Our divergence analysis is a dataflow analysis.– We associate an abstract state with each variable.– This abstract state is a pair (a, b), which means

a × Tid + b.– Each element in the pair can be:

• A constant, which we denote by 'C'• A non-initialized value, which we denote by '?'• An unknown value, which we denote by 'D'

The Divergent Analysis with Affine Constraints

Page 12: Divergence Analysis with Affine Constraints

λ

[email protected]

Uniform Variables• A uniform variable v is bound to the state (0, X), which

means 0 × Tid + X.– If X is a known constant, then v is a constant.

No worries:we shall explainhow we findthese abstractstates!

Page 13: Divergence Analysis with Affine Constraints

λ

[email protected]

Divergent Variables• A divergent variable v is bound to the state (D, D), which

means that we do not know anything about the runtime values that this variable can assume.

No worries:we shall explainhow we findthese abstractstates!

Page 14: Divergence Analysis with Affine Constraints

λ

[email protected]

Affine Variables• An affine variable v is bound to the state (c, X), which

means c × Tid + X. The factor c is always a known constant, X can be either a known constant, or D.

Ok: it is abouttime to explainhow we findthese abstractstates.

Page 15: Divergence Analysis with Affine Constraints

λ

[email protected]

Solving Divergence Analysis• Initially every variable is bound to the abstract state

(?, ?), unless…• It is initialized with a constant, e.g., if we have the

assignment v = 10, then [v] = (0, 10). Unless….• It is initialized with a constant expression of Tid, e.g., if v = 10 * Tid + 3, then [v] = (10, 3). Unless…

• The variable is a function parameter, and its abstract state is (0, D).

Once we have initialized every variable, then we start iterating a few propagation rules, until we reach a fixed point.

Page 16: Divergence Analysis with Affine Constraints

λ

[email protected]

The Propagation Rules• There are many different propagation tables (we call them

dataflow equations).– We have one table for each different program instruction.– Lets consider, for instance, that the program contains an

instruction v = v1 + v2. The abstract state of v1, e.g., [v1] is given by the blue column, and [v2] by the cantaloupe.

+ (0, b1) (0, D) (a1, b1) (a1, D) (D, D)

(0, b2) (0, b1+b2) (0, D) (a1+a2, b1+b2) (a1+a2, D) (D, D)

(0, D) (0, D) (0, D) (a1, D) (a1, D) (D, D)

(a2, b2) (a2, b1+b2) (a2, D) (a1+a2, b1+b2) (a1+a2, D) (D, D)

(a2, D) (a2, D) (a2, D) (a1+a2, D) (a1+a2, D) (D, D)

(D, D) (D, D) (D, D) (D, D) (D, D) (D, D)

Page 17: Divergence Analysis with Affine Constraints

λ

[email protected]

Applying the Rules

• We work on the program dependence graph.• Variables to be processed are placed in a worklist.

Page 18: Divergence Analysis with Affine Constraints

λ

[email protected]

Applying the Rules

• Where there is any variable v in the worklist, we try to process the instructions that use v.

Page 19: Divergence Analysis with Affine Constraints

λ

[email protected]

Applying the Rules• If all the dependences of a variable v have been processed,

then we can remove v from the worklist.• If we process an instruction that defines variable w, then we

add w to the worklist.

We haveremovedTid from theworklist, andadded i0 toit.

Page 20: Divergence Analysis with Affine Constraints

λ

[email protected]

Reaching a Fixed Point• We keep performing this abstract interpretation, until

the worklist is empty.– This happens once we reach a fixed point.

Page 21: Divergence Analysis with Affine Constraints

λ

[email protected]

How to Use the Divergence Analysis

• There are many compiler optimizations that need the information provided by the divergence analysis.

• We are using the results of our divergence analysis with affine constraints to guide a register allocator.– We call it The Divergence Aware Register Allocator.

Divergence analysis with affine constraints.

Divergence aware register allocation.

Page 22: Divergence Analysis with Affine Constraints

λ

[email protected]

What is Register Allocation?

• Register allocation is the problem of finding locations for the variables in a program.

• Variables can stay in registers or in memory.– Variables sent to memory are called spills.

• In Graphics Processing Units we have roughly three types of memory:– Local: outside-chip and private to each thread.– Global: outside-chip and visible to every thread.– Shared: inside-chip and visible to every thread (in the

same warp – lets abstract this detail away).

Page 23: Divergence Analysis with Affine Constraints

λ

[email protected]

The Key Insight: where to place spills

• A traditional allocator moves every spilled variable to the local memory. However, we can do much better:– Uniform spilled variables can be placed in the shared

memory.– And affine spilled variables can be also placed in the

shared memory.• But this is a bit trickier, and I shall explain it later.

Page 24: Divergence Analysis with Affine Constraints

λ

[email protected]

Example

0×Tid + D c×Tid + D D×Tid + DUniform Affine Divergent

Page 26: Divergence Analysis with Affine Constraints

λ

[email protected]

Redundancy:Uniform variables always have the same value for all the threads. Would it not be better to keep only one image of each spilled uniform variable?

Moreover, we can also share affine variables, as we will explain soon.

Page 27: Divergence Analysis with Affine Constraints

λ

[email protected]

is is

wha

t we

get w

ith d

iver

genc

e aw

are

allo

catio

n

Page 28: Divergence Analysis with Affine Constraints

λ

[email protected]

• A traditional allocator spills everything to the local memory.

• The divergent aware allocator uses more the shared memory. This has many advantages:– Shared memory is faster.– Less memory is used to spill

variables.

The benefits of our allocator

Page 29: Divergence Analysis with Affine Constraints

λ

[email protected]

How to Spill Affine Values?

• An affine value is like C×Tid + N, where C is a constant known at compilation time. Lets assume an expression like: N = 2*tid + t0

store: st.local N 0xFFFFFC32

Load: ld.local N 0xFFFFFC32

changes to: st.shared t0 0xFFFFFC32

changes to: ld.shared t0 0xFFFFFC32 N = 2*tid + t0

Page 30: Divergence Analysis with Affine Constraints

λ

[email protected]

Implementation

• We have implemented the affine analysis and the divergence aware register allocator in Ocelot, an open source PTX optimizer.– More than 10,000 lines of code!– This compiler is used in the industry.

• We have successfully tested our divergence analysis in all the 177 different CUDA kernels that we took from the Rodinia and NVIDIA SDK 3.1 benchmark suites.

Page 31: Divergence Analysis with Affine Constraints

λ

[email protected]

Performance

Average

backprop bfs cfd

hotspot

lavaMD nw

particle

filter_float

particle

filter_naive sra

d

streamclu

ster

bilateralFi

lter

binomialOptions

BlackSch

oles

convo

lutionFFT2D

dwtHaar1D

eigenvalues

fastWalsh

Transform

histogram

HSOptica

lFlow

matrixM

ul

mergeSort

nbodysca

n

SobolQ

RNG

sortingNetw

orks

threadFence

Reduction

volumeRender

-15

-5

5

15

25

35

45

55

65Divergent Affine

% faster than naive linearscan executionGtx 570 / Nvidia CUDA driver and toolkit 3.2 / 32 bit linux / 8 register per thread

% faster (execution time)

Page 32: Divergence Analysis with Affine Constraints

λ

[email protected]

Conclusions

• New directions to divergence aware optimizations.– So far, optimizations have been focusing on branch

fusion and synchronization of divergent threads.• Open source implementation already been used by the

Ocelot community.

• To know more:– http://code.google.com/p/gpuocelot/– http://simdopt.wordpress.com

Questions?

Page 33: Divergence Analysis with Affine Constraints

λ

[email protected]

What if the affine expression is formed by constants only?

• If the affine expression is like C0×Tid + C0, where C0 and C1 are constants, then we do not need neither loads nor stores (this is rematerialization). For instance, assume N = 2*tid + 3

store: st.local N 0xFFFFFC32

Load: ld.local N 0xFFFFFC32

the store is completely removed

changes to: N = 2*tid + 3We have all the information to reconstruct N!

Page 34: Divergence Analysis with Affine Constraints

λ

[email protected]

Classification of spilled Variables.

backprop bfs cfd

hotspot

lavaMD nw

particle

filter_float

particle

filter_naive sra

d

streamclu

ster

bilateralFi

lter

binomialOptions

BlackSch

oles

convo

lutionFFT2D

dwtHaar1D

eigenvalues

fastWalsh

Transform

histogram

HSOptica

lFlow

matrixM

ul

mergeSort

nbodysca

n

SobolQ

RNG

sortingNetw

orks

threadFence

Reduction

volumeRender

Constant Uniform Cnst. Affine Affine Divergent

Use

s (lo

ads)

Defin

ition

s (st

ores

)