gpu accelerated domain decomposition

46
Introduction Domain Decomposition DD on the GPU Conclusions How To Use Your Desktop Supercomputer: GPU Accelerated Domain Decomposition Richard Southern GPGPU DDM (Richard Southern) 1

Upload: richard-southern

Post on 13-Apr-2017

253 views

Category:

Science


2 download

TRANSCRIPT

Page 1: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

How To Use Your Desktop Supercomputer:GPU Accelerated Domain Decomposition

Richard Southern

GPGPU DDM (Richard Southern) 1

Page 2: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Overview

Purpose of this talk: To demonstrate by example how theGPU can be used for solving general computing problems.

The example: A Domain Decomposition Method for solvingcommon Boundary Valued Problems.

GPGPU DDM (Richard Southern) 2

Page 3: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

An Introduction to General Purpose GPU programming

GPGPU DDM (Richard Southern) 3

Page 4: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

The Evolution of the Desktop Supercomputer

In the 1970’s, most supercomputers used parallel vector

processors.

Single Instruction, Multiple Data (SIMD).

Same thing for Real-Time Graphics systems.

March 2001 NVIDIA releases the GeForce 3, a vectorprocessing SIMD programmable graphics chip.

GPGPU DDM (Richard Southern) 4

Page 5: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

The NVIDIA nfiniteFX Engine!

From the original press release:“With the GeForce3 and its nfiniteFXTMengine, NVIDIAintroduces the world’s first programmable 3D graphics chiparchitecture. By combining programmable vertex and pixel shadingcapabilities, and 3D texture technology, the nfiniteFX enginedelivers unprecedented visual realism on your PC.”

GPGPU DDM (Richard Southern) 5

Page 6: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Graphics Cards pre-GeForce 3

GPGPU DDM (Richard Southern) 6

Page 7: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

The GeForce3

GPGPU DDM (Richard Southern) 7

Page 8: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Result: Ugly Zombies

GPGPU DDM (Richard Southern) 8

Page 9: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

sin wave water effect

/* Vertex shader */uniform float waveTime;uniform float waveWidth;uniform float waveHeight;

void main() vec4 v = vec4(gl Vertex);v .z = sin(waveWidth ∗ v .x + waveTime)∗

cos(waveWidth ∗ v .y + waveTime) ∗ waveHeight;gl Position = gl ModelViewProjectionMatrix ∗ v ;

/* Fragment shader */void main()

gl FragColor [0] = gl FragCoord [0]/400.0;gl FragColor [1] = gl FragCoord [1]/400.0;gl FragColor [2] = 1.0;

GPGPU DDM (Richard Southern) 9

Page 10: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

sin wave water effect

GPGPU DDM (Richard Southern) 10

Page 11: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

General Purpose GPU programming

We can use this commodity chip can be used for solvingtraditional supercomputing problems.

• Fast Fourier Transform• Bioinformatics (Database queries, Visualization, etc.)• Neural networks• Video processing• . . .

Advantages: cheap, ubiquitous.

Changes needed to be made to the hardware / drivers esp.texture read / write.

GPGPU DDM (Richard Southern) 11

Page 12: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Result: Tesla, the dedicated GPGPU card

GPGPU DDM (Richard Southern) 12

Page 13: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

DGEMM Performance

GPGPU DDM (Richard Southern) 13

Page 14: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Folding@Home Results

OS Type Native TFLOPS x86 TFLOPS Active CPUs Total CPUsWindows 190 190 199965 3405324Mac OS X/PowerPC 3 3 4237 139764Mac OS X/Intel 20 20 6592 129084Linux 59 59 34492 508701ATI GPU 642 677 6296 134815

NVIDIA GPU 1325 2796 11131 211480

PLAYSTATION 3 755 1593 26784 1006896

GPGPU DDM (Richard Southern) 14

Page 15: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Programming on the GPU

Originally assembler (nightmare).

Shader Language API’s:• Cg (NVidia),• GLSL (OpenGL),• HLSL (Microsoft)

General GPU programming languages:• CUDA (NVidia),• Brook (Stanford),• OpenCL (OpenGL)

GPGPU DDM (Richard Southern) 15

Page 16: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

About CUDA

Stands for Compute Unified Device Architecture.

Abstracts stream processing and memory access.

Data can be passed around using C pointers.

Some graphics concepts still need to be understood: textures,surfaces, vector types.

Only works on NVidia cards.

GPGPU DDM (Richard Southern) 16

Page 17: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Simplest CUDA example: Add two vectors in parallel.

global void VecAdd(float ∗ A,float ∗ B,float ∗ C ) int i = threadIdx.x ;C [i ] = A[i ] + B[i ];

int main() . . .// Invoke the kernel from the main programVecAdd <<< 1, N >>> (A, B, C );. . .

GPGPU DDM (Richard Southern) 17

Page 18: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Domain Decomposition for Boundary Value Problems

GPGPU DDM (Richard Southern) 18

Page 19: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Discrete Boundary Value Problems

Boundary Value Problems have numerous applications:• Solutions to general PDE’s,• Finite Element Methods (solving heat equations, antennae

simulations, deformation),• Fluid Simulations (Navier-Stokes theorem, Smooth Particle

Hydrodynamics),• Radial Basis Functions (Smooth data interpolation),• . . .

Basic form:f (x) =

i

qiΦ(‖x − xi‖)

• Φ(r) is some smooth kernel function.• xi is an interpolation center / observation site.

GPGPU DDM (Richard Southern) 19

Page 20: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Solving Boundary Value Problems

Given a set of observation sites and observed valuesf (xi ) = bi , compute coefficients qi .

Define matrix A with Ai ,j = Φ(‖xi − xj‖).

Solve for q inAq = b.

GPGPU DDM (Richard Southern) 20

Page 21: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Solving Boundary Value Problems

Given a set of observation sites and observed valuesf (xi ) = bi , compute coefficients qi .

Define matrix A with Ai ,j = Φ(‖xi − xj‖).

Solve for q inAq = b.

Good for about 5000 observations. What about 1, 000, 000?

GPGPU DDM (Richard Southern) 21

Page 22: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Our method

Method of Yokota et al. 2010 PetRBF.

GMRES: Generalised Minimisation of Residuals

Solves for the following problem

minq

‖b − Aq‖.

Given a preconditioner M, compute iteratively

qn+1 = qn + M (b − Aqn) , q0 = 0.

A perfect preconditioner would be

M = A−1 =⇒ q1 = A−1b = q.

We derive an approximation of A−1 using the Schwartz

method.

GPGPU DDM (Richard Southern) 22

Page 23: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Kernel properties

Must be a compact function for Φ(r).

Gaussian Φ(r) = exp(− r2

σ2 )

Set any value Φ(r) < ε to 0.

Now A is a sparse matrix (store index and value of non-zeroentries).

−3 −2 −1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

GPGPU DDM (Richard Southern) 23

Page 24: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Break the problem down

GPGPU DDM (Richard Southern) 24

Page 25: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Break the problem down

GPGPU DDM (Richard Southern) 25

Page 26: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Break the problem down

GPGPU DDM (Richard Southern) 26

Page 27: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Break the problem down

Each sub-matrix AΩiis the matrix A computed only from the

subset of points in Ωi .

Construct sparse restriction matrices

Rix = [I 0]

[

xΩi

xΩ\Ωi

]

and Rix = [I 0]

[

xΩi

xΩ\Ωi

]

GPGPU DDM (Richard Southern) 27

Page 28: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Additive Schwartz Method (ASM)

Compute the inverse of each sub-matrix, and add themtogether.

Matrix is symmetric, but convergence is slower and less stable.

Preconditioner:

M = A−1ASM

=∑

i

RTi A−1

i .

GPGPU DDM (Richard Southern) 28

Page 29: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Restricted Additive Schwartz Method (RASM)

Compute the inverse of each sub-matrix, but restrict rows tooriginal domain Ωi .

Matrix is non-symmetric - convergence time improved.

Makes it a bit more fiddly to calculate.

Preconditioner:

M = A−1RASM

=∑

i

RTi A−1

i .

GPGPU DDM (Richard Southern) 29

Page 30: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Domain Decomposition on the GPU

GPGPU DDM (Richard Southern) 30

Page 31: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Summary of problem components

1. Domain decomposition of input points in Ωi and Ωi .

2. Compute kernel matrix A.

3. Compute Schwartz preconditioner M = A−1RASM

.

4. Solve GMRES

qn+1 = qn + M (b − Aqn) .

GPGPU DDM (Richard Southern) 31

Page 32: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

1. Decompose the domain

Given points x and

Define Ωi = Ωi + ∆i , where ∆i is some padding applied todomain.

For each point, classify it as either OUTSIDE BOTH,INSIDE OVERLAP, or INSIDE BOTH.

GPGPU DDM (Richard Southern) 32

Page 33: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

1. Decompose the domain

Given points x and

Define Ωi = Ωi + ∆i , where ∆i is some padding applied todomain.

For each point, classify it as either OUTSIDE BOTH,INSIDE OVERLAP, or INSIDE BOTH.

Performance terrible: About 45s to sort 100, 000 points.

GPGPU DDM (Richard Southern) 33

Page 34: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

1. Fast domain decomposition

Alternative overlap

Ωi = Ωi +∑

j∈Ni

Ωj .

Given x in dimension d and resolution vector res ∈ Nd , bucket

sort into a grid.

Consists of two GPU passes:• pointHash() Called per point, determines which cell each

point is in, and• buildGrid() Called per cell, inverts the point hash structure

into a grid.

2.5 million points sorted in 1.25s!

GPGPU DDM (Richard Southern) 34

Page 35: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

2. Compute the kernel matrix

Matrix is normally too big for main memory.

Solution: Compute each row (in parallel) and pack intosparse matrix structure.

A is stored in a Compressed Row Sparse structure (CSR).

CSR makes pre–multiply fast.

Improve computation using existing domain decomposition.

GPGPU DDM (Richard Southern) 35

Page 36: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

3. Compute the preconditioner

Construct each kernel sub–matrix Ai from the restricted pointset Rix.

Compute each matrix inverse A−1i using CUDA accelerated

library CULA.

Combine restricted rows into preconditioner.

M is packed in CSR format.

Each matrix Ai can be inverted in parallel on multiple CPU’s.

GPGPU DDM (Richard Southern) 36

Page 37: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

4. Solve GMRES

Observe thatqn+1 = qn + M (b − Aqn)

can be simplified.

Define g(A, x,b, α) = b + αAx.

Then GMRES becomes two step process:

v = g(A,qn,b,−1)

qn+1 = g(M, v,qn, 1)

GPGPU DDM (Richard Southern) 37

Page 38: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Results and Conclusions

GPGPU DDM (Richard Southern) 38

Page 39: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Smooth image scaling

x is the vector of pixel positions. br ,g ,b is the colour valuevector.

Solve for RBF coefficients for each color channel qr ,g ,b.

100 × 100 256 × 256 (Original) 500 × 500

GPGPU DDM (Richard Southern) 39

Page 40: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Lagergren et al. 2010, about 1 fps

GPGPU DDM (Richard Southern) 40

Page 41: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Results

On my Quadro FX 3700, maxThreadsPerBlock=512.

100, 000 random vertices in 3D, computed in 102.83s.

Task Properties Time(s)Segmentation Ωi Old method ≈ 45sConstructing coefficient matrix A Row occupancy 0.001% 46.98sConstructing preconditioner M 2744 submatrices, average 40 × 40 16.39sRunning GMRES RMS < 0.00001 in 5 steps 0.01s

GPGPU DDM (Richard Southern) 41

Page 42: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Results

On my Quadro FX 3700, maxThreadsPerBlock=512.

100, 000 random vertices in 3D, computed in 102.83s.

Task Properties Time(s)Segmentation Ωi Old method ≈ 45sConstructing coefficient matrix A Row occupancy 0.001% 46.98sConstructing preconditioner M 2744 submatrices, average 40 × 40 16.39sRunning GMRES RMS < 0.00001 in 5 steps 0.01s

Not particularly impressive performance.

Should expect 1, 000, 000 at ≈ 1 fps.

Problem is still limited by hardware constraints.

GPGPU DDM (Richard Southern) 42

Page 43: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Solution: Throw more hardware at the problem

GPGPU DDM (Richard Southern) 43

Page 44: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Solution: Throw more hardware at the problem

Problem is rediculously parallel.• Grid partition Ωi bucket sorting on multiple GPU’s,• Kernel matrix A in chunks on separate GPU’s,• Matrix inversion A−1

i on multiple GPU’s,• . . .

GPGPU DDM (Richard Southern) 44

Page 45: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Conclusions

Parallelized Domain Decomposition problems good use ofgraphics hardware.

Can expect an exponential performance improvement.

Interface is simple. . .

GPGPU DDM (Richard Southern) 45

Page 46: GPU Accelerated Domain Decomposition

Introduction Domain Decomposition DD on the GPU Conclusions

Conclusions

Parallelized Domain Decomposition problems good use ofgraphics hardware.

Can expect an exponential performance improvement.

Interface is simple. . .

BUT Memory management very difficult.

Debugging is a nightmare.

GPGPU DDM (Richard Southern) 46