linear algebra on gpus

Linear Algebra on GPUsLinear Algebra on GPUs

Jens Krüger Technische Universität München

Linear algebra?

1. Why are we interested in Linear Algebra?

It is THE machinery to solve PDEs1. PDEs are at the core of many graphics applications

Physics based simulation, Animation, Mesh fairing …

LA on GPUs?

2. … and why put LA on GPU?

A perfect couple…GPUs are fast stream processors,and many LA operations are “streamable”

…which goes hand in handThe solution is already on the GPU and ready for display

Getting started …

GPU as workhorse for numerical computations

Computergraphics

applications

Visual simulationVisual computing

Education and Training

Basic linearalgebra operators

General linearalgebra package

Programmable GPUsHighbandwidth

Parallelcomputing

• Per-pixel vs. per-vertex operations– 6 gigapixels/second vs. 0.7 gigavertices/second– Efficient texture memory cache– Texture read-write access

– 2D Textures are even better– 2D RGBA textures really rock

– Textures best we can do

Internal affairs

1 N1

N

Vector representation

Representation (cont.)

Dense Matrix representation– Treat a dense matrix as a set of column vectors– Again, store these vectors as 2D textures

Matrix

N

N

i N Vectors

......N

1 i N

N 2D-Textures

... ...1 i N

N

i

Representation (cont.)

Banded Sparse Matrix representation– Treat a banded matrix as a set of diagonal vectors– Combine opposing vectors to save space

N

Matrix 2 Vectors

N

1 2

2 2D-Textures1 2

N-i

Operations 1

• Vector-Vector Operations– Reduced to 2D texture operations– Coded in pixel shaders

• Example: Vector1 + Vector2 Vector3

return tex0 + tex1Pass through

Vector 1 Vector 2

TexUnit 0 TexUnit 1

Vertex Shader Pixel Shader

Vector 3

Render TextureStatic quad

Operations 2 (reduce)

Reduce operation for scalar products

Reduce m x n regionin fragment shader

2 passnd

......

st1 pass

...

original Texture

...

The “single float” on GPUs

Some operations generate single float values e.g. reduce

Read-back to main-mem is slow Keep single floats on the GPU as 1x1 textures

...

Operations (cont.)

Matrix-Vector Operations– Split it up into Vector – Vector operations

Matrix

N

N

i N Vectors

......N

1 i N

N 2D-Textures

... ...1 i N

N

N

Matrixi 2 Vectors

N

1 2

2 2D-Textures1 2

N-i

Operations

In depth example: Vector / Banded-Matrix MultiplicationA xb

=

Example (cont.)

Vector / Banded-Matrix Multiplication

A xb

=

A b

Pass 2Pass 2

Example (cont.)

Compute the result in 2 Passes:

AA

bb

=

b‘b‘

Pass 1Pass 1

xx

Building a Framework

Presented so far:

• Representations on the GPU for– Single float values– Vectors– Matrices

• Dense• Banded• Random sparse (see SIGGRAPH ‘03)

• Operations on these representations– Add, multiply, reduce, …– Upload, download, clear, clone, …

Framework Classes (UML)

+getDevice

clClass

+getSize+clear

clVector

clPackedMatrix

+getSize+matrixVectorOp

clAbstractMatrix+getRenderSurface+getTexture+releaseTexture+releaseRenderSurface

clMemMan

clUnpackedVector+unpack+repack

clPackedVector

+reduce+multiply+copy+add+sub+vectorOp

clFragmentVector

+setRow+getRow

clFragmentMatrix

+setData+setDataSorted

clVertexMatrixclUnpackedMatrix

+setData+getData+add+mul+sub+div+invert

clFloat

Framework Example: CG

Encapsulate into Classes for more complex algorithms– Example use: Conjugate Gradient Method, complete source:

void clCGSolver::solveInit() {Matrix->matrixVectorOp(CL_SUB,X,B,R); // R = A*x-bR->multiply(-1); // R = -RR->clone(P); // P = R R->reduceAdd(R, Rho); // rho = sum(R*R);

}void clCGSolver::solveIteration() {

Matrix->matrixVectorOp(CL_NULL,P,NULL,Q);// Q = Ap;P->reduceAdd(Q,Temp); // temp = sum(P*Q);Rho->div(Temp,Alpha); // alpha = rho/temp;X->addVector(P,X,1,Alpha); // X = X + alpha*PR->subtractVector(Q,R,1,Alpha); // R = R - alpha*QR->reduceAdd(R,NewRho); // newrho = sum(R*R);NewRho->divZ(Rho,Beta); // beta = newrho/rhoR->addVector(P,P,1,Beta); // P = R+beta*P;clFloat *temp; temp=NewRho;NewRho=Rho; Rho=temp; // swap rho and newrho pointers

}void clCGSolver::solve(int maxI) {

solveInit();for (int i = 0;i< maxI;i++) solveIteration();

}int clCGSolver::solve(float rhoTresh, int maxI) {

solveInit(); Rho->clone(NewRho);for (int i = 0;i< maxI && NewRho.getData() > rhoTresh;i++) solveIteration();return i;

}

Example 12D Waves (explicit)

• Finite difference discretization:

• You could write a custom shader for this filter

• Think about this as a matrix-vector operation

t 1 t t t t t t 1i, j i 1, j i, j 1 i 1, j i, j 1 i, j i, jx x x x x (2 4 ) x x

2 2 22

2 2 2

x x xct u v

tx 3tx 4tx 5tx 6tx 7

tx 9

11tx1

2tx1

3tx1

4tx1

5tx1

6tx1

7tx1

8tx1

9tx

tx 1tx 2

tx 7


tx 9

11tx1

2tx1

3tx1

4tx1

5tx1

6tx1

7tx1

8tx1

9tx

tx 1


tx 9

11tx1

2tx1

3tx1

4tx1

5tx1

6tx1

7tx1

8tx1

9tx

tx 1tx 2

tx 7

2

2

2

2

2

2

2

2

2

=-

11tx1

2tx1

3tx1

4tx1

5tx1

6tx1

7tx1

8tx1

9tx

11tx1

2tx1

3tx1

4tx1

5tx1

6tx1

7tx1

8tx1

9tx

11tx1

2tx1

3tx1

4tx1

5tx1

6tx1

7t-x1

8tx1

9tx

2D Waves (cont.)

clMatrix->matrixVectorOp(CL_SUB,clCurrent,clLast,clNext); // next = matrix*current-last;

clLast->copyVector(clCurrent); // save for next iteration

clCurrent->copyVector(clNext); // save for next iteration

cluNext->unpack(clNext); // unpack for rendering

renderHF(cluNext->m_pVectorTexture); // render as heightfield

for (i=sY;i<sX*sY;i++) data[i] = ß; // setup diagonal-sY

matrix->getRow(sX*(sY-1))->setData(data);

for (i=0;i<sX*sY;i++) data[i] = (i%sX) ? ß : 0; // setup diagonal-1

matrix->getRow(sX*sY-1)->setData(data);

for (i=0;i<sX*sY;i++) data[i] = 2-4ß; // setup diagonal

matrix->getRow(sX*sY)->setData(data);

for (i=0;i<sX*sY;i++) data[i] = ((i+1)%sX) ? ß : 0; // setup diagonal+1

matrix->getRow(sX*sY+1)->setData(data);

for (i=0;i<sX*(sY-1);i++) data[i] = ß; // setup diagonal+sY

matrix->getRow(sX*(sY+1))->setData(data);

One Time Matrix Initialization:

Per Frame Iteration

2

2

2

2

2

2

2

2

2

Key Idea– Use different time discretization (e.g. Crank Nicholson)– Results in system of linear equations– Iterative solution using CG

Example 22D Waves (implicit)

=

11

tx

12tx1

3tx1

4tx1

5tx1

6tx1

7tx1

8tx1

9tx

tc3tc4

tc5tc 6tc 7

tc 9

tc1

tc2

tc 7

2D Waves (cont.)

cluRHS->computeRHS(cluLast, cluCurrent); // generate c(i,j,t)

clRHS->repack(cluRHS); // encode into RGBA

iSteps = pCGSolver->solve(iMaxSteps); // solve using CG

cluLast->copyVector(cluCurrent); // save for next iteration

clNext->unpack(cluCurrent); // unpack for rendering

renderHF(cluCurrent->m_pVectorTexture); // render as heightfield

for (i=sY;i<sX*sY;i++) data[i] = -alpha; // setup diagonal-sY

matrix->getRow(sX*(sY-1))->setData(data);

for (i=0;i<sX*sY;i++) data[i] = (i%sX) ? - alpha : 0; // setup diagonal-1

matrix->getRow(sX*sY-1)->setData(data);

for (i=0;i<sX*sY;i++) data[i] = 4*alpha+1 // setup diagonal

matrix->getRow(sX*sY)->setData(data);

for (i=0;i<sX*sY;i++) data[i] = ((i+1)%sX) ? -alpha:0; // setup diagonal+1

matrix->getRow(sX*sY+1)->setData(data);

for (i=0;i<sX*(sY-1);i++) data[i] = -alpha // setup diagonal+sY

matrix->getRow(sX*(sY+1))->setData(data);

One Time Matrix Initialization:

Per Frame Iteration

The End

Thank you!

Questions?

For more infos, browse to:http://wwwcg.in.tum.de/GPU

http://www.gpgpu.org

linear algebra on gpus

Documents

alpha r

r alpha

r rreduceaddr

r beta

banded matrix

alpha alpha

alpha x

x alpha