linear algebra on gpus
DESCRIPTION
Linear Algebra on GPUs. Jens Krüger Technische Universität München. Linear algebra?. Why are we interested in Linear Algebra? It is THE machinery to solve PDEs PDEs are at the core of many graphics applications Physics based simulation, Animation, Mesh fairing …. LA on GPUs?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/1.jpg)
Linear Algebra on GPUsLinear Algebra on GPUs
Jens Krüger Technische Universität München
![Page 2: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/2.jpg)
Linear algebra?
1. Why are we interested in Linear Algebra?
It is THE machinery to solve PDEs1. PDEs are at the core of many graphics applications
Physics based simulation, Animation, Mesh fairing …
![Page 3: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/3.jpg)
LA on GPUs?
2. … and why put LA on GPU?
A perfect couple…GPUs are fast stream processors,and many LA operations are “streamable”
…which goes hand in handThe solution is already on the GPU and ready for display
![Page 4: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/4.jpg)
Getting started …
GPU as workhorse for numerical computations
Computergraphics
applications
Visual simulationVisual computing
Education and Training
Basic linearalgebra operators
General linearalgebra package
Programmable GPUsHighbandwidth
Parallelcomputing
![Page 5: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/5.jpg)
Getting started …
GPU as workhorse for numerical computations
Computergraphics
applications
Visual simulationVisual computing
Education and Training
Basic linearalgebra operators
General linearalgebra package
Programmable GPUsHighbandwidth
Parallelcomputing
![Page 6: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/6.jpg)
• Per-pixel vs. per-vertex operations– 6 gigapixels/second vs. 0.7 gigavertices/second– Efficient texture memory cache– Texture read-write access
– 2D Textures are even better– 2D RGBA textures really rock
– Textures best we can do
Internal affairs
1 N1
N
Vector representation
![Page 7: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/7.jpg)
Representation (cont.)
Dense Matrix representation– Treat a dense matrix as a set of column vectors– Again, store these vectors as 2D textures
Matrix
N
N
i N Vectors
......N
1 i N
N 2D-Textures
... ...1 i N
![Page 8: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/8.jpg)
N
i
Representation (cont.)
Banded Sparse Matrix representation– Treat a banded matrix as a set of diagonal vectors– Combine opposing vectors to save space
N
Matrix 2 Vectors
N
1 2
2 2D-Textures1 2
N-i
![Page 9: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/9.jpg)
Operations 1
• Vector-Vector Operations– Reduced to 2D texture operations– Coded in pixel shaders
• Example: Vector1 + Vector2 Vector3
return tex0 + tex1Pass through
Vector 1 Vector 2
TexUnit 0 TexUnit 1
Vertex Shader Pixel Shader
Vector 3
Render TextureStatic quad
![Page 10: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/10.jpg)
Operations 2 (reduce)
Reduce operation for scalar products
Reduce m x n regionin fragment shader
2 passnd
......
st1 pass
...
original Texture
...
![Page 11: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/11.jpg)
The “single float” on GPUs
Some operations generate single float values e.g. reduce
Read-back to main-mem is slow Keep single floats on the GPU as 1x1 textures
...
![Page 12: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/12.jpg)
Operations (cont.)
Matrix-Vector Operations– Split it up into Vector – Vector operations
Matrix
N
N
i N Vectors
......N
1 i N
N 2D-Textures
... ...1 i N
N
N
Matrixi 2 Vectors
N
1 2
2 2D-Textures1 2
N-i
![Page 13: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/13.jpg)
Operations
In depth example: Vector / Banded-Matrix MultiplicationA xb
=
![Page 14: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/14.jpg)
Example (cont.)
Vector / Banded-Matrix Multiplication
A xb
=
A b
![Page 15: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/15.jpg)
Pass 2Pass 2
Example (cont.)
Compute the result in 2 Passes:
AA
bb
=
b‘b‘
Pass 1Pass 1
xx
![Page 16: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/16.jpg)
Building a Framework
Presented so far:
• Representations on the GPU for– Single float values– Vectors– Matrices
• Dense• Banded• Random sparse (see SIGGRAPH ‘03)
• Operations on these representations– Add, multiply, reduce, …– Upload, download, clear, clone, …
![Page 17: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/17.jpg)
Framework Classes (UML)
+getDevice
clClass
+getSize+clear
clVector
clPackedMatrix
+getSize+matrixVectorOp
clAbstractMatrix+getRenderSurface+getTexture+releaseTexture+releaseRenderSurface
clMemMan
clUnpackedVector+unpack+repack
clPackedVector
+reduce+multiply+copy+add+sub+vectorOp
clFragmentVector
+setRow+getRow
clFragmentMatrix
+setData+setDataSorted
clVertexMatrixclUnpackedMatrix
+setData+getData+add+mul+sub+div+invert
clFloat
![Page 18: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/18.jpg)
Framework Example: CG
Encapsulate into Classes for more complex algorithms– Example use: Conjugate Gradient Method, complete source:
void clCGSolver::solveInit() {Matrix->matrixVectorOp(CL_SUB,X,B,R); // R = A*x-bR->multiply(-1); // R = -RR->clone(P); // P = R R->reduceAdd(R, Rho); // rho = sum(R*R);
}void clCGSolver::solveIteration() {
Matrix->matrixVectorOp(CL_NULL,P,NULL,Q);// Q = Ap;P->reduceAdd(Q,Temp); // temp = sum(P*Q);Rho->div(Temp,Alpha); // alpha = rho/temp;X->addVector(P,X,1,Alpha); // X = X + alpha*PR->subtractVector(Q,R,1,Alpha); // R = R - alpha*QR->reduceAdd(R,NewRho); // newrho = sum(R*R);NewRho->divZ(Rho,Beta); // beta = newrho/rhoR->addVector(P,P,1,Beta); // P = R+beta*P;clFloat *temp; temp=NewRho;NewRho=Rho; Rho=temp; // swap rho and newrho pointers
}void clCGSolver::solve(int maxI) {
solveInit();for (int i = 0;i< maxI;i++) solveIteration();
}int clCGSolver::solve(float rhoTresh, int maxI) {
solveInit(); Rho->clone(NewRho);for (int i = 0;i< maxI && NewRho.getData() > rhoTresh;i++) solveIteration();return i;
}
![Page 19: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/19.jpg)
Example 12D Waves (explicit)
• Finite difference discretization:
• You could write a custom shader for this filter
• Think about this as a matrix-vector operation
t 1 t t t t t t 1i, j i 1, j i, j 1 i 1, j i, j 1 i, j i, jx x x x x (2 4 ) x x
2 2 22
2 2 2
x x xct u v
tx 3tx 4tx 5tx 6tx 7
tx 9
11tx1
2tx1
3tx1
4tx1
5tx1
6tx1
7tx1
8tx1
9tx
tx 1tx 2
tx 7
tx 3tx 4tx 5tx 6tx 7
tx 9
11tx1
2tx1
3tx1
4tx1
5tx1
6tx1
7tx1
8tx1
9tx
tx 1
tx 3tx 4tx 5tx 6tx 7
tx 9
11tx1
2tx1
3tx1
4tx1
5tx1
6tx1
7tx1
8tx1
9tx
tx 1tx 2
tx 7
2
2
2
2
2
2
2
2
2
=-
11tx1
2tx1
3tx1
4tx1
5tx1
6tx1
7tx1
8tx1
9tx
11tx1
2tx1
3tx1
4tx1
5tx1
6tx1
7tx1
8tx1
9tx
11tx1
2tx1
3tx1
4tx1
5tx1
6tx1
7t-x1
8tx1
9tx
![Page 20: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/20.jpg)
2D Waves (cont.)
clMatrix->matrixVectorOp(CL_SUB,clCurrent,clLast,clNext); // next = matrix*current-last;
clLast->copyVector(clCurrent); // save for next iteration
clCurrent->copyVector(clNext); // save for next iteration
cluNext->unpack(clNext); // unpack for rendering
renderHF(cluNext->m_pVectorTexture); // render as heightfield
for (i=sY;i<sX*sY;i++) data[i] = ß; // setup diagonal-sY
matrix->getRow(sX*(sY-1))->setData(data);
for (i=0;i<sX*sY;i++) data[i] = (i%sX) ? ß : 0; // setup diagonal-1
matrix->getRow(sX*sY-1)->setData(data);
for (i=0;i<sX*sY;i++) data[i] = 2-4ß; // setup diagonal
matrix->getRow(sX*sY)->setData(data);
for (i=0;i<sX*sY;i++) data[i] = ((i+1)%sX) ? ß : 0; // setup diagonal+1
matrix->getRow(sX*sY+1)->setData(data);
for (i=0;i<sX*(sY-1);i++) data[i] = ß; // setup diagonal+sY
matrix->getRow(sX*(sY+1))->setData(data);
One Time Matrix Initialization:
Per Frame Iteration
2
2
2
2
2
2
2
2
2
![Page 21: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/21.jpg)
Key Idea– Use different time discretization (e.g. Crank Nicholson)– Results in system of linear equations– Iterative solution using CG
Example 22D Waves (implicit)
=
11
tx
12tx1
3tx1
4tx1
5tx1
6tx1
7tx1
8tx1
9tx
tc3tc4
tc5tc 6tc 7
tc 9
tc1
tc2
tc 7
![Page 22: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/22.jpg)
2D Waves (cont.)
cluRHS->computeRHS(cluLast, cluCurrent); // generate c(i,j,t)
clRHS->repack(cluRHS); // encode into RGBA
iSteps = pCGSolver->solve(iMaxSteps); // solve using CG
cluLast->copyVector(cluCurrent); // save for next iteration
clNext->unpack(cluCurrent); // unpack for rendering
renderHF(cluCurrent->m_pVectorTexture); // render as heightfield
for (i=sY;i<sX*sY;i++) data[i] = -alpha; // setup diagonal-sY
matrix->getRow(sX*(sY-1))->setData(data);
for (i=0;i<sX*sY;i++) data[i] = (i%sX) ? - alpha : 0; // setup diagonal-1
matrix->getRow(sX*sY-1)->setData(data);
for (i=0;i<sX*sY;i++) data[i] = 4*alpha+1 // setup diagonal
matrix->getRow(sX*sY)->setData(data);
for (i=0;i<sX*sY;i++) data[i] = ((i+1)%sX) ? -alpha:0; // setup diagonal+1
matrix->getRow(sX*sY+1)->setData(data);
for (i=0;i<sX*(sY-1);i++) data[i] = -alpha // setup diagonal+sY
matrix->getRow(sX*(sY+1))->setData(data);
One Time Matrix Initialization:
Per Frame Iteration
![Page 23: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/23.jpg)
Demos
![Page 24: Linear Algebra on GPUs](https://reader036.vdocuments.net/reader036/viewer/2022062310/56815daa550346895dcbdbef/html5/thumbnails/24.jpg)
The End
Thank you!
Questions?
For more infos, browse to:http://wwwcg.in.tum.de/GPU
http://www.gpgpu.org