cs 267 dense linear algebra: possible class projects

03/04/2009 CS267 Lecture 12a 1

CS 267 Dense Linear Algebra:

Possible Class Projects

James Demmel

www.cs.berkeley.edu/~demmel/cs267_Spr09

03/04/2009 CS267 Lecture 12a 2

Kinds of class projects• Try tuning existing (widely used) codes in LAPACK,

ScaLAPACK or possible future versions- Possible impact: help many people to run faster

• Add missing functionality to these libraries- Possible impact: lots of users want it

• Experiment with algorithms on new architectures- Possible impact: What do we need to do differently for

performance on these platforms? Are there any bottlenecks or other problems in the architecture? Could they be fixed?

• Experiment with new software approaches- Possible impact: Is it easier to write these algorithms while

getting most of the performance? Should we produce future versions of the libraries this way?

• Experiment with new algorithms- Possible impact: Find a better one!

Challenges to Libraries (and parallel SW in general)• Minimizing communication costs

- Cost of bandwidth and latency (to main memory or over a network) growing exponentially compared to arithmetic

• Heterogeneous platforms- Different communication costs depending on destination

• Same chip vs different socket vs different board …

- CPU + GPU• Perform different operations at very different rates

• Dynamic scheduling & load balancing- Can’t always assume each core/processor makes constant

progress on your task

- May be faster to grab next available task than use predesigned “perfectly balanced” schedule

- OS may give, take away resources on the fly

• Fault tolerance – how to recover when one proc fails03/02/2009 CS267 Lecture 11 3

Strassen’s Matmul on Multicore or GPU

• Why no Strassen in most libraries?- See “Baleful Effect of Benchmarks…” by Prof. Kahan

• Likely to be faster for modest-to-large matrix sizes- Where is the crossover?

• May want hybrid: switch to O(n3) algorithm for certain sizes (smaller)

- Autotuning?

• Lots of “blocking” opportunities as for standard matmul- What is least amount of data movement possible?

• How well does it work for the rectangular matmuls in LU, QR and Cholesky?

- Do we need to modify LU, QR or Cholesky to take advantage of Strassen (by using a variant that multiplies different size matrices)?

03/04/2009 CS267 Lecture 12a 4

Review: Alternative recursive GE formulation

• Toledo (1997) - Describe without pivoting for simplicity

- “Do left half of matrix, then right half”

03/04/2009 CS267 Lecture 12a 5

function [L,U] = RLU (A) … assume A is m by n if (n=1) L = A/A(1,1), U = A(1,1) else [L1,U1] = RLU( A(1:m , 1:n/2)) … do left half of A … let L11 denote top n/2 rows of L1

A( 1:n/2 , n/2+1 : n ) = L11-1 * A( 1:n/2 , n/2+1 : n ) … update top n/2 rows of right half of A A( n/2+1: m, n/2+1:n ) = A( n/2+1: m, n/2+1:n ) - A( n/2+1: m, 1:n/2 ) * A( 1:n/2 , n/2+1 : n ) … update rest of right half of A [L2,U2] = RLU( A(n/2+1:m , n/2+1:n) ) … do right half of A return [ L1,[0;L2] ] and [U1, [ A(.,.) ; U2 ] ]

A = L * U

Register-file resident Linear Algebra on GPUs• Vasily’s results for LU, QR and Cholesky on GPU

target single large matrices, too large to fit just in the “fast memory” (shared + registers) of the GPU

• There is also demand for solving many smaller problems in parallel, eg A(i) * x(i) = b(i) for many different A(1),…,A(k) and b(1),…,b(k)

• Project: Design linear algebra algorithms that operate on many different matrices in parallel, each small enough to fit in the 64 KB register set of each multiprocessor

- single precision square matrix of dimension n=128

• Question: Does possible need to branch differently on each multiprocessor (because of different pivot orders) matter? If so, is QR better than LU?

• Question: Do we need BLAS3 code versions on such small matrices, or is BLAS2 enough?

03/04/2009 CS267 Lecture 12a 6

Extend Vasily’s GPU analysis, code to ATI• Vasily’s Best Student Paper Award from SC08 had

two parts:- Analyzed bottlenecks, speedup possibilities in NVIDIA

architecture

- Applied lessons to reorganization of LU, QR, Cholesky

• What about ATI GPU?- Both above aspects interesting

- ATI GPU available in ParLab

- What are pros, cons of ATI, NVIDIA architectures? Others?

- Do we need to reorganize algorithms differently for each, or does one algorithm (perhaps with different block sizes, other parameters) work for both (which would be simpler)?

• Other BLAS-like operations on GPU- Needed for finite-element analysis

03/04/2009 CS267 Lecture 12a 7

Missing Drivers in Sca/LAPACK

LAPACK ScaLAPACK

Linear Equations

LU

Cholesky

LDLT

xGESV

xPOSV

xSYSV

PxGESV

PxPOSV

missing

Least Squares (LS)

QR

QR+pivot

SVD/QR

SVD/D&C

SVD/MRRR

QR + iterative refine.

xGELS

xGELSY

xGELSS

xGELSD

missing (oops)

missing

PxGELS

missing driver

missing driver

missing (intent)

missing(oops)

missing

Generalized LS LS + equality constr.

Generalized LM

Above + Iterative ref.

xGGLSE

xGGGLM

missing

missing

missing

missing

More missing drivers

LAPACK ScaLAPACK

Symmetric EVD QR / Bisection+Invit

D&C

MRRR

xSYEV / X

xSYEVD

xSYEVR

PxSYEV / X

missing (intent)

missing

Nonsymmetric EVD Schur form

Vectors too

xGEES / X

xGEEV /X

missing driver

missing driver

SVD QR

D&C

MRRR

Jacobi

xGESVD

xGESDD

missing(oops)

xGESVJ

PxGESVD

missing (intent)

missing(oops)

missing

Generalized Symmetric EVD

QR / Bisection+Invit

D&C

MRRR

xSYGV / X

xSYGVD

missing

PxSYGV / X

missing (intent)

missing

Generalized Nonsymmetric EVD

Schur form

Vectors too

xGGES / X

xGGEV / X

missing

missing

Generalized SVD Kogbetliantz

MRRR

xGGSVD

missing(oops)

missing (intent)

missing(oops)

Missing matrix types in ScaLAPACK• Symmetric, Hermitian, triangular

- Band, Packed

• Positive Definite- Packed

• Orthogonal, Unitary- Packed

0

10

20

30

40

50

60

70

80

90

100

seconds

10002000300040005000600070008000900010000

1x60

2x30

3x20

4x15

5x12

6x10

problem size

grid shape

Execution time of PDGESV for various grid shape

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

10-20

0-10

Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect, 2GB Memory

Speedups for using 2D processor grid range from 2x to 8x

Tuning the data layout

Layout depends on block size b and processor grid Pr x PcSimple layouts easy for user, but bad for performance

0.01

0.1

1

10

100

seconds

1000 4000 7000 1000

problem size

Optimal grid (6x10) for PDGESVComparison between Computation and Redistribution of Data from Linear Grid

Calculation Time

RedistributionTime

Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect, 2GB Memory

Cost of redistributing matrix to optimal layout is small

Cost of tuning the data layout, compared to runtime

Possible project: build “wrapper” that chooses fastest layout, whether to convert back and forth, and hides details from the user.

Parallel Eigenvalue Algorithms on GPU

• Harder to use all BLAS3 than solving Ax=b, least squares

• Symmetric eigenvalue problem for A=AT (SVD similar)- Find orthogonal Q to transform A = QTQT, where T=TT is tridiagonal

(nonzero on main diagonal, right above and below

- Find eigenvals =diag(λ1,…,λn)and orthog. eigenvecs U of T = UUT

• Good parallel algorithms; cheaper than first step

- Then A = (QU) (QU)T so orthog. eigenvectors =QU, eigenvalues =

• A = QTQT is proposed challenge- Use “Successive Band Reduction” (Sun, Bischof et al)

- Go from A to wide band matrix B via A = VBVT , V orthogonal• All BLAS3, fast on GPU

- Go from B to tridiagonal T via B = WTWT , W orthogonal• BLAS1 and BLAS2, do it on CPU

- Find T = UUT as above, then A = (VWU) (VWU)T

• Prospect of minimizing communication in theory13

Experiment with PLASMA for Multicore

• PLASMA is experimental system for writing, scheduling linear algebra algorithms as Directed Acyclic Graphs (DAGs)

- icl.cs.utk.edu/plasma/

03/04/2009 CS267 Lecture 12a 14

15

A

C

A

B C

T TT

Fork-Join vs. Dynamic Execution on Multicore

Fork-Join – parallel BLAS

DAG-based – dynamic scheduling

Time

Experiments on Experiments on Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treadswith 2 Sockets w/ 8 Treads

Time saved

Source: Jack Dongarra

Experiment with PLASMA for Multicore PLASMA is experimental system for writing,

scheduling linear algebra algorithms as Directed Acyclic Graphs (DAGs)

- icl.cs.utk.edu/plasma/

Experiment with PLASMA - Implement other factorizations

- Compare performance • To LAPACK with parallel BLAS

• To ScaLAPACK

- Evaluate expressiveness for eigenvalue problems

- Study interaction of scheduler with higher level scheduler being designed in ParLab

• Can PLASMA “gracefully” accept, give up, resources?

03/04/2009 CS267 Lecture 12a 16

Perform analogous experiments with UPC, Titanium or other PGAS languages

17

Investigate role of “Dense Motif” in ParLab Apps

Initial study (below) showed Dense Linear Algebra in Image, Speech, Music

Determine what is really needed Functions, problem sizes, performance requirements

What do we still need to optimize?

cs 267 dense linear algebra: possible class projects

Documents

n2 rows of l1

n2 rows of right half

fails03022009cs267 lecture

u2 cs267 lecture

n update rest of right

half of matrix

different size matrices

librariespossible impact