targeting multi-core systems in linear algebra applications alfredo buttari, jack dongarra, jakub...
TRANSCRIPT
![Page 1: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/1.jpg)
Targeting Multi-Core systems Targeting Multi-Core systems in Linear Algebra applicationsin Linear Algebra applications
Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou
Petascale Applications SymposiumPittsburgh Supercomputing Center, June 22-23, 2007
![Page 2: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/2.jpg)
The free lunch is overThe free lunch is over
Problem
• power consumption• heat dissipation• pins
Solution
reduce clock and increase execution units = Multicore
Consequence
Non-parallel software won't run any faster. A new approach to programming is required.
Hardware
Software
![Page 3: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/3.jpg)
What is a Multicore processor, BTW?What is a Multicore processor, BTW?
“a processor that combines two or more independent processors into a single package” (wikipedia)
What about:• types of core?
homogeneous (AMD Opteron, Intel Woodcrest...) heterogeneous (STI Cell, Sun Niagara...)
• memory? how is it arranged?
• bus? is it going to be fast enough?
• cache? shared? (Intel/AMD) non present at all? (STI Cell)
• communications?
![Page 4: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/4.jpg)
Parallelism in Linear Algebra software so farParallelism in Linear Algebra software so far
LAPACK
ThreadedBLAS
PThreads OpenMP
ScaLAPACK
PBLAS
BLACS+ MPI
Shared Memory Distributed Memory
parallelism
![Page 5: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/5.jpg)
Parallelism in Linear Algebra software so farParallelism in Linear Algebra software so far
LAPACK
ThreadedBLAS
PThreads OpenMP
ScaLAPACK
PBLAS
BLACS+ MPI
Shared Memory Distributed Memoryparallelism
![Page 6: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/6.jpg)
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
DPOTF2: BLAS-2non-blocked factorization of the panel
DTRSM: BLAS-3updates by applying the transformation computed in DPOTF2
DGEMM (DSYRK): BLAS-3updates trailing submatrix
U=LT
![Page 7: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/7.jpg)
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
BLAS2 operations cannot be efficiently parallelized because they are bandwidth bound.
• strict synchronizations• poor parallelism• poor scalability
![Page 8: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/8.jpg)
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
The execution flow if filled with stalls due to synchronizations and sequential operations.
Time
![Page 9: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/9.jpg)
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
do DPOTF2 on
for all do DTRSM on end
for all do DGEMM on end
end
Tiling operations:
![Page 10: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/10.jpg)
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
Cholesky can be represented as a Directed Acyclic Graph (DAG) where nodes are subtasks and edges are dependencies among them.
As long as dependencies are not violated, tasks can be scheduled in any order.
3:3 4:3
3:2 4:2
2:2
2:2 3:2 4:2
2:1 3:1 4:1
1:1
4:2 4:3
1:1
2:1 2:2
3:1
4:1
3:33:2
5:1 5:2 5:3 5:4 5:5
4:4
![Page 11: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/11.jpg)
Time
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization higher flexibility some degree of adaptativity no idle time better scalability
Cost:
1 /3n3
n3
2n3
![Page 12: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/12.jpg)
Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout
Column-Major Block data layout
![Page 13: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/13.jpg)
Column-Major
Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout
Block data layout
![Page 14: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/14.jpg)
Column-Major
Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout
Block data layout
![Page 15: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/15.jpg)
64 128 2560
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Blocking Speedup
DGEMM
DTRSM
block size
spe
ed
up
The use of block data layout storage can significantly improve performance
Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout
![Page 16: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/16.jpg)
Cholesky: performance Cholesky: performance
0 2000 4000 6000 8000 10000 120000
10
20
30
40
50
60
Cholesky -- Dual Clovertown
async. 2d blocking
LAPACK + Th. BLAS
problem size
Gflop/s
![Page 17: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/17.jpg)
Cholesky: performance Cholesky: performance
0 2000 4000 6000 8000 10000 12000 14000 160000
5
10
15
20
25
30
35
40
Cholesky -- 8-way Dual Opteron
async. 2d blocking
LAPACK + Th. BLAS
problem size
Gflop/s
![Page 18: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/18.jpg)
Parallelism in LAPACK: LU/QR factorizationsParallelism in LAPACK: LU/QR factorizations
DGETF2: BLAS-2non-blocked panel factorization
DTRSM: BLAS-3updates U with transformation computed in DGETF2
DGEMM: BLAS-3updates the trailing submatrix
![Page 19: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/19.jpg)
Parallelism in LAPACK: LU/QR factorizationsParallelism in LAPACK: LU/QR factorizations
The LU and QR factorizations algorithms in LAPACK don't allow for 2D distribution and block storage format. LU: pivoting takes into account the whole panel and cannot be split in a block fashion. QR: the computation of Householder reflectors acts on the whole panel. The application of the transformation can only be sliced but not blocked.
![Page 20: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/20.jpg)
Time
Parallelism in LAPACK: LU/QR factorizationsParallelism in LAPACK: LU/QR factorizations
![Page 21: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/21.jpg)
0 2000 4000 6000 8000 10000 120000
5
10
15
20
25
30
35
40
LU -- Dual Clovertown
async. 1DLAPACK + Th. BLAS
problem size
Gflop/s
LU factorization: performanceLU factorization: performance
![Page 22: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/22.jpg)
Multicore friendly, “Multicore friendly, “delightfully delightfully parallelparallel**””, algorithms, algorithmsComputer Science can't go any further on old algorithms. We need some math...
* quote from Prof. S. Kale
![Page 23: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/23.jpg)
The QR factorization in LAPACKThe QR factorization in LAPACK
Assume that is the part of the matrix that has been already factorized and contains the Householder reflectors that determine the matrix Q.
The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.
![Page 24: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/24.jpg)
The QR factorization in LAPACKThe QR factorization in LAPACK
The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.
=DGEQR2( )
![Page 25: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/25.jpg)
The QR factorization in LAPACKThe QR factorization in LAPACK
The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.
=DLARFB( )
![Page 26: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/26.jpg)
The QR factorization in LAPACKThe QR factorization in LAPACK
The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.
How does it compare to LU? It is stable because it uses Householder transformations that are orthogonal It is more expensive than LU because its operation count is versus4 /3n3 2 /3 n3
![Page 27: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/27.jpg)
Multicore friendly algorithmsMulticore friendly algorithms
=DGEQR2( )
A different algorithm can be used where operations can be broken down into tiles.
The QR factorization of the upper left tile is performed. This operation returns a small R factor and the corresponding Householder reflectors .
![Page 28: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/28.jpg)
=DLARFB( )
Multicore friendly algorithmsMulticore friendly algorithmsA different algorithm can be used where operations can be broken down into tiles.
All the tiles in the first block-row are updated by applying the transformation computed at the previous step.
![Page 29: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/29.jpg)
1 =DGEQR2( )
Multicore friendly algorithmsMulticore friendly algorithmsA different algorithm can be used where operations can be broken down into tiles.
The R factor computed at the first step is coupled with one tile in the block-column and a QR factorization is computed. Flops can be saved due to the shape of the matrix resulting from the coupling.
![Page 30: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/30.jpg)
1=DLARFB( )
Multicore friendly algorithmsMulticore friendly algorithmsA different algorithm can be used where operations can be broken down into tiles.
Each couple of tiles along the corresponding block rows is updated by applying the transformations computed in the previous step. Flops can be saved considering the shape of the Householder vectors.
![Page 31: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/31.jpg)
1 =DGEQR2( )
Multicore friendly algorithmsMulticore friendly algorithmsA different algorithm can be used where operations can be broken down into tiles.
The last two steps are repeated for all the tiles in the first block-column.
![Page 32: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/32.jpg)
1=DLARFB( )
Multicore friendly algorithmsMulticore friendly algorithmsA different algorithm can be used where operations can be broken down into tiles.
The last two steps are repeated for all the tiles in the first block-column.
![Page 33: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/33.jpg)
1=DLARFB( )
Multicore friendly algorithmsMulticore friendly algorithmsA different algorithm can be used where operations can be broken down into tiles.
The last two steps are repeated for all the tiles in the first block-column.
25% more Flops than the LAPACK version!!!*
*we are working on a way to remove these extra flops.
![Page 34: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/34.jpg)
Multicore friendly algorithmsMulticore friendly algorithms
![Page 35: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/35.jpg)
Multicore friendly algorithmsMulticore friendly algorithms
Very fine granularity Few dependencies, i.e., high flexibility for the scheduling of tasks Block data layout is possible
![Page 36: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/36.jpg)
Time
Multicore friendly algorithmsMulticore friendly algorithms
Execution flow on a 8-way dual core Opteron.
![Page 37: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/37.jpg)
Multicore friendly algorithmsMulticore friendly algorithms
0 2 4 6 8 10 12 14 16 180
5000
10000
15000
20000
25000
QR Factorization: Scaling -- 8-way Dual Opteron
LAPACK + Th. BLASasync. 1Dasync. 2D blocking
n. of processes
Gfl
op
/s
![Page 38: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/38.jpg)
Multicore friendly algorithmsMulticore friendly algorithms
0 2 4 6 8 10 12 14 16 180
5000
10000
15000
20000
25000
QR Factorization: Scaling -- 8-way Dual Opteron
LAPACK + Th. BLASasync. 1Dasync. 2D blocking
n. of processes
Gfl
op
/s
![Page 39: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/39.jpg)
Multicore friendly algorithmsMulticore friendly algorithms
0 2000 4000 6000 8000 10000 120000
5
10
15
20
25
QR Factorization -- 8-way Dual Opteron
LAPACK + Th. BLAS
async. 1D
async 2D blocking
problem size
Gflop/s
![Page 40: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/40.jpg)
Multicore friendly algorithmsMulticore friendly algorithms
0 2000 4000 6000 8000 10000 120000
5
10
15
20
25
30
35
40
QR Factorization -- Dual Clovertown
async. 2D blockingasync. 1DLAPACK+ Th. BLAS
problem size
Gfl
op
/s
![Page 41: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/41.jpg)
Current work and future plansCurrent work and future plans
![Page 42: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/42.jpg)
Current work and future plansCurrent work and future plans
Implement LU factorization on multicores Is it possible to apply the same approach to two-sided transformations (Hessenberg, Bi-Diag, Tri-Diag)? Explore techniques to avoid extra flops Implement the new algorithms on distributed memory architectures (J. Langou and J. Demmel) Implement the new algorithms on the Cell processor Explore automatic exploitation of parallelism through graph driven programming environments
![Page 43: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/43.jpg)
AllReduce algorithmsAllReduce algorithms
43
The QR factorization of a long and skinny matrix with its data partitioned vertically across several processors arises in a wide range of applications.
Input:A is block distributed by rows
Output:Q is block distributed by rowsR is global
A1
A2
A3
Q1
Q2
Q3
R
![Page 44: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/44.jpg)
AllReduce algorithmsAllReduce algorithms
in iterative methods with multiple right-hand sides (block iterative methods:)
Trilinos (Sandia National Lab.) through Belos (R. Lehoucq, H. Thornquist, U. Hetmaniuk).
BlockGMRES, BlockGCR, BlockCG, BlockQMR, …
in iterative methods with a single right-hand side
s-step methods for linear systems of equations (e.g. A. Chronopoulos),
LGMRES (Jessup, Baker, Dennis, U. Colorado at Boulder) implemented in PETSc,
Recent work from M. Hoemmen and J. Demmel (U. California at Berkeley).
in iterative eigenvalue solvers,
PETSc (Argonne National Lab.) through BLOPEX (A. Knyazev, UCDHSC),
HYPRE (Lawrence Livermore National Lab.) through BLOPEX,
Trilinos (Sandia National Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk),
PRIMME (A. Stathopoulos, Coll. William & Mary )
They are used in:
![Page 45: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/45.jpg)
AllReduce algorithmsAllReduce algorithms
A0
A1
pro
cess
es
time
![Page 46: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/46.jpg)
R0(0)
( , )QR ( )
A0 V0(0)
R1(0)
( , )QR ( )
A1 V1(0)
pro
cess
es
time
11
11
AllReduce algorithmsAllReduce algorithms
![Page 47: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/47.jpg)
R0(0)
( , )QR ( )
A0 V0(0)
)R0(0)
R1(0)
R1(0)
( , )QR ( )
A1 V1(0)
pro
cess
es
time
11
11
11
(
AllReduce algorithmsAllReduce algorithms
![Page 48: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/48.jpg)
R0(0)
( , )QR ( )
A0 V0(0)
R0(1)
( , )QR ( )R0(0)
R1(0)
V0(1)
V1(1)
R1(0)
( , )QR ( )
A1 V1(0)
pro
cess
es
time
11
11
22
11
AllReduce algorithmsAllReduce algorithms
![Page 49: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/49.jpg)
R0(0)
( , )QR ( )
A0 V0(0)
R0(1)
( , )QR ( )R0(0)
R1(0)
V0(1)
V1(1)
InApply ( to )V0(1)
0nV1(1)
Q0(1)
Q1(1)
R1(0)
( , )QR ( )
A1 V1(0)
pro
cess
es
time
11
11
22
33
11
AllReduce algorithmsAllReduce algorithms
![Page 50: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/50.jpg)
R0(0)
( , )QR ( )
A0 V0(0)
R0(1)
( , )QR ( )R0(0)
R1(0)
V0(1)
V1(1)
InApply ( to )V0(1)
0nV1(1)
Q0(1)
Q1(1)
Q0(1)
R1(0)
( , )QR ( )
A1 V1(0)
pro
cess
es
time
11
11
22
33
11 22
Q1(1)
AllReduce algorithmsAllReduce algorithms
![Page 51: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/51.jpg)
R0(0)
( , )QR ( )
A0 V0(0)
R0(1)
( , )QR ( )R0(0)
R1(0)
V0(1)
V1(1)
InApply ( to )V0(1)
0nV1(1)
Q0(1)
Q1(1)
Apply ( to )0n
V0(0)
Q0(1)
Q0
R1(0)
( , )QR ( )
A1 V1(0)
Apply ( to )
V1(0)
Q1(1)
Q1
pro
cess
es
time
0n
11
11
22
33
44
44
11 22
AllReduce algorithmsAllReduce algorithms
![Page 52: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/52.jpg)
pro
cess
es
time
R0(0)( )QR ( )
A0
R0(1)( )
QR ( )R0
(0)
R1(0)
R1(0)( )QR ( )
A1
R2(0)( )QR ( )
A2
R2(1)( )
QR ( )R2
(0)
R3(0)
R3(0)( )QR ( )
A3
R( )QR ( )
R0(1)
R2(1)
11
11
11
2222
22
11
11
11
11
AllReduce algorithmsAllReduce algorithms
![Page 53: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/53.jpg)
0 10 20 30 40 50 60 700
20
40
60
80
100
120
140
N=50, locM=100.000 -- PentiumIII + Dolphin
rhh_qr3
qrf
# of processors
Mflop/s
per
pro
cess
or
0 5 10 15 20 25 30 350
20
40
60
80
100
120
140
N=50, M=100000 -- PentiumIII + Dolphin
rhh_qr3
qrf
# of processors
Mfl
op
s/s
pe
r pro
cess
or
AllReduce algorithms: performanceAllReduce algorithms: performance
Weak Scalability Strong Scalability
![Page 54: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/54.jpg)
CellSuperScalar and SMPSuperScalarCellSuperScalar and SMPSuperScalar
http://www.bsc.es/cellsuperscalar
uses source-to-source translation to determine dependencies among tasks scheduling of tasks is performed automatically by means of the features provided by a library it is easily possible to explore different scheduling policies all of this is obtained by instructing the code with pragmas and, thus, is transparent to other compilers
![Page 55: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/55.jpg)
for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] );}
void sgemm_tile(float *A, float *B, float *C)
void strsm_tile(float *T, float *B)
void ssyrk_tile(float *A, float *C)
CellSuperScalar and SMPSuperScalarCellSuperScalar and SMPSuperScalar
![Page 56: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/56.jpg)
for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] );}
#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void sgemm_tile(float *A, float *B, float *C)
#pragma css task input (T[64][64]) inout(B[64][64])void strsm_tile(float *T, float *B)
#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void ssyrk_tile(float *A, float *C)
CellSuperScalar and SMPSuperScalarCellSuperScalar and SMPSuperScalar
![Page 57: Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649e9f5503460f94ba0c0d/html5/thumbnails/57.jpg)
Thank youThank you
http://icl.cs.utk.edu