ernie chan
DESCRIPTION
Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures. Ernie Chan. How to Program SCC?. Tile. 48 cores in 6×4 mesh with 2 cores per tile 4 DDR3 memory c ontrollers. Core 1. Core 1. L2$1. Router. MPB. Tile. Tile. Tile. - PowerPoint PPT PresentationTRANSCRIPT
T H E U N I V E R S I T Y O F T E X A S A T A U S T I N
Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on
Many-Core Architectures
Ernie Chan
MARC symposium 2
• 48 cores in 6×4 mesh with 2 cores per tile• 4 DDR3 memory controllers
How to Program SCC?
November 9, 2010
Mem
ory
Cont
rolle
r
Tile
RTile
RTile
RTile
RTile
RTile
R
Tile
RTile
RTile
RTile
RTile
RTile
R
Tile
RTile
RTile
RTile
RTile
RTile
R
Tile
RTile
RTile
RTile
RTile
R
Mem
ory
Cont
rolle
r
Mem
ory
Cont
rolle
rM
emor
y Co
ntro
ller
System I/F
TileTile
Core 1
Core 0
L2$1
L2$0
Router MPB
Core 1
Core 0
R
MARC symposium 3
Outline
• How to Program SCC?• Elemental• Collective Communication• Off-Chip Shared-Memory• Conclusion
November 9, 2010
MARC symposium 4
Elemental
• New, Modern Distributed-Memory Dense Linear Algebra Library– Replacement for PLAPACK and ScaLAPACK– Object-oriented data structures for matrices– Coded in C++– Torus-wrap/elemental mapping of matrices to a
two-dimensional process grid– Implemented entirely using bulk synchronous
communication
November 9, 2010
MARC symposium 5
Elemental
• Two-Dimensional Process Grid: – Tile the process grid over the matrix to assign each
matrix element to a process
November 9, 2010
0 2 4
1 3 5
MARC symposium 6
Elemental
• Two-Dimensional Process Grid: – Tile the process grid over the matrix to assign each
matrix element to a process
November 9, 2010
0 2 4
1 3 5
MARC symposium 7
Elemental
• Two-Dimensional Process Grid: – Tile the process grid over the matrix to assign each
matrix element to a process
November 9, 2010
0 2 4
1 3 5
MARC symposium 8
Elemental
• Redistributing the Matrix Over a Process Grid– Collective communication
November 9, 2010
MARC symposium 9
Outline
• How to Program SCC?• Elemental• Collective Communication• Off-Chip Shared-Memory• Conclusion
November 9, 2010
MARC symposium 10
Collective Communication
• RCCE Message Passing API– Blocking send and receiveint RCCE_send( char *buf, size_t num, int dest );int RCCE_recv( char *buf, size_t num, int src );
– Potential for deadlock
November 9, 2010
0 2 41 3 5
MARC symposium 11
Collective Communication
• Avoiding Deadlock– Even number of cores in cycle
November 9, 2010
0 2 41 3 5
0 2 41 3 5
MARC symposium 12
Collective Communication
• Avoiding Deadlock– Odd number of cores in cycle
November 9, 2010
0 2 41 3
0 2 41 3
0 2 41 3
MARC symposium 19
Collective Communication
• Scatterint RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm );
November 9, 2010
Before
MARC symposium 20
Collective Communication
• Scatterint RCCE_scatter( char *inbuf, char *outbuf, size_t num, int root, RCCE_COMM comm );
November 9, 2010
After
MARC symposium 21
Collective Communication
• Allgatherint RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm );
November 9, 2010
Before
MARC symposium 22
Collective Communication
• Allgatherint RCCE_allgather( char *inbuf, char *outbuf, size_t num, RCCE_COMM comm );
November 9, 2010
After
MARC symposium 30
Collective Communication
• Minimum Spanning Tree Algorithm– Scatter
November 9, 2010
MARC symposium 31
Collective Communication
• Minimum Spanning Tree Algorithm– Scatter
November 9, 2010
MARC symposium 32
Collective Communication
• Minimum Spanning Tree Algorithm– Scatter
November 9, 2010
MARC symposium 33
Collective Communication
• Minimum Spanning Tree Algorithm– Scatter
November 9, 2010
MARC symposium 34
Collective Communication
• Cyclic (Bucket) Algorithm– Allgather
November 9, 2010
MARC symposium 35
Collective Communication
• Cyclic (Bucket) Algorithm– Allgather
November 9, 2010
MARC symposium 36
Collective Communication
• Cyclic (Bucket) Algorithm– Allgather
November 9, 2010
MARC symposium 37
Collective Communication
• Cyclic (Bucket) Algorithm– Allgather
November 9, 2010
MARC symposium 38
Collective Communication
• Cyclic (Bucket) Algorithm– Allgather
November 9, 2010
MARC symposium 39
Collective Communication
• Cyclic (Bucket) Algorithm– Allgather
November 9, 2010
MARC symposium 40
Collective Communication
November 9, 2010
MARC symposium 41
Elemental
November 9, 2010
MARC symposium 43
Elemental
November 9, 2010
MARC symposium 44
Elemental
November 9, 2010
MARC symposium 45
Elemental
November 9, 2010
MARC symposium 46
Outline
• How to Program SCC?• Elemental• Collective Communication• Off-Chip Shared-Memory• Conclusion
November 9, 2010
MARC symposium 47
Off-Chip Shared-Memory
• Distributed vs. Shared-Memory
November 9, 2010
Mem
ory
Cont
rolle
r
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
TileR
Mem
ory
Cont
rolle
r
Mem
ory
Cont
rolle
rM
emor
y Co
ntro
ller
System I/F
Dist
ribut
ed
Mem
ory
Shared-Memory
MARC symposium 48
Off-Chip Shared-Memory
• SuperMatrix– Map dense matrix computation
to a directed acyclic graph
– No matrix distribution– Store DAG and matrix
on off-chip shared- memory
November 9, 2010
CHOL0
TRSM2TRSM1
SYRK5GEMM4SYRK3
CHOL6
TRSM7
SYRK8
CHOL9
MARC symposium 49
Off-Chip Shared-Memory
• Non-cacheable vs. Cacheable Shared-Memory– Non-cacheable• Allow for a simple programming interface• Poor performance
– Cacheable• Need software managed cache coherency mechanism• Execute on data stored in cache
• Interleave distributed and shared-memory programming concepts
November 9, 2010
MARC symposium 50
Off-Chip Shared-Memory
November 9, 2010
MARC symposium 56
Outline
• How to Program SCC?• Elemental• Collective Communication• Off-Chip Shared-Memory• Conclusion
November 9, 2010
MARC symposium 57
Conclusion
• Distributed vs. Shared-Memory– Elemental vs. SuperMatrix?
• A Collective Communication Library for SCC– RCCE_comm: released under LGPL and available
on the public Intel SCC software repositoryhttp://marcbug.scc-dc.com/svn/repository/trunk/rcce_applications/UT/RCCE_comm/
November 9, 2010
MARC symposium 58
Acknowledgments
• We thank the other members of the FLAME team for their support– Bryan Marker, Jack Poulson, and Robert van de Geijn
• We thank Intel for access to SCC and their help– Timothy G. Mattson and Rob F. Van Der Wijngaart
• Funding– Intel Corporation– National Science Foundation
November 9, 2010
MARC symposium 59
Conclusion
November 9, 2010
• More Informationhttp://www.cs.utexas.edu/~flame