algebraic multigrid (amg) · 2014. 12. 12. · algebraic multigrid calculation repeated many times...

46
Page 1 Algebraic Multigrid (AMG) Multigrid Methods and Parallel Computing Klaus Stüben Fraunhofer Institute SCAI Schloss Birlinghoven 53754 St. Augustin, Germany [email protected] 2nd International FEFLOW User Conference Potsdam, September 14-16, 2009

Upload: others

Post on 25-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Page 1

    Algebraic Multigrid (AMG)

    Multigrid Methods and Parallel Computing

    Klaus StübenFraunhofer Institute SCAISchloss Birlinghoven53754 St. Augustin, [email protected]

    2nd International FEFLOW User ConferencePotsdam, September 14-16, 2009

    mailto:[email protected]

  • Page 2

    Contents

    What are we talking about?(Algebraic) MultigridParallel computing nowadaysPerformance of Algebraic MultigridConclusions

  • Page 3

    What Are We Talking About?

  • Page 4

    Large-Scale Applications

    Casting and Molding

    Fluid Dynamics

    Circuit Simulation

    Groundwatermodeling

    Oil reservoir simulation

    Electrochemical PlatingProcess- & Device-Simulation

  • Page 5

    Bottleneck: Linear Equation Solver

    required:"scalable" solvers,

    Work = O(N)

    Huge but sparse „elliptic“ systems

    1( 1, ..., )

    N

    ij j ij

    i Na u f=

    ==∑ Au f=or

    Scalability requires• Locally “operating” numerical methods• Exploitation of origin and nature of the

    discrete linear systems

    hierarchical solvers:multigrid,

    algebraic multigrid

    Calculation repeated many times over• to follow the time evolution• to resolve non-linearities• for history matching• for optimization processes

  • Page 6

    Additional aspects affecting numerical performance (besides size)• extremely unstructured or locally refined grids, highly heterogeneous

    Bottleneck: Linear Equation Solver

  • Page 7

    Additional aspects affecting numerical performance (besides size)• extremely unstructured or locally refined grids, highly heterogeneous• not optimally shaped elements

    Bottleneck: Linear Equation Solver

    Basin flow model:• pentahedral prismatic mesh• several faulty zones• vertical distortion of the prisms

  • Page 8

    Additional aspects affecting numerical performance (besides size)• extremely unstructured or locally refined grids, highly heterogeneous• not optimally shaped elements• extreme parameter contrasts, discontinuities and anisotropies

    Bottleneck: Linear Equation Solver

    High parameter contrast:Permeabilities vary by7 orders of magnitude

  • Page 9

    Additional aspects affecting numerical performance (besides size)• extremely unstructured or locally refined grids, highly heterogeneous• not optimally shaped elements• extreme parameter contrasts, discontinuities and anisotropies• non-symmetry and indefiniteness• multiple (varying) degrees of freedom, .....

    Bottleneck: Linear Equation Solver

    Nothing is for free!Compromise?

    Requested: solver should• be fast (scalable)• be general• require low memory• be easy to integrate

  • Page 10

    (Algebraic) Multigrid

  • Page 11

    Classical one-level approach

    Geometric versus Algebraic Multigrid

  • Page 12

    2hhI

    24

    hhI

    48

    hhI

    2hhI

    42

    hhI

    84

    hhI

    hS Ωh

    2hS Ω2h

    4hS Ω4h

    Ω8h

    Geometric Multigrid (GMG)

    Geometric versus Algebraic Multigrid

  • Page 13

    2hhI

    24

    hhI

    48

    hhI

    2hhI

    42

    hhI

    84

    hhI

    hS Ωh

    2hS Ω2h

    4hS Ω4h

    Ω8h

    Geometric Multigrid (GMG) Basic idea: combine

    • coarse-grid correctionNumerically scalableIdeally suited for

    • structured meshes• “smooth” problems

    Hardly used in industry!

    • relaxation for smoothing

    Difficult to integrate intoexisting software:

    • no “plug-in” software• “technical” limitations

    Geometric versus Algebraic Multigrid

  • Page 14

    2hhI

    24

    hhI

    48

    hhI

    2hhI

    42

    hhI

    84

    hhI

    hS Ωh

    2hS Ω2h

    4hS Ω4h

    Ω8h

    Geometric Multigrid (GMG) Algebraic Multigrid (AMG)

    Geometric versus Algebraic Multigrid

    A112I

    23I

    34I

    21I

    32I

    43I

    A2A3

    A4

    A1u=f

    given linearsystem:

    automaticallyconstructedas part ofthe AMGalgorithm

  • Page 15

    Two-stage process:

    Straightforward solution phase:• select ”simple” smoother (eg, relaxation)• cycling as before

    Geometric versus Algebraic MultigridAlgebraic Multigrid (AMG)

    12I

    23I

    34I

    21I

    32I

    43I

    A1A2A3

    A4

    A1u=f

    given linearsystem:

    automaticallyconstructedas part ofthe AMGalgorithm

    Recursive setup phase i=1,2,...:1. Find subset of variables

    representing level i+12. Construct interpolation3. Construct restriction4. Compute coarse-level matrix

    1iiI+

    1iiI +

    11

    1iii

    iiiIA A I

    +++ =

    sparse matrixproduct (Galerkin)

  • Page 16

    Relies on same basic idea

    • coarse-”grid” correction

    "Virtually" scalableAlso suited for

    • unstructured meshes, 2-3D• “non-smooth” problems

    Very popular in industry!

    • relaxation for smoothing

    Easy to integrate intoexisting software:

    • “plug-in” software• requires only matrices

    Geometric versus Algebraic MultigridAlgebraic Multigrid (AMG)

    12I

    23I

    34I

    21I

    32I

    43I

    A1A2A3

    A4

    A1u=f

    given linearsystem:

    automaticallyconstructedas part ofthe AMGalgorithm

  • Page 17

    Scalability

    Durlofsky problem:

    • flow from left to right

    • heterogeneous conductivity(red=1 m/s, blue=10(-6) m/s)

    • steady-state

    • basic mesh: 20x20,(refined by halfing mesh sizes)

    h=1

    h=0

  • Page 18

    Scalability h=

    1

    h=0

    0

    5

    10

    15

    20

    25

    1600 6400 25600 102400 409600 1638400

    # Elements

    Computational workper grid node

    SAMGstd. PCG

  • Page 19

    iter=4474CPU=1479sec

    iter=11CPU=23.4sec

    iter=13CPU=59.8sec

    iter=662CPU=489sec

    Steady State Test Cases (USGS, MODFLOW)

    speedup = 8.2 speedup = 63.2 (!)

  • Page 20

    Distorted Mesh Case (Wasy)

    Basin flow model:• pentahedral prismatic mesh• a number of faulty zones• vertical distortion of the prisms• transient• parameter contrast: 3 orders

  • Page 21

    Distorted Mesh Case (Wasy)

    Standard one-level method:• failed to converge below 10(-8)• solution fully instable

    SAMG:• stable solution• sufficiently accurate

  • Page 22

    Electrochemical Machining - Extreme Aspect Ratios

    Diesel injection system, ~ 4.2 Mio tetrahedronsAspect ratios up to 1:10,000!!

  • Page 23

    Electrochemical Machining - Extreme Aspect Ratios

    -4.5

    -4

    -3.5

    -3

    -2.5

    -2

    -1.5

    -1

    -0.5

    00 200 400 600 800 1000

    CGAMG

    Log(

    Res

    idua

    l)

    iter

    Convergencehistory

    Diesel injection system, ~ 4.2 Mio tetrahedrons

    AMG/cg

    ILU/cg

    Aspect ratios up to 1:10,000!!

  • Page 24

    Summary of AMGSteady state applications (elliptic)

    • highly efficient• scalable• increased robustness• in particular: discontinuities, anisotropies, ...

    Time dependent applications• applicable in each time step• increased robustness• efficiency depends on various aspects such as

    - requested accuracy per time step- mesh size- "behavior" of classical solvers- time discretization, in particular, time step size

    How does AMG perform on parallel computers?

  • Page 25

    Parallel Computing Nowadays

  • Page 26

    HardwareDistributed memory

    • MPI (process-based)CPU1

    RAM

    CacheCPU2

    RAM

    CacheCPU3

    RAM

    Cache

    Network (Gigabit, Infiniband, ...)

  • Page 27

    Shared memory (multicore)• MPI (process-based)• OpenMP (thread-based)• mixed MPI/OpenMP

    Quadcore socket

    Cross-NodeLink

    Quadcore socket

    Hardware

  • Page 28

    Hybrid architectures• mere MPI• mixed MPI/OpenMP

    Hardware

    CPU1

    RAM

    CacheCPU2

    RAM

    CacheCPU3

    RAM

    Cache

    Network (Gigabit, Infiniband, ...)

  • Page 29

    Distributed memory• MPI (process-based)

    Shared memory (multicore)• MPI (process-based)• OpenMP (thread-based)• mixed MPI/OpenMP

    Hybrid architectures• mere MPI• mixed MPI/OpenMP

    Specialized hardware• GPUs (CUDA, OpenCL)• Cell

    Hardware

  • Page 30

    However, this is not trivial at all• there is a conflict

    - really parallel solvers are usually not serially scalable- serially scalable solvers are usually not really parallel

    • in particular, important AMG modules are not parallel• parallelization requires changes in the algorithm/numerics• this requires compromises

    - ensure "stable" convergence (essentially independent of # processors)- minimize fill-in / complexity / overhead- minimize communication / memory access

    Parallel ComputingGeneral goal

    • combine serial scalability with scalability of parallel hardware

    Classical (A)MG parallelization: MPI-based

  • Page 31

    MPI-based Parallelization

    2hhI

    24

    hhI

    48

    hhI

    2hhI

    42

    hhI

    84

    hhI

    hS

    2hS

    4hS

    12I

    23I

    34I

    21I

    32I

    43I

    A1A2A3

    A4

  • Page 32

    2hhI

    24

    hhI

    48

    hhI

    2hhI

    42

    hhI

    84

    hhI

    hS

    2hS

    4hS

    Partitioning of 1st grid inducespartitioning on all levels!

    The user just provides the 1st level set ofequations along with a partitioning.

    The setup phase has to be parallelized!

    12I

    23I

    34I

    21I

    32I

    43I

    A1A2A3

    A4

    MPI-based Parallelization

  • Page 33

    Parallel setup phase:1. Find subset of variables2. Construct interpolation3. Construct restriction4. Compute coarse-level matrix

    Fully parallelHowever: highly communication-intensive

    Solution phase:1. Smoothing2. Restriction3. Interpolation4. Stepping from fine-to-coarse and back

    MPI-based ParallelizationInherently sequential, most critical in parallelization!Various attempts have been made to constructparallel analogs (eg, LLNL, Griebel). Unfortunately, ...

    Generally, good smoothers are serial (eg, GS, ILU)!Compromise: hybrid smoothers (eg, GS/Jacobi)

    Fully parallel

    Inherently sequential! Towards coarser levels:increasingly bad computation-communication ratio.Various attempts have been made to improve this(eg, McCormick, Griebel). Unfortunately, ...

  • Page 34

    Performance ofAlgebraic Multigrid

  • Page 35

    Test case:

    p0: 884,736p1: 1,560,896p2: 3,176,523p3: 6,331,625

    Hardware 64 nodes connected by infiniband;2 dual-core sockets/node (Opteron AMD)

    Mesh sizes:

    Partitioning (Metis):

    3D Poisson problem,27-point FE stencil Used as: 64-core distributed memory hardware

    or as: 64-core hybrid system

    Hybrid Architecture (MPI)

  • Page 36

    Hybrid Architecture (MPI)

    Solution phase (4 cores/node)

    cores

    speedup

    Solution phase (1 core/node)

    cores

    speedup

    Speedup reduced: Cores within a nodehave to share the internal bandwidth!

  • Page 37

    Hybrid Architecture (mixed MPI/OpenMP)Mixed MPI/OpenMP programming?

    OpenMP on thenodes seemsadvantageous:8:4 is most efficient!

    8:48 MPI processeswith 4 threads each

    OpenMP

    16:216 MPI processeswith 2 threads each

    OpenMP

    OpenMP

    32:132 MPI processeswith 1 thread each

    Performance on 32 cores ....

  • Page 38

    8 Quad-Core sockets 4 CPUs 8 x 64 KB L1 Cache (Instr./Data) 4 x 512 KB L2 Cache 1 x 6144 KB L3 Cache (shared) 1 x 16144 MB (shared)

    Shared memory (multicore)AMD OpteronProcessor 8384, 2.7 GHz

    MPI or (straightforward!) OpenMP,mixed MPI/OpenMP

    Programming

    Quadcore socket

    Cross-NodeLink

    Quadcore socket Quadcore socket

    Cross-NodeLink

    Quadcore socket

    Cross-NodeLink

    Comparison:(MPI processes : Threads per proc) = (1:16), (2:8), (4:4), (8:2), (16:1)

    Shared Memory (mixed MPI/OpenMP)

    Remark: We use only16 cores in order toinvestigate thread-/memory binding

    pureOpenMP

    pureMPI

    truelyhybrid

  • Page 39

    Shared Memory (mixed MPI/OpenMP)

    coh: coherent;scat: scatteredsys: system-provided

    Process- / thread binding

    sys-binding

    Run timeRun time

  • Page 40

    Shared Memory (mixed MPI/OpenMP)

    coh: coherent;scat: scatteredsys: system-provided

    Process- / thread binding

    coh-binding

    Run timeRun time

  • Page 41

    Shared Memory (mixed MPI/OpenMP)

    3D, FEPoisson

    3D, FEElasticity

    small2D, FDPoisson

    medium largeInfluence of different thread-bindings: fairly unpredictable, practically does not pay

  • Page 42

    Shared Memory (OpenMP)

    Reasons:• Sparse matrix operations

    - matrix-matrix, matrix-vector• Lots of integer computations

    - indirect addressing, CSR data storage, ...• Efficient cache utilization very difficult

    - for multicore much more difficult than unicore- not sufficient to just consider matrix-vector- efficient tools are required

    Conclusions:• Pure MPI seems a reasonable choice• Truely hybrid is faster only rarely• Manual thread binding does not really pay• Pure OpenMP is always slowest

    Poblem with pure OpenMP:

    • Setup phase scales nicely• Scalability of solution phase is limited

    speedup

  • Page 43

    Shared Memory (OpenMP)Remedies (unicore):• Improve cache utilization by, eg,

    - register blocking- pre-fetching- cache blocking

    • Change data access,- improve re-usage of matrix entries, eg,- "wave-like" relaxation, restriction, ....- potential conflict with modular programming

    Remedies (multicore):• Data blocking, re-ordering

    - minimize memory bank conflicts• Explicit thread- and/or memory bindings

    - not (yet) really reliable- not enough information availabe- too hardware-dependent

    Reasons:• Sparse matrix operations

    - matrix-matrix, matrix-vector• Lots of integer computations

    - indirect addressing, CSR data storage, ...• Efficient cache utilization very difficult

    - for multicore much more difficult than unicore- not sufficient to just consider matrix-vector- efficient tools are required

    Conclusions:• Pure MPI seems a reasonable choice• Truely hybrid is faster only rarely• Manual thread binding does not really pay• Pure OpenMP is always slowest

  • Page 44

    ConclusionsClearly, results and conclusions strongly depend on

    • hardware, network, memory access, cache hierarchy, ....• application, in particular, discretization, grid size, sparsity, ....

    Distributed memory architecture• complex programming via MPI• easy to achieve high scalability (if problem size per node is sufficiently large)

    Hybrid architecture (nodes consisting of multicores)• mere MPI suffers from a reduced bandwidth inside the nodes• mixed MPI/OpenMP programming seems best (if adjusted to the architecture)

    Shared memory architecture (one or more multicore sockets)• mere MPI is generally superior to pure (straightforward!) OpenMP• using reasonable threat-binding, mixed programming is occasionally better• however, without substantial effort, OpenMP performance is currently limited

    - eg, in the solution phase (dominated by matrix-vector operations)- demanding developments still required (tools as well as algorithms)- dynamic decisions are actually required at run time!

  • Page 45

    One should not forget numerics (solvers)!• it is important to consider numerically "optimal" solvers (such as AMG)• it does not make sense to use sub-optimal solvers just because they scale better

    Unfortunately,• depending on the situation, AMG may not be suitable

    - not applicable, problem too small, insufficient convergence• there exists no solver which works always and is always efficient!• in a time-dependent simulation, a good solver may even depend on time• solvers should be adjusted at run time

    Challenge: Dynamic and automatic decisions at run time• automatic switching between solvers and parameters ("intelligent solver")• automatic adjustment of parallelization strategy

    Conclusions / Outlook

  • Page 46

    [email protected]://www.scai.fraunhofer.de/samg

    Thank You!

    • Reports• User Manuals• Benchmarks

    Algebraic Multigrid (AMG)Foliennummer 2What Are We Talking About?Large-Scale ApplicationsFoliennummer 5Foliennummer 6Foliennummer 7Foliennummer 8Foliennummer 9(Algebraic) MultigridFoliennummer 11Foliennummer 12Foliennummer 13Foliennummer 14Foliennummer 15Foliennummer 16Scalability Scalability Steady State Test Cases (USGS, MODFLOW)Distorted Mesh Case (Wasy)Distorted Mesh Case (Wasy)Foliennummer 22Foliennummer 23Summary of AMGParallel Computing NowadaysHardwareFoliennummer 27Foliennummer 28Foliennummer 29Parallel ComputingMPI-based ParallelizationFoliennummer 32MPI-based ParallelizationPerformance of�Algebraic MultigridFoliennummer 35Foliennummer 36Foliennummer 37Shared Memory (mixed MPI/OpenMP)Shared Memory (mixed MPI/OpenMP)Shared Memory (mixed MPI/OpenMP)Shared Memory (mixed MPI/OpenMP)Shared Memory (OpenMP)Shared Memory (OpenMP)ConclusionsConclusions / OutlookThank You!