petsc: high-performance software for engineering and...
TRANSCRIPT
1
PETSc: High-Performance Software for Engineering and Science
Hong Zhang
Computer Science Dept., Illinois Institute of Technology Mathematics and Computer Science Div.
Argonne National Laboratory, USA Sept., 2010
PETSc – Portable Extensible Toolkit for Scientific computation
• High-performance software for the scalable (parallel) solution of scientific applications.
• PETSc History • Begun Sept. 1991 • Over 20,000 downloads since 1995 (version 2) • Currently 400 per month
• PETSc Funding and Support • Department of Energy, USA • National Science Foundation, USA
2
Portable Extensible Toolkit for Scientific computation
• Portable to any parallel system supporting MPI • Tightly coupled systems
• Cray T3E, SGI Origin, IBM SP, HP 9000,…
• Loosely coupled systems • PCs running Linux or Windows
• Free for everyone, including industrial users. • Download from www.mcs.anl.gov/petsc • Extensive documentation • Hundreds of tutorial-style examples • Support via email: [email protected] • Usable from Fortran90, C, C++ and Python
4
5
Computation and Communication Kernels"MPI, MPI-IO, BLAS, LAPACK"
Profiling Interface"
Application Codes!
Matrices, Vectors, Indices" Grid"Management"
Linear Solvers"Preconditioners + Krylov Methods"
Nonlinear Solvers"
ODE Integrators" Visualization"
Interface
Portable Extensible Toolkit for Scientific computation
PETSc Structure
Level of Abstraction
Portable Extensible Toolkit for Scientific computation • Parallel vector/array operations • Numerous (parallel) sparse matrix formats • Numerous linear solvers • Nonlinear solvers • ODE integrators • Parallel grid/data management • Provides common interface for most linear solver software
(Hypre, SuperLU, MUMPS, Trilinos) • Allows switching of virtually all solvers at runtime
6
7
Portable Extensible Toolkit for Scientific computation Interfaced Packages
Interfaced Packages:
• LU, Cholesky, ILU, ICC SuperLU, SuperLU_DIST, ESSL, Matlab, PLAPACK, UMFPACK, SPOOLES, MUMPS PLAPACK, BlockSolve95
• Algebraic multigrid Parallel BoomerAMG (part of Hypre, LLNL), ML (part of Trilinos, SNL)
• Parititioning: Parmetis, Chaco, Jostle, Party, Scotch • ODE integrators: Sundials (LLNL) • Eigenvalue solvers: BLOPEX • FFTW • SPRN • …
Child Packages of PETSc: All have PETSc’s style of programming
• SIPs: Shift-and-Invert Parallel Spectral Transformations • SLEPc: scalable eigenvalue/eigenvector solver packages. • TAO: scalable optimization algorithms
Portable Extensible Toolkit for Scientific computation
Algorithms, (parallel) debugging aids, low-overhead profiling
• It is not possible to pick the solver a priori. – What will deliver best/competitive performance for a given physics,
problem size, discretization, and architecture?
• PETSc was developed as a platform for experimentation – models, discretization, algorithms, solvers – algebra of composition so new solvers can be created at runtime.
• Important to keep solvers decoupled from physics and discretization because we also experiment with those.
9
Portable Extensible Toolkit for Scientific computation
Who Uses PETSc? • Computational Scientists
– PyLith (TECTON), Underworld, Columbia group, PFLOTRAN
• Algorithm Developers – Iterative methods and Preconditioning researchers
• Package Developers – SIPs, SLEPc, TAO, MagPar, StGermain, Dealll, PETSc-FEM
10
Portable Extensible Toolkit for Scientific computation Applications of PETSc • Nano-simulations • Biology/Medical • Cardiology • Imaging and Surgery • Fusion • Geosciences • Environmental/Subsurface Flow • Computational Fluid Dynamics • Wave propagation and the Helmholz equation • Optimization • Fast Algorithms • Software engineering • Algorithm analysis and design
12
What Can We Handle? • PETSc has run implicit problem with over 1 billion unknowns
• PFLOTRAN for flow in porous media
• PETSc has run on over 130,000 cores efficiently • UNIC on IBM BG/P Intrepid at ANL • PFLOTRAN on the Gray XT5 Jaguar at ORNL
• PETSc applications have run at 3 Teraflops • LANL PFLOTRAN code
• PETSc also runs on your laptop
• Only a handful of our users ever go over 64 processors
14
Matrices are
• large: ultimate goal 50,000 atoms with electronic structure ~ N=200,000
• sparse: non-zero density -> 0 as N increases
• dense solutions are requested: 60% eigenvalues and eigenvectors
Dense solutions of large sparse problems!
15
DFTB-eigenvalue problem is distinguished by
• (A, B) is large and sparse Iterative method
• A large number of eigensolutions (60%) are requested Iterative method + multiple shift-and-invert
• The spectrum has - poor average eigenvalue separation O(1/N), - cluster with hundreds of tightly packed eigenvalues - gap >> O(1/N) Iterative method + multiple shift-and-invert + robusness
• The matrix factorization of (A-σB)=LDLT : not-very-sparse(7%) <= nonzero density <= dense(50%) Iterative method + multiple shift-and-invert + robusness + efficiency
• Ax=λBx is solved many times (possibly 1000’s) Iterative method + multiple shift-and-invert + robusness + efficiency
+ initial approximation of eigensolutions
16
Software Structure
MPI
PETSc
SLEPc
MUMPS
ARPACK
Shift-and-Invert Parallel Spectral Transforms (SIPs)
• Select shifts
• Bookkeep and validate eigensolutions
• Balance parallel jobs
• Ensure global orthogonality of eigenvectors
• Manage matrix storage
17
FACETS: Framework Application for Core-Edge Transport Simulations • https://facets.txcorp.com/facets • PI: John Cary, Tech-X Corporation
• Goal: Providing modeling of a fusion device from the core to the wall
• TOPS Emphasis in FACETS
– Incorporate TOPS expertise in scalable nonlinear algebraic solvers into the base physics codes that provide the foundation for the coupled models
– Study mathematical challenges that arise in coupled core-edge and transport-turbulence systems
IU NYU
Lodestar
18
The edge-plasma region is a key component to include for integrated modeling of fusion devices
• Edge-pedestal temperature has large impact on fusion gain
• Plasma exhaust can damage walls • Impurities from wall can dilute core fuel
and radiate substantial energy
2D mesh for DIII-D
19
UEDGE is a 2D plasma/neutral transport code • Features of UEDGE
Physics: – Multispecies plasma; var. ni,e, u||i,e, Ti,e for particle
density, parallel momentum, and energy balances – Reaction-diffusion-convection type eqstions – Reduced Navier-Stokes or Monte Carlo for wall-
recycled/sputtered neutrals – Multi-step ionization and recombination Numerics: – Finite volume discretization – Preconditioned Newton-Krylov implicit solver – Non-orthogonal mesh for fitting divertor – Steady-state or time dependent – Parallel version – PYTHON or BASIS scripting control
• T. D. Rognlien and M. E. Rensink, Edge-Plasma Models and Characteristics for Magnetic Fusion Energy Devices, Fusion Engineering and Design, 60: 497-514, 2002. • T. D. Rognlien, D. D. Ryutov, N. Mattor, and G. D. Porter, Two-Dimensional Electric Fields and Drifts Near the Magnetic Separatrix in Divertor Tokamaks, Physics of Plasmas, 6 (5): 1851-1857, 1999. • T. D. Rognlien, X. Q. Xu, and A. C. Hindmarsh, Application of Parallel Implicit Methods to Edge-Plasma Numerical Simulations, Journal of Computational Physics, 175: 249-268, 2002.
References:
20
The Role of PETSc
Developing parallel, nontrivial PDE solvers that deliver high performance is still difficult and requires months (or even years) of concentrated effort.
PETSc is a tool that can ease these difficulties and reduce the development time, but it is not a black-box PDE solver, nor a silver bullet.
21
Outline
• Overview of PETSc – Linear solver interface: KSP – Nonlinear solver interface: SNES – Profiling and debugging
• Ongoing research and developments
22
The PETSc Programming Model • Distributed memory, “shared-nothing”
• Requires only a standard compiler • Access to data on remote machines through MPI
• Hide within objects the details of the communication
• User orchestrates communication at a higher abstract level than direct MPI calls
PETSc Structure
23
Getting Started
PetscInitialize(); ObjCreate(MPI_comm,&obj); ObjSetType(obj, ); ObjSetFromOptions(obj, );
ObjSolve(obj, ); ObjGetxxx(obj, );
ObjDestroy(obj); PetscFinalize()
Integration
24
Compressed Sparse Row
(AIJ)
Blocked Compressed Sparse Row
(BAIJ)
Block Diagonal (BDIAG)
Dense Other
Indices Block Indices Stride Other Index Sets (IS)
Vectors (Vec)
Line Search Trust Region
Newton-based Methods Other
Nonlinear Solvers (SNES)
Additive Schwartz
Block Jacobi Jacobi ILU ICC LU
(Sequential only) Others
Preconditioners (PC)
Euler Backward Euler
Pseudo Time Stepping Other
Time Steppers (TS)
GMRES CG CGS Bi-CG-STAB TFQMR Richardson Chebychev Other
Krylov Subspace Methods (KSP)
Matrices (Mat)
PETSc Numerical Components
Distributed Arrays(DA)
Matrix-free
25
PETSc
Application Initialization Evaluation of A and b Post-
Processing
Solve Ax = b PC
Linear Solvers (KSP)
PETSc code User code
Linear Solver Interface: KSP Main Routine
solvers:linear beginner
26
• -ksp_type [cg,gmres,bcgs,tfqmr,…] • -pc_type [lu,ilu,jacobi,sor,asm,…]
• -ksp_max_it <max_iters> • -ksp_gmres_restart <restart> • -pc_asm_overlap <overlap> • -pc_asm_type [basic,restrict,interpolate,none] • etc ...
Setting Solver Options at Runtime
solvers:linear beginner
1 intermediate
2
1
2
27
Recursion: Specifying Solvers for Schwarz Preconditioner Blocks
• Specify KSP solvers and options with “-sub” prefix, e.g., – Full or incomplete factorization
-sub_pc_type lu -sub_pc_type ilu -sub_pc_ilu_levels <levels>
– Can also use inner Krylov iterations, e.g., -sub_ksp_type gmres -sub_ksp_rtol <rtol> -sub_ksp_max_it <maxit>
solvers: linear: preconditioners beginner
28
PETSc code User code
Application Initialization
Function Evaluation
Jacobian Evaluation
Post- Processing
PC PETSc
Main Routine
Linear Solvers (KSP)
Nonlinear Solvers (SNES)
Timestepping Solvers (TS)
Flow of Control for PDE Solution
PETSc Structure
29
UEDGE and PETSc: Solve F(u) = 0
Post- Processing
Application Initialization Function
Evaluation
Jacobian Evaluation
PETSc
Nonlinear Solvers (SNES)
PETSc code
Application code
UEDGE finite differencing Jacobian for preconditioning matrix; PETSc code for matrix-free Jacobian-vector products
Matrices Vectors
Krylov Solvers Preconditioners
GMRES
TFQMR
BCGS
CGS
BCG
Others…
ASM
ILU
B-Jacobi
SSOR
Multigrid
Others…
AIJ
B-AIJ
Diagonal
Dense
Matrix-free
Others…
Sequential
Parallel
Others…
UEDGE Driver + Timestepping
Algorithms and data structures originally employed by UEDGE
30
Nonlinear Solver Interface: SNES
Goal: For problems arising from PDEs, support the general solution of F(u) = 0
User provides: – Code to evaluate F(u) – Code to evaluate Jacobian of F(u) (optional)
• or use sparse finite difference approximation • or use automatic differentiation
– AD support via collaboration with P. Hovland and B. Norris – Coming in next PETSc release via automated interface to
ADIFOR and ADIC (see http://www.mcs.anl.gov/autodiff)
solvers: nonlinear
31
SNES: Review of Basic Usage • SNESCreate( ) - Create SNES context • SNESSetFunction( ) - Set function eval. routine • SNESSetJacobian( ) - Set Jacobian eval. routine • SNESSetFromOptions( ) - Set runtime solver options
for [SNES,SLES, KSP,PC] • SNESSolve( ) - Run nonlinear solver • SNESView( ) - View solver options
actually used at runtime (alternative: -snes_view)
• SNESDestroy( ) - Destroy solver
solvers: nonlinear
32
Uniform access to all linear and nonlinear solvers
• -ksp_type [cg,gmres,bcgs,tfqmr,…] • -pc_type [lu,ilu,jacobi,sor,asm,…] • -snes_type [ls,…]
• -snes_line_search <line search method> • -sles_ls <parameters> • -snes_convergence <tolerance> • etc...
solvers: nonlinear
1
2
33
PETSc Programming Aids • Correctness Debugging
– Automatic generation of tracebacks
– Detecting memory corruption and leaks
– Optional user-defined error handlers • Performance Profiling
– Integrated profiling using -log_summary – Profiling by stages of an application – User-defined events
Integration
34
Conclusions
PETSc can help you
• easily construct a code to test your ideas – lots of code construction, management, debugging tools
• scale an existing code to large or distributed machines
• incorporate more scalable or higher performance algorithms – such as domain decomposition or multigrid
• tune your code to new architectures – using profiling tools and specialized implementations
Ongoing Research and Developments
• Coupled multiphysics system
• Preconditioning using splitting methods
• Memory performance for sparse kernels – Sparse Matrix-Vector products – Triangular solves
• ILU with drop tolerance
• Time integration – Differential Algebraic Equations – Strong stability preserving methods
• …
36
• Two or more models which are not traditionally solved fully coupled:
F(x, y) = 0 G(x,y) = 0
• Our approach: - simple user specification - abstractions for managing the composition of solvers - flexible solver options and efficient solution
Coupled Multiphysics System
37
A model "multi-physics" solver based on the Vincent Mousseau's reactor core pilot code: There are three grids
Framework for Multi-model Algebraic System ~petsc/src/snes/examples/tutorials/ex31.c
DA1
DA2
DA3
Fluid
Thermal conduction
(cladding and core)
Fission (core)
38
/* Create the DMComposite object to manage the three grids/physics. */ DMCompositeCreate(app.comm,&app.pack); DACreate1d(app.comm,DA_XPERIODIC,app.nxv,6,3,0,&da1); DMCompositeAddDA(app.pack,da1); DACreate2d(app.comm,DA_YPERIODIC,DA_STENCIL_STAR,…,&da2); DMCompositeAddDA(app.pack,da2); DACreate2d(app.comm,DA_XYPERIODIC,DA_STENCIL_STAR,…,&da3); DMCompositeAddDA(app.pack,da3);
/* Create the solver object and attach the grid/physics info */ DMMGCreate(app.comm,1,0,&dmmg); DMMGSetDM(dmmg,(DM)app.pack); DMMGSetSNES(dmmg,FormFunction,0);
/* Solve the nonlinear system */ DMMGSolve(dmmg);
/* Free work space */ DMCompositeDestroy(app.pack); DMMGDestroy(dmmg);
39
/* Unwraps the input vector and passes its local ghosted pieces into the user function */ FormFunction(SNES snes,Vec X,Vec F,void *ctx) … DMCompositeGetEntries(dm,&da1,&da2,&da3); DAGetLocalInfo(da1,&info1);
/* Get local vectors to hold ghosted parts of X; then fill in the ghosted vectors from the unghosted global vector X */ DMCompositeGetLocalVectors(dm,&X1,&X2,&X3); DMCompositeScatter(dm,X,X1,X2,X3);
/* Access subvectors in F - not ghosted and directly access the memory locations in F */ DMCompositeGetAccess(dm,F,&F1,&F2,&F3);
/* Evaluate local user provided function */ FormFunctionLocalFluid(&info1,x1,f1); FormFunctionLocalThermal(&info2,x2,f2); FormFunctionLocalFuel(&info3,x3,f3); …
Three phase instantaneous time domain simulation of electric power systems using PETSc by Shri Abhyankar, IIT
Network subsystem
Generator subsystem
Load subsystem
Equations for the power system model Generator subsystem equations
Network subsystem equations
Load subsystem equations
Multi-physics modeling approach – - Solve the entire system at once. - Unpack the System X and pack the subsystem F at each iteration.
This is done in PETSc using the DMMG object.
Using PETSc multi-physics data structures and functions
Unpack the composite Vector X
DMMGGetLocalVectors
Compute the subsystem functions
Pack the subsystem Functions
DMMGRestoreLocalAccess
Plus (+) • vectors and matrix structures. • Wide range of linear and nonlinear solvers, preconditioners. • Parallel data structures and solvers. • Efficient multiphysics data structures and functions to access system
and subsystem vectors. • Run time options!!
Minus (-) • Scattering of vectors locally in a subsystem.
The PETSc Experience
44
Memory performance for sparse kernels Computational kernels:"• SpMv: y = A*x"• Solve A*x = b"
– Sparse factorization: A = L U or A ~ L U"– Sparse triangular solves: "
Ly = b and Ux = y"
• Have been implemented in numerical software libraries without serious consideration of computer architecture, e.g., memory layout, data-fetching behavior, …"
• Counts for 70-80% of total computational time for most of applications"
45
Sparse LU factor
Input: A"Output: L and U"For i= 1, 2, …, n Do: " w := A(i, :) " For k=1,2, …, i-1 Do: " mul := A(i,k)/A(k,k)" update w := w - mul*U(k,:) " EndDo" store w in L(i,:) and U(i,:)"EndDo "
Old storage: L and U are stored in the order of data creation: L(1,:), U(1,:), L(2,:), U(2,:), L(3,:), U(3,:)
Sparse Triangular Solve Input: L, U, b"Output: x, s.t., LUx ~ b"
Forward substitution"For i=1, 2,…,n Do" x(i):=b(i)-L(i,1:i-1)*[x(1), …,x(i-1)]^{T}"EndDo"
Backward substitution" For i=n, n-1,…,1 Do " x(i):=(x(i)-U(i,i+1:n)*[x(i+1),…,x(n)]^{T} )/U(i,i)" EndDo"
Old storage: the array of factored values is accessed twice:
L(1,:), U(1,:), L(2,:), U(2,:), L(3,:), U(3,:)
47
Memory access can be improved through a simple reorganization: Sparse LU factor
Input: A"Output: L and U"For i= 1, 2, …, n Do: " w := A(i, :) " For k=1,2, …, i-1 Do: " mul := A(i,k)/A(k,k)" update w := w - mul*U(k,:) " EndDo" store w in L(i,:) and U(i,:)"EndDo "
New storage: L and U are stored in the order of data accessing:
L(1,:), L(2,:), L(3,:), U(3,:), U(2,:), U(1,:)
48
Sparse Triangular Solve Input: L, U, b"Output: x, s.t., LUx ~ b"
Forward substitution"For i=1, 2,…,n Do" x(i):=b(i)-L(i,1:i-1)*[x(1), …,x(i-1)]^{T}"EndDo"
Backward substitution" For i=n, n-1,…,1 Do " x(i):=(x(i)-U(i,i+1:n)*[x(i+1),…,x(n)]^{T} )/U(i,i)" EndDo"
New storage: The array is accessed in a linear fashion:
L(1,:), L(2,:), L(3,:), U(3,:), U(2,:), U(1,:)
-> Performance improvement of PETSc applications
Processor Matrix SpMV (Mflps/s)
Triangular Solve Old Storage
Triangular Solve New Storage
Core 2 Duo 7-point Laplace 3D Euler AIJ
3D Euler BAIJ
537 620 890
261 (49%) 260 (42%) 468 (53%)
447 (83%) 660 (106%) 758 (85%)
BlueGene/P 7-point Laplace 3D Euler AIJ
3D Euler BAIJ
98 148 303
46 (47%) 99 (66%) 198 (65%)
62 (63%) 126 (85%) 260 (86%)
49
Barry Smith and Hong Zhang, “Sparse Triangular Solvers for ILU Revisited: Data Layout Crucial to Better Performance” , Intl. J. of High Performance Computing Applications, 2010
Improvement from better storage
Select data structure based on how it will be accessed efficiently and frequently, not on how it is created first!
50
How will we solve numerical applications in 20 years?
• Not with the algorithms we use today?
• Not with the software (development) we use today?
51
How Can We Help? • Provide documentation:"
– http://www.mcs.anl.gov/petsc"• Quickly answer questions"• Help install"• Guide large scale flexible code development"• Answer email at [email protected]"