stochastic optimization of complex energy systems on high-performance computers
DESCRIPTION
Stochastic Optimization of Complex Energy Systems on High-Performance Computers. Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory [email protected] SIAM CSE 2013 Joint work with Olaf Schenk(USI Lugano ), Miles Lubin ( MIT ) , - PowerPoint PPT PresentationTRANSCRIPT
Stochastic Optimization of Complex Energy Systems on High-Performance Computers
Cosmin G. PetraMathematics and Computer Science Division
Argonne National [email protected]
SIAM CSE 2013
Joint work with Olaf Schenk(USI Lugano),
Miles Lubin (MIT), Klaus Gaertner(WIAS Berlin)
Outline
Application of HPC to power-grid optimization under uncertainty
Parallel interior-point solver (PIPS-IPM)– structure exploiting
Revisiting linear algebra
Experiments on BG/P with the new features
2
Stochastic unit commitment with wind power
Wind Forecast – WRF(Weather Research and Forecasting) Model– Real-time grid-nested 24h simulation – 30 samples require 1h on 500 CPUs (Jazz@Argonne)
3
1min COST
s.t. , ,
, ,
ramping constr., min. up/down constr.
wind
wind
p u dsjk jk jk
s j ks
sjk kj
windsjk
j
wik ksj
ndsk
jjk
j
c c cN
p D s k
p D R s k
p
p
S N T
N
N
N
N
S T
S T
Slide courtesy of V. Zavala & E. Constantinescu
Wind farmThermal generator
4
Stochastic Formulation
Discrete distribution leads to block-angular LP
5
Large-scale (dual) block-angular LPs
• In terminology of stochastic LPs:• First-stage variables (decision now): x0
• Second-stage variables (recourse decision): x1, …, xN
• Each diagonal block is a realization of a random variable (scenario)
Extensive form
6
Computational challenges and difficulties
May require many scenarios (100s, 1,000s, 10,000s …) to accurately model uncertainty
“Large” scenarios (Wi up to 250,000 x 250,000) “Large” 1st stage (1,000s, 10,000s of variables) Easy to build a practical instance that requires 100+ GB of
RAM to solve Requires distributed memory
Real-time solution needed in our applications
Linear algebra of primal-dual interior-point methods (IPM)
7
12
T Tx Qx c x
subj. to.
Min
0Ax bx
Convex quadratic problem
0
T xQ Arhs
yA
IPM Linear System
1 1
1 1
2 2
2 2
01 2 0
0
0 00 0
0 00 0
0 00 0
0 0 00 0 0 0 0 0 0
T
T
TS
ST T T T
S
N
N
H BB A
H BB A
H BB A
A A A H AA
Two-stage SP
arrow-shaped linear system(modulo a permutation)
Multi-stage SP
nested
N is the number of scenarios
2 solves per IPM iteration - predictor directions - corrector directions
8
Special Structure of KKT System (Arrow-shaped)
9
Parallel Solution Procedure for KKT System
Steps 1 and 5 trivially parallel– “Scenario-based decomposition”
Steps 1,2,3 are >95% of total execution time.
10
Components of Execution Time
Notice break in y-axis scale
11
Scenario Calculations – Steps 1 and 5
Each scenario is assigned to an MPI process, which locally performs steps 1 and 5.
Matrices are sparse and symmetric indefinite (symmetric with positive and negative eigenvalues).
Computing is very expensive when solving with the factors of against non-zero columns of and multiplying from left with
4 hours 10 minutes wall time to solve a 4h-horizon problem with 8k scenarios on 8k nodes.
Need to run under strict time requirements – For example, solve 24h-horizon problem in less
than 1h
1Ti i iB K B
iBTiB
12
Revisiting scenario computations for shared-memory
Multiple sparse right-hand sides Triangular solves phase hard to parallelize in shared-memory (multi-core) Factorization phase speeds up very well and achieves considerable peak-
performance Our approach: incomplete factorization of
Stop factorization after the elimination of (1,1) block will sit in the (2,2) block (Schur complement)
1"Compute SC" (Step 1): Ti i iB K B
0
Ti i
ii
K BM
B
1Ti i iB K B
13
Implementation
Requires modification of the linear solver PARDISO (Schenk) -> PARDISO-SC
Pivot perturbations during factorization needed to maintain numerical stability Errors due to perturbations are absorbed by iterative refinement This would be extremely expensive in our case (many right-hand sides) We let errors propagate in the “global” Schur complement C (Step 2)
Factorize the perturbed C (denoted by ) (Step 3) After Step 1, 2 and 3, we have the factorization of an approximation matrix
10
1
NTi i i
i
C K B K B
C
1 1
1 0
, i iN N
T TN
K B
K K K KK B
B B K
14
Pivot error absorption by preconditioned BiCGStab
Still we have to solve with
“Absorb errors” by solving Kz=r using preconditioned BiCGStab– Numerical experiments showed it is more robust than iterative refinement.
Preconditioner is Each BiCGStab iteration requires
– 2 mat-vecs: Kz– 2 applications of the preconditioner:
One application of the preconditioner resumes to performing “solve” steps 4 and 5 for
1 1
1 0
N NT T
N
K B
KK B
B B K
K
1K z
K
15
Summary of the new approach
1
10 1
0 0 0
1
1. Calculate , 1, ("Compute S.C.")
2. Form : ("Form S.C.")
3. Factorize ("Factor S.C.")do BiCGStab iteration ... apply preconditioner (steps 4
Ti i i
N Ti ii
T
B K B i N
C K B K B
C L D L
K z
1
and 5) mat-vec ... apply preconditioner (steps 4 and 5) mat-vec while BiCGStab did not converged
Kz
K zKz
16
Test architecture
“Intrepid” Blue Gene/P supercomputer– 40,960 nodes– Custom interconnect– Each node has quad-core 850 Mhz PowerPC processor, 2 GB
RAM
DOE INCITE Award 2012-2013 – 24 million core hours
17
Numerical experiments
4h (UC4), 12h(UC12), 24h(UC24) horizon problems
1 scenario per node (4 cores per scenario)
Large-scale: 12h horizon, up to 32k scenarios and 128k cores (k=1,024)– 16k scenarios – 2.08 billion variables, 1.81 billion constraints, KKT
system size = 3.89 billion
LAPACK+SMP ESSL BLAS for first-stage linear systems PARDISO-SC for second-stage linear systems
18
Compute SC Times
19
Time per IPM iteration UC12, 32k scenarios, 32k nodes (128k cores)
BiCGStab iteration count ranges from 0 to 1.5 Cost of absorbing factorization perturbation errors is between 10 and 30% of total
iteration cost
20
Solve to completion – UC12
Nodes/scens Wall time (sec) IPM Iterations Time per IPM iteration (sec)
4096 3548.5 103 33.57
8192 3883.7 112 34.67
16384 4208.8 123 34.80
32768 4781.7 133 35.95
Before: 4 hours 10 minutes wall time to solve UC4 problem with 8k scenarios on 8k nodes
Now: UC12
21
Weak scaling
22
Strong scaling
23
Conclusions and Future Considerations
Multicore-friendly reformulation of sparse linear algebra computations lead to one order of magnitude faster execution times.
Fast factorization-based computation of SC Robust and cheap pivot errors absorption via Krylov iterative methods
Parallel efficiency of PIPS remains good.
Performance evaluation on today’s supercomputers– IBM BG/Q – Cray XK7, XC30