large-scale gw calculations on pre-exascale hpc systems archive/tech_poster/poster_files... ·...

1
Large-Scale GW Calculations on Pre-Exascale HPC Systems BerkeleyGW: Method Developments and Code Optimization M. Del Ben 1 , F.H. da Jornada 1,2 , A. Canning 1 , N. Wichmann 3 , K. Raman 4 , R. Sasanka 4 , C. Yang 1 , S.G. Louie 1,2 and J. Deslippe 1 1 Lawrence Berkeley National Laboratory, 2 University of California at Berkeley, 3 CRAY, 4 Intel Corporation The GW Method Solve Dyson’s equation: - 1 2 2 + V loc + Σ(E n ) φ n = E n φ n , (1) Σ(E n ) self-energy (non-Hermitian, non-local, energy-dependent operator) I Perturbative expansion on the screened Coulomb interaction W I First approximation Σ = iGW I W obtained from Inverse Dielectric Matrix -1 of the system In BerkeleyGW [1]: I Epsilon code Compute -1 I Sigma code Compute W from -1 and solve eq.1 Epsilon code: Inverse Dielectric Matrix -1 Input: ψ mk , mk , {q-points}, {ω i } 1. Calculate plane-waves matrix elements (FFT’s): O (N v N c N G log N G ) M G jak (q)= hψ j k+q | e i (G+q)·r |ψ ak i 2. Calculate RPA polarizability (Matrix-Multiplication): O (N ω N v N c N 2 G ) χ(qi )= M(q) Δ jak ( j k , ak , qi )M(q) Δ diagonal matrix containing the frequency dependence 3. Dielectric matrix and inversion: O (N ω N 3 G ) GG 0 (qi )= δ GG 0 - v (q + G)χ GG 0 (qi ) -1 (qi )=(I - v χ(qi )) -1 Parameters: N v number of valence bands, N c number of conduction (empty) bands, N G PW basis set size, N ω number of frequencies. Static Subspace Approximation [2]: Speed-Up the Calculation of -1 (ω i ) for ω i 6=0 I For ω i = 0: Standard calculation of ¯ χ(0) = v 1 2 χ(0)v 1 2 χ(0) = M Δ jak (0)M Eigendecomposition ¯ χ(0) = C 0 xC 0 , define C 0 s according to threshold t eigen I Projection of M into the subspace spanned by C 0 s M 0 s = Mv 1 2 C 0 s I For ω i 6= 0: direct computation of ¯ χ s (ω i ) ¯ χ s (ω i )= M 0 s Δ jak (ω i ) M 0 s I Final evaluation of -1 (ω i ) from ¯ χ s (ω i ) Execution Memory Matrix Element O (N v N c N G log N G ) O (N v N c N G ) Polarizability ω =0 O (N v N c N 2 G ) O (N 2 G ) Eigendecomposition: C 0 s O (N 3 G ) O (N 2 G ) M 0 s O (N b N v N c N G ) O (N v N c N b ) Polarizability ω 6=0 O (N ω N v N c N 2 b ) O (N ω N 2 b ) Inversion O (N ω N 3 b ) O (N ω N 2 b ) I/O O (N G N b + N ω N 2 b ) O (N G N b + N ω N 2 b ) Evaluation of -1 (ω i ) O (N ω N b N 2 G ) O (N ω N 2 G ) References 1. J. Deslippe, G. Samsonidze, D. A. Strubbe, M. Jain, M. L. Cohen, and S. G. Louie, Comput. Phys. Commun. 183, 1269 (2012). 2. M. Govoni and G. Galli, J. Chem. Theory Comput. 11, 2680 (2015) ; T.A. Pham, H.V. Nguyen, D. Rocca, and G. Galli, Phys. Rev. B 87, 155148 (2013) ; H.-V. Nguyen, T. A. Pham, D. Rocca, and G. Galli, Phys. Rev. B 85, 081101 (2012) ; H. F. Wilson, D. Lu, F. Gygi, and G. Galli, Phys. Rev. B 79, 245106 (2009) ; H. F. Wilson, F. Gygi, and G. Galli, Phys. Rev. B 78, 113303 (2008). 3. P. Umari, Geoffrey Stenuit, and Stefano Baroni, Phys. Rev. B 79, 201104-R (2009) 4. M. Del Ben, F.H. da Jornada, A. Canning, N. Wichmann, K. Raman, R. Sasanka, C. Yang, S.G. Louie and J. Deslippe, In Preparation Benchmark Calculations Silicon Carbide (β -SiC): (a) (b) (c) Figure: (a) band structure as obtained without (solid blue) and with the static subspace approximation (red dotted t eig =0.01); (b) mean absolute error between the reference and approximate results (196 E QP calculated); (c) percentage reduction in time to solution (red) and number of eigenvectors (blue) for the evaluation of -1 . Single Vacancy V 0 1 (1000-Si Atoms Benchmark) ΔE v ΔE g ΔE c ΔE Si #Nodes Time (min) LDA 0.29 0.07 0.23 0.60 - - G 0 W 0 (6Ry) 0.41 0.27 0.46 1.15 480 73 G 0 W 0 (12Ry) 0.42 0.26 0.49 1.18 2048 157 Energies in eV G 0 W 0 = Full-Frequency Contour Deformation Calculations Performed on Edison@NERSC (CRAY-XC30) I Approximation tested for Insulators, Semiconductors, Metals, Clusters, Slabs I Direct correlation between accuracy and t eigen (5-25% of the eigenstates →∼ 1 meV accuracy) Sigma code: Calculate Σ lm (ω ) and Solve Dyson’s Equation 1. Calculate plane-waves matrix elements (FFT’s): O (N Σ N n N G log N G ) M -G nm = hφ n | e -i G·r |φ m i 2. For any given pair of orbital functions {φ l m } calculate: Σ lm (E )= i 2π Z 0 d ω X n X GG 0 M -G nl -1 GG 0 (ω ) · v (G 0 ) E - E n - ω M -G 0 nm Matrix-Multiplication + 2D Dot-Product O (N Σ N ω N n N 2 G ) Parallelization Strategy Σ lm (ω ) matrix elements distributed over Pools of processes. For each Pool: Data distribution layout N G 10 5 ; N n 10 4 ; N E 10 2 I Left Matrix: -1 (ω ) distributed over rows (N G × N ω combined index) I Right Matrix: M G nl distributed over columns (FFT performed locally) I At each cycle the matrix contraction is performed locally (ZGEMM) I Ring communication restricted to contiguous MPI tasks Non-blocking cyclic communication layout I Communication can be performed over blocks of processes: I N n 00 size roughly constant independent on the ration OMP×MPI I Good ZGEMM performance independent on the number of MPI tasks employed Same algorithm is used in the low-rank approximation case: matrix size N G N b I Speed-up proportional to (N G /N b ) 2 I Good performance achieved by using smaller pool sizes Performance Measurement on Cori-KNL@NERSC Systems: Divacancy states in Silicon supercells containing 998 and 1726 atoms. I FLOPs per node I Best/worst performing node I Strong Scaling I Time to solution I Comparison to peak performance Single Pool Performance: 200 KNL Nodes (a) (b) (c) (a) Poor performance due to small Comput./Commun. ratio (different OMPxMPI ratio, no message blocking) (b) Improved performance for 64-OMPxMPI by using different processes block size (c) Different OMPxMPI ratio and process block size adjusted to give roughly constant n 00 . Strong Scaling: Individual Pool and Full Sigma (d) (e) (a) Individual Pool scaling, total execution (red squares) and computationally intense part (blue circles), (b) Full Sigma, 200-NKL nodes per Pool. Full Sigma: Best Performance 998 Si 1726 Si Number KNL Nodes 9600 9500 Number of Cores 633,600 627,000 Number E qp Evaluated 48 38 Time to solution (s) 160 201 PetaFLOP/s 11.8 11.3 % Peak Performance 47 46 I Sigma kernel capable to scale to full-Cori I Sigma achieves high fraction of peak performance I Excellent time to solution (100 seconds) for systems made of thousands of atoms Acknowledgments This work was supported by the Center for Computational Study of Excited-State Phenomena in Energy Materials at the Lawrence Berkeley National Laboratory, which is funded by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division under Contract No. DE-AC02-05CH11231, as part of the Computational Materials Sciences Program. [email protected] http://www.berkeleygw.org/

Upload: others

Post on 25-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-Scale GW Calculations on Pre-Exascale HPC Systems Archive/tech_poster/poster_files... · Large-Scale GW Calculations on Pre-Exascale HPC Systems BerkeleyGW: Method Developments

Large-Scale GW Calculations on Pre-Exascale HPC SystemsBerkeleyGW: Method Developments and Code Optimization

M. Del Ben1, F.H. da Jornada1,2, A. Canning1, N. Wichmann3, K. Raman4, R. Sasanka4, C. Yang1, S.G. Louie1,2 and J. Deslippe1

1Lawrence Berkeley National Laboratory, 2University of California at Berkeley, 3CRAY, 4Intel Corporation

The GW Method

Solve Dyson’s equation: [−1

2∇2 + Vloc + Σ(En)

]φn = Enφn, (1)

Σ(En) → self-energy (non-Hermitian, non-local, energy-dependent operator)

I Perturbative expansion on the screened Coulomb interaction W

I First approximation Σ = iGW

I W obtained from Inverse Dielectric Matrix ε−1 of the system

In BerkeleyGW [1]:

I Epsilon code → Compute ε−1

I Sigma code → Compute W from ε−1 and solve eq.1

Epsilon code: Inverse Dielectric Matrix ε−1

Input: ψmk, εmk, {q-points}, {ωi}1. Calculate plane-waves matrix elements (FFT’s): O(NvNcNG logNG)

MGjak(q) = 〈ψjk+q| e i(G+q)·r |ψak〉

2. Calculate RPA polarizability (Matrix-Multiplication): O(NωNvNcN2G)

χ(q, ωi) = M(q)†∆jak(εjk, εak,q, ωi)M(q)

∆ diagonal matrix containing the frequency dependence

3. Dielectric matrix ε and inversion: O(NωN3G)

εGG ′(q, ωi) = δGG ′ − v(q + G)χGG ′(q, ωi)

ε−1(q, ωi) = (I − vχ(q, ωi))−1

Parameters: Nv number of valence bands, Nc number of conduction (empty) bands, NG PW basis set size, Nω number of frequencies.

Static Subspace Approximation [2]: Speed-Up the Calculation of ε−1(ωi) for ωi 6= 0

I For ωi = 0: Standard calculation of χ̄(0) = v12χ(0)v

12

χ(0) = M†∆jak(0)M

Eigendecomposition χ̄(0) = C0†xC0, define C0s according to threshold teigen

I Projection of M into the subspace spanned by C0s

M0s = Mv

12C0

s

I For ωi 6= 0: direct computation of χ̄s(ωi)

χ̄s(ωi) = M0s

†∆jak(ωi)M

0s

I Final evaluation of ε−1(ωi) from χ̄s(ωi)

Execution Memory

Matrix Element O(NvNcNG logNG) O(NvNcNG)

Polarizability ω = 0 O(NvNcN2G) O(N2

G)

Eigendecomposition: C0s O(N3

G) O(N2G)

M0s O(NbNvNcNG) O(NvNcNb)

Polarizability ω 6= 0 O(NωNvNcN2b) O(NωN

2b)

Inversion O(NωN3b) O(NωN

2b)

I/O O(NGNb + NωN2b) O(NGNb + NωN

2b)

Evaluation of ε−1(ωi) O(NωNbN2G) O(NωN

2G)

References

1. J. Deslippe, G. Samsonidze, D. A. Strubbe, M. Jain, M. L. Cohen, and S. G. Louie, Comput. Phys. Commun. 183, 1269 (2012).

2. M. Govoni and G. Galli, J. Chem. Theory Comput. 11, 2680 (2015) ; T.A. Pham, H.V. Nguyen, D. Rocca, and G. Galli, Phys. Rev. B 87,

155148 (2013) ; H.-V. Nguyen, T. A. Pham, D. Rocca, and G. Galli, Phys. Rev. B 85, 081101 (2012) ; H. F. Wilson, D. Lu, F. Gygi, and

G. Galli, Phys. Rev. B 79, 245106 (2009) ; H. F. Wilson, F. Gygi, and G. Galli, Phys. Rev. B 78, 113303 (2008).

3. P. Umari, Geoffrey Stenuit, and Stefano Baroni, Phys. Rev. B 79, 201104-R (2009)

4. M. Del Ben, F.H. da Jornada, A. Canning, N. Wichmann, K. Raman, R. Sasanka, C. Yang, S.G. Louie and J. Deslippe, In Preparation

Benchmark Calculations

Silicon Carbide (β−SiC):

(a) (b) (c)

Figure: (a) band structure as obtained without (solid blue) and with the static subspace approximation (red dottedteig = 0.01); (b) mean absolute error between the reference and approximate results (196 EQP calculated); (c)percentage reduction in time to solution (red) and number of eigenvectors (blue) for the evaluation of ε−1.

Single Vacancy V 01 (1000-Si Atoms Benchmark)

∆E v ∆E g ∆E c ∆E Si #Nodes Time (min)

LDA 0.29 0.07 0.23 0.60 - -

G0W0 (6Ry) 0.41 0.27 0.46 1.15 480 73

G0W0 (12Ry) 0.42 0.26 0.49 1.18 2048 157

Energies in eV

G0W0 = Full-Frequency Contour Deformation

Calculations Performed on Edison@NERSC (CRAY-XC30)

I Approximation tested for Insulators, Semiconductors, Metals, Clusters, Slabs

I Direct correlation between accuracy and teigen (5-25% of the eigenstates →∼ 1 meV accuracy)

Sigma code: Calculate Σlm(ω) and Solve Dyson’s Equation

1. Calculate plane-waves matrix elements (FFT’s): O(NΣNnNG logNG)

M−Gnm = 〈φn| e−iG·r |φm〉2. For any given pair of orbital functions {φl , φm} calculate:

Σlm(E ) =i

∫ ∞0

dω∑n

∑GG ′

M−Gnl

ε−1GG ′(ω) · v(G ′)

E − En − ωM−G

nm

Matrix-Multiplication + 2D Dot-Product O(NΣNωNnN2G)

Parallelization Strategy

Σlm(ω) matrix elements distributed over Pools of processes. For each Pool:

Data distribution layout

NG ' 105 ; Nn ' 104 ; NE ' 102

I Left Matrix: ε−1(ω) distributed over rows (NG ×Nω combined index)

I Right Matrix: MGnl distributed over columns (FFT performed locally)

I At each cycle the matrix contraction is performed locally (ZGEMM)

I Ring communication restricted to contiguous MPI tasks

Non-blocking cyclic communication layout

I Communication can be performed over blocks of processes:I Nn′′ size roughly constant independent on the ration OMP×MPII Good ZGEMM performance independent on the number of MPI

tasks employed

Same algorithm is used in the low-rank approximation case: matrix size NG → Nb

I Speed-up proportional to (NG/Nb)2

I Good performance achieved by using smaller pool sizes

Performance Measurement on Cori-KNL@NERSC

Systems: Divacancy states in Silicon supercells containing 998 and 1726 atoms.

I FLOPs per node

I Best/worst performing node

I Strong Scaling

I Time to solution

I Comparison to peak performance

Single Pool Performance: 200 KNL Nodes

(a) (b) (c)

(a) Poor performance due to small Comput./Commun. ratio (different OMPxMPI ratio, nomessage blocking) (b) Improved performance for 64-OMPxMPI by using different processes blocksize (c) Different OMPxMPI ratio and process block size adjusted to give roughly constant n′′.

Strong Scaling: Individual Pool and Full Sigma

(d) (e)

(a) Individual Pool scaling, total execution (red squares) and computationally intense part (bluecircles), (b) Full Sigma, 200-NKL nodes per Pool.

Full Sigma: Best Performance

998 Si 1726 Si

Number KNL Nodes 9600 9500

Number of Cores 633,600 627,000

Number Eqp Evaluated 48 38

Time to solution (s) 160 201

PetaFLOP/s 11.8 11.3

% Peak Performance 47 46

I Sigma kernel capable to scale to full-Cori

I Sigma achieves high fraction of peak performance

I Excellent time to solution (∼100 seconds) for systems made of thousands of atoms

Acknowledgments

This work was supported by the Center for Computational Study of Excited-State Phenomena inEnergy Materials at the Lawrence Berkeley National Laboratory, which is funded by the U.S.Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences andEngineering Division under Contract No. DE-AC02-05CH11231, as part of the ComputationalMaterials Sciences Program.

[email protected] http://www.berkeleygw.org/