performance optimization of quantum espresso on knl · performance optimization of quantum espresso...
TRANSCRIPT
![Page 1: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/1.jpg)
Performance Optimization of Quantum Espresso on KNL
[email protected] NERSC November 3 2016
Taylor Barnes, Thorsten Kurth, Paul Kent, Pierre Carrier, Nathan Wichmann, David Prendergast, Jack Deslippe
![Page 2: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/2.jpg)
Introduction
Approximate exchange functional Exact exchange operator
Local DFT: Hybrid DFT:
Cost of Hybrid DFT
Goal: Prepare QE for large-scale execution on the KNL architecture, with a particular focus on improving the implementation of hybrid exchange
0
200
400
600
800
1000
1200
PBE PBE0
Wal
ltim
e (s
)
Local DFT Hybrid DFT
Total walltime for an SCF calculation on 16 waters:
![Page 3: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/3.jpg)
NERSC Whitebox, Quad Cache Mode; Intel 17.0.042 Compiler:
Threads
Original OpenMP Threading
![Page 4: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/4.jpg)
Increasing the amount of work performed in each OMP loop dramatically reduces overhead costs.
NERSC Whitebox, Quad Cache Mode; Intel 17.0.042 Compiler:
Threads
Improved OpenMP Threading
![Page 5: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/5.jpg)
Flat Mode
With the original code, running KNL in Cache mode is approximately 2x faster than Flat mode. Using FASTMEM directives enables Flat mode to outperform Cache mode.
KNL NUMA Mode Performance
![Page 6: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/6.jpg)
Using the improved OpenMP code is the fastest way to run on KNL, and is about 1.6x faster than the best Haswell walltimes. OMP is significantly improved on both Haswell and KNL.
KNL vs. Haswell
![Page 7: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/7.jpg)
Strong scaling of QE on Ivy Bridge, using pure MPI mode:
Strong Scaling of QE
![Page 8: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/8.jpg)
FFT-1[ ψi(g) ]
Calculate ck(g)
v(g) = ck(g) ρij(g) FFT-1[ v(g) ]
FFT[ ρij(r) ]
FFT[ Hψi(r) ]
Reduce Hψi(r)
Local Potential
Update ψ
Broadcast ψ(g) SCF Loop
(duplicated on each bgrp) Loop i
(duplicated on each bgrp)
Hψi(r) += v(r) ψj(r)
Loop j (bgrp parallelized)
ρij(r) = ψi(r) ψj(r)
Loop k
Overview of Existing Code
![Page 9: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/9.jpg)
FFT-1[ ψi(g) ]
v(g) = ck(g) ρij(g) FFT-1[ v(g) ]
FFT[ ρij(r) ]
Local Potential
Update ψ
Broadcast ψ(g) SCF Loop
(duplicated on each bgrp)
Hψi(r) += v(r) ψj(r)
Loop j (bgrp parallelized)
ρij(r) = ψi(r) ψj(r)
Loop k
FFT[ Hψi(r) ]
Reduce Hψi(g)
Re-use ck(g)
Loop i (bgrp parallelized)
Pair Parallelization
![Page 10: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/10.jpg)
Strong scaling of the exact exchange part of the code on Ivy Bridge, with 1 band group per node:
Parallelization over band pairs both improves the strong scaling of the code and also improves the load balancing.
Pair Parallelization Performance
![Page 11: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/11.jpg)
FFT-1[ ψi(g) ]
v(g) = ck(g) ρij(g) FFT-1[ v(g) ]
FFT[ ρij(r) ]
Local Potential
Update ψ
Broadcast ψ(g) SCF Loop
(duplicated on each bgrp)
Hψi(r) += v(r) ψj(r)
Loop j (bgrp parallelized)
ρij(r) = ψi(r) ψj(r)
Loop k
FFT[ Hψi(r) ]
Reduce Hψi(g)
Re-use ck(g)
Loop i (bgrp parallelized)
Code Overview
![Page 12: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/12.jpg)
Fraction of total walltime spent in local regions of the code, using 1 band group per node on Haswell:
Unintuitively, local regions of the code dominate the cost of the calculation when using large numbers of nodes. This is because the local regions of the code a run in serial with respect to band groups.
Cost of the Local Calculation
![Page 13: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/13.jpg)
Independent Parallelization of the Local Code
Node 1
Band 1
Data structure of Ψ(g), in local code:
Band 2
Band 3
Band 4
Band 5
Band 6
g5 g2 g4 g6 g1 g3
Node 1
Band 1
Band 2
Band 3
Band 4
Band 5
Band 6
g1 g2 g3 g4 g5 g6
Node 2 Node 3
Data structure of Ψ(g), in exact exchange code:
Node 2
Node 3
Enabling parallelization of the local regions of the code across band groups requires on-the-fly transformation of the data structures between local and exact exchange regions of the code.
![Page 14: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/14.jpg)
Full Calculation Exact Exchange Part Strong scaling on Ivy Bridge, with 1 band group per node:
Improved Strong Scaling
![Page 15: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul](https://reader034.vdocuments.net/reader034/viewer/2022042804/5f548bca9d248524475fc875/html5/thumbnails/15.jpg)
Thank You
Taylor Barnes, NERSC, November 2016 - 15