porting climate models to nec sx-aurora tsubasa...< panagiotis adamidis> (dkrz) nec sx-aurora...
TRANSCRIPT
-
< Panagiotis Adamidis> (DKRZ)
Deutsches Klimarechenzentrum (DKRZ)
Porting Climate Models to NEC SX-Aurora TSUBASA
-
< Panagiotis Adamidis> (DKRZ)
NEC SX-Aurora TSUBASA @ DKRZ
2
People who contributed Panagiotis Adamidis, Jan Frederik Engels, Hendryk
Bockelmann (DKRZ) Günther Zängl (DWD) Rene Redler (MPI-Met)
-
< Panagiotis Adamidis> (DKRZ)
Climate Models @ DKRZ
3
High Resolution Resolving small scale physical processes
Coarse Resolution Simulating longer periods (80000 – 100000 years) Complete glacial cycles
-
< Panagiotis Adamidis> (DKRZ)
ICON Grid Resolutions
4
grid number of cells avg. resolutionR2B04 20480 158 kmR2B05 81920 79 kmR2B06 327680 40 km R2B07 1310720 20 kmR2B09 20971520 5 kmR2B10 83886080 2.5 km R2B11 335544320 1.25 km
-
< Panagiotis Adamidis> (DKRZ)
Very High Resolution Climate Modelling HD(CP)2
-
< Panagiotis Adamidis> (DKRZ)
HDCP2 Modell - Sustained Performance
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0 7200 14400 21600 28800 36000 43200 50400 57600 64800
Sim
ulat
edD
ays
/ Day
Number of Cores
DE 3 DomainsDE-3Dom-Schifted-NightDE-3Dom-Schifted-DayEstimationWithout Output
-
< Panagiotis Adamidis> (DKRZ)
Some Statistics (01.2016-02.2018)
Total number of / amount ofsimulated days 26
variables written out 169
meteorological stations 36
data output : 1,5 PB + meteogram : 1 TB
node hours used 3,25 Mio node hours
7
-
< Panagiotis Adamidis> (DKRZ)
NEC SX-Aurora TSUBASA: ICON-NWP (R2B06L90)
8
• Successful compilation with –O2 –finline-functions
Aurora without IVDEP
Time in sec
total 2412.74nh_solve 902.22
physics 1170.16
DWD-Cray 36core-BDW
Time in sec
total 433.21nh_solve 226.43
physics 128.46
Time in sec Aurorawith
IVDEP
1633.41 total124.89 nh_solve
1177.07 physics
7.2x
1.8x
(G. Zängl, DWD)
-
< Panagiotis Adamidis> (DKRZ)
NEC SX-Aurora TSUBASA
9
Compiler shows reluctant behaviour towardsvectorization
Many unvectorizable dependencies detected Using Pointerarrays with the CONTIGUOUS
attribute leads to unvectorizable dependencies
-
< Panagiotis Adamidis> (DKRZ)
NEC SX-Aurora TSUBASA: ICON-NWP (R2B06L90)
10
nh_solve 1 node NEC SX-Aurora TSUBASA16 cores
(time in sec)
1 node CRAY Broadwell36 cores
(time in sec)
nh_solve.cellcomp(direct addressing,
memory-bound)
3.16 36.05
nh_solve.veltend(indirect addressing)
14 31.7
nh_solve.vimpl(good cache utilization)
11.9 53.5
(G. Zängl, DWD)
-
< Panagiotis Adamidis> (DKRZ)
DYAMOND : ICON-AES / R2B09L90 / 1 Simulated Day
12
mistral BDW
100 nodes
200 nodes
400 nodes
600nodes
900nodes
total 8731.34 4072.34 2192.81 1446.71 1007.58
0
1
2
3
4
5
6
7
8
9
10
100 200 400 600 900
Spee
dup
# nodes
Strong Scaling
linear
speedup
-
< Panagiotis Adamidis> (DKRZ)
DYAMOND : ICON-AES / R2B09L90 / 1 Simulated Day
13
56% 57% 58% 57% 57%
25%28% 31% 30% 31%
18% 15%11% 13% 12%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
100 200 400 600 900# nodes
Time Distribtution
rest
physcis/total
nh_solve/total
-
< Panagiotis Adamidis> (DKRZ)
NEC SX-Aurora TSUBASA: ICON-coupled ATM/OCE
14
• Successful compilation with optimization level O3
Hardware MPI-Tasks Total (time in sec)
echam_bcs(time in sec)
Difference of Total time in
sec
Slowdownfactor
1 mistral node :
24 HSW cores
22 Atmosphere+
2 Ocean
65,8 4,12
1 Aurora node : 16
Vector cores
14 Atmosphere+
2 Ocean
296,7 16,7 230,9 4,5X
• 1 Simulated Day• Input seems to be a bottleneck
(R. Redler, MPI-Met)
-
< Panagiotis Adamidis> (DKRZ)
NEC SX-Aurora TSUBASA : ECHAM6
-
< Panagiotis Adamidis> (DKRZ)
NEC SX-Aurora TSUBASA: ECHAM6
16
• Successful compilation only when using low optimization level O1
NumberMPI
Tasks
HSWO3
+ hiopt(time in sec)
NEC SX-AuroraO1
(time in sec)
Difference
(time in sec)
Slowdown
16 48,657 210,49 161,83 13,15X
32 26,122 166,43 140,30 6,37X
-
< Panagiotis Adamidis> (DKRZ)
NEC SX-Aurora TSUBASA : ftrace
17
Overhead of ftrace is huge
with ftrace(time in sec)
no ftrace(time in sec)
ICON-AES 478 28 ICON-Coupled 1600 297
-
< Panagiotis Adamidis> (DKRZ)
NEC SX-Aurora architecture has good potentialto deliver sustained performance Compiler with efficient vectorization is essential Good scaling over hundreds/thousands of
nodes is necessary Efficient parallel I/O is vital for high resolution
simulations The performance of the file system is very
important
Conclusion & Outlook
Foliennummer 1NEC SX-Aurora TSUBASA @ DKRZClimate Models @ DKRZICON Grid ResolutionsFoliennummer 5HDCP2 Modell - Sustained PerformanceSome Statistics (01.2016-02.2018)NEC SX-Aurora TSUBASA: ICON-NWP (R2B06L90) NEC SX-Aurora TSUBASANEC SX-Aurora TSUBASA: ICON-NWP (R2B06L90) DYAMOND : ICON-AES / R2B09L90 / 1 Simulated DayDYAMOND : ICON-AES / R2B09L90 / 1 Simulated DayNEC SX-Aurora TSUBASA: ICON-coupled ATM/OCENEC SX-Aurora TSUBASA : ECHAM6NEC SX-Aurora TSUBASA: ECHAM6NEC SX-Aurora TSUBASA : ftraceFoliennummer 18