deutscher wetterdienst - cisl home wetterdienst contents strong scalability of the operational...
TRANSCRIPT
Deutscher Wetterdienst
Porting Operational Models
to
Multi- and Many-Core Architectures
Oliver Fuhrer
MeteoSchweiz
Xavier Lapillonne
MeteoSchweiz
Ulrich Schättler
Deutscher Wetterdienst
Deutscher Wetterdienst
Contents
Strong Scalability of the Operational Models
GPU Developments for the COSMO-Model
Details of the Porting
2015 Multi-Core Workshop, Boulder 21.09.2015 2
Deutscher Wetterdienst
Contributing Scientists
COSMO-Model: „The COSMO Development Team“
COSMO, GPU Developments: Xavier Lapillonne, Oliver Fuhrer, Carlos
Osuna, Thomas Schulthess, and many more
ICON: „The ICON Development Team“, esp. Florian Prill, Daniel Reinert,
Günther Zängl
21.09.2015 2015 Multi-Core Workshop, Boulder 3
Thank you all for providing material for this presentation!
Deutscher Wetterdienst
Strong Scalability for
the Operational Models
21.09.2015 2015 Multi-Core Workshop, Boulder 4
Deutscher Wetterdienst
The COSMO-Model
Operational since December 1999
Original DWD code, now further
developed and maintained by the
COSMO consortium
Used at about 30 national weather
services (for operational production)
and many universities and institutes
(for research)
Also used for climate modelling
(COSMO-CLM) and to simulate
aerosols and reactive trace gases
(COSMO-ART)
Flat MPI implementation
21.09.2015 2015 Multi-Core Workshop, Boulder 5
Deutscher Wetterdienst
Old Scalability Results
From HP2C Report (June 2010):
„Performance Analysis and
Prototyping of the COSMO
Regional Atmospheric Model“
(Matthew Cordery, et al.)
„We note the poor parallel scaling
characteristics of COSMO beyond
1000 cores“
Parallel speedup of COSMO for a
1-hour simulation on 1 km grid
Tests based on COSMO Version
4.10 (from September 2009)
2015 Multi-Core Workshop, Boulder 21.09.2015 6
Deutscher Wetterdienst
Scalability Tests with COSMO-D2
New domain: 651 × 716 × 65 grid points
Test Characteristics
12 hour forecast should run
in ≤ 1200 s in ensemble mode
in ≤ 400 s in deterministic mode
„nudgecast“ run: nudging and latent
heat nudging in the first 3h
SynSat pictures every 15 minutes
amount of output data per hour:
1.6 GByte: asynchronous output is
used with 4 or 5 output cores
Based on COSMO 5.1 (Nov. 2014)
2015 Multi-Core Workshop, Boulder 21.09.2015 7
Deutscher Wetterdienst
Scalability of COSMO Components (incl. Comm.)
0,25
0,5
1
2
4
8
16
32
200 400 800 1600 3200 6400
Ideal
Dynamics
Physics
Nudging
LHN
I/O
Total
2015 Multi-Core Workshop, Boulder 21.09.2015 8
Deutscher Wetterdienst
Is this good or bad?
Scalability of COSMO-Model for COSMO-D2 domain size is reasonably well
up to 1600 cores. Dynamics and Physics also scale beyond up to 6400 cores
(which is interesting for climate applications).
Meeting the operational (DWD / NWP) requirements:
for ensemble mode about 650 cores would be necessary to run a 12 hour
forecast in less than 1200 seconds.
for deterministic mode, 6400 cores are needed to run in less than 400
seconds.
But this is one of the more demanding applications. Typical NWP and CLM
configurations run well on several hundred or a few thousand cores.
2015 Multi-Core Workshop, Boulder 21.09.2015 9
Deutscher Wetterdienst
ICON: ICOsahedral Nonhydrostatic Model
New development: Replaced GME at
DWD in January 2015 as operational
global model
Hybrid implementation: MPI + OpenMP
2015 Multi-Core Workshop, Boulder
Joint development project of DWD and
Max-Planck-Institute for Meteorology
about 40 active developers from
meteorology and computer science
~ 600,000 lines of Fortran code
Lately joined by KIT to implement
ICON-ART (environmental prediction)
A regional mode is under development
(will replace COSMO around 2020)
multi-scale
global
regional
weather
prediction
climate
prediction
environmental
predictions
21.09.2015 10
Deutscher Wetterdienst
ICON Parallel Scaling on ECMWF´s XC30 („ccb“)
2015 Multi-Core Workshop, Boulder
Real-Data Test Setup
(date: 01.06.2014)
5 km global resolution
20,971,520 grid cells
hybrid run, 4 threads/task
1000 steps forecast,
w/out reduced radiation grid,
no output
21.09.2015 11
Deutscher Wetterdienst
ICON in the High-Q Club
To join the High-Q Club, the application
has to scale across the full JUQUEEN,
the 28-rack BlueGene/Q system at Jülich
Supercomputer Center (JSC) with
458,752 cores (1,835,008 possible
threads).
A cloud-resolving (large-eddy simulation,
LES) version of the ICON, which is
developed within the „High definition
clouds and precipitation for advancing
climate prediction“, HD(CP)2, has been
tested. A horizontal grid spacing of
approximately 100m has been used.
21.09.2015 2015 Multi-Core Workshop, Boulder
http://www.fz-juelich.de/ias/jsc/EN/Expertise/High-Q-Club/_node.html
The group was able to show that the LES physics and the dynamical core scale well up
to the full JUQUEEN machine.
12
Deutscher Wetterdienst
GPU Developments for
the COSMO-Model
21.09.2015 2015 Multi-Core Workshop, Boulder 13
Deutscher Wetterdienst
The COSMO GPU Project
Started as part of the larger Swiss HP2C Initiative (High-Performance High-
Productivity Computing). Aims of HP2C were (among others):
to develop applications running efficiently on different architectures:multi-
core CPUs (x86) and GPUs
to allow domain scientist to easily bring new developments
Co-Design of the applications
Porting Strategy:
All COSMO components have low compute intensity: too costly to transfer
data for selected components
Full port strategy [1]
Work is going on in the COSMO Priority Project POMPA (Performance on
Massively Parallel Architectures), to implement all changes in the operational
code
21.09.2015 2015 Multi-Core Workshop, Boulder
[1] Fuhrer, O. et al., Supercomputing Frontiers and Innovations, 1, 2014
14
Deutscher Wetterdienst
Porting Strategy
Avoid CPU-GPU copies by
executing all the time loop
computations on the GPU
Complete Re-Write of the
dynamical core with a domain
specific language (STELLA)
based on C++
Rest of the model is ported
with OpenACC
Communication library (GCL)
for GPU-GPU communications
21.09.2015 2015 Multi-Core Workshop, Boulder 15
Deutscher Wetterdienst
Ongoing Work / Applications
GPU version of the code used for
cloud-resolving climate simulations
at European-scale (D. Leutwyler,
ETHZ)
Also used for a COSMO Project
“CALMO” for calibration / tuning of
the model. GPU version of the code has been run
daily for more than 1.5 years in
Switzerland for validation
Merge back changes into the official
COSMO trunk (ongoing)
GPU version of COSMO will be used
for pre-operational forecasting at
MeteoSwiss starting November 1, 2015
Testcase : 7 days in Jan. 2007, includes winter storm Kyrill
(18.01.2007), PiZ daint 144 nodes
33h simulation on prototype OPCODE
system (8 GPUs).
21.09.2015 2015 Multi-Core Workshop, Boulder 16
Deutscher Wetterdienst
Performance Results
COSMO-E (2.2 km) MeteoSwiss Ensemble
configuration : 582x390x60
grid points, 120h, 21 members
dynamics
Physics: microphysics, radiation, sso,
turbulence, soil, shallow convection,
output
Comparison (chip to chip):
CPU: 8 Intel Xeon E5 2690 (Haswell) @ 2.6 GHz (each with 12 cores)
on Cray XC40 Piz Dora at CSCS
GPU: 4 NVIDIA Tesla K80 cards (each with 2 GK210 GPUs) on Cray
CS-Storm cluster Piz Kesch at CSCS
21.09.2015 2015 Multi-Core Workshop, Boulder 17
Deutscher Wetterdienst
02000400060008000
100001200014000160001800020000
Tim
e (
s)
CPU-new DP
GPU-new DP
Overall run 2.1 x faster on GPU
Physics, OpenACC, and
optimizations: 2.4 x faster on
GPU
Dynamics using STELLA library
2.4 x faster on GPU (also faster
on CPU than original code)
Performance Results
21.09.2015 2015 Multi-Core Workshop, Boulder
x2.1 Lower is
better
Timing for 1 COSMO-E member (120h, 582x390x60
grid points)
18
• CPU: 8 Intel Xeon E5 2690
• GPU: 4 NVIDIA Tesla K80
Deutscher Wetterdienst
Complete Re-Write of the Dynamics
STELLA is a domain specific language based on C++ (template meta
programming)
Has been developed by Computer Scientists in cooperation with MeteoSwiss
and C2SM
Backends available for
x86 CPUs: C++
NVIDIA GPUs: Cuda
Xeon Phi: a first implementation exists, but gives no satisfying
performance results (waiting for KNL)
It is going to be updated (will be GridTools then):
for better usability
to include also other applications (e.g. global weather and climate models
such as ICON)
21.09.2015 2015 Multi-Core Workshop, Boulder 20
Deutscher Wetterdienst
Advantages
You can learn new methods and a
new programming style
Programming with a DSL is
architecture independent
CPU version runs faster than
Fortran version (usage of iterators,
blocking)
Disadvantages
You have to learn new methods and
a new programming style
Low involvement of the actual
developers of the dynamical core
Backends have to be supported by
specialists; not clear now, whether
actual developers can do this.
21.09.2015 2015 Multi-Core Workshop, Boulder 21
Complete Re-Write of the Dynamics (II)
Fortran and C++ dynamical core will co-exist in the next 2-3 years!
Deutscher Wetterdienst
Physics
Code optimized for GPU [2]:
loop restructuring (reduce kernel
overhead, improve reuse)
scalar replacements
on the fly computations (reduce memory accesses)
manual caching (using scalars)
replacement of local automatic arrays with global automatic arrays (avoid
frequent memory allocation on the GPU)
Some GPU optimizations degrade performance on CPU : keep separate
routines when required (using ifdef)
21.09.2015 2015 Multi-Core Workshop, Boulder
[2] Lapillonne and Fuhrer, Parallel Processing Letters, 24, 2014
22
Deutscher Wetterdienst
Original code
do k=2,nk
do i=1,ni
some code 1 …
c(i) = D*exp(a(i,k-1))
end do
do i=1,ni
a(i,k)=c(i)*a(i,k)
some code 2 …
end do
end do
OpenACC 2
!$acc parallel
!$acc loop gang vector
do k=2,nk
do i=1,ni
some code 1 …
zc=D*exp(a(i,k-1))
a(i,k)=zc*a(i,k)
some code 2 …
end do
end do
!$acc end parallel
OpenACC 1
do k=2,nk
!$acc parallel
!$acc loop gang vector
do i=1,ni
some code 1 …
c(i) = D*exp(a(i,k-1))
end do
!$acc end parallel
!$acc parallel
!$acc loop gang vector
do i=1,ni
a(i,k)=c(i)*a(i,k)
some code 2 …
end do
!$acc end parallel
end do
OpenACC 1 : Keep performance on CPU,
not optimal on GPU
OpenACC 2 : Optimal performance on
GPU, may decrease perfromance on CPU
Code Example
21.09.2015 2015 Multi-Core Workshop, Boulder 23
Deutscher Wetterdienst
• Considering key kernel of radiation
• Test domain 128x128x60, 1 Sandy Bridge CPU vs 1 K20x GPU
0
0,5
1
1,5
2
2,5
3
3,5
CPU opt (ref) GPU opt
Sp
eed
up
Speed up with respect to reference code on CPU
Ref
CPU
CPU
GPU
GPU
Radiation is the strongest example, on average in the physics, code optimized
for GPU is 1.3x times slower when run on CPU
Higher
is better
Example in the Radiation
21.09.2015 2015 Multi-Core Workshop, Boulder 24
Deutscher Wetterdienst
Different Optimization Requirements: CPU vs. GPU
CPU: Compute bound in the physics
Compiler auto-vectorization: easier with small loop constructs
Pre-computation
GPU: memory bandwidth limited
Benefit from large kernels : reduce kernel launch overhead, better
computation/memory access overlap
Loop re-ordering and scalar replacement
On the fly computation
Problematic for single source code:
Example: Cloudice microphysics in ICON still uses the small kernels (for
vectorization). COSMO uses the version with large kernels. But in
principle they are the same.
global automatic arrays for CPU version (where available memory might
be a problem)
21.09.2015 2015 Multi-Core Workshop, Boulder 25
Deutscher Wetterdienst
Developers Wish List
Want to use a single source code for all architectures
and not want to use too many ifdefs
keep code simple, portable, readable and similar to Fortran
Dynamic memory management on GPUs
Common / unified support for OpenACC by more compilers
performance portability for OpenACC
what is INTELs opinion on OpenACC?
21.09.2015 2015 Multi-Core Workshop, Boulder 26