deutscher wetterdienst - cisl home wetterdienst contents strong scalability of the operational...

Deutscher Wetterdienst

Porting Operational Models

to

Multi- and Many-Core Architectures

Oliver Fuhrer

MeteoSchweiz

Xavier Lapillonne

MeteoSchweiz

Ulrich Schättler



Contents

Strong Scalability of the Operational Models

GPU Developments for the COSMO-Model

Details of the Porting

2015 Multi-Core Workshop, Boulder 21.09.2015 2


Contributing Scientists

COSMO-Model: „The COSMO Development Team“

COSMO, GPU Developments: Xavier Lapillonne, Oliver Fuhrer, Carlos

Osuna, Thomas Schulthess, and many more

ICON: „The ICON Development Team“, esp. Florian Prill, Daniel Reinert,

Günther Zängl

21.09.2015 2015 Multi-Core Workshop, Boulder 3

Thank you all for providing material for this presentation!


Strong Scalability for

the Operational Models



The COSMO-Model

Operational since December 1999

Original DWD code, now further

developed and maintained by the

COSMO consortium

Used at about 30 national weather

services (for operational production)

and many universities and institutes

(for research)

Also used for climate modelling

(COSMO-CLM) and to simulate

aerosols and reactive trace gases

(COSMO-ART)

Flat MPI implementation



Old Scalability Results

From HP2C Report (June 2010):

„Performance Analysis and

Prototyping of the COSMO

Regional Atmospheric Model“

(Matthew Cordery, et al.)

„We note the poor parallel scaling

characteristics of COSMO beyond

1000 cores“

Parallel speedup of COSMO for a

1-hour simulation on 1 km grid

Tests based on COSMO Version

4.10 (from September 2009)



Scalability Tests with COSMO-D2

New domain: 651 × 716 × 65 grid points

Test Characteristics

12 hour forecast should run

in ≤ 1200 s in ensemble mode

in ≤ 400 s in deterministic mode

„nudgecast“ run: nudging and latent

heat nudging in the first 3h

SynSat pictures every 15 minutes

amount of output data per hour:

1.6 GByte: asynchronous output is

used with 4 or 5 output cores

Based on COSMO 5.1 (Nov. 2014)



Scalability of COSMO Components (incl. Comm.)

0,25

0,5

1

2

4

8

16

32

200 400 800 1600 3200 6400

Ideal

Dynamics

Physics

Nudging

LHN

I/O

Total



Is this good or bad?

Scalability of COSMO-Model for COSMO-D2 domain size is reasonably well

up to 1600 cores. Dynamics and Physics also scale beyond up to 6400 cores

(which is interesting for climate applications).

Meeting the operational (DWD / NWP) requirements:

for ensemble mode about 650 cores would be necessary to run a 12 hour

forecast in less than 1200 seconds.

for deterministic mode, 6400 cores are needed to run in less than 400

seconds.

But this is one of the more demanding applications. Typical NWP and CLM

configurations run well on several hundred or a few thousand cores.



ICON: ICOsahedral Nonhydrostatic Model

New development: Replaced GME at

DWD in January 2015 as operational

global model

Hybrid implementation: MPI + OpenMP

2015 Multi-Core Workshop, Boulder

Joint development project of DWD and

Max-Planck-Institute for Meteorology

about 40 active developers from

meteorology and computer science

~ 600,000 lines of Fortran code

Lately joined by KIT to implement

ICON-ART (environmental prediction)

A regional mode is under development

(will replace COSMO around 2020)

multi-scale

global

regional

weather

prediction

climate

prediction

environmental

predictions

21.09.2015 10


ICON Parallel Scaling on ECMWF´s XC30 („ccb“)

2015 Multi-Core Workshop, Boulder

Real-Data Test Setup

(date: 01.06.2014)

5 km global resolution

20,971,520 grid cells

hybrid run, 4 threads/task

1000 steps forecast,

w/out reduced radiation grid,

no output

21.09.2015 11


ICON in the High-Q Club

To join the High-Q Club, the application

has to scale across the full JUQUEEN,

the 28-rack BlueGene/Q system at Jülich

Supercomputer Center (JSC) with

458,752 cores (1,835,008 possible

threads).

A cloud-resolving (large-eddy simulation,

LES) version of the ICON, which is

developed within the „High definition

clouds and precipitation for advancing

climate prediction“, HD(CP)2, has been

tested. A horizontal grid spacing of

approximately 100m has been used.

21.09.2015 2015 Multi-Core Workshop, Boulder

http://www.fz-juelich.de/ias/jsc/EN/Expertise/High-Q-Club/_node.html

The group was able to show that the LES physics and the dynamical core scale well up

to the full JUQUEEN machine.

12









GPU Developments for

the COSMO-Model



The COSMO GPU Project

Started as part of the larger Swiss HP2C Initiative (High-Performance High-

Productivity Computing). Aims of HP2C were (among others):

to develop applications running efficiently on different architectures:multi-

core CPUs (x86) and GPUs

to allow domain scientist to easily bring new developments

Co-Design of the applications

Porting Strategy:

All COSMO components have low compute intensity: too costly to transfer

data for selected components

Full port strategy [1]

Work is going on in the COSMO Priority Project POMPA (Performance on

Massively Parallel Architectures), to implement all changes in the operational

code


[1] Fuhrer, O. et al., Supercomputing Frontiers and Innovations, 1, 2014

14


Porting Strategy

Avoid CPU-GPU copies by

executing all the time loop

computations on the GPU

Complete Re-Write of the

dynamical core with a domain

specific language (STELLA)

based on C++

Rest of the model is ported

with OpenACC

Communication library (GCL)

for GPU-GPU communications



Ongoing Work / Applications

GPU version of the code used for

cloud-resolving climate simulations

at European-scale (D. Leutwyler,

ETHZ)

Also used for a COSMO Project

“CALMO” for calibration / tuning of

the model. GPU version of the code has been run

daily for more than 1.5 years in

Switzerland for validation

Merge back changes into the official

COSMO trunk (ongoing)

GPU version of COSMO will be used

for pre-operational forecasting at

MeteoSwiss starting November 1, 2015

Testcase : 7 days in Jan. 2007, includes winter storm Kyrill

(18.01.2007), PiZ daint 144 nodes

33h simulation on prototype OPCODE

system (8 GPUs).



Performance Results

COSMO-E (2.2 km) MeteoSwiss Ensemble

configuration : 582x390x60

grid points, 120h, 21 members

dynamics

Physics: microphysics, radiation, sso,

turbulence, soil, shallow convection,

output

Comparison (chip to chip):

CPU: 8 Intel Xeon E5 2690 (Haswell) @ 2.6 GHz (each with 12 cores)

on Cray XC40 Piz Dora at CSCS

GPU: 4 NVIDIA Tesla K80 cards (each with 2 GK210 GPUs) on Cray

CS-Storm cluster Piz Kesch at CSCS



02000400060008000

100001200014000160001800020000

Tim

e (

s)

CPU-new DP

GPU-new DP

Overall run 2.1 x faster on GPU

Physics, OpenACC, and

optimizations: 2.4 x faster on

GPU

Dynamics using STELLA library

2.4 x faster on GPU (also faster

on CPU than original code)

Performance Results


x2.1 Lower is

better

Timing for 1 COSMO-E member (120h, 582x390x60

grid points)

18

• CPU: 8 Intel Xeon E5 2690

• GPU: 4 NVIDIA Tesla K80


Details of the Porting



Complete Re-Write of the Dynamics

STELLA is a domain specific language based on C++ (template meta

programming)

Has been developed by Computer Scientists in cooperation with MeteoSwiss

and C2SM

Backends available for

x86 CPUs: C++

NVIDIA GPUs: Cuda

Xeon Phi: a first implementation exists, but gives no satisfying

performance results (waiting for KNL)

It is going to be updated (will be GridTools then):

for better usability

to include also other applications (e.g. global weather and climate models

such as ICON)



Advantages

You can learn new methods and a

new programming style

Programming with a DSL is

architecture independent

CPU version runs faster than

Fortran version (usage of iterators,

blocking)

Disadvantages

You have to learn new methods and

a new programming style

Low involvement of the actual

developers of the dynamical core

Backends have to be supported by

specialists; not clear now, whether

actual developers can do this.


Complete Re-Write of the Dynamics (II)

Fortran and C++ dynamical core will co-exist in the next 2-3 years!


Physics

Code optimized for GPU [2]:

loop restructuring (reduce kernel

overhead, improve reuse)

scalar replacements

on the fly computations (reduce memory accesses)

manual caching (using scalars)

replacement of local automatic arrays with global automatic arrays (avoid

frequent memory allocation on the GPU)

Some GPU optimizations degrade performance on CPU : keep separate

routines when required (using ifdef)


[2] Lapillonne and Fuhrer, Parallel Processing Letters, 24, 2014

22


Original code

do k=2,nk

do i=1,ni

some code 1 …

c(i) = D*exp(a(i,k-1))

end do

do i=1,ni

a(i,k)=c(i)*a(i,k)

some code 2 …

end do

end do

OpenACC 2

!$acc parallel

!$acc loop gang vector

do k=2,nk

do i=1,ni

some code 1 …

zc=D*exp(a(i,k-1))

a(i,k)=zc*a(i,k)

some code 2 …

end do

end do

!$acc end parallel

OpenACC 1

do k=2,nk

!$acc parallel


do i=1,ni

some code 1 …

c(i) = D*exp(a(i,k-1))

end do

!$acc end parallel

!$acc parallel


do i=1,ni

a(i,k)=c(i)*a(i,k)

some code 2 …

end do

!$acc end parallel

end do

OpenACC 1 : Keep performance on CPU,

not optimal on GPU

OpenACC 2 : Optimal performance on

GPU, may decrease perfromance on CPU

Code Example



• Considering key kernel of radiation

• Test domain 128x128x60, 1 Sandy Bridge CPU vs 1 K20x GPU

0

0,5

1

1,5

2

2,5

3

3,5

CPU opt (ref) GPU opt

Sp

eed

up

Speed up with respect to reference code on CPU

Ref

CPU

CPU

GPU

GPU

Radiation is the strongest example, on average in the physics, code optimized

for GPU is 1.3x times slower when run on CPU

Higher

is better

Example in the Radiation



Different Optimization Requirements: CPU vs. GPU

CPU: Compute bound in the physics

Compiler auto-vectorization: easier with small loop constructs

Pre-computation

GPU: memory bandwidth limited

Benefit from large kernels : reduce kernel launch overhead, better

computation/memory access overlap

Loop re-ordering and scalar replacement

On the fly computation

Problematic for single source code:

Example: Cloudice microphysics in ICON still uses the small kernels (for

vectorization). COSMO uses the version with large kernels. But in

principle they are the same.

global automatic arrays for CPU version (where available memory might

be a problem)



Developers Wish List

Want to use a single source code for all architectures

and not want to use too many ifdefs

keep code simple, portable, readable and similar to Fortran

Dynamic memory management on GPUs

Common / unified support for OpenACC by more compilers

performance portability for OpenACC

what is INTELs opinion on OpenACC?


Thank you

very much

for your

attention

deutscher wetterdienst - cisl home wetterdienst contents strong scalability of the operational...

Documents