cosmo - dynamical core rewrite approach, rewrite and status tobias gysi

24
Manno, 4.5..2011, © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno, 3.5.2011 Supercomputing Systems AG Fon +41 43 456 16 00 Technopark 1 Fax +41 43 456 16 10 8005 Zürich www.scs.ch

Upload: jacqueline-wilder

Post on 31-Dec-2015

70 views

Category:

Documents


0 download

DESCRIPTION

COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno, 3.5.2011. Supercomputing Systems AGFon +41 43 456 16 00 Technopark 1Fax +41 43 456 16 10 8005 Zürichwww.scs.ch. Approach. COSMO Dynamical Core Rewrite. Challenge - PowerPoint PPT Presentation

TRANSCRIPT

Manno, 4.5..2011, © by Supercomputing Systems1 1

COSMO - Dynamical Core Rewrite Approach, Rewrite and Status

Tobias Gysi

POMPA Workshop, Manno, 3.5.2011

Supercomputing Systems AG Fon +41 43 456 16 00

Technopark 1 Fax +41 43 456 16 10

8005 Zürich www.scs.ch

Manno, 4.5..2011, © by Supercomputing Systems2 2

Approach

Manno, 4.5..2011, © by Supercomputing Systems3 3

COSMO Dynamical Core Rewrite

Challenge

•Assuming that the COSMO code will continue run on commodity processors

in the next couple of years, what is the performance improvement we can

achieve by rewriting the dynamical core?

Boundary Conditions

•Do not touch the underlying physical model

(i.e. equations that are being solved)

– Formulas must remain as they are

– Arbitrary ordering of computations, etc. may change

– Results must remain ‘identical’ to ‘a very high level of accuracy’

•Part of an initiative looking at all parts of the COSMO code.

Support

•Support from & direct interaction with MeteoSwiss, DWD, CSCS, C2SM

Manno, 4.5..2011, © by Supercomputing Systems4 4

Approach

Feasibility Study Lib. Design

Rewrite

Test Tune

Feasibility

Library

Test & Tune

~2 Years

CPU

GPUt

Yo

u A

re H

ere

Manno, 4.5..2011, © by Supercomputing Systems5 5

Feasibilty Study

Manno, 4.5..2011, © by Supercomputing Systems6 6

Feasibility Study - Overview

• Get to know the code

• Understand performance characteristics

• Find computational motives

– Stencil

– Tri-Diagonal Solver

• Implement a prototype code

– Relevant part of the dynamical core

(Fast Wave Solver, ~30% of total runtime)

– Try to optimize for x86

– No MPI parallelization

Manno, 4.5..2011, © by Supercomputing Systems8 8

Feasibility Study - Prototype

• Implemented in C++

• Optimize for memory-bandwidth

utilization

– Avoid pre-computation, do

computation on the fly

– Merge loops accessing the

common variables

– Use iterators rather than full index

calculation on 3D grid

– Store data contiguous in

‘k-direction’ (vertical columns)

Manno, 4.5..2011, © by Supercomputing Systems9 9

Fast Wave Solver - SpeedupThe performance

difference is NOT due to programming

language but due to code optimizations!

Manno, 4.5..2011, © by Supercomputing Systems10 10

Feasibility Study - Conclusion

• A performance increase of 2x has been achieved on a representative part

of the code

• Main optimizations identified (for scalar processors)

– Avoid pre-calculation whenever possible

– Merge loops

– Change the storage order to k-first

• Performance is all about memory bandwidth

Manno, 4.5..2011, © by Supercomputing Systems11 11

Rewrite

Manno, 4.5..2011, © by Supercomputing Systems12 12

Design Targets

Write a code that

•Delivers the right results

– Dedicated unit-tests & verification framework

•Apply the performance optimization strategies used in the prototype

•Can be developed within a year to run on x86 and GPU platforms

– Mandatory: support three-level parallelism in a very flexible way

• Vector processing units (e.g. SSE)

• Multi-core node (sub-domain)

• Multiple nodes (domain) - not part of the SCS project

– Optional: write one single code that can be compiled to both platforms

Manno, 4.5..2011, © by Supercomputing Systems13 13

Design Targets

Write a code that

•Facilitates future improvements in terms of

– New models / algorithms

– Portability to new computer architectures

•Can and will be integrated by the COSMO consortium into the main

branch

Manno, 4.5..2011, © by Supercomputing Systems14 14

Stencil Library - Ideas

• It is challenging to develop a stencil library

– There is no big chunk of work that can be hidden behind a API call

(e.g. matrix multiplication)

– The actual update function of the stencil is heavily application specific

and performance critical

• We use a DSEL like approach (Domain Specific Embedded

Language)

– “Stencil language” embedded in C++

– Separate description of loop logic and update function

– During compile time generate optimized C++ code

(possible due to C++ meta programming capabilities)

Manno, 4.5..2011, © by Supercomputing Systems15 15

Stencil Library - Parallelization

• Parallelization on the node level is done by

– Splitting the calculation domain into blocks (IJ-Plane)

– Parallelize the work over the blocks

– Double buffering avoids concurrency issues

Manno, 4.5..2011, © by Supercomputing Systems16 16

Stencil Library – Loop Merging

• The library allows the definition of multiple stages per stencil

– Stages are update functions applied consecutively to one block

– As a block is typically much smaller than the complete domain

we can leverage the caches of the CPU

Manno, 4.5..2011, © by Supercomputing Systems17 17

Stencil Library – Calculation On The Fly

• Calculation on the fly is

supported using a

combination of stages and

column buffers

– Column buffers are

fields with the size of

one block local to every

CPU core

– A first stage writes to a

buffer while a second

stage consumes the

pre-calculated values

Manno, 4.5..2011, © by Supercomputing Systems18 18

Stencil Code – My Toy Example

1. Naive

for k

a(k) := b(k) + c(k)

end

...

for k

d(k) := a(k-1)*e(-1) +

a(k)*e(0) +

a(k+1)*e(+1)

end

...

for k

f(k) := a(k)*g(k) + d(k)

end

Manno, 4.5..2011, © by Supercomputing Systems19 19

Stencil Code – My Toy Example

2. No pre-calculation

for k

d(k) := (b(k-1)+c(k-1))*e(-1) +

(b(k)+c(k))*e(0) +

(b(k+1)+c(k+1))*e(+1)

f(k) := (b(k)+c(k))*g(k) + d(k)

end

1. Naive

for k

a(k) := b(k) + c(k)

end

...

for k

d(k) := a(k-1)*e(-1) +

a(k)*e(0) +

a(k+1)*e(+1)

end

...

for k

f(k) := a(k)*g(k) + d(k)

end

Manno, 4.5..2011, © by Supercomputing Systems20 20

Stencil Code – My Toy Example

3. Pre-calculation with

temporary variables

for k

z := b(k+1) + c(k+1)

d(k) := x*e(-1) +

y*e(0) +

z*e(+1)

f(k) := y*g(k) + d(k)

x:=y

y:=z

end

1. Naive

for k

a(k) := b(k) + c(k)

end

...

for k

d(k) := a(k-1)*e(-1) +

a(k)*e(0) +

a(k+1)*e(+1)

end

...

for k

f(k) := a(k)*g(k) + d(k)

end

Manno, 4.5..2011, © by Supercomputing Systems21 21

Stencil Code – My Toy Example

4. Pre-calculation with

column buffer

for k

a(k) := b(k) + c(k)

end

for k

d(k) := a(k-1)*e(-1) +

a(k)*e(0) +

a(k+1)*e(+1)

f(k) := a(k)*g(k) + d(k)

end

1. Naive

for k

a(k) := b(k) + c(k)

end

...

for k

d(k) := a(k-1)*e(-1) +

a(k)*e(0) +

a(k+1)*e(+1)

end

...

for k

f(k) := a(k)*g(k) + d(k)

end

Manno, 4.5..2011, © by Supercomputing Systems22 22

Stencil Code – My Toy Example

5. Pre-calculation with

stages & column Buffer

Stencil

Stage 1

a := b + c

Stage 2

d := a*e (k:-1,0,1)

Stage 3

f := a*g + d

Apply Stencil

1. Naive

for k

a(k) := b(k) + c(k)

end

...

for k

d(k) := a(k-1)*e(-1) +

a(k)*e(0) +

a(k+1)*e(+1)

end

...

for k

f(k) := a(k)*g(k) + d(k)

end

Manno, 4.5..2011, © by Supercomputing Systems23 23

Status

Manno, 4.5..2011, © by Supercomputing Systems24 24

Status

• So far the following stencils have been implemented:

– Fast wave solver (w bottom boundary initialization missing)

– Advection

• 5th order advection

• Bott 2 advection (cri implementation missing)

– Complete tendencies

– Horizontal Diffusion

– Coriolis

• The next steps are:

– Implicit vertical diffusion

– Put it all together

– Performance optimization

Manno, 4.5..2011, © by Supercomputing Systems25 25

Discussion

Acknowledgements to all our collaborators at•C2SM (Center for Climate Systems Modeling)•MeteoSwiss•DWD (Deutscher Wetterdienst)•CSCS