erlangen regional computing center performance engineering … · erlangen regional computing...

ERLANGEN REGIONAL

COMPUTING CENTER

Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein

Erlangen Regional Computing Center & Dept. of Computer Science

Friedrich-Alexander-University Erlangen-Nuremberg, Germany

Performance Engineering for Stencil Updates

on Modern Processors

Partially funded by DFG Priority Programme1648

2

• There are so many

• potential stencil structures (2D/3D, long-/short-range,…),

• text book optimizations (register / spatial / temporal blocking,…),

• parameters for optimization (blocking / unrolling factor,…),

• parameters for execution (OpenMP schedule, clock speed, #cores).

• Basic questions addressed by ECM model

• What is the bottleneck of my stencil implementation?

Choose appropriate optimization technique

• What is the next bottleneck I will hit and “how far” is it from the first?

This yields the performance potential of the optimization

• Impact of processor frequency and #cores used on performance

Motivation

Performance Engineering for Stencil Kernels

3

Agenda

ECM model – basics

STREAM like kernel

Why does a single core not get full BW?

Case study 1: 5-pt 2D stencil

Case study 2: Long range 3D stencil

Case study 3: 3D stencil with DIVide

Summary


4

Execution-Cache-Memory (ECM) model

Assumptions

Single core execution time is determined by

In-core execution time &

Data delay in memory hierarchy

Socket scaling in socket is linear until

relevant shared bottleneck is hit

Insights

Single-core performance & socket scaling

Relevant bottlenecks & performance impact

Input

Similar as for roofline (e.g. STREAM)

Data transfer times of all memory levels

Registers

L1

L2 / L3

Arithmetic

Memory

”d

ata

de

lay”

”c

ore

ti

me”

4.3cy/CL (40GB/s @2.7GHz)

2cy/CL

2cy/CL


5

REPEAT[ A(:) = B(:) + C(:) * D(:)] @ double precision

Analysis: Sandy Bridge core w/ AVX (unit of work: 1 Cache Line)

Example: Schönauer Vector Triad in L2 cache

1 LD/cy + 0.5 ST/cy

Registers

L1

L2

32 B/cy (2 cy/CL)

Machine characteristics:

Arithmetic:

1 ADD/cy+ 1 MULT/cy

Triad analysis (per CL):

Registers

L1

L2

6 cy/CL

10 cy/CL

Arithmetic:

AVX: 2 cy/CL

Roofline prediction: 16/10 F/cy

LD LD LD WA ST

Measurement: ≈ 16 /17 F/cy


Timeline:

LD

LD

ST/2

LD

ST/2

LD

LD

ST/2

LD

ST/2

ADD

MULT

ADD

MULT

16 Flops/CL

(AVX)

6

Data

in Model

Measure

ment

L1 6 cy 6.04 cy

L2 16 cy 17.2 cy

L3 26 cy 26.3 cy

MEM 50 cy 52.3 cy

Example: ECM model for Schönauer Vector Triad A(:)=B(:)+C(:)*D(:) with AVX

LOAD/STORE CL

transfer

Write-allocate

CL transfer


Full socket

stream BW

LD

6 c

y L2

-L1

1

0cy

L3

-L2

1

0cy

M

-L3

2

4cy

ST 4cy

Hard

ware

specific

ation

Time [cycles]

for single Cache Line Update

7

Identify relevant bandwidth (shared) bottlenecks

L3 cache

Memory interface

Scale single-thread performance until first bottleneck is hit:

Multicore scaling in the ECM model

. . . Sandy Bridge: Scalable L3

Bottleneck (Proof): Memory

P(t)=min(t*P0,Proof), with Proof=min(Pmax,l *bS)


8

Saturation point (# cores) well predicted

Measurement: scaling not perfect

Single core performance accurately

predicted

ECM model works well for this

architecture and this benchmark!

ECM prediction vs. measurements for A(:)=B(:)+C(:)*D(:)


Intel Sandy Bridge EP

Only „measurement“: full socket

STREAM BW

9

Case study 1

ECM modelling for stencil structures:

2D Jacobi (double precision) with SSE2 on SNB


10

Case study 1: 2D Jacobi in DP with SSE2 on SNB

Instruction count

- 13 LOAD

- 4 STORE

- 12 ADD

- 4 MUL

4-way unrolling

8 LUP / iteration


11


Code characteristics

(SSE instructions per CL)

- 13 LOAD

- 4 STORE

- 12 ADD

- 4 MUL

Machine characteristics

(SSE instructions per cycle)

- 2 LOAD || (1 LOAD + 1 STORE)

- 1 ADD

- 1 MUL

- Clock speed: 3.5 GHz

LD LD LD LD 2

LD

2

LD

2

LD

2

LD

L

ST ST ST ST

+ + + + + + + + + + + +

* * * *

Core execution

12 cycles

Bottleneck

ADD

Port utilization in cycles

CP

U P

ort

s


12


Situation 1: Data set fits into L1 cache

ECM prediction:

(8 LUP / 12 cy) * 3.5 GHz = 2.3 GLUP/s

Measurement: 2.2 GLUP/s

Situation 2: Data set fits into L2 cache

3 transfer streams from L2 to L1

(LD T0 + LD/ST T1)

ECM prediction:

(8 LUP / (12+6) cy) * 3.5 GHz = 1.5 GLUP/s

Measurement: 1.9 GLUP/s What‘s

wrong??

12 cy

6 cy


13


LD LD LD LD 2LD 2LD 2LD 2LD L

ST ST ST ST

+ + + + + + + + + + + +

* * * * core execution: 12 cycles

ECM prediction:

(8 LUP / (8.5+6) cy) * 3.5 GHz =1.9 GLUP/s

Measurement: 1.9 GLUP/s

Register L1: 8.5 cycles

8.5 cy

6 cy

“If the model fails, we learn something”


L1 L2: 6 cycles

L1 / L2 transfer

overlaps with arithmetic

No concurrent data transfer between memory levels

Data transfer may overlap with other in-core instructions

14


ECM modelling for stencil structures: Long range 3D 25-pt stencil

Background:

Three-dimensional Seismic simulations in oil industry

Finite-Difference-Time-Domain (FDTD) discretization

Stencil operator is

Isotropic (symmetric),

With constant coefficients,

2nd order in time

8th order in space


Innermost loop shown only

Collaboration with

D. Keyes & T. Malas

(KAUST)

15


Data delay below L1 assuming

spatial blocking easy to calculate: 65 cy/CL

Data transfer L1 – register for V depends

on compiler

How to calculate / estimate core

execution time and actual load / stores?

Use Intel Architecture Code Analyzer (IACA))


Core execution time

LD cycles: L1 Reg.

Core bottleneck

V?

16


Core execution (Non-LD/ST cycles) Data delay

L1-R

EG (

Load

) 6

4cy

L2

-L1

2

4cy

L3

-L2

2

4cy

M

-L3

1

7cy

12

9c

y

FrontEnd stalls

overlap:

2*(34.1-32) =4.2cy

AD

D

52

cy

MU

LT

39

cy

Re

g-R

eg

tran

sfe

rs

48

cy

Stores 4cy

Single-core performance (ECM)

2.7GHz / (129cy / 16LUP) = 335 MLUP/s

Measurement: 320 MLUP/s

IACA throughput

34.1cy/8LUP

1 CL 16 LUP (sp)


17

Case study 3: 3D stencil with DIVide

ECM modelling for stencil structures: Impact of complex arithmetic

Background:

Three-dimensional Seismic simulations

Unelastic Wave Propagation Code

(http://hpgeoc.sdsc.edu/personnel.html)

Finite-Difference-Time-Domain (FDTD) discretization


Collaboration with

O. Schenk

(USI Lugano)

DIV operation

18

Case study 3: 3D stencil with DIV

Data delay

L1-R

EG

38

cy

L2-L

1

20

cy

L3-L

2

20

cy

MEM

-L3

2

6cy

Double Precision

Core execution

DIV

ide

r

84

cy

10

4cy

Single Precision

Core execution

FrontEnd

stalls

Single-core performance (SP)

ECM (104cy ): 415 MLUP/s


Single-core performance (DP)

ECM (104cy ): 208 MLUP/s


Compiler

avoids

DIV instruction

DIV becomes

bottleneck for

temporal

blocking!


19

Conclusions

ECM allows good estimate/prediction of single core performance

and socket scalability for “streaming kernels”

ECM reveals computational/hardware bottlenecks and quantifies

their impact on performance

ECM allows to quantify impact of clock speed or vectorization

ECM could be integrated into autotuning frameworks or DSLs

Work is funded by


DFG Priority Programme1648 Bavarian Network for HPC


High Performance Computing

is Computing at a

Bottleneck!*

*Georg Hager

21

References – ECM model basics & applications

1. G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power

properties of modern multicore chips via simple machine models. (accepted)

Concurrency and Computation: Practice and Experience,

DOI: 10.1002/cpe.3180 (2013). Preprint: arXiv:1208.2908

2. M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein: Chip-level and multi-

node analysis of energy-optimized lattice-Boltzmann CFD simulations. Submitted.

Preprint: arXiv:1304.7664

3. J. Treibig, G. Hager, and G. Wellein: Performance patterns and hardware metrics

on modern multicore processors: Best practices for performance engineering.

Proc. 5th Workshop on Productivity and Performance (PROPER 2012) at Euro-Par

2012. Euro-Par 2012: Parallel Processing Workshops, LNCS 7640, 451-460

(2013), Springer, ISBN 978-3-642-36948-3. DOI: 10.1007/978-3-642-36949-0_50.

Preprint: arXiv:1206.3738


http://dx.doi.org/10.1002/cpe.3180




http://arxiv.org/abs/1208.2908


http://www.vi-hps.org/proper2012ws

http://europar2012.cti.gr/




http://link.springer.com/book/10.1007/978-3-642-36949-0



http://dx.doi.org/10.1007/978-3-642-36949-0_50

http://dx.doi.org/10.1007/978-3-642-36949-0_50

http://dx.doi.org/10.1007/978-3-642-36949-0_50

http://dx.doi.org/10.1007/978-3-642-36949-0_50

http://dx.doi.org/10.1007/978-3-642-36949-0_50

http://dx.doi.org/10.1007/978-3-642-36949-0_50

http://dx.doi.org/10.1007/978-3-642-36949-0_50

http://dx.doi.org/10.1007/978-3-642-36949-0_50

http://dx.doi.org/10.1007/978-3-642-36949-0_50


erlangen regional computing center performance engineering … · erlangen regional computing...

Documents