the ecm (execution-cache-memory) performance model · the ecm (execution-cache-memory) performance...

40
The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop “Memory issues on Multi- and Manycore Platforms” at PPAM 2009, the 8th International Conference on Parallel Processing and Applied Mathematics, Wroclaw, Poland, September 13-16, 2009. Lecture Notes in Computer Science Volume 6067, 2010, pp 615-624. DOI: 10.1007/978-3-642-14390-8_64. G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience, DOI: 10.1002/cpe.3180 (2013). Preprint: arXiv:1208.2908 H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Submitted. Preprint: arXiv:1410.5010

Upload: others

Post on 09-Oct-2019

57 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

The ECM (Execution-Cache-Memory)

Performance Model

J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop “Memory issues on Multi- and Manycore Platforms” at PPAM 2009, the 8th International Conference on Parallel Processing and Applied Mathematics, Wroclaw, Poland, September 13-16, 2009. Lecture Notes in Computer Science Volume 6067, 2010, pp 615-624. DOI: 10.1007/978-3-642-14390-8_64. G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience, DOI: 10.1002/cpe.3180 (2013). Preprint: arXiv:1208.2908 H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Submitted. Preprint: arXiv:1410.5010

Page 2: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Assumptions and shortcomings of the roofline model

Assumes one of two bottlenecks

1. In-core execution

2. Bandwidth of a single hierarchy level

Latency effects are not modeled pure data streaming assumed

In-core execution is sometimes hard to

model

Saturation effects in multicore

chips are not explained

ECM model gives more insight

A(:)=B(:)+C(:)*D(:)

Roofline predicts full socket BW

ECM model (c) RRZE 2014 2

Page 3: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

The Execution-Cache-Memory (ECM)

model

Page 4: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

ECM Model

ECM = “Execution-Cache-Memory”

Observations:

Single-core execution time is not the maximum of 1. In-core execution

2. Data transfers through a single bottleneck

Data transfers may or may not overlap with each

other or with in-core execution

Scaling is linear until the relevant bottleneck is

reached

ECM model Input:

Same as for Roofline

+ data transfer times in hierarchy

ECM model (c) RRZE 2014 4

Page 5: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example: Schönauer Vector Triad in L2 cache

REPEAT[ A(:) = B(:) + C(:) * D(:)] @ double precision

Analysis for Sandy Bridge core w/ AVX (unit of work: 1 cache line)

ECM model

1 LD/cy + 0.5 ST/cy

Registers

L1

L2

32 B/cy (2 cy/CL)

Machine characteristics:

Arithmetic: 1 ADD/cy+ 1 MULT/cy

Registers

L1

L2

Triad analysis (per CL):

6 cy/CL

10 cy/CL

Arithmetic: AVX: 2 cy/CL

LD LD ST/2

LD ST/2 LD LD

ST/2 LD

ST/2

LD

ADD MULT

ADD MULT

LD LD WA ST

Roofline prediction: 16/10 F/cy

Timeline:

16 F/CL (AVX)

Measurement: 16F / ≈17cy

(c) RRZE 2014 5

Page 6: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example: ECM model for Schönauer Vector Triad A(:)=B(:)+C(:)*D(:) on a Sandy Bridge Core with AVX

ECM model

CL transfer

Write-allocate CL transfer

(c) RRZE 2014 6

Page 7: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Testing different overlap hypotheses

ECM model

Results suggest no overlap!

(c) RRZE 2014 7

Page 8: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Multicore scaling in the ECM model

Identify relevant bandwidth bottlenecks

L3 cache

Memory interface

Scale single-thread performance until first bottleneck is hit:

ECM model

𝑛 threads: 𝑃 𝑛 = min(𝑛𝑃0, 𝐼 ∙ 𝑏𝑆 )

. . . Example: Scalable L3

on Sandy Bridge

(c) RRZE 2014 8

Page 9: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

ECM prediction vs. measurements for A(:)=B(:)+C(:)*D(:)

on a Sandy Bridge socket (no-overlap assumption)

Model: Scales until saturation

sets in

Saturation point (# cores) well

predicted

Measurement: scaling not perfect

Caveat: This is specific for this

architecture and this benchmark!

Check: Use “overlappable” kernel

code

ECM model (c) RRZE 2014 9

Page 10: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

ECM prediction vs. measurements for A(:)=B(:)+C(:)/D(:)

on a Sandy Bridge socket (full overlap assumption)

ECM model

In-core execution is dominated by

divide operation

(44 cycles with AVX, 22 scalar)

Almost perfect agreement with

ECM model

General observation:

If the L1 cache is 100% occupied

by LD, there is no overlap

throughout the hierarchy

If there is “slack” at the L1, there is

overlap in the hierarchy

(c) RRZE 2014 10

Page 11: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example 1: A 2D Jacobi stencil in DP with SSE2 on Sandy Bridge

ECM model (c) RRZE 2014 11

Page 12: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example 1: 2D Jacobi in DP with SSE2 on SNB

ECM model

Instruction count - 13 LOAD - 4 STORE - 12 ADD - 4 MUL

4-way unrolling 8 LUP / iteration

(c) RRZE 2014 12

Page 13: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example 1: 2D Jacobi in DP with SSE2 on SNB

ECM model

Code characteristics

(SSE instructions per iteration)

- 13 LOAD

- 4 STORE

- 12 ADD

- 4 MUL

Processor characteristics

(SSE instructions per cycle)

- 2 LOAD || (1 LOAD + 1 STORE)

- 1 ADD

- 1 MUL

LD LD LD LD 2LD 2LD 2LD 2LD L

ST ST ST ST

+ + + + + + + + + + + +

* * * * core

execution:

12 cy

(c) RRZE 2014 13

Page 14: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example 1: 2D Jacobi in DP with SSE2 on SNB

ECM model

Situation 1: Data set fits into L1 cache

ECM prediction:

(8 LUP / 12 cy) * 3.5 GHz = 2.3 GLUP/s

Measurement: 2.2 GLUP/s

Situation 2: Data set fits into L2 cache (not into L1)

3 additional transfer streams from L2 to L1 (data delay)

Prediction:

(8 LUP / (12+6) cy) * 3.5 GHz = 1.5 GLUP/s

Measurement: 1.9 GLUP/s

Overlap?

12 cy

6 cy t0 RFO t1

(c) RRZE 2014 14

Page 15: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example 1: 2D Jacobi in DP with SSE2 on SNB

ECM model

LD LD LD LD 2LD 2LD 2LD 2LD L

ST ST ST ST

+ + + + + + + + + + + +

* * * *

core execution: 12 cycles

ECM prediction w/ overlap:

(8 LUP / (8.5+6) cy) * 3.5 GHz = 1.9 GLUP/s

Measurement: 1.9 GLUP/s

L1 „single ported“ no overlap during LD/ST

L2 delay: 6 cycles

12 cy

6 cy RFO t0 t1

“If the model fails, we learn something”

(c) RRZE 2014 15

LOAD bottleneck:

8.5 cy

Page 16: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

ECM model – the rules

1. LOADs in the L1 cache do not overlap with any other data transfer in the memory hierarchy

2. Everything else in the core overlaps perfectly with data transfers

3. The scaling limit is set by the ratio of

# cycles per CL overall

# cycles per CL at the bottleneck

4. The Roofline Model is recovered when assuming full overlap of all contributions

ECM model

LOAD

L2-L1

L3-L2

MEM-L3

STORE

ADD MULT …

tim

e [c

y]

6 cy

9 cy

9 cy

19 cy

Example:

Single-core (data in L1): 8 cy (ADD)

Single-core (data in memory):

6+9+9+19 cy = 43 cy

Scaling limit: 43 / 19 = 2.3 cores

8 cy 3 cy 43 cy 4 cy

(c) RRZE 2014 16

Page 17: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Core time = overlapping and non-overlapping contributions

ECM prediction = maximum of overlapping time and sum of all other

contributions

Convenient shorthand notation for contributions:

Example from prev. slide:

Predictions for data in different memory hierarchy levels:

Experimental data (measured) notation:

Saturation assumption for memory bottleneck:

ECM model – notation

(c) RRZE 2014 17 ECM model

Page 18: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

ECM Model for DAXPY (AVX) on SNB 2.7 GHz (phinally)

Loop:

Contributions:

Predictions:

(c) RRZE 2014 18 ECM model

Page 19: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

ECM Model and measurements for array sum on SNB 2.7 GHz

(phinally)

Loop:

Naive = scalar, no unrolling (full 3 cy penalty per ADD)

(c) RRZE 2014 19 ECM model

Page 20: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

ECM Model and measurements for 2D Jacobi (AVX)

on SNB 2.7 GHz (phinally)

Loop:

LC = layer condition satisfied in

(c) RRZE 2014 20 ECM model

Page 21: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Jacobi 2D impact of inner loop blocking on SNB (phinally)

(c) RRZE 2014 21 ECM model

ECM

Page 22: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Jacobi 2D: Why outer loop blocking?

(c) RRZE 2014 22 ECM model

Extra data prefetched from memory at block

boundaries

Page 23: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Kahan dot product

Page 24: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Kahan dot product

Goal: Compute large sums (many operands) with controlled numerical

error

(c) RRZE 2014 24 ECM model

__attribute__((optimize("no-tree-vectorize")))

void ddot_kahan_scalar_comp(

int N, const double* a, const double* b, double* r)

{

int i;

double sum = 0.0;

double c = 0.0;

for (i=0; i<N; ++i) {

double prod = a[i]*b[i];

double y = prod-c;

double t = sum+y;

c = (t-sum)-y;

sum = t;

}

(*r) = sum;

}

Page 25: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example (from Wikipedia)

6-digit FP, initial sum = 10000.0, adding 3.14159 and 2.71828

(c) RRZE 2014 25 ECM model

y = 3.14159 - 0 y = input[i] - c

t = 10000.0 + 3.14159

= 10003.1 Many digits have been lost!

c = (10003.1 - 10000.0) - 3.14159 This must be evaluated as written!

= 3.10000 - 3.14159 Assimilated part of y recovered, vs. full y.

= -.0415900

sum = 10003.1 Inaccurate result

On the next step, c gives the error.

y = 2.71828 - -.0415900 Shortfall from previous stage included.

= 2.75987 It is of a size similar to y: most digits meet.

t = 10003.1 + 2.75987 But few meet the digits of sum.

= 10005.85987, rounds to 10005.9

c = (10005.9 - 10003.1) - 2.75987 This extracts whatever went in.

= 2.80000 - 2.75987 In this case, too much.

= .040130 The excess would be subtracted off next time.

sum = 10005.9 Exact result is 10005.85987,

this is correctly rounded to 6 digits.

Page 26: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

ECM Model and measurements on Emmy

(IVB 2.2 GHz, 3 cy/CL from memory)

Standard DP ddot:

Scalar:

AVX:

Kahan ddot:

Scalar:

AVX:

Conclusion: DP Kahan ddot saturates even in scalar mode

SP Kahan will not saturate

(c) RRZE 2014 26 ECM model

Page 27: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Performance Modeling of Stencil Codes

Applying the ECM model to stencil updates:

- 3D Jacobi smoother (DP, AVX)

- Long-range stencil (SP, AVX)

(H. Stengel, RRZE)

Page 28: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example 2: A 3D Jacobi smoother

with AVX vectorization

on an Intel Ivy Bridge processor

ECM model (c) RRZE 2014 28

Page 29: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Jacobi 3D Manual Analysis

Cycle Count

(4x unroll + AVX = 16 LUP)

MUL 4

ADD 20

LOAD 24

STORE 8

ECM model

Operation Count

(1 LUP)

MUL 1

ADD 5

LOAD 6

STORE 1

(c) RRZE 2014 29

Page 30: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Interlude: Intel Architecture Code Analyzer (IACA)

Performs architecture-specific code analysis

Prerequisite: Mark start and end of dominant work loop

In high-level code (documented)

In assembly code (see iacaMarks.h)

Does not influence code optimization (e.g. vectorization)

Assembly loop might perform multiple updates per iteration (unrolling, SIMD)

Important reports (throughput mode):

Block throughput: runtime of one loop iteration ( core-time)

Throughput bottleneck: limiting resource for code execution

Port pressure: dominant pipeline port

ECM model (c) RRZE 2014 30

Page 31: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

16 updates (4x unroll + AVX) = 2 cache lines per loop iteration #pragma vector aligned

ECM model (c) RRZE 2014 31

Page 32: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Jacobi 3D ECM

ECM model

Non-LD/ST time Data transfers

L1-R

EG

(LD

12

cy)

L2-L

1

(10

cy)

L3

-L2

(1

0cy

)

M-L

3

(1

2cy

)

44cy

FrontEnd stalls

0.5*(24.1 - 24) =0.05cy

AD

D

(1

0cy

)

Stores (4cy)

Times [cy] for 8 LUP (DP) = 1 CL update = 0.5 loop iterations (ASM) = 0.5 * IACA output

Single-core performance 3.0GHz / (44cy/ 8LUP) = 545MLUP/s Measurement (N=400): 542MLUP/s (~44cy)

IACA throughput: 24.1cy/16LUP

MUL (2cy) Reg-Reg

(6cy)

Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz Memory Bandwidth 47 GB/s

#pragma vector aligned

(c) RRZE 2014 32

Page 33: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Socket Scaling

ECM model

Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz Memory Bandwidth 47 GB/s

(c) RRZE 2014 34

Page 34: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example 3: 3D long-range stencil in single precision

with AVX on Sandy Bridge

ECM model (c) RRZE 2014 35

Page 35: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example 3: 3D long-range stencil in SP with AVX on SNB

Core execution

4 neighbors per direction

Operations per update (code)

27 LOAD (25 V, 1 ROC, 1 U)

1 STORE (U)

26 ADD

15 MUL

Core time &

actual LOAD count

IACA

ECM model

Collaboration with D. Keyes & T. Malas

(KAUST)

(c) RRZE 2014 36

Page 36: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

IACA example output – Core execution

ECM model

AVX vectorization, no unrolling: One iteration updates 8 SP (float) elements Multiply all numbers by 2X to get time for updating 1 CacheLine (16 floats)

128 Bit Loads

Data transfer: LOAD ports REG – L1: 2*30.5 cy = 61 cy

Core Execution time (16 LUP) = 2*34.25 cy = 68.5 cy

(c) RRZE 2014 37

Page 37: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example 3: Data delay

Problem size: 2603 (single precision) – cy/CL

Spatial blocking Layer condition at L3 and row condition in L1: OK

ECM model

61 cy

From IACA analysis

Minimum data transfer to main memory: 4 WORD/LUP (LD: U,V,ROC – ST:U)

MemBW=40 GB/s

17 cy

8 LOADS to V can be served directly by L3 cache + 1 from main memory

24 cy

24 cy

(c) RRZE 2014 38

Page 38: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Example 3: Putting it all together

Core execution (Non-LD/ST cycles) Data delay

L1-R

EG (

Load

) 6

1 c

y L2

-L1

2

4 c

y L3

-L2

2

4 c

y

M-L

3

17

cy

12

6 c

y

FrontEnd stalls overlap: (68.5-61) cy =7.5cy

AD

D

52

cy MU

LT

38

cy

Re

g-R

eg

tran

sfe

rs

48

cy

Stores 4cy

Single-core performance (ECM Model) 2.7GHz / (126cy / 16LUP) = 343 MLUP/s

Measurement: 320 MLUP/s

IACA throughput 68.5 cy / CL (sp)

ECM model (c) RRZE 2014 39

optimization target!

temporal blocking useless!

Page 39: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

Socket scaling

ECM model

memory bandwidth limit

(c) RRZE 2014 41

Page 40: The ECM (Execution-Cache-Memory) Performance Model · The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited

ECM model: Conclusions & outlook

Saturation effects are ubiquitous; understanding them gives us

opportunity to

Find out about optimization opportunities

Save energy by letting cores idle see power model later on

Putting idle cores to better use communication, functional decomposition

Simple models work best. Do not try to complicate things unless it

is really necessary!

Possible extensions to the ECM model

Accommodate latency effects

Model simple “architectural hazards”

ECM model (c) RRZE 2014 42