multicore scaling: the ecm model

29
Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Three stencil code examples

Upload: others

Post on 01-Dec-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multicore Scaling: The ECM Model

Multicore Scaling:

The ECM Model

Single-core performance prediction

The saturation point

Three stencil code examples

Page 2: Multicore Scaling: The ECM Model

Assumptions and shortcomings of the roofline model

Assumes one of two bottlenecks

1. In-core execution

2. Bandwidth of a single hierarchy level

Latency effects are not modeled pure data streaming assumed

In-core execution is sometimes hard to

model

Saturation effects in multicore

chips are not explained

ECM model gives more insight

A(:)=B(:)+C(:)*D(:)

Roofline predicts full socket BW

(c) RRZE 2014 ECM Model 2

Page 3: Multicore Scaling: The ECM Model

The Execution-Cache-Memory (ECM)

model

G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience, DOI: 10.1002/cpe.3180 (2013). Preprint: arXiv:1208.2908

Page 4: Multicore Scaling: The ECM Model

ECM Model

ECM = “Execution-Cache-Memory”

Assumptions:

Single-core execution time is composed of

1. In-core execution

2. Data transfers in the memory hierarchy

Data transfers may or may not overlap with

each other or with in-core execution

Scaling is linear until the relevant bottleneck

is reached

Input:

Same as for Roofline

+ data transfer times in hierarchy

(c) RRZE 2014 4 ECM Model

Page 5: Multicore Scaling: The ECM Model

Example: Schönauer Vector Triad in L2 cache

REPEAT[ A(:) = B(:) + C(:) * D(:)] @ double precision

Analysis for Sandy Bridge core w/ AVX (unit of work: 1 cache line)

(c) RRZE 2014 5 ECM Model

1 LD/cy + 0.5 ST/cy

Registers

L1

L2

32 B/cy (2 cy/CL)

Machine characteristics:

Arithmetic: 1 ADD/cy+ 1 MULT/cy

Registers

L1

L2

Triad analysis (per CL):

6 cy/CL

10 cy/CL

Arithmetic: AVX: 2 cy/CL

LD LD ST/2

LD ST/2 LD LD

ST/2 LD

ST/2

LD

ADD MULT

ADD MULT

LD LD WA ST

Roofline prediction: 16/10 F/cy

Timeline:

16 F/CL (AVX)

Measurement: 16F / ≈17cy

Page 6: Multicore Scaling: The ECM Model

Example: ECM model for Schönauer Vector Triad A(:)=B(:)+C(:)*D(:) on a Sandy Bridge Core with AVX

(c) RRZE 2014 6 ECM Model

CL transfer

Write-allocate CL transfer

Page 7: Multicore Scaling: The ECM Model

Full vs. partial vs. no overlap

(c) RRZE 2014 7 ECM Model

Results suggest no overlap!

Page 8: Multicore Scaling: The ECM Model

Multicore scaling in the ECM model

Identify relevant bandwidth bottlenecks

L3 cache

Memory interface

Scale single-thread performance until first bottleneck is hit:

(c) RRZE 2014 8 ECM Model

𝑃 𝑡 = min(𝑡𝑃0, 𝑃roof), with 𝑃roof= min(𝑃max, 𝐼 ∙ 𝑏𝑆)

. . . Example: Scalable L3

on Sandy Bridge

Page 9: Multicore Scaling: The ECM Model

ECM prediction vs. measurements for A(:)=B(:)+C(:)*D(:)

on a Sandy Bridge socket (no-overlap assumption)

Model: Scales until saturation

sets in

Saturation point (# cores) well

predicted

Measurement: scaling not perfect

Caveat: This is specific for this

architecture and this benchmark!

Check: Use “overlappable” kernel

code

(c) RRZE 2014 9 ECM Model

Page 10: Multicore Scaling: The ECM Model

ECM prediction vs. measurements for A(:)=B(:)+C(:)/D(:)

on a Sandy Bridge socket (full overlap assumption)

(c) RRZE 2014 10 ECM Model

In-core execution is dominated by

divide operation

(44 cycles with AVX, 22 scalar)

Almost perfect agreement with

ECM model

General observation:

If the L1 cache is 100% occupied

by LD/ST, there is no overlap

throughout the hierarchy

If there is “slack” at the L1, there is

overlap in the hierarchy

Page 11: Multicore Scaling: The ECM Model

Performance Modeling of Stencil Codes

Applying the ECM model to stencil updates:

- 2D Jacobi smoother

- Long-range stencil

(H. Stengel, RRZE)

Page 12: Multicore Scaling: The ECM Model

Example 1: 2D Jacobi in DP with SSE2 on Sandy Bridge

(c) RRZE 2014 12 ECM Model

Page 13: Multicore Scaling: The ECM Model

Example 1: 2D Jacobi in DP with SSE2 on SNB

(c) RRZE 2014 13 ECM Model

Instruction count - 13 LOAD - 4 STORE - 12 ADD - 4 MUL

4-way unrolling 8 LUP / iteration

Page 14: Multicore Scaling: The ECM Model

Example 1: 2D Jacobi in DP with SSE2 on SNB

(c) RRZE 2014 14 ECM Model

Code characteristics (SSE instructions per iteration) - 13 LOAD - 4 STORE - 12 ADD - 4 MUL

Processor characteristics (SSE instructions per cycle) - 2 LOAD || (1 LOAD + 1 STORE) - 1 ADD - 1 MUL

LD LD LD LD 2LD 2LD 2LD 2LD L

ST ST ST ST

+ + + + + + + + + + + +

* * * *

core execution: 12 cy

Page 15: Multicore Scaling: The ECM Model

Example 1: 2D Jacobi in DP with SSE2 on SNB

(c) RRZE 2014 15 ECM Model

Situation 1: Data set fits into L1 cache

ECM prediction:

(8 LUP / 12 cy) * 3.5 GHz = 2.3 GLUP/s

Measurement: 2.2 GLUP/s

Situation 2: Data set fits into L2 cache (not into L1)

3 additional transfer streams from L2 to L1 (data delay)

ECM prediction:

(8 LUP / (12+6) cy) * 3.5 GHz = 1.5 GLUP/s

Measurement: 1.9 GLUP/s

Overlap?

12 cy

6 cy t0 RFO t1

Page 16: Multicore Scaling: The ECM Model

Example 1: 2D Jacobi in DP with SSE2 on SNB

(c) RRZE 2014 16 ECM Model

LD LD LD LD 2LD 2LD 2LD 2LD L

ST ST ST ST

+ + + + + + + + + + + +

* * * *

core execution: 12 cycles

ECM prediction w/ overlap:

(8 LUP / (8.5+6) cy) * 3.5 GHz = 1.9 GLUP/s

Measurement: 1.9 GLUP/s

L1 „single ported“ no overlap during LD/ST

data delay: 6 cycles

12 cy

6 cy RFO t0 t1

“If the model fails, we learn something”

Page 17: Multicore Scaling: The ECM Model

Example 2: 3D long-range stencil in single precision

with AVX on Sandy Bridge

(c) RRZE 2014 17 ECM Model

Page 18: Multicore Scaling: The ECM Model

Example 2: 3D long-range stencil in SP with AVX on SNB

4 neighbors per direction

Operations per update

27 LOAD (25 V, 1 ROC, 1 U)

1 STORE (U)

26 ADD

15 MUL

Core time

IACA

(c) RRZE 2014 18 ECM Model

Page 19: Multicore Scaling: The ECM Model

Interlude: Intel Architecture Code Analyzer (IACA)

Performs architecture-specific code analysis

Prerequisite: Mark start and end of dominant work loop

In high-level code (documented)

In assembly code (see iacaMarks.h)

Does not influence code optimization (e.g. vectorization)

Assembly loop might perform multiple updates per iteration (unrolling, SIMD)

Important reports (throughput mode):

Block throughput: runtime of one loop iteration ( core-time)

Throughput bottleneck: limiting resource for code execution

Port pressure: dominant pipeline port

(c) RRZE 2014 19 ECM Model

Page 20: Multicore Scaling: The ECM Model

IACA example output

(c) RRZE 2014 20 ECM Model

AVX vectorization, no unrolling: One iteration updates 8 SP (float) elements

Page 21: Multicore Scaling: The ECM Model

Example 2: 3D long-range stencil in SP with AVX on SNB

4 neighbors per direction

Operations per update

27 LOAD (25 V, 1 ROC, 1 U)

1 STORE (U)

26 ADD

15 MUL

Core time (IACA)

34.25 cy / 8 LUP (SP)

69 cy / CL

LOAD dominated

(c) RRZE 2014 21 ECM Model

Page 22: Multicore Scaling: The ECM Model

Example 2: 3D long-range stencil in SP with AVX on SNB

Data delay

Spatial blocking for minimal traffic between L3 cache and memory

8 additional streams for smaller caches

Single-core performance

2.7GHz / (134cy / 16LUP) = 322MLUP/s

Measurement: 320MLUP/s

LOAD dominated no overlap expected

Socket scaling (8 cores)

8*322MLUP/s = 2576MLUP/s

Limit: 40GB/s / 16B/LUP = 2500MLUP/s

Saturation at 8 cores

Measurement: ~2200MLUP/s (88% max.)

Optimization possibilities

Data transfer well optimized (blocking)

Reduce core time (LD); limit: ADD (2*26cy)

Possible speedup: 69cy-52cy=17cy (~15%)

(c) RRZE 2014 22 ECM Model

per

cyc

le t

ran

sfer

wid

ths

per

cac

he

line

up

dat

e ti

me

69cy

24cy

24cy 24cy

17cy

Page 23: Multicore Scaling: The ECM Model

Socket scaling

(c) RRZE 2014 23 ECM Model

memory bandwidth limit

Page 24: Multicore Scaling: The ECM Model

Example 3: 3D red-black tetrahedral smoother in DP

with AVX on Sandy Bridge

“Terra-Neo” DFG project

(c) RRZE 2014 24 ECM Model

Page 25: Multicore Scaling: The ECM Model

Stencil inner loop code and pattern

Notes

Stride-two access pattern on the central line (red-black pattern)

15-point stencil

Compiler “vectorizes” inner loop, but loads and stores are scalar

(c) RRZE 2014 25 ECM Model

Page 26: Multicore Scaling: The ECM Model

ECM for Terra-Neo kernel

measured manually, optimization target!

ECM Model (c) RRZE 2014 26

8 LUP (1 CL):

layer condition:

Page 27: Multicore Scaling: The ECM Model

Socket scaling of Terra-Neo kernel

ECM Model (c) RRZE 2014 27

no overlap

max. overlap

“some“ overlap

Page 28: Multicore Scaling: The ECM Model

What we have learned

Code does not saturate the relevant bottleneck

We have to improve single-core performance

Core analysis: Bottleneck is register-register moves

Reason: Vectorized arithmetic, but scalar loads

Optimization goal: Truly vectorized kernel with SIMD loads

Options:

SIMD-friendly data layout

(red-black separation)

→ # iterations?

Different smoother algorithm

(e.g. Jacobi) on central line

→ convergence?

ECM Model (c) RRZE 2014 28

Page 29: Multicore Scaling: The ECM Model

ECM model: Conclusions & outlook

Saturation effects are ubiquitous; understanding them gives us

opportunity to

Find out about optimization opportunities

Save energy by letting cores idle see power model later on

Putting idle cores to better use see spMVM case study

Simple models work best. Do not try to complicate things unless it

is really necessary!

Possible extensions to the ECM model

Accommodate latency effects

Model simple “architectural hazards”

(c) RRZE 2014 29 ECM Model