application-centric benchmarking and modeling for co … · application-centric benchmarking and...

25
APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference Edinburgh, Scotland, UK All used images belong to the producers.

Upload: others

Post on 27-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN

Torsten Hoefler Exascale Applications and Software Conference Edinburgh, Scotland, UK

All used images belong to the producers.

Page 2: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Number of PEs grows exponentially

Bottlenecks shift quickly

e.g., to data serialization

Some aspects are over-engineered

At least for the “average application”

How do we know what applications need?

At scale?

WHY MODELING AND CO-DESIGN?

Torsten Hoefler Slide 2 of 25

Page 3: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Allows to:

1. Predict performance at a different scale [1,2]

2. Predict performance on a different machine [3,4]

3. Predict performance for a different problem [5,6]

4. Assess impact of network topology choices [7]

5. Find performance bugs [to appear]

6. Determine application requirements [this talk]

PERFORMANCE MODELING

[1]: Zhai et al.:“Phantom: predicting performance of parallel applications on large-scale *…+”, PPoPP’10 *2+: Lee et al.: “Methods of inference and learning for performance modeling of parallel applications.”, PPoPP’07 [3]: Marin, Mellor-Crummey: “Cross-architecture performance predictions for scientific applications *…+”, SIGMETRICS’04 *4+: Yang et al.: “Cross-platform performance prediction of parallel applications using partial execution.”, SC’05 *5+: Hoefler et al.: “Performance modeling for systematic performance tuning”, SC’11 [6]: Kerbyson et al.: “Predictive performance and scalability modeling of a large-scale application.”, SC’01 *7+: Bauer et al.: “Performance Modeling and Comparative Analysis of the MILC Lattice QCD Application su3 rmd”, CCGrid’12

Torsten Hoefler Slide 3 of 25

Page 4: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

SIMPLEST EXAMPLE – MATRIX MULTIPLICATION

Semi-Analytic Perf. Modeling [1]:

T(N) = tN3

t=2.2ns

Slide 4 of 28 Torsten Hoefler

1 1 3 1 1 4 1 7 9 4 1 2 1 5 1 3

1 3 0 1 3 7 4 1 3 0 9 8 1 2 5 6

5

for(int i=0; i<N; ++i)

for(int j=0; j<N; ++j)

for(int k=0; k<N; ++k)

C[i+j*N] += A[i+k*N] * B[k+j*N];

Algorithmic (analytic) Parameters

Architectural (empirical) Parameters

<0.8% model error

*1+: Hoefler et al.: “Performance modeling for systematic performance tuning”, SC’11

Torsten Hoefler Slide 4 of 25

Page 5: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Dual to performance modeling

What we get vs. what we need!

Allows to model requirements of applications

E.g., FLOPs, Memory Bandwidth, …

Rules of thumb: “1 Byte/Word per FLOP”

Better: “balance principles” *1+

Burton’s interpretation of Little’s law

concurrency = latency x bandwidth

REQUIREMENTS MODELING

[1]: Czechowski et al: “Balance Principles for Algorithm-Architecture Co-design”, HotPar’11

minimum needed to hide latency and keep system busy

Torsten Hoefler Slide 5 of 25

Page 6: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

System features

Memory/Network Latency, Bandwidth, Flops, …

Requirements space of system X

Feature vector F defines limits

Each application run has a requirement vector R

In general R < F

Bottlenecks are easily identified

MODEL REQUIREMENTS

*1+: Hoefler et al.: “Performance modeling for systematic performance tuning”, SC’11

Torsten Hoefler Slide 6 of 25

Page 7: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Requirements vectors are not enough

Back to our simple example: Matrix Multiplication!

2N3 FLOP/s

Memory bandwidth?

LLC Misses:

C(N) = aN3 – bN2

a=3.8e-4

b=2.7e-1

PARAMETRIC REQUIREMENTS MODEL

“Algorithmic Parameters”

“Architectural Parameters”

*1+: Hoefler et al.: “Performance modeling for systematic performance tuning”, SC’11

Torsten Hoefler Slide 7 of 25

Page 8: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

MM needs < ½ memory bandwidth Small problems even less

HOW IS THE BALANCE?

asymptotic ~8.3 in-cach

e

Operational Intensity Roofline Visualization

Torsten Hoefler Slide 8 of 25

Page 9: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

MILC – MIMD Lattice Computation, su3_rmd

Well understood and modeled [1] Five phases: FL, LL, CG, GF, FF

Phase detection can be automated

A REAL (SIMPLE) APPLICATION

*1+: Bauer et al.: “Performance Modeling and Comparative Analysis of the MILC Lattice QCD Application su3 rmd”, CCGrid’12 Torsten Hoefler Slide 9 of 25

Page 10: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

MILC BALANCE?

Thanks to Greg Bauer

Torsten Hoefler Slide 10 of 25

Page 11: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Thanks to Greg Bauer

Cache-aware programming

MILC BALANCE?

Torsten Hoefler Slide 11 of 25

Page 12: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Thanks to Greg Bauer

Micro-Optimization

MILC BALANCE?

Torsten Hoefler Slide 12 of 25

Page 13: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Simple strategy for requirements modeling for co-design, isn’t it?

Can be used by system designers and architects

Issues:

Underestimated requirement functions

Working on it, e.g., bug finding!

Overlooked requirement dimensions

Misguided optimization/-benchmarking!

E.g.: Serialization overheads, Routing issues

SIMPLE! WHAT ARE THE PITFALLS?

*1+: Hoefler et al.: “Performance modeling for systematic performance tuning”, SC’11

Torsten Hoefler Slide 13 of 25

Page 14: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

MILC: FULL PERFORMANCE MODEL

Data from POWER5

Torsten Hoefler Slide 14 of 25

Page 15: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Networks channels are serial!

But you want to communicate this pattern:

Or a face- (Cartesian boundary-) exchange:

MISSED REQUIREMENT I: DATA SERIALIZATION

Recv from proc 1

Send to proc 2

*1+: Schneider et al.: “Micro-Applications for Communication Data Access Patterns and MPI Datatypes”, EuroMPI’12

Torsten Hoefler Slide 15 of 25

Page 16: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

MISSED REQUIREMENT I: DATA SERIALIZATION

Recv from proc 1

Send to proc 2

Manual Packing sbuf = malloc(N/2*sizeof(int)); rbuf = malloc(N/2*sizeof(int)); for (i=1; i<N; i+=2) sbuf[i/2]=data[i]; MPI_Isend(sbuf, …); MPI_Irecv(rbuf, …); MPI_Waitall(…); for (i=0; i<N; i+=2) data[i]=rbuf[i/2]; free(sbuf); free(rbuf);

Allocate extra send / recv Buffers

*1+: Schneider et al.: “Micro-Applications for Communication Data Access Patterns and MPI Datatypes”, EuroMPI’12

Torsten Hoefler Slide 16 of 25

Page 17: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Recv from proc 1

Send to proc 2

Manual Packing sbuf = malloc(N/2*sizeof(int)); rbuf = malloc(N/2*sizeof(int)); for (i=1; i<N; i+=2) sbuf[i/2]=data[i]; MPI_Isend(sbuf, …); MPI_Irecv(rbuf, …); MPI_Waitall(…); for (i=0; i<N; i+=2) data[i]=rbuf[i/2]; free(sbuf); free(rbuf);

Copy data to / from extra send / recv Buffers

MISSED REQUIREMENT I: DATA SERIALIZATION

*1+: Schneider et al.: “Micro-Applications for Communication Data Access Patterns and MPI Datatypes”, EuroMPI’12

Torsten Hoefler Slide 17 of 25

Page 18: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Recv from proc 1

Send to proc 2

Manual Packing sbuf = malloc(N/2*sizeof(int)); rbuf = malloc(N/2*sizeof(int)); for (i=1; i<N; i+=2) sbuf[i/2]=data[i]; MPI_Isend(sbuf, …); MPI_Irecv(rbuf, …); MPI_Waitall(…); for (i=0; i<N; i+=2) data[i]=rbuf[i/2]; free(sbuf); free(rbuf);

Actual sending

MISSED REQUIREMENT I: DATA SERIALIZATION

*1+: Schneider et al.: “Micro-Applications for Communication Data Access Patterns and MPI Datatypes”, EuroMPI’12

Torsten Hoefler Slide 18 of 25

Page 19: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Recv from proc 1

Send to proc 2

Manual Packing sbuf = malloc(N/2*sizeof(int)); rbuf = malloc(N/2*sizeof(int)); for (i=1; i<N; i+=2) sbuf[i/2]=data[i]; MPI_Isend(sbuf, …); MPI_Irecv(rbuf, …); MPI_Waitall(…); for (i=0; i<N; i+=2) data[i]=rbuf[i/2]; free(sbuf); free(rbuf);

MPI Datatypes MPI_Datatype nt; MPI_Type_vector(n/2, 1, 2, MPI_INT, &nt); MPI_Type_commit(&nt); MPI_Isend(&data[1], 1, nt, …); MPI_Irecv(&data[0], 1, nt, …); MPI_Waitall(…); MPI_Type_free(&nt);

• No explicit copying • Less code

MISSED REQUIREMENT I: DATA SERIALIZATION

*1+: Schneider et al.: “Micro-Applications for Communication Data Access Patterns and MPI Datatypes”, EuroMPI’12

Torsten Hoefler Slide 19 of 25

Page 20: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

SERIALIZATION ACCESS PATTERNS

Application Classes

Access Patterns

Atmospheric Science (WRF)

Geophysical Science

(SPECFEM3D)

Quantumchro- Modynamics (MILC_su3)

Molecular Dynamics

(LAMMPS, MiniMD)

Computational Fluiddynamics (NAS LU + MG)

Face Exchanges Unstructured Exchange Matrix Transposition

*1+: Schneider et al.: “Micro-Applications for Communication Data Access Patterns and MPI Datatypes”, EuroMPI’12

Torsten Hoefler Slide 20 of 25

Page 21: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

SERIALIZATION BENCHMARK RESULTS

All packing methods take more than 80% of communication time

*1+: Schneider et al.: “Micro-Applications for Communication Data Access Patterns and MPI Datatypes”, EuroMPI’12

Torsten Hoefler Slide 21 of 25

Page 22: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

SERIALIZATION BENCHMARK RESULTS

Data is contiguous

*1+: Schneider et al.: “Micro-Applications for Communication Data Access Patterns and MPI Datatypes”, EuroMPI’12

Torsten Hoefler Slide 22 of 25

Page 23: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

SERIALIZATION BENCHMARK RESULTS

Unstructured Exchange

*1+: Schneider et al.: “Micro-Applications for Communication Data Access Patterns and MPI Datatypes”, EuroMPI’12

Torsten Hoefler Slide 23 of 25

Page 24: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Observed bandwidth < peak bandwidth

Example [1]:

IB full bisection bandwidth fat tree

Effective bandwidth: 69% of peak

Reason: static routing

Conjecture:

Routing needs to be co-designed [2]

With applications and topologies (complex topic)

MISSED REQUIREMENT II: GLOBAL BANDWIDTH

*1+: Hoefler et al.: “Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks”, Cluster’08 [2]: Prisacari et al.: “Bandwidth-optimal Alltoall Exchanges in Fat Tree Networks”, ICS’13

Torsten Hoefler Slide 24 of 25

Page 25: APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO … · APPLICATION-CENTRIC BENCHMARKING AND MODELING FOR CO-DESIGN Torsten Hoefler Exascale Applications and Software Conference

Co-design is becoming more important

Requirements modeling is a simple strategy!

Not trivial, dangers:

Underestimating requirements function

Overlooking requirements dimension

Tool support on its way

As automatic as possible

Collaboration with GRS, SPPEXA

CONCLUSIONS

Torsten Hoefler Slide 25 of 25