performance analysis with bsc-tools - vi-hps€¦ · performance analysis with bsc-tools ......

48
Performance analysis with BSC-Tools Jesus Labarta, Judit Gimenez BSC

Upload: votram

Post on 05-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

Performance analysis with

BSC-Tools

Jesus Labarta, Judit GimenezBSC

Page 2: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 2

Performance analysis tools objective

Who can I blame?

Fly with instruments

Understand our systems

InsightIdentify most productive efforts

Page 3: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 3

Performance analysis tools objective

Help validate hypotheses

Help generate hypotheses

Qualitatively

Quantitatively

Page 4: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 4

Our Tools

Since 1991Based on tracesOpen Source

http://www.bsc.es/paraver

Core tools:Paraver – offline trace analysisDimemas – message passing simulatorExtrae – instrumentation

FocusDetail, flexibility, intelligence

Page 5: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 5

Why traces?

Detail and variability are importantalong time, across processors

Highly non-linear systemsmicroscopic effects may have macroscopic impact

Extremely useful to develop/test analysis techniques

Page 6: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 6

Overview of capabilities of the environment

AnalysisTrace analysisScaling model

What ifSimple abstract model

Structure detectionHWC projection and CPI stack modelExtreme detail at low overhead

ScalabilityData selection and reduction Online

Page 7: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 7

Outline

ExtraeParaverDimemas

Scaling modelStructure detectionHWC analyses

Projection and CPI Stack modelsFolding: Instrumentation + sampling

Scalability

Page 8: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 8

Outline

ExtraeParaverDimemas

Scaling modelStructure detectionHWC analyses

Projection and CPI Stack modelsFolding

Scalability

Page 9: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 9

Extrae – traced information

Parallel programing model runtimeMPI, OpenMP, Pthreads, StarSs, ...

CountersCPU counters – PAPI (standard + native)Network counters (GM, MX) - at flushesOS counters

Links to sourceCallstack at MPI callsUser functions selected – default none

Periodic samplesPAPI counters + callstack

User events

Page 10: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 10

Outline

ExtraeParaverDimemas

Scaling modelStructure detectionHWC analyses

Projection and CPI Stack modelsFolding

Scalability

Page 11: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 11

Paraver – Performance data browser

Timelines

Raw data

2/3D tables (Statistics)

Goal = FlexibilityNo semantics

Programmable

Configuration filesDistribution

Your own

Comparative analysesMultiple traces

Synchronize scales

Page 12: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 12

Multispectral imaging

Different looks at one realityDifferent spectral bands (light sources and filters)

Highlight different aspects

Page 13: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 13

Views: Timelines

Raw events Piece-wise constant functions of time plots / colors

• Basic metrics

• Derived metrics

• Modelspreempted time=elapsed− cycles

clock freqL2miss latency= cycles− instr / idealIPC

L2misses

InstructionsColor encoding

Useful duration

MPI calls

usefulcyclesinstrIPCuseful *

##_ bytes

durationcallMPICostcallMPI#

____

Page 14: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 14

Tables: Profiles, histograms, correlations

Huge number of statistics computed from timelines

MPI calls profile

Useful Duration

Instructions

IPC

L2 miss ratio

Page 15: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 15

Tables: Profiles, histograms, correlations

By the way: six months later ….

Useful Duration

Instructions

IPC

L2 miss ratio

Page 16: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 16

Outline

ExtraeParaverDimemas

Scaling modelStructure detectionHWC analyses

Projection and CPI Stack modelsFolding

Scalability

Page 17: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 17

BSC-Tools Environment

Paraver

DimemasVENUS

PeekPerfData Display Tools

.prv+

.pcf

.trf

Machine description

Time analysis, filters

.cfg

Paramedir+

formatting scriptsContention analysis environment

.prv

Instr. Level Simulators

...

.viz

.txt

.cube.xls

Extrae

Dyninst, PAPI

MRNET

XMLcontrol Predictions/expectations

CUBE

Page 18: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 18

Dimemas: Coarse grain, Trace driven simulation

Simulation: Highly non linear model

Linear componentsPoint to point communication

Sequential processor performanceGlobal CPU speedPer block/subroutine

Non linear componentsSynchronization semantics

Blocking receivesRendezvous

Resource contentionCPUCommunication subsystem

links (half/full duplex), busses

LBW

eMessageSizT

CPU

LocalMemory

B

CPUCPU

L

CPUCPU

CPULocal

Memory

L

CPUCPU

CPULocal

Memory

L

Page 19: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 19

Collective communication model

Generic modelBarrier / Fan-in / Fan-outCost of communication phase

Generic

Per callModel factor

Lin / log / constSize of message

Min over all processesAvg over all processesMax over all processes

Processor time

Block time

Comm. time

Collective

Page 20: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 20

Outline

ExtraeParaverDimemas

Scaling modelStructure detectionHWC analyses

Projection and CPI Stack modelsFolding

Scalability

Page 21: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 21

Understanding applications

MPIRE 32 tasks, no network contention

L=25us, BW=100MB/s L=1000us, BW=100MB/s

L=25us, BW=10MB/s

All windows same scale

Page 22: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 22

Insight advice appropriate direction

SPECFEM3D – asynchronous communication?

Real

ideal

MN prediction

Prediction 5MB/s

Prediction 1MB/s

Prediction 10MB/s

Prediction 100MB/s

Courtesy Dimitri Komatitsch

Page 23: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 23

Intrinsic application behavior

Load balanced and dependence problems?BW = , L = 0

waitall

sendrecalltoall

Real run

Ideal network

Allgather+

sendrecvallreduce

GADGET @ Nehalem cluster256 processes

Page 24: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 24

Impact of architectural parameters

Ideal speeding up ALL the computation bursts by the CPUratio factor

The more processes the less speedup (higher impact of bandwidth limitations) !!!!!

64128

256

512

1024

2048

4096

8192

1638

4

14

1664

0

20

40

60

80

100

120

140

Bandwidth (MB/s)

CPUratio

Speedup

64128

256

512

1024

2048

4096

8192

1638

4

1

8

640

20

40

60

80

100

120

140

Bandwidth (MB/s)

CPUratio

Speedup

64128

256

512

1024

2048

4096

8192

1638

4

1

8

640

20

40

60

80

100

120

140

Bandwidth (MB/s)

CPUratio

Speedup

64 procs

128 procs

256 procs

GADGET

Page 25: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 25

What if: Potential of hybrid parallelization

Hybrid/accelerator parallelizationSpeedup SELECTED regions by the CPUratio

factor

We do need to overcome the hybrid Amdahl’s law asynchrony + Load balancing

mechanisms !!!

Profile

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10 11 12 13code region

% o

f com

puta

tion

time

64128

256

512

1024

2048

4096

8192

1638

4

14

1664

0

5

10

15

20

Bandwdith (MB/s)

CPUratio

Speedup

64128

256

512

1024

2048

4096

8192

1638

4

1

8

64

0

5

10

15

20

Bandwdith (MB/s)

CPUratio

Speedup

64128

256

512

1024

2048

4096

8192

1638

4

1

8

64

0

5

10

15

20

Bandwdith (MB/s)

CPUratio

Speedup

93.67% 97.49% 99.11%

Code region

%el

apse

d tim

e

128 procs.

Page 26: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 26

Presenting application performance

Factors modeling parallel efficiencyLoad balance (LB)Micro load balance (LB) or serializationTransfer

Factors describing serial behaviorPerformance: IPC

Scaling model

TransferLBLB **

instrinstr

IPCIPC

PPSup

##*** 0

000

Page 27: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 27

Scaling model

)max(*1

i

P

ii

effP

effLB

)max( ieffCommEff

TTeff i

i iTT

IPCinstr#

Directly from real execution metrics

instrinstr

IPCIPC

TransferTransfer

microLBmicroLB

macroLBmacroLB

PPSup

##***** 0

00000

Requires Dimemas simulationMigrating/local load imbalance

Serialization

idealT

TTTransfer ideal

ideal

i

TTmicroLB )max(

instrinstr

IPCIPC

CommEffCommEff

LBLB

PPSup

##**** 0

0000

Page 28: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 28

Modeling Parallel performance

Platform Processors Input Iteration time (s)

Speedup Relative efficiency

MN 64 A 78,71 1,00 1,00MN 128 A 44,22 1,78 0,89MN 256 A 22,78 3,46 0,86

BGP 64 A 250,41 1,00 1,00BGP 128 A 131,78 1,90 0,95BGP 256 A 79,10 3,17 0,79ICE 64 A 40,41 1,00 1,00ICE 128 A 21,65 1,87 0,93ICE 256 A 12,15 3,32 0,83

AlmostAcceptablescalability

GADGET @ PRACE data set 1

Page 29: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 29

Modeling Parallel performance

AlmostAcceptablescalability

Performsnot so well

Fair load

balance

TransferIssue

Good in BGP

Platform Processors Input Iteration time (s)

Speedup Relative efficiency

Parallel Efficiency

LB microLB Transfer

MN 64 A 78,71 1,00 1,00 0,66 0,90 0,94 0,78MN 128 A 44,22 1,78 0,89 0,58 0,97 0,92 0,65MN 256 A 22,78 3,46 0,86 0,56 0.95 0,77 0,76

BGP 64 A 250,41 1,00 1,00 0,87 0,90 0,99 0,97BGP 128 A 131,78 1,90 0,95 0,86 0,96 0,98 0,91BGP 256 A 79,10 3,17 0,79 0,75 0,95 0,89 0,90ICE 64 A 40,41 1,00 1,00 0,88 0,91 0,97 0,83ICE 128 A 21,65 1,87 0,93 0,68 0,98 0,95 0,73ICE 256 A 12,15 3,32 0,83 0,61 0,94 0,95 0,68

Although we have observed production runs at BSC with horrible load balance

GADGET @ PRACE data set 1

Page 30: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 30

Parallel Performance Models

Old new version: -43% instr, -52% time

0

0,2

0,4

0,6

0,8

1

1,2

0 100 200 300Processors

Performance Factors

Parallel_Eff

LB

uLB

Transfer

0

0,2

0,4

0,6

0,8

1

1,2

0 100 200 300Processors

Performance Factors

Parallel_Eff

LB

uLB

Transfer

Original Version v4.5

GROMACS

Page 31: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 31

Outline

ExtraeParaverDimemas

Scaling modelStructure detectionHWC analyses

Projection and CPI Stack modelsFolding

Scalability

Page 32: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 32

Performance @ serial computation bursts

Burst = continuous computation region between exit of an MPI call and entry to the next

Scatter plot representation of burstsCollapse time dimensionN dimensional space of HWC derived relevant metrics

• Instructions: idea of computational complexity, computational load imbalance,…

• IPC: Idea of absolute performance and performance imbalance

StructureClouds, clusters: Burst of “similar” characteristicsHow those similarities spread in the timeline?How does it relate to source code?

Page 33: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 33

Performance @ serial computation bursts

WRFGROMACSSPECFEM3D

Page 34: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 34

The importance of joint timeline+scatterplots

GROMACS FFTs balance

Instructions imbalance

IPC Imbalance

Page 35: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 35

Automatic clustering quality assessment

Leverage Multiple Sequence alignment tools from Life SciencesProcess == Sequence of clusters sequence of amino acids == DNACLUSTAL W, T-Coffee, Kalign2Cluster Sequence Score (0..1)Per cluster / Global

Weighted average

BT.A0043

BT.A0022

Page 36: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 36

Outline

ExtraeParaverDimemas

Scaling modelStructure detectionHWC analyses

Projection and CPI Stack modelsFolding

Scalability

Page 37: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 37

Full characterizationAll HWCs and ratios between themCPI stack model

From a single run

HWC projection

Page 38: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 38

Outline

ExtraeParaverDimemas

Scaling modelStructure detectionHWC analyses

Projection and CPI Stack modelsFolding

Scalability

Page 39: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 39

Folding: Instrumentation + sampling

Extremely detailed time evolution of hardware counts, rates and callstack

Minimal overheadBased on

trace: instrumentation events (iteration, MPI, …) and periodic samples.Application structure: manual iteration instrumentation, routines, clusters

FoldingPostprocessing to project all samples into one instance

Detailed fine grain instructions within one

iteration

Original sampled instructions

Page 40: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 40

Folding profiles of rates

Folded source code line

Folded instructions

Page 41: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 41

PEPC

96 MIPS

Performance metrics

16 MIPS

2.3 M L2 misses/s

0.1 M TLB misses/s

htable%node = 0htable%key = 0htable%link = -1htable%leaves = 0htable%childcode = 0

do i = 1, nhtable(i)%node = 0htable(i)%key = 0htable(i)%link = -1htable(i)%leaves = 0htable(i)%childcode = 0

End do

Changes-70% time

-18% instructions

-63% L2 misses

-78% TLB misses

253 MIPS (+163%)

Page 42: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 42

Outline

ExtraeParaverDimemas

Scaling modelStructure detectionHWC analyses

Projection and CPI Stack modelsFolding

Scalability

Page 43: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 43

Data reduction

Data handling/summarization capabilitySoftware counters, filtering and cutting

Automatizable through signal processing techniques:

Mathematical morphology to clean up perturbed regions

Wavelet transform to identify coarse regionsSpectral analysis for detailed periodic pattern

Algorithms applied to traces and onlineExtrae (Stand alone or using on MRNET)Paraver

570 s2.2 GB

MPI, HWC

WRF-NMMPeninsula 4km128 procs

570 s5 MB

4.6 s36.5 MB

Page 44: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 44

Spectral analysis

WaveletHigh

frequency

Autocorrelation

Spectral density

T

Page 45: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 45

Scalability: online data reduction

Jaguar

~ 47 seconds

Flow Tran

Jugene

~ 105 seconds

FlowTran

8K cores

12K cores

16K cores

PFLOTRAN

Page 46: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 46

Scalability: online automatic interval selection

“ G. Llort et all, “Scalable tracing with dynamic levels of detail” ICPADS 2011

T0

ClusteringAnalysis

MRNetFront-end

T1 Tn

Back-end threads

Aggregatedata

Broadcastresults

Structure detection

Detait trace only for small interval

Page 47: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 47

Scalable display: Non linear rendering

Linpack @ Marenostrum: 10k cores x 1700 s

Dgemm L1 miss ratio

0.8

0.7

Dgemm duration

11.8 s

10 s

2.95

2.85

Dgemm IPC

Page 48: Performance analysis with BSC-Tools - VI-HPS€¦ · Performance analysis with BSC-Tools ... Structure detection ... Extremely detailed time evolution of hardware counts, rates and

VI – HPS Tuning Workshop – Aachen September 2011 48

Conclusion

• Extreme flexibility:• Maximize iterations of the hypothesis – validation loop

• Learning curve• “Don’t ask whether something can be done, ask how can it be done”

• Detailed and precise analysis• Squeeze the information obtainable form a single run

• Insight and correct advise with estimates of potential gain

• Data analysis techniques applied to performance data

www.bsc.es/paraver