cs 594:scientific computing for engineers performance analysis tools: part i gabriel marin...

CS 594:CS 594: SCIENTIFIC SCIENTIFIC COMPUTING COMPUTING FOR ENGINEERS FOR ENGINEERS

PERFORMANCE ANALYSIS TOOLS: PERFORMANCE ANALYSIS TOOLS: PART IPART I

Gabriel Marin Gabriel Marin [email protected]@eecs.utk.edu

Rough OutlineRough Outline

1.1. Part IPart I• Motivation• Introduction to Computer Architecture • Overview of Performance Analysis techniques

2.2. Part IIPart II• Introduction to Hardware Counter Events

• PAPI: Access to hardware performance counters

3.3. Part IIIPart III• HPCToolkit: Low overhead, full code profiling using hardware

counters sampling

• MIAMI: Performance diagnosis based on machine-independent application modeling

22

• Getting results as quickly as possible?• Getting correct results as quickly as possible? • What about Budget?• What about Development Time?• What about Hardware Usage?• What about Power Consumption?

WHAT IS PERFORMANCE ?WHAT IS PERFORMANCE ?

33

WHY PERFORMANCE ANALYSIS ?WHY PERFORMANCE ANALYSIS ?

• Large investments in HPC systemso Procurement costs: ~$40 million / yearo Operational costs: ~$5 million / yearo Electricity costs: 1 MW year ~$1 million

• Efficient usage is important because of expensive and limited resources

• Scalability is important to achieve next bigger simulation

• Embedded systems have strict power and memory constraints.

44

SIMPLE PERFORMANCE EQUATIONSIMPLE PERFORMANCE EQUATION

• N – number of executed instructions• C – CPI = cycles per instruction• f – processor frequency

• I – IPC = instructions per cycle

• Frequency scaling provided “easy” performance gains for many years• Power use increases with frequency cubed

55

SIMPLE PERFORMANCE EQUATIONSIMPLE PERFORMANCE EQUATION

• N – affected by implementation algorithm, compiler, machine instruction set (e.g. SIMD instructions)

• f – determined by architecture, is not going up anymore

• I – affected by code optimizations (manual or compiler) and by micro-architecture optimizations

• Current architectures can issue 6-8 micro-ops per cycle• Retire 3-4 instructions per cycle (Itanium can retire 6)• IPC > 1.5 is very good, ~1 is OK, many applications

get IPC < 166

FACTORS AFFECTING PERFORMANCEFACTORS AFFECTING PERFORMANCE

• Algorithm – biggest impact• O(N*log(N)) performs much better than O(N2) for

useful values of N• Code implementation

• Integer factor performance difference between efficient and inefficient implementations of the same algorithm

• Compiler and compiler flags• Architecture

77

EXAMPLE: MATRIX MULTIPLYEXAMPLE: MATRIX MULTIPLY

void compute(int reps){ register int i, j, k, r; for (r=0 ; r<reps ; ++r) { for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { for (k = 0; k < N; k++) { C(i,j) += A(i,k) * B(k,j); } } } }}

88

MATRIX MULTIPLY: DIFFERENT COMPILERSMATRIX MULTIPLY: DIFFERENT COMPILERS

Timing for the full code, not just the compute routine

Can we explain the differences in performance?

99

MATRIX MULTIPLY: DIFFERENT MACHINESMATRIX MULTIPLY: DIFFERENT MACHINES

The two machines are part of the same family

1010

COMPUTER ARCHITECTURE REVIEWCOMPUTER ARCHITECTURE REVIEW

Typically we cannot modify our target architecture, but we can tailor our code to the target architecture.

In very specialized cases we want to tailor an architecture to a specific application / workload

You do not have to be a computer architect to optimize code … but it really helps

1111

COMPUTER ARCHITECTURE REVIEWCOMPUTER ARCHITECTURE REVIEW

CPU front-end

CPU back-end

1212

PROCESSOR FRONT-END MODELPROCESSOR FRONT-END MODEL

• Front-end: operates in-order• Instruction fetch• Instruction decode• Branch predictor - speculative instruction fetch and decode

•~ 1 in 5 instructions is a branch• Issue buffer – store decoded instructions for the scheduler• Reorder buffer – instructions decoded and not retired

Branch predictor

mispredict rate

I-cache

miss rate

Fetch buffer

F DDecode pipeline

pipeline depth

Issuebuffe

r#

entries

D

Reorder buffer (ROB)

# entriesD R

I

Scheduler

Back-end

1313

PROCESSOR FRONT-END STALLSPROCESSOR FRONT-END STALLS

• Possible front-end stall events

• I-Cache or I-TLB miss

• Branch misprediction

• Full Reorder Buffer

Branch predictor

mispredict rate

I-cache

miss rate

Fetch buffer

F DDecode pipeline

pipeline depth

Issuebuffe

r#

entries

D


# entriesD R

I

Scheduler

Back-end

1414

FRONT-END STALLS: I-CACHE or I-TLB MISSFRONT-END STALLS: I-CACHE or I-TLB MISS

• Instruction fetch stops• Instructions in the front-end buffers continue to be dispatched

until front-end drains (hides penalty)• Front-end pipeline starts to refill after the miss event is

handled; front-end refill time ~ front-end drain time• Penalty ~= miss event delay

Branch predictor

mispredict rate

I-cache

miss rate

Fetch buffer

F DDecode pipeline

pipeline depth

Issuebuffe

r#

entries

D


# entriesD R

I

Scheduler

Back-end

1515

FRONT-END STALLS: I-CACHE or I-TLB MISSFRONT-END STALLS: I-CACHE or I-TLB MISS

• Possible causes• Execution is spread over large regions of code, with

branchy unpredictable control flow•Not typical for HPC

• Large loop footprint + small I-cache•Older Itanium2: 16KB I-cache, no hardware prefetcher•Space inefficient VLIW instruction set•Loop fusion / loop unrolling can create large loop footprints

• Possible solutions• Feedback directed compilation can change code layout• Limit loop unrolling or fusion

1616

FRONT-END STALLS: BRANCH MISPREDICTIONFRONT-END STALLS: BRANCH MISPREDICTION

• Instruction fetch continues along a wrong path• Useful instructions before mispredicted branch continue to be

dispatched until branch enters window• Pipeline is flushed when branch is resolved, mispred detected• Instruction fetch starts on the correct path, front-end pipeline

starts to refill• Penalty ~= branch resolution time + pipeline refill

Branch predictor

mispredict rate

I-cache

miss rate

Fetch buffer

F DDecode pipeline

pipeline depth

Issuebuffe

r#

entries

D


# entriesD R

I

Scheduler

Back-end

1717

BRANCH MISPREDICTION PENALTYBRANCH MISPREDICTION PENALTY

• This is the minimum penalty, proportional to the processor pipeline depth. • Bulldozer has a deeper pipeline than K10 -> higher penalty• Sandy Bridge added a micro-ops cache, which can lower

misprediction penalty compared to Nehalem.

Architecture Branch misprediction

penalty

AMD K10 (Barcelona, Istanbul, Magny-Cours)

12

AMD Bulldozer 20

Pentium 4 20

Core 2 (Conroe, Penryn) 15

Nehalem 17

Sandy Bridge 14-17

1818


• Branch predictors have improved in time• Both Intel and AMD

• Modern branch predictors have very good accuracy on typical workloads, 95%+

• Is there room for improvement?• Does it matter if we go from 95% to 96%?

1919


• Branch predictors have improved in time• Both Intel and AMD

• Modern branch predictors have very good accuracy on typical workloads, 95%+

• Is there room for improvement?• Does it matter if we go from 95% to 96%?

• Performance loss is proportional to branch misprediction rate• 5% to 4% misprediction rate is a 20% improvement• ~ 1 in 5 instructions is a branch in typical workloads

• Losses due to branch misprediction• Branch misprediction rate X pipeline depth

2020

FRONT-END STALLS: ROB FULLFRONT-END STALLS: ROB FULL

• ROB maintains in-order state of not yet retired micro-ops• μops still in the issue buffer• μops in the back-end pipelines (executing)• Completed μops, but ROB head micro-op did not

• On a long data access, other micro-ops continue to issue, but micro-ops dispatched after the stalled load cannot retire

• Dispatch continues until ROB fills up, then it stalls

Branch predictor

mispredict rate

I-cache

miss rate

Fetch buffer

F DDecode pipeline

pipeline depth

Issuebuffe

r#

entries

D


# entriesD R

I

Scheduler

Back-end

2121

PROCESSOR BACK-END MODELPROCESSOR BACK-END MODEL

• Execution units organized in stacks• Can issue one μop to each issue port each cycle• Deal with different instruction mixes

• Micro-op scheduler: unified or partitioned• Register files (not shown)• Bypass network to forward results between stacks

Intel Sandy Bridge

AMD K10

2222

PROCESSOR BACK-END MODELPROCESSOR BACK-END MODEL

• Micro-ops enter the scheduler issue buffer in order• Can be dispatched out-of-order, but maintains close to FIFO

• Picks the oldest ready to execute micro-ops•Skip micro-ops that are not ready

• Increases instruction level parallelism (ILP)• Why try to maintain FIFO?

• Retirement width < Issue width• Retirement balances front-end width• Larger issue width to account for short variations in ILP

2323

HOW TO DEFINE PEAK PERFORMANCE?HOW TO DEFINE PEAK PERFORMANCE?

• Peak retirement rate (IPC)• From the architecture point of view

• Peak issue of “useful” instructions• HPC cares about FLOPS, mainly Adds and Multiplies

•Peak FLOPS rate, everything else is overhead• You need many data movement instructions (Loads, Reg

Copy, Data Shuffling, Data Conversion) + address arithmetic and Branches to perform useful work

•Cannot get close to peak for most workloads, dense linear algebra is an exception

• What about SIMD instructions?• Peak issue of SIMD “useful” instructions

2424

BACK-END INEFFICIENCIESBACK-END INEFFICIENCIES

• Mismatch between application instruction mix and available machine resources• Contention on a particular execution unit or issue port• “useful” instructions are a fraction of all program instructions

• Instruction dependencies limit available ILP• Machine units sit mostly idle

• Long data access• Memory access misses in D-Cache or D-TLB• Non-blocking caches, other instructions continue to issue

•Multiple outstanding accesses to memory possible• Retirement stops on a long latency instruction

•Eventually ROB fills up, dispatch stops

2525

LONG DATA ACCESSESLONG DATA ACCESSES

• Typically the main source of performance loss• Architecture optimizations

• Multiple levels of cache – takes advantage of temporal and spatial reuse

• Hardware prefetchers – bring data closer to the execution units before it is needed

•works best with streaming memory access patterns• Software optimizations

• High level loop nest optimizations: tiling, fusion, loop interchange, loop splitting, data layout transformations

•Increase temporal or spatial reuse• Software prefetching – uses instruction issue bandwidth

2626

PERFORMANCE ANALYSISPERFORMANCE ANALYSIS

• Performance analysis is part science and part art

• Many variables affect performance

•Architecture, algorithm, code implementation, compiler

•Caches and various micro-architecture optimizations make analysis nondeterministic

• You must have a feeling of what can go wrong

•Everyone has a different style

• Knowledge of computer architecture helps

• Compilers background helps

• Math / numerical algorithms background helps

2727

Measure & Analyze:• Have an optimization phase

•Just like testing & debugging phase

• It is often skipped•Budget or development time constraints

PERFORMANCE OPTIMIZATION CYCLEPERFORMANCE OPTIMIZATION CYCLE

Usage / Production

Measure

Analyze

Modify / Tune

functionally completeand correct program

complete, correct and well-performing

program

Instrumentation

Code Development

2828

PERFORMANCE ANALYSIS TECHNIQUESPERFORMANCE ANALYSIS TECHNIQUES

• Performance measurement• Performance modeling• Simulation

• The line between different techniques can be blurry• Modeling can use measurement or simulation results as

input

2929

PERFORMANCE MEASUREMENTPERFORMANCE MEASUREMENT

• Further classification• Profiling vs. tracing• Instrumentation vs. sampling

• Advantages• Measures performance effects

•Actual code on a real system• Reveals hotspots

• Disadvantages• Instrumentation overhead can perturb measurements• Sampling can have attribution errors• Measures performance effects

•Performance insight (diagnosis) not easily apparent

3030

MEASUREMENT: PROFILING VS TRACINGMEASUREMENT: PROFILING VS TRACING

Profiling• Aggregate performance metrics

•No timeline dimension, or ordering of events• Number of times a routine was invoked• Time spent or cache misses incurred in a loop or a routine• Memory, message communication sizesTracing• When and where events took place along a global timeline• Time-stamped events (points of interest)• Message communication events (sends/receives) are tracked• Shows when and from/to where messages were sent• Event Trace: collection of all events of a process/program

sorted by time

3131

MEASUREMENT: PROFILINGMEASUREMENT: PROFILING

• Recording of summary information during executiono inclusive, exclusive time, # calls, hardware counter

statistics, …• Reflects performance behavior of program entities

o functions, loops, basic blockso user-defined “semantic” entities

• Very good for low-cost performance assessment• Helps to expose hotspots• Implemented through either

o sampling: periodic OS interrupts or hardware counter trapso instrumentation: direct insertion of measurement code

3232

MEASUREMENT: TRACINGMEASUREMENT: TRACING

• Recording of information about significant points (events) during program execution

• Save information in event recordo Timestampo CPU identifier, thread identifiero Event type and event-specific information

• Useful to expose interactions between parallel processes or threads

Tracing Disadvantageso Traces can become very largeo Instrumentation and tracing add overheado Event buffering, clock synchronization, …

3333

PERFORMANCE MODELINGPERFORMANCE MODELING

• Paper-and-pencil or semi-automated• Characterize the application and the architecture independently

• Aims to separately understand the contribution of the application and the architecture to performance

• Use a convolution process to predict performance• Advantages

• Enables “what if” analysis – explore changes to the application or architecture characteristics

• Provides bounds on performance• Can provide performance insight into bottlenecks

•Depends on the model’s level of detail• Faster than simulation, but less accurate

3434

PENCIL AND PAPER MODELINGPENCIL AND PAPER MODELING

• Back of the envelope analysis• Count the type of operations at a high level• Usually assumes ideal conditions, peak machine issue rates

• Advantages• Quick performance estimates• Compare algorithms / implementations with similar

asymptotic complexities before implementing them• Disadvantages

• Actual performance can be far from estimate•Does not account for many low level details

• Hard to do by hand for large complex applications

3535

ARCHITECTURE SIMULATIONARCHITECTURE SIMULATION

• Micro-architecture/device vs. full system simulation• Functional vs. timing simulation

• Functional: emulates target system• Timing: simulates timing features of the architecture

• Trace-driven vs. execution-driven• Advantages

• Obtain detailed performance metrics• Account for ordering of dynamic events

•More accurate than analytical modeling• Disadvantages

• Can be very slow•Depends on the level of detail

3636

Raj Jain (1991)“Contrary to common belief, performance evaluation is an art. ... Like artist, each analyst has a unique style. Given the sample problem, two analysts may choose different performance metrics and evaluation methodologies.”

… but even they need tools!

PERFORMANCE ANALYSIS TOOLSPERFORMANCE ANALYSIS TOOLS

3737

cs 594:scientific computing for engineers performance analysis tools: part i gabriel marin...

Documents

performance diagnosis

performance loss

r branch misprediction

hardware performance

branch misprediction

easy performance gains

lower misprediction

amd modern branch predictors