performance monitoring on the intel ® itanium ® 2 processor cgo’04 tutorial 3/21/04 ck. luk...

Performance Monitoring on the Intel® Itanium® 2 Processor

CGO’04 Tutorial3/21/04

CK. [email protected]

Massachusetts Microprocessor Design CenterIntel Corporation

CGO’04 Tutorial 2

Itanium 2 Performance Monitoring• The Itanium Architecture defines a generic framework for the Performance

Monitoring Unit (PMU):– Consistent software APIs across processor models– Yet, a processor model can implement its own PMU extension

• Generic PMU support:– 4 64-bit Performance Monitor Data registers (PMDs) (extensible to 256 in total) – 8 64-bit Performance Monitor Configuration registers (PMCs) (extensible to 256 in total) – A performance monitor = 1 PMC + N PMDs (where N >= 1)– 3 additional status/control registers: PSR, DCR, PMV

• Itanium 2 PMU support:– Monitor a rich set (140+) of events– 16 PMCs, 18 PMDs (4 for event counting, others for buffering event-specific info.)

• Can pinpoint exactly where a “miss” event happened in the program

• PMU usages:– Workload characterization– Profiling

CGO’04 Tutorial 3

Workload Characterization

• Understand the performance characteristics of the workload

• Two fundamental measures of interests:– Event occurrences

• How often certain events occurred?

– Cycle accounting• How were the cycles spent during execution?

CGO’04 Tutorial 4

Measuring Event Occurrences • On Itanium 2, 140+ monitored micro-architectural events:

• Basic: clock cycles, retired instructions• Instruction Dispersal , Instruction Execution, Branch• Pipeline Stall• Cache, TLB, RSE• System, Bus

• Much more information can be derived, e.g.:• IPC = IA64_INST_RETIRED / CPU_CYCLES• #I-cache refs = L1I_READS + L1I_PREFETCHES

• Multi-occurrence events ( > 1 event per cycle)• E.g., inst retired, # live entries in the issue queue• Thresholding:

• An event is counted only if it occurs more than the threshold in a cycle

CGO’04 Tutorial 5

Itanium 2 Pipeline with Events

ROTIPG RENEXP DET WRBEXEREG Back End

Front End

Back End not stalled

BE_FLUSH_BUBBLE.XPNBE_FLUSH_BUBBLE.BRUBE_L1D_FPU_BUBBLE

FE_BUBBLE BE_RSE_BUBBLE

Instruction Buffer

DISP_STALLEDINST_DISPERSEDSYLL_NOT_DISPERSED

BACK_END_BUBBLE.FE

FE_LOST_BWIDEAL_BE_LOST_BW_DUE_TO_FEBE_LOST_BW_DUE_TO_FE

BE_EXE_BUBBLE

CGO’04 Tutorial 6

Cycle Accounting• Cycle breakdown by stall and flush reasons:

• Overlapping stall and flush conditions are prioritized in the reverse pipeline order

Branch Mispredict Flush Exception/Interruption Flush

CP

U_C

YC

LE

S

Data Access Stall

Scoreboard and Register Dependency Stall

RSE Spill/Fill Stall Front End Stall

IA64_INST_RETIRED >= 1

decr

easi

ng p

rior

ity

CGO’04 Tutorial 7

Profiling

• Common uses of profile feedback:– Manual tuning of applications– Driving automatic compiler optimizations

• Profiling support on the Itanium 2:– Program counter (PC) sampling– Miss event address sampling– Event qualification (filtering)

CGO’04 Tutorial 8

Program Counter Sampling• Two ways to sample the program counter:

– Time-based: • Sample the pc with timer interrupts (when ITC==ITM)

– Event-based:• Sample the pc when an event overflows

• PC sampling can be used to derive basic-block execution counts

• But can’t precisely identify which instructions are causing performance problems

CGO’04 Tutorial 9

Miss Event Address SamplingProblems:

– When a cache miss/branch mispredict event occurs, PC sampling tends to indicate the stall point, not the source

• The sampled PC is N insts off the actual miss instruction

• N is non-deterministic due to the nature of the micro-arch.

Solution provided by the Itanium 2:– Hardware provides a set of Event Address Registers

(EARs) to record the instruction and data addresses of the offending instruction (plus other useful information)

CGO’04 Tutorial 10

Three Types of EARs• Instruction EAR (I-EAR)

– I-cache– I-TLB

• Data EAR (D-EAR)– D-cache– D-TLB– ALAT

• Branch Trace Buffer (BTB)

I-EAR, D-EAR, BTB can be activated simultaneously


Instruction EAR

• Instruction Cache– Triggers on I-cache misses– Records the inst. address and the fetch latency

• Instruction TLB– Triggers on I-TLB misses– Records the inst. address and who serviced the

miss (VHPT or software)


Data EAR• Data Cache

– Triggers on D-cache misses– Records the load PC, data address, and the load latency

• Data TLB– Triggers on D-TLB misses– Records the load PC, data address, and who serviced the

miss (L2 D-TLB, VHPT, or software)

• ALAT– Trigger on ALAT misses– Records the PC of the inst (chk.a or ld.c) that missed


Example of a D-EAR Sample

4000000000000450: [MFI] addl r16=4812,r1nop.f 0x0addl r17=4804,r1;;

4000000000000460: [MMI] ld4 r14=[r16];;adds r14=1,r14nop.i 0x0;;

4000000000000470: [MMB] st4 [r17]=r14nop.m 0x0br.ret.sptk.many b0;;

pmd2 = 0x6000000000005c04 data address

pmd3 = 0x7 (7 cycles)pmd17 = 0x4000000000000468

load’s pcbits 0-1: slot bit 2: bundlebit 3: validbits 4-63: bundle addr interrupted pc = 0x4000000000000470


Branch Trace Buffer (BTB)• A circular buffer of 8 entries

– Captures the last 4 branches

• Can select branches to monitor based on:– Taken/not taken, path prediction, target prediction– Type (any combo of ip-rel, ret, non-ret indirect)

• Information recorded for each branch:– PC of the branch itself– Target bundle address of the branch– Taken or not– Correctly predicted or not


Event Qualification (Filtering)• An event is counted only if certain constraints are met• Constraints supported:

– Instruction address range check• Monitor specific DLLs, functions, loops

– Instruction opcode match• Monitor specific instruction types or register usages

– Data address range check• Focus on particular data structures

– Event umasks• Further specification within an event

– e.g., bus transactions originated from the core or I/O– Instruction set check

• IA-32 or IA-64– Privilege level (2 dimensions):

• Process vs. system• User vs. kernel vs. interrupt


ReferencesGeneral Itanium PMUs

IA-64 Linux Kernel: Design and Implementation, D. Mosberger and S. Eranian. (Chapter 9)

Itanium 2-specific PMUs

Intel® Itanium 2® Processor Reference Manual for Software Development and Optimization (Chapters 10 and 11)http://developer.intel.com/design/itanium2/manuals/251110.htm

performance monitoring on the intel ® itanium ® 2 processor cgo’04 tutorial 3/21/04 ck. luk...

Documents