performance monitoring on the intel ® itanium ® 2 processor cgo’04 tutorial 3/21/04 ck. luk...
TRANSCRIPT
Performance Monitoring on the Intel® Itanium® 2 Processor
CGO’04 Tutorial3/21/04
Massachusetts Microprocessor Design CenterIntel Corporation
CGO’04 Tutorial 2
Itanium 2 Performance Monitoring• The Itanium Architecture defines a generic framework for the Performance
Monitoring Unit (PMU):– Consistent software APIs across processor models– Yet, a processor model can implement its own PMU extension
• Generic PMU support:– 4 64-bit Performance Monitor Data registers (PMDs) (extensible to 256 in total) – 8 64-bit Performance Monitor Configuration registers (PMCs) (extensible to 256 in total) – A performance monitor = 1 PMC + N PMDs (where N >= 1)– 3 additional status/control registers: PSR, DCR, PMV
• Itanium 2 PMU support:– Monitor a rich set (140+) of events– 16 PMCs, 18 PMDs (4 for event counting, others for buffering event-specific info.)
• Can pinpoint exactly where a “miss” event happened in the program
• PMU usages:– Workload characterization– Profiling
CGO’04 Tutorial 3
Workload Characterization
• Understand the performance characteristics of the workload
• Two fundamental measures of interests:– Event occurrences
• How often certain events occurred?
– Cycle accounting• How were the cycles spent during execution?
CGO’04 Tutorial 4
Measuring Event Occurrences • On Itanium 2, 140+ monitored micro-architectural events:
• Basic: clock cycles, retired instructions• Instruction Dispersal , Instruction Execution, Branch• Pipeline Stall• Cache, TLB, RSE• System, Bus
• Much more information can be derived, e.g.:• IPC = IA64_INST_RETIRED / CPU_CYCLES• #I-cache refs = L1I_READS + L1I_PREFETCHES
• Multi-occurrence events ( > 1 event per cycle)• E.g., inst retired, # live entries in the issue queue• Thresholding:
• An event is counted only if it occurs more than the threshold in a cycle
CGO’04 Tutorial 5
Itanium 2 Pipeline with Events
ROTIPG RENEXP DET WRBEXEREG Back End
Front End
Back End not stalled
BE_FLUSH_BUBBLE.XPNBE_FLUSH_BUBBLE.BRUBE_L1D_FPU_BUBBLE
FE_BUBBLE BE_RSE_BUBBLE
Instruction Buffer
DISP_STALLEDINST_DISPERSEDSYLL_NOT_DISPERSED
BACK_END_BUBBLE.FE
FE_LOST_BWIDEAL_BE_LOST_BW_DUE_TO_FEBE_LOST_BW_DUE_TO_FE
BE_EXE_BUBBLE
CGO’04 Tutorial 6
Cycle Accounting• Cycle breakdown by stall and flush reasons:
• Overlapping stall and flush conditions are prioritized in the reverse pipeline order
Branch Mispredict Flush Exception/Interruption Flush
CP
U_C
YC
LE
S
Data Access Stall
Scoreboard and Register Dependency Stall
RSE Spill/Fill Stall Front End Stall
IA64_INST_RETIRED >= 1
decr
easi
ng p
rior
ity
CGO’04 Tutorial 7
Profiling
• Common uses of profile feedback:– Manual tuning of applications– Driving automatic compiler optimizations
• Profiling support on the Itanium 2:– Program counter (PC) sampling– Miss event address sampling– Event qualification (filtering)
CGO’04 Tutorial 8
Program Counter Sampling• Two ways to sample the program counter:
– Time-based: • Sample the pc with timer interrupts (when ITC==ITM)
– Event-based:• Sample the pc when an event overflows
• PC sampling can be used to derive basic-block execution counts
• But can’t precisely identify which instructions are causing performance problems
CGO’04 Tutorial 9
Miss Event Address SamplingProblems:
– When a cache miss/branch mispredict event occurs, PC sampling tends to indicate the stall point, not the source
• The sampled PC is N insts off the actual miss instruction
• N is non-deterministic due to the nature of the micro-arch.
Solution provided by the Itanium 2:– Hardware provides a set of Event Address Registers
(EARs) to record the instruction and data addresses of the offending instruction (plus other useful information)
CGO’04 Tutorial 10
Three Types of EARs• Instruction EAR (I-EAR)
– I-cache– I-TLB
• Data EAR (D-EAR)– D-cache– D-TLB– ALAT
• Branch Trace Buffer (BTB)
I-EAR, D-EAR, BTB can be activated simultaneously
CGO’04 Tutorial 11
Instruction EAR
• Instruction Cache– Triggers on I-cache misses– Records the inst. address and the fetch latency
• Instruction TLB– Triggers on I-TLB misses– Records the inst. address and who serviced the
miss (VHPT or software)
CGO’04 Tutorial 12
Data EAR• Data Cache
– Triggers on D-cache misses– Records the load PC, data address, and the load latency
• Data TLB– Triggers on D-TLB misses– Records the load PC, data address, and who serviced the
miss (L2 D-TLB, VHPT, or software)
• ALAT– Trigger on ALAT misses– Records the PC of the inst (chk.a or ld.c) that missed
CGO’04 Tutorial 13
Example of a D-EAR Sample
4000000000000450: [MFI] addl r16=4812,r1nop.f 0x0addl r17=4804,r1;;
4000000000000460: [MMI] ld4 r14=[r16];;adds r14=1,r14nop.i 0x0;;
4000000000000470: [MMB] st4 [r17]=r14nop.m 0x0br.ret.sptk.many b0;;
pmd2 = 0x6000000000005c04 data address
pmd3 = 0x7 (7 cycles)pmd17 = 0x4000000000000468
load’s pcbits 0-1: slot bit 2: bundlebit 3: validbits 4-63: bundle addr interrupted pc = 0x4000000000000470
CGO’04 Tutorial 14
Branch Trace Buffer (BTB)• A circular buffer of 8 entries
– Captures the last 4 branches
• Can select branches to monitor based on:– Taken/not taken, path prediction, target prediction– Type (any combo of ip-rel, ret, non-ret indirect)
• Information recorded for each branch:– PC of the branch itself– Target bundle address of the branch– Taken or not– Correctly predicted or not
CGO’04 Tutorial 15
Event Qualification (Filtering)• An event is counted only if certain constraints are met• Constraints supported:
– Instruction address range check• Monitor specific DLLs, functions, loops
– Instruction opcode match• Monitor specific instruction types or register usages
– Data address range check• Focus on particular data structures
– Event umasks• Further specification within an event
– e.g., bus transactions originated from the core or I/O– Instruction set check
• IA-32 or IA-64– Privilege level (2 dimensions):
• Process vs. system• User vs. kernel vs. interrupt
CGO’04 Tutorial 16
ReferencesGeneral Itanium PMUs
IA-64 Linux Kernel: Design and Implementation, D. Mosberger and S. Eranian. (Chapter 9)
Itanium 2-specific PMUs
Intel® Itanium 2® Processor Reference Manual for Software Development and Optimization (Chapters 10 and 11)http://developer.intel.com/design/itanium2/manuals/251110.htm