performance monitoring on the intel ® itanium ® 2 processor cgo’04 tutorial 3/21/04 ck. luk...

16
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk [email protected] Massachusetts Microprocessor Design Center Intel Corporation

Upload: dana-copeland

Post on 12-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

Performance Monitoring on the Intel® Itanium® 2 Processor

CGO’04 Tutorial3/21/04

CK. [email protected]

Massachusetts Microprocessor Design CenterIntel Corporation

Page 2: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 2

Itanium 2 Performance Monitoring• The Itanium Architecture defines a generic framework for the Performance

Monitoring Unit (PMU):– Consistent software APIs across processor models– Yet, a processor model can implement its own PMU extension

• Generic PMU support:– 4 64-bit Performance Monitor Data registers (PMDs) (extensible to 256 in total) – 8 64-bit Performance Monitor Configuration registers (PMCs) (extensible to 256 in total) – A performance monitor = 1 PMC + N PMDs (where N >= 1)– 3 additional status/control registers: PSR, DCR, PMV

• Itanium 2 PMU support:– Monitor a rich set (140+) of events– 16 PMCs, 18 PMDs (4 for event counting, others for buffering event-specific info.)

• Can pinpoint exactly where a “miss” event happened in the program

• PMU usages:– Workload characterization– Profiling

Page 3: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 3

Workload Characterization

• Understand the performance characteristics of the workload

• Two fundamental measures of interests:– Event occurrences

• How often certain events occurred?

– Cycle accounting• How were the cycles spent during execution?

Page 4: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 4

Measuring Event Occurrences • On Itanium 2, 140+ monitored micro-architectural events:

• Basic: clock cycles, retired instructions• Instruction Dispersal , Instruction Execution, Branch• Pipeline Stall• Cache, TLB, RSE• System, Bus

• Much more information can be derived, e.g.:• IPC = IA64_INST_RETIRED / CPU_CYCLES• #I-cache refs = L1I_READS + L1I_PREFETCHES

• Multi-occurrence events ( > 1 event per cycle)• E.g., inst retired, # live entries in the issue queue• Thresholding:

• An event is counted only if it occurs more than the threshold in a cycle

Page 5: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 5

Itanium 2 Pipeline with Events

ROTIPG RENEXP DET WRBEXEREG Back End

Front End

Back End not stalled

BE_FLUSH_BUBBLE.XPNBE_FLUSH_BUBBLE.BRUBE_L1D_FPU_BUBBLE

FE_BUBBLE BE_RSE_BUBBLE

Instruction Buffer

DISP_STALLEDINST_DISPERSEDSYLL_NOT_DISPERSED

BACK_END_BUBBLE.FE

FE_LOST_BWIDEAL_BE_LOST_BW_DUE_TO_FEBE_LOST_BW_DUE_TO_FE

BE_EXE_BUBBLE

Page 6: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 6

Cycle Accounting• Cycle breakdown by stall and flush reasons:

• Overlapping stall and flush conditions are prioritized in the reverse pipeline order

Branch Mispredict Flush Exception/Interruption Flush

CP

U_C

YC

LE

S

Data Access Stall

Scoreboard and Register Dependency Stall

RSE Spill/Fill Stall Front End Stall

IA64_INST_RETIRED >= 1

decr

easi

ng p

rior

ity

Page 7: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 7

Profiling

• Common uses of profile feedback:– Manual tuning of applications– Driving automatic compiler optimizations

• Profiling support on the Itanium 2:– Program counter (PC) sampling– Miss event address sampling– Event qualification (filtering)

Page 8: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 8

Program Counter Sampling• Two ways to sample the program counter:

– Time-based: • Sample the pc with timer interrupts (when ITC==ITM)

– Event-based:• Sample the pc when an event overflows

• PC sampling can be used to derive basic-block execution counts

• But can’t precisely identify which instructions are causing performance problems

Page 9: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 9

Miss Event Address SamplingProblems:

– When a cache miss/branch mispredict event occurs, PC sampling tends to indicate the stall point, not the source

• The sampled PC is N insts off the actual miss instruction

• N is non-deterministic due to the nature of the micro-arch.

Solution provided by the Itanium 2:– Hardware provides a set of Event Address Registers

(EARs) to record the instruction and data addresses of the offending instruction (plus other useful information)

Page 10: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 10

Three Types of EARs• Instruction EAR (I-EAR)

– I-cache– I-TLB

• Data EAR (D-EAR)– D-cache– D-TLB– ALAT

• Branch Trace Buffer (BTB)

I-EAR, D-EAR, BTB can be activated simultaneously

Page 11: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 11

Instruction EAR

• Instruction Cache– Triggers on I-cache misses– Records the inst. address and the fetch latency

• Instruction TLB– Triggers on I-TLB misses– Records the inst. address and who serviced the

miss (VHPT or software)

Page 12: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 12

Data EAR• Data Cache

– Triggers on D-cache misses– Records the load PC, data address, and the load latency

• Data TLB– Triggers on D-TLB misses– Records the load PC, data address, and who serviced the

miss (L2 D-TLB, VHPT, or software)

• ALAT– Trigger on ALAT misses– Records the PC of the inst (chk.a or ld.c) that missed

Page 13: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 13

Example of a D-EAR Sample

4000000000000450: [MFI] addl r16=4812,r1nop.f 0x0addl r17=4804,r1;;

4000000000000460: [MMI] ld4 r14=[r16];;adds r14=1,r14nop.i 0x0;;

4000000000000470: [MMB] st4 [r17]=r14nop.m 0x0br.ret.sptk.many b0;;

pmd2 = 0x6000000000005c04 data address

pmd3 = 0x7 (7 cycles)pmd17 = 0x4000000000000468

load’s pcbits 0-1: slot bit 2: bundlebit 3: validbits 4-63: bundle addr interrupted pc = 0x4000000000000470

Page 14: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 14

Branch Trace Buffer (BTB)• A circular buffer of 8 entries

– Captures the last 4 branches

• Can select branches to monitor based on:– Taken/not taken, path prediction, target prediction– Type (any combo of ip-rel, ret, non-ret indirect)

• Information recorded for each branch:– PC of the branch itself– Target bundle address of the branch– Taken or not– Correctly predicted or not

Page 15: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 15

Event Qualification (Filtering)• An event is counted only if certain constraints are met• Constraints supported:

– Instruction address range check• Monitor specific DLLs, functions, loops

– Instruction opcode match• Monitor specific instruction types or register usages

– Data address range check• Focus on particular data structures

– Event umasks• Further specification within an event

– e.g., bus transactions originated from the core or I/O– Instruction set check

• IA-32 or IA-64– Privilege level (2 dimensions):

• Process vs. system• User vs. kernel vs. interrupt

Page 16: Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design

CGO’04 Tutorial 16

ReferencesGeneral Itanium PMUs

IA-64 Linux Kernel: Design and Implementation, D. Mosberger and S. Eranian. (Chapter 9)

Itanium 2-specific PMUs

Intel® Itanium 2® Processor Reference Manual for Software Development and Optimization (Chapters 10 and 11)http://developer.intel.com/design/itanium2/manuals/251110.htm