a monte carlo model of in-order micro-architectural performance: decomposing processor stalls olaf...

17
A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook

Upload: roger-pearson

Post on 02-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

A Monte Carlo Model of In-order Micro-architectural Performance:Decomposing Processor Stalls

Olaf LubeckRam SrinivasanJeanine Cook

Percent of peak on single PE?About 5-8% (12-20x less than peak)

Deterministic Transport Kerbyson, Hoisie, Pautz (2003)

Single Processor Efficiency is Critical in Parallel Systems

Efficiency Loss in Scaling from 1 to 1000’s of PEs ?

Roughly 2-3x

Processor Model: A Monte Carlo Approach to Predict CPI

Token GeneratorToken:

Instruction classes

Service Centers:

Delays caused by• ALU Latencies

•Memory Latencies•Branch Misprediction

Max rate: 1 token every CPII

Feedback Loop - Stalls: Latencies interacting

with app characteristics

Retire: non-producing tokens

Stall producer Tokens

Inherent CPI:Best application cpi given no processor stalls – infinite zero-latency resources.

Processor Model: A Monte Carlo Approach to Predict CPI

Token GeneratorToken:

Instruction classes

Max rate: 1 token every CPII

Retire: non-producing tokens

Producer Tokens

MEM:V

TLB:31

L3:16

L2:6

L1:1

GSF:6

FPU:4

INT:1

Retire

BrM:6

Dependence Check and Stall Generator

Dependence Distance

Generation

Transition probabilities

associated with each path

Processor Stalls and Characterization of Application Dependence

Major source of stalls for in-order processors are RAW and WAW dependences

Token # 1 2 3 4 5 6

Load? N Y (hit in L3) N N N N

Consumer? N N N N N Y (token 2)

Prob Distribution:

load-to-use distance (instructions)

Based on path that token 2 has taken, we compute the stall time (and cause) for token 6: 16 – 4*CPI i

Application pdf’s:

1. Load-to-use

2. FP-to-use

3. INT-to-use

Summary of Model Parameters

• Inherent CPI– Binary instrumentation tool

• Instruction Classes– INT, FP, Reg, Branch mis-prediction, Loads, Non-producing– Note that Stores are retired immediately (treated as non-producers)

• Transition Probabilities– Probabilities of generating each instruction class computed from binary instrumentation– Cache hits computed from performance counters – can be predicted from models

• Distribution Functions of dependence distances (measured in instructions)– Load-to-use, FP-to-use, INT-to-use from binary instrumentation

• Processor and Memory Latencies – from architecture manuals

1. Parameters are computed in 1-2 hours

2. Model converges in a few secs

3. ~800 lines of C code

Bench – 1.3 GHz Itanium 2

3 MB L3 cache

260 cp mem latency

Hertz – 900 MHz Itanium 2

1.5 MB L3 cache

112 cp mem latency

Model Accuracy

Constant Memory Latencies

CPI Decomposition on Bench

Model Extensions for Variable Memory Latency: Compiler-controlled Prefetching

Yes

Yes

No

No

Model Extensions for Variable Memory Latency: Compiler-controlled Prefetching

1.Linear relationship between prefetch distance and memory latency

2. Late prefetch can increase memory latency

This relationship suggest the Prefetch-to-load pdf

Token # 11 12 13 14 15 16

Issue Cycle 21 22 25 29 30 31

Load? N PF N N Y N

Model AccuracyVariable Memory Latencies from Prefetch

Model Extensions:Toward Multicore Chips (CMP’s)

Hertz: slope of 27 cpsBench: slope of 101 cps

Slope times are obtained empirically and a function of memory controllers, chip speeds, bus bandwidths, etc.

Memory latency as a function of outstanding loads

Application Characteristics: Dependence Distributions

Sixtrack L2 Hits: 5.7%Eon L2 Hits: 3.6%

But, Eon L2 stalls are 6x larger than Sixtrack

Dependence Distances are need to explain stalls

What kinds of questions can we explore with the model?

What if FPU was not pipelined?

What if L2 was removed (2 Level cache)?

What if processor freq was changed (power aware)?

Summary

• Monte Carlo techniques can be effectively applied to micro-architecture performance prediction– Main Advantage: whole program analysis, predictive and

extensible– Main Problems that we have seen: small loops are not well

predicted, binary instrumentation for prefetch can take >24 hrs.

• The model is surprisingly accurate given the architectural & application simplifications

• Distributions that are used to develop predictive models are significant application characteristics that need to be evaluated

• We are ready to go into a “production” mode where we apply the model to a number of in-order architectures: Cell and Niagara