behavioral application-dependent superscalar core modeling

Behavioral Application-Dependent Superscalar Core ModelingRicardo Andrés Velásquez

Advisor: Pierre MichaudCo-advisor: André Seznec

Introduction – Simulation

19/4/13Behavioral Application-dependent superscalar core modeling - 2

Some engineering fields allow us to build prototypes identical to the target design.

Computer engineering in contrast makes extensive use of computer simulation to test the boundaries of a design.

731E6 Transistors

Intel Core i7 - 2012

Jack Kilby's original integrated circuit - 1958

1 Transistor

Introduction – Microarchitecture simulation

Slow simulatorsSimulation complexity increases faster than computer performance.

Wider design space explorationComplexity does not allow to rely on intuition.Designers rely on simulators to compare designs.

Multi/Many-core Processors

Complexity “doubles” with every generation.

Research focus on uncore (shared cache, interconnection, main memory, etc.)


Introduction – Microarchitecture simulation

Various models targeting different objectives.

Behavioral Application-dependent superscalar core modeling 19/4/13 - 4

Simulation speed

Acc

urac

y

Low

High RTLmodels

Cyc.-accurate models

Detailed simulation

Core models

1-IPCmodels

Single core

Multi/many core

Statistical models Empirical

models

Contributions I

1. BADCO modeling technique for approximate simulation of modern superscalar cores.

2. Workload stratification methodology for selecting small and representative multiprogram workloads.


Simulation time – Detailed simulator

calc

deal

milc

sjeng

nam

dgc

cas

tar

hmm

zeus

leslie

sopl

mcf

0

0.2

0.4

0.6

0.8

1

core uncore

Sim

ula

tio

n T

ime

No

rmal

ized


Even worse for multicore architectures!!!

Core models


Functional model / Oracle

Fetch Alloc. Decode Exec. Commit

ITLB IL1 DTLB DL1

Uncore (L2, LLC, MM, Interconnection, etc.)

Tem

pora

l mod

el

Benchmark

Functional model / Oracle


ITLB IL1 DTLB DL1

L2

Benchmark

Uncore

Core models


Functional model


ITLB IL1 DTLB DL1

Benchmark

Functional model


ITLB IL1 DTLB DL1

Benchmark

Functional model


ITLB IL1 DTLB DL1

Benchmark

LLC

Interconnection

Main memory

What if our design target is just the Uncore?

Core 0 Core 1 Core N-1Core 0 Model

Model Simulator

Core 1 Model

Model Simulator

Core N-1 Model

Model Simulator

L2 L2 L2

Core models

Approximate model of a superscalar core that can be connected to a detailed uncore model.

Structural core modelsEmulate internal behavior.Model first order parameters (ROB length, width).Interval Simulation, In-N-Out, etc

Behavioral core modelsEmulate external behavior.Derived from detailed simulation.PDCM, ASPEN, etc.


Behavioral Core Models

19/4/13Behavioral Application-dependent superscalar core modeling

0 1000 2000 3000 4000 5000 6000

requests responses

0 500 1000 1500 2000 2500 3000

cycles

2 REAL traces of uncore requests – identical instructions.

Requests timing changes in no obvious ways.

Current practices fail to model the timing changes.

Behavioral core models try to reproduce the external behavior of the core.

uncore A

uncore B

- 10

Pairwise Dependent Cache Miss model

(PDCM)K. Lee, S. Evans, and S. Cho, ISPASS 2009.

Trace of retired uops with uncore requests ideal L2

3 kinds of requests: IL1 miss, DL1 load-misses and DL1 store-misses.

Emulate ROB to limit number of parallel requests.

Consider data dependencies between trace items.

SimpleScalar + Perfect branch prediction + no HW-prefetching.


PDCM – Simulation flow


Benchmark + core config.

Cyc. Accu. Sim. Zero penalty

Trace simulator

uncoreconfig.

uncoreconfig.

Uncoreconfig.

PDCM trace

Uncore simulator

Performed once for every benchmark

and core config. pairSLOW

Performed once for every uncore configuration

FASTPerformance

PDCM – Model building


1RT=16

2RT=17

3RT=17

4RT=19

5RT=20

6RT=20

8RT=23Request

uop

Non-request

uop

RT = retirement timeS = number of uopsW = number of cycles

7RT=22

9RT=25

2RT=17

3RT=17

5RT=20

6RT=20

8RT=23

9RT=25

1RT=16

4RT=19

7RT=22

S=3W=171,2,3

S=3W=34,5,6

S=2W=37,8

S=1W=2

9

TraceItem

Data dependencies:Reg + mem

Tuning PDCM to Zesto


Series1

-30.0

-20.0

-10.0

0.0

10.0

20.0

30.0

7.8 7.2 6.1 5.5 4.6 4.1

CP

I E

rror

(%

)

Average CPI error 4.5 % SimpleScalar vs 7.8 % Zesto.

Considering additional requests increases accuracy.

Zesto is a highly detailed cycle-level simulator Loh et al. ISPASS’09

PDCM++

- 14

+TLB

_miss

es

+writ

e_ba

cks

PDCM

+wro

ng_p

ath

+pre

fetc

h

+del

ayed

_hits

PDCM limitations

Different sources of dependencies:Data dependencies (register & memory).Resource dependencies (queues:LDQ, STQ, etc).

Resource dependencies impact performance.

Long latency accessesContention for resourcesMore request in wrong path

Tracking all sources of dependencies is complex.


Behavioral application-dependent superscalar core model – BADCO

New core model inspired from PDCM.

Two cycle accurate traces: Null latency T0 same as PDCM.Long latency TL infer dependencies.

Emulate ROB and level-1 MSHRs to limit the number of parallel requests.

Differentiated processing for Instruction request and store requests.


BADCO Simulation Flow


Benchmark + core config.

Simulation Zero penalty

Model Building

Simulation Long penalty

BADCO machine

T0

uncoreconfig.

uncoreconfig.

Uncoreconfig.

TL

Model Graph

Uncore simulator

Performed once for every benchmark

and core config. PairSLOW

Performed once for every uncore configuration

FAST

- 17

Trace Generation

Two traces (Zesto) of retired μops.

T0

Level1 cache misses – zero penalty.

μops annotated with retirement time.

Capture fixed cost (W) of μops.

TL

Level1 cache misses – long penalty (1000 cycles).

μops annotated with: issue time (IT), completion time (CT) and uncore requests.

Infer and expose dependencies - capture requests.


RT=16

IT=9CT=2009

IT=9CT=2009

IT=2010CT=2013

IT=14CT=19

RT=20

dependent independent

W=4

Model Building


RT=16

RT=17

RT=17

RT=19

RT=20

RT=20

RT=23

IT=9CT=2009

IT=2010CT=2013

IT=14CT=19

IT=2011CT=3012

IT=3014CT=3016

IT=2012CT=2019

IT=3013CT=3021

T0 TLN1

W=16S=1D=0

1N2

W=1S=1

D=N12

N3W=2S=1

D=N14

N3W=2S=1

D=N14

N4W=1S=1

D=N35

N1W=16S=1D=0

1

N1W=16S=2D=01,3

N2W=1S=1

D=N12

N1W=16S=2D=01,3

N2W=1S=1

D=N12

N3W=2S=2

D=N14,6

N1W=16S=2D=01,3

N2W=1S=1

D=N12

N1W=16S=2D=01,3

N2W=1S=1

D=N12

N4W=4S=2

D=N35,7

N3W=2S=2

D=N14,6

N4W=1S=1

D=N35

N1W=16S=2D=01,3

N2W=1S=1

D=N12

Request uop

Non-request

uop

RT = retirement timeIT = issue timeCT = completion time

Request node

Non-request

node

W = weight (cycles)S = size (μops)

1

2

3

4

5

6

7

- 19

1

2

3

4

5

6

7

Fetch Exe. Store

Model Simulation – BADCO Machine


N1W=17S=4

D(N1)=0------------

ITLBIL1

N7W=4S=5

D(N7)=N6------------

Uncore

N1W=17S=4

D(N1)=0------------

ITLB1IL1

N1W=17S=4

D(N1)=0------------

IL1

N1W=17S=4

D(N1)=0------------

N2W=9S=8

D(N2)=N1------------DTLB1DL1_LDDL1_PF

N2W=9S=8

D(N2)=N1------------DTLB1DL1_LDDL1_PF

N3W=25S=26

D(N3)=0------------

N3W=25S=26

D(N3)=0------------

N5W=50S=56

D(N5)=N1------------

DL1_HoMDL1_HoMDL1_HoM

N2W=9S=8

D(N2)=N1------------DL1_LDDL1_PF

N2W=9S=8

D(N2)=N1------------DL1_LD

N3W=25S=26

D(N3)=0------------

N2W=9S=8

D(N2)=N1------------

N3W=25S=26

D(N3)=0------------

ROB=4ROB=12ROB=38ROB=86ROB=142ROB=138ROB=130ROB=104ROB=0

N4W=51S=48

D(N4)=N2------------DL1_ST

N5W=50S=56

D(N5)=N1------------

ROB=184

N4W=51S=48

D(N4)=N2------------DL1_ST

ROB=136

N6W=73S=80

D(N6)=N4------------DL1_LDDL1_WB

N8W=10S=13

D(N8)=N6------------DL1_LDDL1_LDDL1_PF

N9W=21S=19

D(N9)=N8------------DL1_LDDL1_PF

N10W=50S=56

D(N5)=N1------------

DL1_HoMDL1_HoMDL1_HoM

ITLBIL1 DL1_STDTLB1DL1_LDDL1_PF

Cycle = 0150110011002100310041005101715022002200320112036203720881006

STALL

- 20

Evaluation methodology

• Compare single-core accuracy of PDCM and BADCO with respect to Zesto:

• Quantitative Accuracy (3 core config.)• Relative Accuracy (6 uncore config.)

• Compare simulation speed of PDCM and BADCO for single thread.

• Measure multi-core accuracy and simulation speed of BADCO with respect to Zesto.


Experimental Setup


Low(“0”) High(“1”)

L2 size/latency 256 kB / 6 cyc. 1 MB / 8 cyc.

LLC size/latency 2 MB / 8 cyc. 16 MB / 24 cyc.

FSB width 2 bytes 8 bytes

DL1 write buffer 8 entries

L2 64-byte line, 8-way, LRU, write-back, 8 entry write buffer, 16

MSHRs, IP-based stride + next line prefetchers

LLC 64-byte line, 16-way, LRU, write-back, 8 entry write buffer, 16

MSHRs, IP-based stride + stream prefetchers

FSB clock 800 MHz

DRAM latency 200 cycles

Core type Small Medium Big

Decode/issue/commit 3/4/3 3/5/3 4/6/4

RS/LDQ/STQ/ROB 12/12/8/32 18/18/12/64 36/36/24/128

Clock 3GHz

IL1 cache 2 cycles, 32 kB, 4 way, 64-byte line, LRU, next-line prefetcher

ITLB 2 cycles, 128-entry, 4-way, LRU, 4 kB page

DL1 cache 2 cycles, 32 kB, 8-way, 64-byte line, LRU, write-back, IP-based stride + next line

prefetchers

DTLB 2 cycles, 512-entry, 4-way, LRU, 4 kB page

Branch predictor TAGE 4 kB, BTAC 7.5 kB, indirect branch predictor 2 kB, RAS 16 entries

22 SPEC2K6 benchmarks + 2 SPEC2K benchmarks (Vortex & Crafty)

- 22

Quantitative Accuracy – Big core


zeus le

sl

grom

nam

dgo

bmh2

64 craf

bzip

gcc

asta

sopl

omne

-20

-10

0

10

20

30

PDCM PDCM++ BADCO

SPEC2k6

CP

I er

ror

(%)

- 23

Quantitative Accuracy – Summary

Small Medium Big0

1

2

3

4

5

6

7

8

9

PDCMPDCM++BADCO

Ave

rag

e C

PI

erro

r (%

)


Relative Accuracy

Design Space Exploration.

Speedup more relevant than absolute performance.

We would like minimum Speedup Error


Relative AccuracyConfig: 256KB L2, 16MB LLC and 2-byte Bus


sjen

h264

perlb

povr

aygo

bm deal les

lvo

rtm

cfso

plca

ctas

ta

-10

-5

0

5

10

15

20

PDCM PDCM++ BADCO

SPEC2k6

Sp

eed

up

err

or

(%)

Ref config.: 256KB L2, 2MB LLC and 8-byte Bus. - 26

Relative Accuracy - Summary


256KB/2MB/2B 256KB/16MB/2B

256KB/16MB/8B

1MB/16MB/2B 1MB/16MB/8B0

1

2

3

4

PDCM PDCM++ BADCO

L2-size/LLC-size/FSB-width

Avg

. sp

eed

up

err

or

(%)

Ref config.: 256KB L2, 2MB LLC and 8-byte Bus. - 27

Simulation Speed

Zesto uncore Core alone0.1

1

10

100

0.17 0.19

2.91

13.04

2.52

8.82

ZestoPDCM++BADCOM

IPS


(17x) (15x)

(68x)(47x)Speedup

- 28

Multicore simulationestimated CPI vs. measured CPI


Simulation speed

1 core 2 cores 4 cores 8 cores0.01

0.1

1

10

0.17 0.096 0.049 0.017

2.52 2.411.89

1.19

ZestoBADCOM

IPS


(14.8x) (25.2x)(38.9x)

(68.1x)

- 30

Behavioral core modeling summary

• Behavioral core models increase simulation speed between one and two orders with respect to detailed simulation.

• PDCM has limitations We introduce BADCO, a new behavioral core model.

• BADCO models are built from two cycle-accurate simulations.

• BADCO is more accurate than PDCM and PDCM++.


Contributions II

1. BADCO modeling technique for approximate simulation of modern superscalar cores.

2. Workload stratification methodology for selecting small and representative multiprogram workloads.


Workload design

Select from the workload space a set of representative workloads.

Single-coreWorkload = 1 benchmark.Well established methods (Benchmark design).

Multi-coreWorkload = combination of benchmarks.No standard method for workload selection


Multiprogram workload selection

The number “W” of possible multiprogram workloads:

For 29 SPEC-CPU benchmarks


B num. benchmarksK num. cores

- 34

2 4 8 161.00E+02

1.00E+04

1.00E+06

1.00E+08

1.00E+10

1.00E+12

4.35E+02

3.60E+04

3.03E+07

4.17E+11

cores

Impossible to simulate all possible benchmark combinations

Current practices I

Survey 2007 – 2012 (ISCA, MICRO and HPCA)

75 papers

9/75 random sampling.Arbitrary sample size.

66/75 class-based selection.Benchmark classes selected manually.Define workload types.Diverse practices to select workloads.Arbitrary sample size.


Current practices II

“Interesting Sample” High degree of subjectivity.

Sample may be interesting but it may not be representative of the population.

Caution to make general conclusion.


Representative sample?

Probability that a characteristic of the population is kept for the sample totally or with certain tolerance.

Example characteristics:Global throughput.Global speedup.Global ranking of microarchitectures.

Which of two microarchitecture is better?

TARGET: define a question that you want to ask to the sample and then look for a way to answer that question.


Methodology

TARGET: Small representative sampleCorrect ranking of two microarchitectures.

Case study= 5 shared-cache replacement policiesLRU, RANDOM, FIFO, DIP and DRRIP.

Use approximate simulation (BADCO)All benchmark combinations (2 & 4 cores).10000 workloads for 8 cores.

Study random sampling.Analytical model to compute sample size.

Study alternative sampling methods.


Random Sampling

All workloads have the same probability to be selected.

Safe way to avoid biases if the sample is big enough.

Lends itself to analytical modeling.


Analytical Model

What we want from the random sample is to know whether or not a microarchitecture Y is better than X.

tY(w) and tX(w) per-workload throughput of Y and X.

TY and TX average throughput

We define the following random variable:


d(w) is the per-workload throughput difference.D is the average throughput difference.

- 40

Analytical Model

Central limit theorem sample throughput D can be approximated by a normal distribution.

The degree of confidence that Y is better than X is equal to the probability that D is greater than zero.

Assuming almost 100% confidence and after some math we have

Where W is the sample size and cv is the coefficient of variation of d(w).


Coefficient of variation

The coefficient of variation (CV) is a normalized measure of dispersion of a probability distribution.


-10 -8 -6 -4 -2 0 2 4 6 8 10

CV=1CV=2CV=10

- 42

Estimate CV = compute sample size.

σ=0.5, μ=0.5

σ=1, μ=0.5

σ=5, μ=0.5

CV Estimation: 4 cores – WSU

LRU>RND

LRU>FIFO

LRU>DIP

LRU>DRRIP

RND>FIFO

RND>DIP

RND>DRRIP

FIFO

>DIP

FIFO

>DRRIP

DIP>DRRIP

-1.5

-1

-0.5

0

0.5

1

1.5

Sam.-Zesto Sam.-BADCO Pop.-BADCO

1/C

V


Random sampling model validation


Experimental confidence vs. model confidence that “DRRIP outperforms DIP” using WSU

- 44

Can we do better?

Explore alternative sampling techniques:Balanced random sampling.Stratified sampling: benchmark classes. per-workload throughput.


LRU>RND

LRU>FIFO

LRU>DIP

LRU>DRRIP

RND>FIFO

RND>DIP

RND>DRRIP

FIFO

>DIP

FIFO

>DRRIP

DIP>DRRIP

1

10

100

1000

10000

IPCTWSUHSU

Sam

ple

Siz

e 4 cores

Big Samples

Balanced Random Sampling

Each benchmark occurs the same number of times in the whole workload population.

Balanced random each benchmark occurs the same number of times in the sample.

Probability of selecting a workload depends on the previous workloads selected.

No mathematical model.


Stratified Random Sampling

Classical sampling method.

Exploit homogeneities.

Divide the population in non-overlapping subsets (strata).

Take random samples in each strata.

Sample throughput is a weighted average.

We study 2 variants: Benchmark stratification.Workload stratification.


Benchmark stratification

Attempt to formalize common practices.Class-based selection.

Divide benchmarks in classes. Group benchmarks with similar behavior.

Build strata for every combination of classes.

For example: three classes in a 4 core machine generates 15 strata (004, 013, 022, 112, …)


Class Benchmarks

Low misses povray, gromacs, milc, calculix, namd, dealII, perlbench, gobmk, h264ref, hmmer, sjeng

Medium misses bzip2, gcc, astar, zeusmp, cactusADM

High misses libquantum, omenetpp, leslie3d, bwaves, mcf, soplex

- 48

Workload stratification

Estimate per-workload throughput:Approximate simulator (BADCO).Large sample (>800 workloads).

Measure per-workload throughput difference d(w).

Sort the workloads according to d(w).

Use a cluster algorithm to group workloads in strata.


Alternative Sampling MethodsDIP > LRU, IPCT, 4 cores


CV= 10.86

10 20 30 40 50 60 80 100

120

140

160

180

200

300

400

500

600

700

8000.5

0.6

0.7

0.8

0.9

1

random bal-random bench-strata workload-strata

Co

nfi

den

ce

Alternative Sampling MethodsDDRIP > LRU, IPCT, 4 cores


CV=2.70

- 51

10 20 30 40 50 60 80 100 120 140 1600.6

0.7

0.8

0.9

1

random bal-random bench-strata workload-strata

Co

nfi

den

ce

Practical guidelines

Method intended for incremental modification of a microarchitecture.

Estimate Cv:From the sample increase sample if necessary.Use approximate simulator.

IFCv > 10 same average performance.Cv in [2,10] workload stratification.Cv < 2 Random sampling.

If workload stratification:Sample greater than 800 workloads.Approx. simulator with good relative accuracy.


Conclusions I

• Improve the simulation speed of multicore processors sacrificing some core accuracy

• Behavioral core model Target the uncore.

• BADCO = Two traces + Infer dependencies.

• One to two orders of magnitude faster than Zesto.

• Average CPI error of less than 5% with respect to Zesto.


Conclusions II

• Current practices NON representative sample.

• Interesting workloads ≠ representative workloads.

• First steps towards defining representative samples.

• Important!!! to define sample representativeness.

• Analytical model ranking correctly two microarchitectures.

• Alternative sampling method Workload stratification. Requires approximate simulation.


Future work

• BADCO models for multi-thread programs?

• BADCO models for studying energy efficiency and heterogeneous architectures.

• Alternatives ways of representativeness.• Analytical models to compute sample size.

• Phase behavior of benchmarks.

• Clustering methods for workload stratification.


THANKS FOR YOUR ATTENTION!!!

QUESTIONS?


ALF

- 56

behavioral application-dependent superscalar core modeling

Documents

meeting notes

single core processor

simulation speedups

computer simulation

showing core model problem

target design

731e6 transistorsintel

target uncore