behavioral application-dependent superscalar core modeling

56
Behavioral Application- Dependent Superscalar Core Modeling Ricardo Andrés Velásquez Advisor: Pierre Michaud Co-advisor: André Seznec

Upload: kalei

Post on 04-Jan-2016

33 views

Category:

Documents


2 download

DESCRIPTION

Behavioral Application-Dependent Superscalar Core Modeling. Ricardo Andrés Velásquez Advisor: Pierre Michaud Co-advisor : André Seznec. Introduction – Simulation. Jack Kilby's original integrated circuit - 1958. 1 Transistor. Intel Core i7 - 2012. 731E6 Transistors. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Behavioral  Application-Dependent  Superscalar Core Modeling

Behavioral Application-Dependent Superscalar Core ModelingRicardo Andrés Velásquez

Advisor: Pierre MichaudCo-advisor: André Seznec

Page 2: Behavioral  Application-Dependent  Superscalar Core Modeling

Introduction – Simulation

19/4/13Behavioral Application-dependent superscalar core modeling - 2

Some engineering fields allow us to build prototypes identical to the target design.

Computer engineering in contrast makes extensive use of computer simulation to test the boundaries of a design.

731E6 Transistors

Intel Core i7 - 2012

Jack Kilby's original integrated circuit - 1958

1 Transistor

Page 3: Behavioral  Application-Dependent  Superscalar Core Modeling

Introduction – Microarchitecture simulation

Slow simulatorsSimulation complexity increases faster than computer performance.

Wider design space explorationComplexity does not allow to rely on intuition.Designers rely on simulators to compare designs.

Multi/Many-core Processors

Complexity “doubles” with every generation.

Research focus on uncore (shared cache, interconnection, main memory, etc.)

19/4/13Behavioral Application-dependent superscalar core modeling - 3

Page 4: Behavioral  Application-Dependent  Superscalar Core Modeling

Introduction – Microarchitecture simulation

Various models targeting different objectives.

Behavioral Application-dependent superscalar core modeling 19/4/13 - 4

Simulation speed

Acc

urac

y

Low

High RTLmodels

Cyc.-accurate models

Detailed simulation

Core models

1-IPCmodels

Single core

Multi/many core

Statistical models Empirical

models

Page 5: Behavioral  Application-Dependent  Superscalar Core Modeling

Contributions I

1. BADCO modeling technique for approximate simulation of modern superscalar cores.

2. Workload stratification methodology for selecting small and representative multiprogram workloads.

19/4/13Behavioral Application-dependent superscalar core modeling - 5

Page 6: Behavioral  Application-Dependent  Superscalar Core Modeling

Simulation time – Detailed simulator

calc

deal

milc

sjeng

nam

dgc

cas

tar

hmm

zeus

leslie

sopl

mcf

0

0.2

0.4

0.6

0.8

1

core uncore

Sim

ula

tio

n T

ime

No

rmal

ized

19/4/13Behavioral Application-dependent superscalar core modeling - 6

Even worse for multicore architectures!!!

Page 7: Behavioral  Application-Dependent  Superscalar Core Modeling

Core models

19/4/13Behavioral Application-dependent superscalar core modeling - 7

Functional model / Oracle

Fetch Alloc. Decode Exec. Commit

ITLB IL1 DTLB DL1

Uncore (L2, LLC, MM, Interconnection, etc.)

Tem

pora

l mod

el

Benchmark

Functional model / Oracle

Fetch Alloc. Decode Exec. Commit

ITLB IL1 DTLB DL1

L2

Benchmark

Page 8: Behavioral  Application-Dependent  Superscalar Core Modeling

Uncore

Core models

19/4/13Behavioral Application-dependent superscalar core modeling - 8

Functional model

Fetch Alloc. Decode Exec. Commit

ITLB IL1 DTLB DL1

Benchmark

Functional model

Fetch Alloc. Decode Exec. Commit

ITLB IL1 DTLB DL1

Benchmark

Functional model

Fetch Alloc. Decode Exec. Commit

ITLB IL1 DTLB DL1

Benchmark

LLC

Interconnection

Main memory

What if our design target is just the Uncore?

Core 0 Core 1 Core N-1Core 0 Model

Model Simulator

Core 1 Model

Model Simulator

Core N-1 Model

Model Simulator

L2 L2 L2

Page 9: Behavioral  Application-Dependent  Superscalar Core Modeling

Core models

Approximate model of a superscalar core that can be connected to a detailed uncore model.

Structural core modelsEmulate internal behavior.Model first order parameters (ROB length, width).Interval Simulation, In-N-Out, etc

Behavioral core modelsEmulate external behavior.Derived from detailed simulation.PDCM, ASPEN, etc.

19/4/13Behavioral Application-dependent superscalar core modeling - 9

Page 10: Behavioral  Application-Dependent  Superscalar Core Modeling

Behavioral Core Models

19/4/13Behavioral Application-dependent superscalar core modeling

0 1000 2000 3000 4000 5000 6000

requests responses

0 500 1000 1500 2000 2500 3000

cycles

2 REAL traces of uncore requests – identical instructions.

Requests timing changes in no obvious ways.

Current practices fail to model the timing changes.

Behavioral core models try to reproduce the external behavior of the core.

uncore A

uncore B

- 10

Page 11: Behavioral  Application-Dependent  Superscalar Core Modeling

Pairwise Dependent Cache Miss model

(PDCM)K. Lee, S. Evans, and S. Cho, ISPASS 2009.

Trace of retired uops with uncore requests ideal L2

3 kinds of requests: IL1 miss, DL1 load-misses and DL1 store-misses.

Emulate ROB to limit number of parallel requests.

Consider data dependencies between trace items.

SimpleScalar + Perfect branch prediction + no HW-prefetching.

19/4/13Behavioral Application-dependent superscalar core modeling - 11

Page 12: Behavioral  Application-Dependent  Superscalar Core Modeling

PDCM – Simulation flow

19/4/13Behavioral Application-dependent superscalar core modeling - 12

Benchmark + core config.

Cyc. Accu. Sim. Zero penalty

Trace simulator

uncoreconfig.

uncoreconfig.

Uncoreconfig.

PDCM trace

Uncore simulator

Performed once for every benchmark

and core config. pairSLOW

Performed once for every uncore configuration

FASTPerformance

Page 13: Behavioral  Application-Dependent  Superscalar Core Modeling

PDCM – Model building

19/4/13Behavioral Application-dependent superscalar core modeling - 13

1RT=16

2RT=17

3RT=17

4RT=19

5RT=20

6RT=20

8RT=23Request

uop

Non-request

uop

RT = retirement timeS = number of uopsW = number of cycles

7RT=22

9RT=25

2RT=17

3RT=17

5RT=20

6RT=20

8RT=23

9RT=25

1RT=16

4RT=19

7RT=22

S=3W=171,2,3

S=3W=34,5,6

S=2W=37,8

S=1W=2

9

TraceItem

Data dependencies:Reg + mem

Page 14: Behavioral  Application-Dependent  Superscalar Core Modeling

Tuning PDCM to Zesto

19/4/13Behavioral Application-dependent superscalar core modeling

Series1

-30.0

-20.0

-10.0

0.0

10.0

20.0

30.0

7.8 7.2 6.1 5.5 4.6 4.1

CP

I E

rror

(%

)

Average CPI error 4.5 % SimpleScalar vs 7.8 % Zesto.

Considering additional requests increases accuracy.

Zesto is a highly detailed cycle-level simulator Loh et al. ISPASS’09

PDCM++

- 14

+TLB

_miss

es

+writ

e_ba

cks

PDCM

+wro

ng_p

ath

+pre

fetc

h

+del

ayed

_hits

Page 15: Behavioral  Application-Dependent  Superscalar Core Modeling

PDCM limitations

Different sources of dependencies:Data dependencies (register & memory).Resource dependencies (queues:LDQ, STQ, etc).

Resource dependencies impact performance.

Long latency accessesContention for resourcesMore request in wrong path

Tracking all sources of dependencies is complex.

19/4/13Behavioral Application-dependent superscalar core modeling - 15

Page 16: Behavioral  Application-Dependent  Superscalar Core Modeling

Behavioral application-dependent superscalar core model – BADCO

New core model inspired from PDCM.

Two cycle accurate traces: Null latency T0 same as PDCM.Long latency TL infer dependencies.

Emulate ROB and level-1 MSHRs to limit the number of parallel requests.

Differentiated processing for Instruction request and store requests.

19/4/13Behavioral Application-dependent superscalar core modeling - 16

Page 17: Behavioral  Application-Dependent  Superscalar Core Modeling

BADCO Simulation Flow

19/4/13Behavioral Application-dependent superscalar core modeling

Benchmark + core config.

Simulation Zero penalty

Model Building

Simulation Long penalty

BADCO machine

T0

uncoreconfig.

uncoreconfig.

Uncoreconfig.

TL

Model Graph

Uncore simulator

Performed once for every benchmark

and core config. PairSLOW

Performed once for every uncore configuration

FAST

- 17

Page 18: Behavioral  Application-Dependent  Superscalar Core Modeling

Trace Generation

Two traces (Zesto) of retired μops.

T0

Level1 cache misses – zero penalty.

μops annotated with retirement time.

Capture fixed cost (W) of μops.

TL

Level1 cache misses – long penalty (1000 cycles).

μops annotated with: issue time (IT), completion time (CT) and uncore requests.

Infer and expose dependencies - capture requests.

19/4/13Behavioral Application-dependent superscalar core modeling - 18

RT=16

IT=9CT=2009

IT=9CT=2009

IT=2010CT=2013

IT=14CT=19

RT=20

dependent independent

W=4

Page 19: Behavioral  Application-Dependent  Superscalar Core Modeling

Model Building

19/4/13Behavioral Application-dependent superscalar core modeling

RT=16

RT=17

RT=17

RT=19

RT=20

RT=20

RT=23

IT=9CT=2009

IT=2010CT=2013

IT=14CT=19

IT=2011CT=3012

IT=3014CT=3016

IT=2012CT=2019

IT=3013CT=3021

T0 TLN1

W=16S=1D=0

1N2

W=1S=1

D=N12

N3W=2S=1

D=N14

N3W=2S=1

D=N14

N4W=1S=1

D=N35

N1W=16S=1D=0

1

N1W=16S=2D=01,3

N2W=1S=1

D=N12

N1W=16S=2D=01,3

N2W=1S=1

D=N12

N3W=2S=2

D=N14,6

N1W=16S=2D=01,3

N2W=1S=1

D=N12

N1W=16S=2D=01,3

N2W=1S=1

D=N12

N4W=4S=2

D=N35,7

N3W=2S=2

D=N14,6

N4W=1S=1

D=N35

N1W=16S=2D=01,3

N2W=1S=1

D=N12

Request uop

Non-request

uop

RT = retirement timeIT = issue timeCT = completion time

Request node

Non-request

node

W = weight (cycles)S = size (μops)

1

2

3

4

5

6

7

- 19

1

2

3

4

5

6

7

Page 20: Behavioral  Application-Dependent  Superscalar Core Modeling

Fetch Exe. Store

Model Simulation – BADCO Machine

19/4/13Behavioral Application-dependent superscalar core modeling

N1W=17S=4

D(N1)=0------------

ITLBIL1

N7W=4S=5

D(N7)=N6------------

Uncore

N1W=17S=4

D(N1)=0------------

ITLB1IL1

N1W=17S=4

D(N1)=0------------

IL1

N1W=17S=4

D(N1)=0------------

N2W=9S=8

D(N2)=N1------------DTLB1DL1_LDDL1_PF

N2W=9S=8

D(N2)=N1------------DTLB1DL1_LDDL1_PF

N3W=25S=26

D(N3)=0------------

N3W=25S=26

D(N3)=0------------

N5W=50S=56

D(N5)=N1------------

DL1_HoMDL1_HoMDL1_HoM

N2W=9S=8

D(N2)=N1------------DL1_LDDL1_PF

N2W=9S=8

D(N2)=N1------------DL1_LD

N3W=25S=26

D(N3)=0------------

N2W=9S=8

D(N2)=N1------------

N3W=25S=26

D(N3)=0------------

ROB=4ROB=12ROB=38ROB=86ROB=142ROB=138ROB=130ROB=104ROB=0

N4W=51S=48

D(N4)=N2------------DL1_ST

N5W=50S=56

D(N5)=N1------------

ROB=184

N4W=51S=48

D(N4)=N2------------DL1_ST

ROB=136

N6W=73S=80

D(N6)=N4------------DL1_LDDL1_WB

N8W=10S=13

D(N8)=N6------------DL1_LDDL1_LDDL1_PF

N9W=21S=19

D(N9)=N8------------DL1_LDDL1_PF

N10W=50S=56

D(N5)=N1------------

DL1_HoMDL1_HoMDL1_HoM

ITLBIL1 DL1_STDTLB1DL1_LDDL1_PF

Cycle = 0150110011002100310041005101715022002200320112036203720881006

STALL

- 20

Page 21: Behavioral  Application-Dependent  Superscalar Core Modeling

Evaluation methodology

• Compare single-core accuracy of PDCM and BADCO with respect to Zesto:

• Quantitative Accuracy (3 core config.)• Relative Accuracy (6 uncore config.)

• Compare simulation speed of PDCM and BADCO for single thread.

• Measure multi-core accuracy and simulation speed of BADCO with respect to Zesto.

19/4/13Behavioral Application-dependent superscalar core modeling - 21

Page 22: Behavioral  Application-Dependent  Superscalar Core Modeling

Experimental Setup

19/4/13Behavioral Application-dependent superscalar core modeling

Low(“0”) High(“1”)

L2 size/latency 256 kB / 6 cyc. 1 MB / 8 cyc.

LLC size/latency 2 MB / 8 cyc. 16 MB / 24 cyc.

FSB width 2 bytes 8 bytes

DL1 write buffer 8 entries

L2 64-byte line, 8-way, LRU, write-back, 8 entry write buffer, 16

MSHRs, IP-based stride + next line prefetchers

LLC 64-byte line, 16-way, LRU, write-back, 8 entry write buffer, 16

MSHRs, IP-based stride + stream prefetchers

FSB clock 800 MHz

DRAM latency 200 cycles

Core type Small Medium Big

Decode/issue/commit 3/4/3 3/5/3 4/6/4

RS/LDQ/STQ/ROB 12/12/8/32 18/18/12/64 36/36/24/128

Clock 3GHz

IL1 cache 2 cycles, 32 kB, 4 way, 64-byte line, LRU, next-line prefetcher

ITLB 2 cycles, 128-entry, 4-way, LRU, 4 kB page

DL1 cache 2 cycles, 32 kB, 8-way, 64-byte line, LRU, write-back, IP-based stride + next line

prefetchers

DTLB 2 cycles, 512-entry, 4-way, LRU, 4 kB page

Branch predictor TAGE 4 kB, BTAC 7.5 kB, indirect branch predictor 2 kB, RAS 16 entries

22 SPEC2K6 benchmarks + 2 SPEC2K benchmarks (Vortex & Crafty)

- 22

Page 23: Behavioral  Application-Dependent  Superscalar Core Modeling

Quantitative Accuracy – Big core

19/4/13Behavioral Application-dependent superscalar core modeling

zeus le

sl

grom

nam

dgo

bmh2

64 craf

bzip

gcc

asta

sopl

omne

-20

-10

0

10

20

30

PDCM PDCM++ BADCO

SPEC2k6

CP

I er

ror

(%)

- 23

Page 24: Behavioral  Application-Dependent  Superscalar Core Modeling

Quantitative Accuracy – Summary

Small Medium Big0

1

2

3

4

5

6

7

8

9

PDCMPDCM++BADCO

Ave

rag

e C

PI

erro

r (%

)

19/4/13Behavioral Application-dependent superscalar core modeling - 24

Page 25: Behavioral  Application-Dependent  Superscalar Core Modeling

Relative Accuracy

Design Space Exploration.

Speedup more relevant than absolute performance.

We would like minimum Speedup Error

19/4/13Behavioral Application-dependent superscalar core modeling - 25

Page 26: Behavioral  Application-Dependent  Superscalar Core Modeling

Relative AccuracyConfig: 256KB L2, 16MB LLC and 2-byte Bus

19/4/13Behavioral Application-dependent superscalar core modeling

sjen

h264

perlb

povr

aygo

bm deal les

lvo

rtm

cfso

plca

ctas

ta

-10

-5

0

5

10

15

20

PDCM PDCM++ BADCO

SPEC2k6

Sp

eed

up

err

or

(%)

Ref config.: 256KB L2, 2MB LLC and 8-byte Bus. - 26

Page 27: Behavioral  Application-Dependent  Superscalar Core Modeling

Relative Accuracy - Summary

19/4/13Behavioral Application-dependent superscalar core modeling

256KB/2MB/2B 256KB/16MB/2B

256KB/16MB/8B

1MB/16MB/2B 1MB/16MB/8B0

1

2

3

4

PDCM PDCM++ BADCO

L2-size/LLC-size/FSB-width

Avg

. sp

eed

up

err

or

(%)

Ref config.: 256KB L2, 2MB LLC and 8-byte Bus. - 27

Page 28: Behavioral  Application-Dependent  Superscalar Core Modeling

Simulation Speed

Zesto uncore Core alone0.1

1

10

100

0.17 0.19

2.91

13.04

2.52

8.82

ZestoPDCM++BADCOM

IPS

19/4/13Behavioral Application-dependent superscalar core modeling

(17x) (15x)

(68x)(47x)Speedup

- 28

Page 29: Behavioral  Application-Dependent  Superscalar Core Modeling

Multicore simulationestimated CPI vs. measured CPI

19/4/13Behavioral Application-dependent superscalar core modeling - 29

Page 30: Behavioral  Application-Dependent  Superscalar Core Modeling

Simulation speed

1 core 2 cores 4 cores 8 cores0.01

0.1

1

10

0.17 0.096 0.049 0.017

2.52 2.411.89

1.19

ZestoBADCOM

IPS

19/4/13Behavioral Application-dependent superscalar core modeling

(14.8x) (25.2x)(38.9x)

(68.1x)

- 30

Page 31: Behavioral  Application-Dependent  Superscalar Core Modeling

Behavioral core modeling summary

• Behavioral core models increase simulation speed between one and two orders with respect to detailed simulation.

• PDCM has limitations We introduce BADCO, a new behavioral core model.

• BADCO models are built from two cycle-accurate simulations.

• BADCO is more accurate than PDCM and PDCM++.

19/4/13Behavioral Application-dependent superscalar core modeling - 31

Page 32: Behavioral  Application-Dependent  Superscalar Core Modeling

Contributions II

1. BADCO modeling technique for approximate simulation of modern superscalar cores.

2. Workload stratification methodology for selecting small and representative multiprogram workloads.

19/4/13Behavioral Application-dependent superscalar core modeling - 32

Page 33: Behavioral  Application-Dependent  Superscalar Core Modeling

Workload design

Select from the workload space a set of representative workloads.

Single-coreWorkload = 1 benchmark.Well established methods (Benchmark design).

Multi-coreWorkload = combination of benchmarks.No standard method for workload selection

19/4/13Behavioral Application-dependent superscalar core modeling - 33

Page 34: Behavioral  Application-Dependent  Superscalar Core Modeling

Multiprogram workload selection

The number “W” of possible multiprogram workloads:

For 29 SPEC-CPU benchmarks

19/4/13Behavioral Application-dependent superscalar core modeling

B num. benchmarksK num. cores

- 34

2 4 8 161.00E+02

1.00E+04

1.00E+06

1.00E+08

1.00E+10

1.00E+12

4.35E+02

3.60E+04

3.03E+07

4.17E+11

cores

Impossible to simulate all possible benchmark combinations

Page 35: Behavioral  Application-Dependent  Superscalar Core Modeling

Current practices I

Survey 2007 – 2012 (ISCA, MICRO and HPCA)

75 papers

9/75 random sampling.Arbitrary sample size.

66/75 class-based selection.Benchmark classes selected manually.Define workload types.Diverse practices to select workloads.Arbitrary sample size.

19/4/13Behavioral Application-dependent superscalar core modeling - 35

Page 36: Behavioral  Application-Dependent  Superscalar Core Modeling

Current practices II

“Interesting Sample” High degree of subjectivity.

Sample may be interesting but it may not be representative of the population.

Caution to make general conclusion.

19/4/13Behavioral Application-dependent superscalar core modeling - 36

Page 37: Behavioral  Application-Dependent  Superscalar Core Modeling

Representative sample?

Probability that a characteristic of the population is kept for the sample totally or with certain tolerance.

Example characteristics:Global throughput.Global speedup.Global ranking of microarchitectures.

Which of two microarchitecture is better?

TARGET: define a question that you want to ask to the sample and then look for a way to answer that question.

19/4/13Behavioral Application-dependent superscalar core modeling - 37

Page 38: Behavioral  Application-Dependent  Superscalar Core Modeling

Methodology

TARGET: Small representative sampleCorrect ranking of two microarchitectures.

Case study= 5 shared-cache replacement policiesLRU, RANDOM, FIFO, DIP and DRRIP.

Use approximate simulation (BADCO)All benchmark combinations (2 & 4 cores).10000 workloads for 8 cores.

Study random sampling.Analytical model to compute sample size.

Study alternative sampling methods.

19/4/13Behavioral Application-dependent superscalar core modeling - 38

Page 39: Behavioral  Application-Dependent  Superscalar Core Modeling

Random Sampling

All workloads have the same probability to be selected.

Safe way to avoid biases if the sample is big enough.

Lends itself to analytical modeling.

19/4/13Behavioral Application-dependent superscalar core modeling - 39

Page 40: Behavioral  Application-Dependent  Superscalar Core Modeling

Analytical Model

What we want from the random sample is to know whether or not a microarchitecture Y is better than X.

tY(w) and tX(w) per-workload throughput of Y and X.

TY and TX average throughput

We define the following random variable:

19/4/13Behavioral Application-dependent superscalar core modeling

d(w) is the per-workload throughput difference.D is the average throughput difference.

- 40

Page 41: Behavioral  Application-Dependent  Superscalar Core Modeling

Analytical Model

Central limit theorem sample throughput D can be approximated by a normal distribution.

The degree of confidence that Y is better than X is equal to the probability that D is greater than zero.

Assuming almost 100% confidence and after some math we have

Where W is the sample size and cv is the coefficient of variation of d(w).

19/4/13Behavioral Application-dependent superscalar core modeling - 41

Page 42: Behavioral  Application-Dependent  Superscalar Core Modeling

Coefficient of variation

The coefficient of variation (CV) is a normalized measure of dispersion of a probability distribution.

19/4/13Behavioral Application-dependent superscalar core modeling

-10 -8 -6 -4 -2 0 2 4 6 8 10

CV=1CV=2CV=10

- 42

Estimate CV = compute sample size.

σ=0.5, μ=0.5

σ=1, μ=0.5

σ=5, μ=0.5

Page 43: Behavioral  Application-Dependent  Superscalar Core Modeling

CV Estimation: 4 cores – WSU

LRU>RND

LRU>FIFO

LRU>DIP

LRU>DRRIP

RND>FIFO

RND>DIP

RND>DRRIP

FIFO

>DIP

FIFO

>DRRIP

DIP>DRRIP

-1.5

-1

-0.5

0

0.5

1

1.5

Sam.-Zesto Sam.-BADCO Pop.-BADCO

1/C

V

19/4/13Behavioral Application-dependent superscalar core modeling - 43

Page 44: Behavioral  Application-Dependent  Superscalar Core Modeling

Random sampling model validation

19/4/13Behavioral Application-dependent superscalar core modeling

Experimental confidence vs. model confidence that “DRRIP outperforms DIP” using WSU

- 44

Page 45: Behavioral  Application-Dependent  Superscalar Core Modeling

Can we do better?

Explore alternative sampling techniques:Balanced random sampling.Stratified sampling: benchmark classes. per-workload throughput.

19/4/13Behavioral Application-dependent superscalar core modeling - 45

LRU>RND

LRU>FIFO

LRU>DIP

LRU>DRRIP

RND>FIFO

RND>DIP

RND>DRRIP

FIFO

>DIP

FIFO

>DRRIP

DIP>DRRIP

1

10

100

1000

10000

IPCTWSUHSU

Sam

ple

Siz

e 4 cores

Big Samples

Page 46: Behavioral  Application-Dependent  Superscalar Core Modeling

Balanced Random Sampling

Each benchmark occurs the same number of times in the whole workload population.

Balanced random each benchmark occurs the same number of times in the sample.

Probability of selecting a workload depends on the previous workloads selected.

No mathematical model.

19/4/13Behavioral Application-dependent superscalar core modeling - 46

Page 47: Behavioral  Application-Dependent  Superscalar Core Modeling

Stratified Random Sampling

Classical sampling method.

Exploit homogeneities.

Divide the population in non-overlapping subsets (strata).

Take random samples in each strata.

Sample throughput is a weighted average.

We study 2 variants: Benchmark stratification.Workload stratification.

19/4/13Behavioral Application-dependent superscalar core modeling - 47

Page 48: Behavioral  Application-Dependent  Superscalar Core Modeling

Benchmark stratification

Attempt to formalize common practices.Class-based selection.

Divide benchmarks in classes. Group benchmarks with similar behavior.

Build strata for every combination of classes.

For example: three classes in a 4 core machine generates 15 strata (004, 013, 022, 112, …)

19/4/13Behavioral Application-dependent superscalar core modeling

Class Benchmarks

Low misses povray, gromacs, milc, calculix, namd, dealII, perlbench, gobmk, h264ref, hmmer, sjeng

Medium misses bzip2, gcc, astar, zeusmp, cactusADM

High misses libquantum, omenetpp, leslie3d, bwaves, mcf, soplex

- 48

Page 49: Behavioral  Application-Dependent  Superscalar Core Modeling

Workload stratification

Estimate per-workload throughput:Approximate simulator (BADCO).Large sample (>800 workloads).

Measure per-workload throughput difference d(w).

Sort the workloads according to d(w).

Use a cluster algorithm to group workloads in strata.

19/4/13Behavioral Application-dependent superscalar core modeling - 49

Page 50: Behavioral  Application-Dependent  Superscalar Core Modeling

Alternative Sampling MethodsDIP > LRU, IPCT, 4 cores

19/4/13Behavioral Application-dependent superscalar core modeling - 50

CV= 10.86

10 20 30 40 50 60 80 100

120

140

160

180

200

300

400

500

600

700

8000.5

0.6

0.7

0.8

0.9

1

random bal-random bench-strata workload-strata

Co

nfi

den

ce

Page 51: Behavioral  Application-Dependent  Superscalar Core Modeling

Alternative Sampling MethodsDDRIP > LRU, IPCT, 4 cores

19/4/13Behavioral Application-dependent superscalar core modeling

CV=2.70

- 51

10 20 30 40 50 60 80 100 120 140 1600.6

0.7

0.8

0.9

1

random bal-random bench-strata workload-strata

Co

nfi

den

ce

Page 52: Behavioral  Application-Dependent  Superscalar Core Modeling

Practical guidelines

Method intended for incremental modification of a microarchitecture.

Estimate Cv:From the sample increase sample if necessary.Use approximate simulator.

IFCv > 10 same average performance.Cv in [2,10] workload stratification.Cv < 2 Random sampling.

If workload stratification:Sample greater than 800 workloads.Approx. simulator with good relative accuracy.

19/4/13Behavioral Application-dependent superscalar core modeling - 52

Page 53: Behavioral  Application-Dependent  Superscalar Core Modeling

Conclusions I

• Improve the simulation speed of multicore processors sacrificing some core accuracy

• Behavioral core model Target the uncore.

• BADCO = Two traces + Infer dependencies.

• One to two orders of magnitude faster than Zesto.

• Average CPI error of less than 5% with respect to Zesto.

19/4/13Behavioral Application-dependent superscalar core modeling - 53

Page 54: Behavioral  Application-Dependent  Superscalar Core Modeling

Conclusions II

• Current practices NON representative sample.

• Interesting workloads ≠ representative workloads.

• First steps towards defining representative samples.

• Important!!! to define sample representativeness.

• Analytical model ranking correctly two microarchitectures.

• Alternative sampling method Workload stratification. Requires approximate simulation.

19/4/13Behavioral Application-dependent superscalar core modeling - 54

Page 55: Behavioral  Application-Dependent  Superscalar Core Modeling

Future work

• BADCO models for multi-thread programs?

• BADCO models for studying energy efficiency and heterogeneous architectures.

• Alternatives ways of representativeness.• Analytical models to compute sample size.

• Phase behavior of benchmarks.

• Clustering methods for workload stratification.

19/4/13Behavioral Application-dependent superscalar core modeling - 55

Page 56: Behavioral  Application-Dependent  Superscalar Core Modeling

THANKS FOR YOUR ATTENTION!!!

QUESTIONS?

19/4/13Behavioral Application-dependent superscalar core modeling

ALF

- 56