ismm, seattle, june 2013 tomas kalibera, richard jones university of kent rigorous benchmarking in...

55
ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

Upload: john-furness

Post on 31-Mar-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 1

Tomas Kalibera, Richard Jones

University of Kent

Rigorous Benchmarking inReasonable Time

Page 2: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 2

What do we want to establish?

By comparing an old and a new system rigorously, find

If there is a performance change?

How large is the change?

What variation we expect?

How confident are we of the result?

How many experiments must we carry out?

new execution timeold execution time

Page 3: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 3

Uncertainty

Computer systems are complex.

Many factors influence performance: Some known. Some out of experimenter’s control. Some non-deterministic.

Execution times vary.

We need to design experiments and summarise results in a repeatable and reproducible fashion.

Page 4: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 4

Uncertainty should be reported!

PLDI

ASPL

OS

ISMM

TOPL

AS

TACO

Tota

l

0

20

40

60

80

100

120

140

Papers (122)

Execution time (67)

Execution time ratio (59)

Ignored uncertainty (47)

Papers published in 2011

Page 5: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 5

Uncertainty should be reported!

PLDI

ASPL

OS

ISMM

TOPL

AS

TACO

Tota

l

0

20

40

60

80

100

120

140

Papers (122)

Execution time (67)

Execution time ratio (59)

Ignored uncertainty (47)

70%ignoreduncertainty

Papers published in 2011

Page 6: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 6

How were the experiments performed? Not always obvious if experiments were repeated.

Very few report that experiments repeat at more than one level, e.g. Repeat executions (e.g. invocations of a JVM). Repeat measurements (e.g. iterations of an application).

Number of repetitions: arbitrary or heuristic-based?

Page 7: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 7

One benchmark…

Good experimental methods take time

Page 8: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 8

A suite…

Good experimental methods take time

Page 9: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 9

Add invocations…

Good experimental methods take time

Page 10: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 10

and iterations…

Good experimental methods take time

Page 11: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 11

…and heap sizes

Good experimental methods take time

Page 12: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 12

A lost cause?

Is statistically rigorous experimental methodology simply infeasible?

Page 13: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 13

NO!

With some initial one-off investment, We can cater for variation Without excessive repetition (in most cases).

Our contributions:

A sound experimental methodology that makes best use of experiment time.

How to establish how much repetition is needed.

How to estimate error bounds .

Page 14: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 14

The Challenge of Reasonable Repetition

Variation at several stages of a benchmark experiment — iteration, execution, compilation…

Controlled variables platform, heap size or compiler options.

Random variables statistical properties.

Uncontrolled variables try to convert these to controlled or randomised (e.g. by randomising link order).

The challenge: How to design efficient experiments given the

random variables present, and Summarise the results, with a confidence interval.

Page 15: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 15

Our running example

An experiment with 3 “levels” (though our technique is general):

1. Repeat compilation to create a binary— e.g. if code performance depends on layout.

2. Repeat executions of the same binary.

3. Repeat iterations of a benchmark.

Page 16: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 17

Independent state

Researchers are typically interested in steady state performance.

Initialised state: no significant initialisation overhead.

Independent state: iteration times are (statistically) independent and identically distributed (IID).

Don’t repeat measurements before independence. If measurements are not IID, the variance and confidence interval estimates will be biased.

Page 17: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 18

Independent state

Does a benchmark reach an independent state?After how many iterations?

DaCapo/OpenJDK 7: ‘large’ and ‘small’ sizes3 executions, 300 iterations/execution.

Inspect run-sequence, lag and auto-correlation plots for patterns indicating dependence.

Page 18: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 19

Independent state

Does a benchmark reach an independent state?After how many iterations?

DaCapo/OpenJDK 7: ‘large’ and ‘small’ sizes3 executions, 300 iterations/execution.

Inspect run-sequence, lag and auto-correlation plots for patterns indicating dependence.

RECOMMENDATION: Use this manual procedure just once to find how many iterations each benchmark, VM and

platform combination requires to reach an independent state.

Page 19: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 20

Reached independent state?

avrora9 bloat6 chart6

eclipse6

eclipse9fop9

fop6

h29 hsqldb6

jython6

jython9

luindex6

luindex9lusearch9

pmd6

pmd9

sunflow9 tomcat9tradebeans9

tradesoap9

xalan6

xalan9

Intel Xeon: 2 processors x 4 cores x 2-way HT

DaCapo ‘small’

Page 20: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 21

Reached independent state?

avrora9 bloat6 chart6

eclipse6

eclipse9fop9

fop6

h29 hsqldb6

jython6

jython9

luindex6

luindex9lusearch9

pmd6

pmd9

sunflow9 tomcat9tradebeans9

tradesoap9

xalan6

xalan9

AMD Opteron: 4 processors x 16 cores

DaCapo ‘small’

Page 21: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 22

Reached independent state?

avrora9 bloat6 chart6

eclipse6

eclipse9fop9

fop6

h29 hsqldb6

jython6

jython9

luindex6

luindex9lusearch9

pmd6

pmd9

sunflow9 tomcat9tradebeans9

tradesoap9

xalan6

xalan9

Intel Xeon: 2 processors x 4 cores x 2-way NT

DaCapo ‘large’

Page 22: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 23

Reached independent state?

avrora9 bloat6 chart6

eclipse6

eclipse9fop9

fop6

h29 hsqldb6

jython6

jython9

luindex6

luindex9lusearch9

pmd6

pmd9

sunflow9 tomcat9tradebeans9

tradesoap9

xalan6

xalan9

AMD Opteron: 4 processors x 16 cores

DaCapo ‘large’

Page 23: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 24

Reached independent state?

avrora9 bloat6 chart6

eclipse6

eclipse9fop9

fop6

h29 hsqldb6

jython6

jython9

luindex6

luindex9lusearch9

pmd6

pmd9

sunflow9 tomcat9tradebeans9

tradesoap9

xalan6

xalan9

AMD Opteron: 4 processors x 16 cores

DaCapo ‘small’

Page 24: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 25

Some benchmarks don’t reach independent state

Many benchmarks do not reach an independent state in a reasonable time. Most have strong auto-dependencies. Gradual drift in times and trends (increases and

decreases); abrupt state changes; systematic transitions.

Choice of iteration significantly influences a result. Problematic for online algorithms which distinguish small

differences although the noise is many times larger.

Fortunately, trends tend to be consistent across runs.

Page 25: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 26

Some benchmarks don’t reach independent state

Many benchmarks do not reach an independent state in a reasonable time. Most have strong auto-dependencies. Gradual drift in times and trends (increases and

decreases); abrupt state changes; systematic transitions.

Choice of iteration significantly influences a result. Problematic for online algorithms which distinguish small

differences although the noise is many times larger.

Fortunately, trends tend to be consistent across runs.

RECOMMENDATION: If a benchmark does not reach an independent state

in a reasonable time,take the same iteration from each

run.

Page 26: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 27

Heuristics don’t do well

Initialised Independent

Harness Georges

bloat 2 4 8 ∞

chart 3 4 1

eclipse 5 7 7 4

fop 10 180 7 8

hsqldb 6 6 8 15

jython 3 5 2

luindex 13 4 8

lusearch 10 85 7 8

pmd 7 4 1

xalan 6 13 15 139

Page 27: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 28

Heuristics don’t do well

Initialised Independent

Harness Georges

bloat 2 4 8 ∞

chart 3 4 1

eclipse 5 7 7 4

fop 10 180 7 8

hsqldb 6 6 8 15

jython 3 5 2

luindex 13 4 8

lusearch 10 85 7 8

pmd 7 4 1

xalan 6 13 15 139

Wastes time!

Page 28: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 29

Heuristics don’t do well

Initialised Independent

Harness Georges

bloat 2 4 8 ∞

chart 3 4 1

eclipse 5 7 7 4

fop 10 180 7 8

hsqldb 6 6 8 15

jython 3 5 2

luindex 13 4 8

lusearch 10 85 7 8

pmd 7 4 1

xalan 6 13 15 139

Unusable!

Page 29: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 30

Heuristics don’t do well

Initialised Independent

Harness Georges

bloat 2 4 8 ∞

chart 3 4 1

eclipse 5 7 7 4

fop 10 180 7 8

hsqldb 6 6 8 15

jython 3 5 2

luindex 13 4 8

lusearch 10 85 7 8

pmd 7 4 1

xalan 6 13 15 139

Initialised in reasonable time

Page 30: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 31

What to repeat?

Run a benchmark to independence and then repeat a number of iterations, collecting each result? or

Repeatedly, run a benchmark until it is initialised and then collect a single result?

The first method saves experimentation time if variation between iterations > variation between

executions, initialisation warmup + VM initialisation is large, and independence warmup is small.

Variation%

bloat6 eclipse9 lusearch9

xalan6 xalan9

Iteration 14.1 0.8 3.3 7.0 3.5

Execution 3.7 0.4 30.3 9.1 1.0AMD Opteron: 4 processors x 16 cores

Page 31: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 32

What to repeat?

Run a benchmark to independence and then repeat a number of iterations, collecting each result? or

Repeatedly, run a benchmark until it is initialised and then collect a single result?

The first method saves experimentation time if variation between iterations > variation between

executions, initialisation warmup + VM initialisation is large, and independence warmup is small.

Variation%

bloat6 eclipse9 lusearch9

xalan6 xalan9

Iteration 14.1 0.8 3.3 7.0 3.5

Execution 3.7 0.4 30.3 9.1 1.0AMD Opteron: 4 processors x 16 cores

Page 32: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 33

A clear but rigorous account

Goal: We want to quantify a performance optimisation in the form of an effect size confidence interval, e.g.“we are 95% confident that system A is faster than system B by 5.5% ± 2.5%”.

We need to repeat executions and take multiple measurements from each.

For a given experimental budget, we want to obtain the tightest possible confidence interval.

Adding repetition at the highest level always increases precision. but it is often cheaper to add repetitions at lower levels.

Page 33: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 34

Multi-level repetition

How many repetitions to do at which levels?

1. Run an initial, dimensioning experiment Gather the cost of a repetition at each level.

Iteration — time to complete an iteration. Execution — more expensive, need to get to an

independent state. Calculate optimal repetition counts for the real

experiment.

2. Run the real experiment. Use the optimal repetition counts from the initial

experiment. Calculate the effect size confidence interval.

Page 34: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 35

Initial experiment

Choose arbitrary repetition counts r1,…,rn

20 may be enough, 30 if possible, 10 if you must (e.g. if there are many levels)

Then, measure the cost of each level, e.g. c1  time to get an iteration (iteration duration).

c2  time to get an execution (time to independent state).

c3  time to get a binary (build time) .

Also take the measurement times Yjn...j1

Y2,1,3 = time of the 3rd non-warmup iteration from the 1st execution of the 2nd binary.

Init

ial Exp

eri

men

t

Page 35: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 36

Variance estimators

First calculate n biased estimators S12,…,Sn

2

Then the unbiased estimators Ti2 iteratively

Init

ial Exp

eri

men

t

Page 36: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 37

Variance estimators

First calculate n biased estimators S12,…,Sn

2

Then the unbiased estimators Ti2 iteratively

Init

ial Exp

eri

men

t

Mean calculated over all indexes denoted by a bullet

Page 37: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 38

Optimal repetition counts

The optimal repetition counts to be used in the real experiments are r1,…,rn-1

We don’t calculate rn, the repetition count for the highest level rn can always be increased for more precision.

Calculate the variance estimators Sn2 for the real

experiment as before but using the optimal repetition counts r1,…,rn-1 and the measurements from the real experiment.R

eal Exp

eri

men

t

Page 38: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 39

Confidence intervals R

eal Exp

eri

men

t Asymptotic confidence interval with confidence (1 − a)

where is the (1-a/2)-quantile of the I-distribution with n = rn-1 degrees of freedom.

See the ISMM’13 paper for details of constructing confidence intervals of execution time ratios.

See our technical report for proofs and gory details.

Page 39: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 40

Confidence interval for execution time ratios

Confidence interval due to Fieller (1954). and are average execution times from the old and new

systems. Variance estimators Sn2 and S’n2 and half-widths h,h’ as

before.

Page 40: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 41

In practise

For each benchmark/VM/platform…

Conduct a dimensioning experiment to establish the optimal repetition counts for each but the top level of the real experiment.

Redimension if only if the benchmark/VM/platform changes.

Page 41: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 42

DaCapo (revisited)

The confidence half-intervals using optimal repetition counts correspond closely to those obtained by running large numbers of executions (30) and iterations (40).

But repetition counts are much lower. E.g. lusearch: r1=1 so time better spent repeating

executions

bloat6 lusearch9 xalan6 xalan9

c1(s) 35.5 1.7 10.8 6.7

c2(s) 110.0 12.3 3.4 30.2r110 1 2 15

Half-intervals

Optimal (%) 14.0 3.4 7.2 3.5

Original (%) 14.1 3.3 7.0 3.5

AMD Opteron: 4 processors x 16 cores

Page 42: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 43

Conclusions

Researchers should provide measures of variation when reporting results.

DaCapo and SPEC CPU benchmarks need very different repetition counts on different platforms before they reach an initialised or independent state.

Iteration execution times are often strongly auto-dependent: for these, automatic detection of steady state is not applicable. They can waste time or mislead.

An one-off (per benchmark/VM/platform) dimensioning experiment can provide the optimal counts for repetition at each level of the real experiments.

Page 43: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 44

RECOMMENDATION: Benchmark developers should include our dimensioning methodology as a one-off, per-system configuration requirement.

Page 44: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 45

Page 45: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 46

Code layout experiments

Page 46: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 47

What’s of interest?

Mean execution times

Minimum threshold for ratio of execution times Only interested in ‘significant’ performance changes

Improvements in systems research are often small, e.g. 10%.

Many factors influence performance E.g. memory placement, randomised compilation algorithms, JIT

compiler, symbol names… [Mytkowicz et al., ASPLOS 2009; Gu et al, Component and

middleware performance workshop 2004]

Randomisation to avoid measurement bias E.g. Stabiliser tool [Curtsinger & Berger, UMass TR, 2012]

Page 47: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 48

Current best practice

Based on 2-level hierarchical experiments Repeat measurements until standard deviation of last few

measurements is small enough.

Quantify changes using a visual or statistical significance test

[Georges et al, OOPSLA 2007; PhD 2008]

Problems Two levels are not always appropriate Null hypothesis significance tests are deprecated in other

sciences Visual tests are overly conservative

Page 48: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 49

Null hypothesis significance tests Null hypothesis: “the 2 systems have the same

performance”

Tests if the null hypothesis can be rejected: “it is unlikely that the systems have the same performance”

Student’s t-test

Visual test

Page 49: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 50

Visual test

Construct confidence intervals

Do they overlap?

If not, it is unlikely that the systems have the same performance

[If only slight overlap — centre not covered by other CI — fall back to statistical test]

Page 50: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 51

What’s wrong with this?

1. It does not tell us what we want to know Only if there is a performance change We could also report the ratio of sample means But we still don’t know how much of this change is due to

uncertainty

2. The decision is affected by sample size The larger the sample, the more unlikely even a small and

meaningless change becomes Its limitations have been known for 70 years Deprecated in many fields: statistics, psychology,

medicine, biology, chemistry, sociology, education, ecology…

Page 51: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 52

What’s wrong with this (cont.)?

3. Both tests use parametric methods that violate their assumptions Performance measurements are not usually normally

distributed Multi-modal, long tails to the right

Good practice to check if data is close to normal Robust methods are used in some fields Should at least make assumptions clear

That using Student’s t-test is OK… …Often it is OK

Page 52: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 53

Two methods

Statistical model of random effects in n-way classification

Use this model to construct effect size confidence interval for the ratio of the means of execution time.

1. A parametric method based on asymptotic normality

2. A non-parametric method based on statistical simulation (‘bootstrap’)

Page 53: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 54

Quantifying the performance (1)

Parametric method

Use the same number of repetitions for the old (OY) and new (NY) system.

Report (1-a) confidence interval (e.g. a=0.05 for 95% CI)

ta/2,n denotes the a/2-quantile of the t-distribution with n = nn+1 - 1 degrees of freedom

Page 54: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 55

Quantifying performance (2)

Bootstrap method

1. Perform many simulations (1000 or more if there is time) Use real data within each simulated step

2. Randomly choose the values to use at each level Replacement at all levels seems safe

3. Calculate many sample means from these Asymptotically normal due to the Central Limit Theorem

Form a (1-a) CI by using the a/2 and 1-a/2 sample quantiles E.g. order the values and use the 25th and 975th

Page 55: ISMM, Seattle, June 2013 Tomas Kalibera, Richard Jones University of Kent Rigorous Benchmarking in Reasonable Time 1

ISMM, Seattle, June 2013 56

Parametric vs. bootstrap

Bootstrap is more robust than parametric method Uses fewer assumptions Does not depend on underlying distribution No need to check if data is reasonably close to normal Can be used with other metrics, e.g. medians

Parametric method is more confident Narrower confidence intervals More likely to find a significant difference