evaluation

evaluation, validation andempirical methods

Alan Dix

http://www.alandix.com/

evaluation

you’ve designed it, but is it right?

different kinds of evaluation

endless argumentsquantitative vs. qualitative

in the lab vs. in the wild

experts vs. real users (vs UG students!)

really need tocombine methods

quantitative – what is true & qualitative – why

what is appropriate and possible

purpose

Two types of evaluation

purpose stage

formative improve a design development

summative say “this is good” contractual/sales

investigative gain understanding researchinvestigative/ exploratory

researchgain understanding

Three

when does it end?

in a world of perpetual beta ...

real use is the ultimate evaluation

logging, bug reporting, etc.

how do people really use the product?

are some features never used?

studies and experiments

what varies (and what you choose)

individuals / groups (not only UG students!)

tasks / activities

products / systems

principles / theories

prior knowledge and experience

learning and order effects

which are you trying to find out about?which are ‘noise’

a little story …

BIG ACM sponsored conference

‘good’ empirical paper

looking at collaborative support for a task X

three pieces of software:A – domain specific software, synchronous

B – generic software, synchronous

C – generic software, asynchronous A

B C

asyn

c

sync

domainspec.

generic

experiment

reasonable nos. subjects in each condition

quality measures

significant results p<0.05domain spec. > generic

asynchronous > synchronous

conclusion: really want async domain specific

A

B C

asyn

c

sync

domainspec.

generic

domainspec.

generic

asyncsync

conclusion: really want async domain specific

what’s wrong with that?

interaction effectsgap is interesting to studynot necessarily end up best

more important … if you blinked at the wrong moment …

NOT independent variablesthree different pieces of softwarelike experiment on 3 people!say system B was just bad

domainspec.

generic asyncsync

A

B C

asyn

c

sync

domainspec.

generic

?

B < A B < C

what went wrong?

borrowed psych method… but method embodies assumptions

single simple cause, controlled environment

interaction needs ecologically valid exp.multiple causes, open situations

what to do? understand assumptions and modify

numbers and statistics

are five users enough?

one of the myths of usability!from a study by Nielsen and Landauer (1993)

empirical work, cost–benefit analysis and averages, many assumptions: simplified model, iterative

steps, ...

basic idea: decreasing returnseach extra user gives less new information

really ... it dependsfor robust statistics – many many morefor something interesting – one may be enough

points of comparison

measures:average satisfaction 3.2 on a 5 point scale

time to complete task in range 13.2–27.6 seconds

good or bad?

need a point of comparisonbut what?

self, similar system, created or real??

think purpose ...

what constitutes a ‘control’think!!

do I need statistics?

finding some problem to fix

to know

how frequently it occurs

whether most users experience it

if you’ve found most problems

NO

YES

statistics

need a course in itself!experimental design

choosing right test

etc., etc., etc.

a few things ...

statistical significance

stat. sig = likelihood of seeing effect by chance5% (p <0.05) = 1 in 20 chancebeware many tests and cherry picking!10 tests means 50:50 chance of seeing p<0.05

not necessarily large effect (i.e. ≠ important)

non-significant = not proven (NOT no effect)may simply not be sensitive enoughe.g. too few users

to show no (small) effect need other methodsfind out about confidence intervals!

statistical power

how likely effect will show up in experiment

more users means more ‘power’2x senisitivity needs 4x number of

users

manipulate it!

more users (but usually many more)

within subject/group (‘cancels’ individual diffs.)

choice of task (particularly good/bad)

add distracter task

from data to knowledge

types of knowledge

descriptiveexplaining what happened

predictivesaying what will happen

cause ⇒ effectwhere science often ends

• synthetic– working out what to do to make what you want

happeneffect ⇒ cause

– design and engineering

syntheticworking out what to do to make what you want happen

effect ⇒ causedesign and engineering

generalisation?

can we ever generalise?

every situation is unique, but ...

... to use past experience is to generalise

generalisation ≠ abstractioncases, descriptive frameworks, etc.

data ≠ generalistioninterpolation – maybe

extrapolation??

generalisation ...

never comes (solely) from data

always comes from the head

requires understanding

mechanism

reduction reconstruction– formal hypothesis testing+ may be qualitative too– more scientific precision

• wholistic analytic– field studies, ethnographies+ ‘end to end’ experiments– more ecological validity

wholistic analytic– field studies, ethnographies

+ ‘end to end’ experiments– more ecological validity

?

? ? ? ? ?

from evaluation to validation

validating work

• justification– expert opinion– previous research– new experiments

• evaluation– experiments– user studies– peer review

your work

evaluation

experimentsuser studiespeer review

singularity?different peopledifferent situations

sampling

generative artefacts



artefact

evaluationsingularitypeople, situationsplus ...different designersdifferent briefs

toolkitsdevicesinterfacesguidelinesmethodologies

(pure) evaluation of generative artefacts is methodologically unsound

too many

to sample

validating work



your work

evaluation

experimentsuser studiespeer review

justification

expert opinionprevious researchnew experiments

justification vs. validation

• different disciplines– mathematics: proof = justification– medicine: drug trials = evaluation

• combine them:– look for weakness in justification– focus evaluation there

evaluationjustification

example – scroll arrows ...

Xerox STAR – first commercial GUIprecursor of Mac, Windows, ...

principled design decisions

which direction for scroll arrows?not obvious: moving document or handle?

=> do a user study!

gap in justification => evaluation

unfortunately ...

Apple got the wrong designs

evaluation

Education

domain specific software

task xthree pieces of

system b

wrong moment

whats wrong

evaluation youve

wrong designs

real users vs ug students