evaluation

29
evaluation, validation and empirical methods Alan Dix http://www.alandix.com/

Upload: hcicourse

Post on 13-Dec-2014

256 views

Category:

Education


0 download

DESCRIPTION

Slides for HCICourse.com

TRANSCRIPT

Page 1: Evaluation

evaluation, validation andempirical methods

Alan Dix

http://www.alandix.com/

Page 2: Evaluation

evaluation

you’ve designed it, but is it right?

Page 3: Evaluation

different kinds of evaluation

endless argumentsquantitative vs. qualitative

in the lab vs. in the wild

experts vs. real users (vs UG students!)

really need tocombine methods

quantitative – what is true & qualitative – why

what is appropriate and possible

Page 4: Evaluation

purpose

Two types of evaluation

purpose stage

formative improve a design development

summative say “this is good” contractual/sales

investigative gain understanding researchinvestigative/ exploratory

researchgain understanding

Three

Page 5: Evaluation

when does it end?

in a world of perpetual beta ...

real use is the ultimate evaluation

logging, bug reporting, etc.

how do people really use the product?

are some features never used?

Page 6: Evaluation

studies and experiments

Page 7: Evaluation

what varies (and what you choose)

individuals / groups (not only UG students!)

tasks / activities

products / systems

principles / theories

prior knowledge and experience

learning and order effects

which are you trying to find out about?which are ‘noise’

Page 8: Evaluation

a little story …

BIG ACM sponsored conference

‘good’ empirical paper

looking at collaborative support for a task X

three pieces of software:A – domain specific software, synchronous

B – generic software, synchronous

C – generic software, asynchronous A

B C

asyn

c

sync

domainspec.

generic

Page 9: Evaluation

experiment

reasonable nos. subjects in each condition

quality measures

significant results p<0.05domain spec. > generic

asynchronous > synchronous

conclusion: really want async domain specific

A

B C

asyn

c

sync

domainspec.

generic

domainspec.

generic

asyncsync

conclusion: really want async domain specific

Page 10: Evaluation

what’s wrong with that?

interaction effectsgap is interesting to studynot necessarily end up best

more important … if you blinked at the wrong moment …

NOT independent variablesthree different pieces of softwarelike experiment on 3 people!say system B was just bad

domainspec.

generic asyncsync

A

B C

asyn

c

sync

domainspec.

generic

?

B < A B < C

Page 11: Evaluation

what went wrong?

borrowed psych method… but method embodies assumptions

single simple cause, controlled environment

interaction needs ecologically valid exp.multiple causes, open situations

what to do? understand assumptions and modify

Page 12: Evaluation

numbers and statistics

Page 13: Evaluation

are five users enough?

one of the myths of usability!from a study by Nielsen and Landauer (1993)

empirical work, cost–benefit analysis and averages, many assumptions: simplified model, iterative

steps, ...

basic idea: decreasing returnseach extra user gives less new information

really ... it dependsfor robust statistics – many many morefor something interesting – one may be enough

Page 14: Evaluation

points of comparison

measures:average satisfaction 3.2 on a 5 point scale

time to complete task in range 13.2–27.6 seconds

good or bad?

need a point of comparisonbut what?

self, similar system, created or real??

think purpose ...

what constitutes a ‘control’think!!

Page 15: Evaluation

do I need statistics?

finding some problem to fix

to know

how frequently it occurs

whether most users experience it

if you’ve found most problems

NO

YES

Page 16: Evaluation

statistics

need a course in itself!experimental design

choosing right test

etc., etc., etc.

a few things ...

Page 17: Evaluation

statistical significance

stat. sig = likelihood of seeing effect by chance5% (p <0.05) = 1 in 20 chancebeware many tests and cherry picking!10 tests means 50:50 chance of seeing p<0.05

not necessarily large effect (i.e. ≠ important)

non-significant = not proven (NOT no effect)may simply not be sensitive enoughe.g. too few users

to show no (small) effect need other methodsfind out about confidence intervals!

Page 18: Evaluation

statistical power

how likely effect will show up in experiment

more users means more ‘power’2x senisitivity needs 4x number of

users

manipulate it!

more users (but usually many more)

within subject/group (‘cancels’ individual diffs.)

choice of task (particularly good/bad)

add distracter task

Page 19: Evaluation

from data to knowledge

Page 20: Evaluation

types of knowledge

descriptiveexplaining what happened

predictivesaying what will happen

cause ⇒ effectwhere science often ends

• synthetic– working out what to do to make what you want

happeneffect ⇒ cause

– design and engineering

syntheticworking out what to do to make what you want happen

effect ⇒ causedesign and engineering

Page 21: Evaluation

generalisation?

can we ever generalise?

every situation is unique, but ...

... to use past experience is to generalise

generalisation ≠ abstractioncases, descriptive frameworks, etc.

data ≠ generalistioninterpolation – maybe

extrapolation??

Page 22: Evaluation

generalisation ...

never comes (solely) from data

always comes from the head

requires understanding

Page 23: Evaluation

mechanism

reduction reconstruction– formal hypothesis testing+ may be qualitative too– more scientific precision

• wholistic analytic– field studies, ethnographies+ ‘end to end’ experiments– more ecological validity

wholistic analytic– field studies, ethnographies

+ ‘end to end’ experiments– more ecological validity

?

? ? ? ? ?

Page 24: Evaluation

from evaluation to validation

Page 25: Evaluation

validating work

• justification– expert opinion– previous research– new experiments

• evaluation– experiments– user studies– peer review

your work

evaluation

experimentsuser studiespeer review

singularity?different peopledifferent situations

sampling

Page 26: Evaluation

generative artefacts

• justification– expert opinion– previous research– new experiments

• evaluation– experiments– user studies– peer review

artefact

evaluationsingularitypeople, situationsplus ...different designersdifferent briefs

toolkitsdevicesinterfacesguidelinesmethodologies

(pure) evaluation of generative artefacts is methodologically unsound

too many

to sample

Page 27: Evaluation

validating work

• justification– expert opinion– previous research– new experiments

• evaluation– experiments– user studies– peer review

your work

evaluation

experimentsuser studiespeer review

justification

expert opinionprevious researchnew experiments

Page 28: Evaluation

justification vs. validation

• different disciplines– mathematics: proof = justification– medicine: drug trials = evaluation

• combine them:– look for weakness in justification– focus evaluation there

evaluationjustification

Page 29: Evaluation

example – scroll arrows ...

Xerox STAR – first commercial GUIprecursor of Mac, Windows, ...

principled design decisions

which direction for scroll arrows?not obvious: moving document or handle?

=> do a user study!

gap in justification => evaluation

unfortunately ...

Apple got the wrong designs