1 lecture 19 chapter 9 evaluation techniques (part 2)

27
1 Lecture 19 chapter 9 evaluation techniques (part 2)

Upload: charlotte-gibson

Post on 04-Jan-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

1

Lecture 19

chapter 9evaluation techniques

(part 2)

Page 2: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

2

Evaluating Implementations

Requires an artefact:a simulation, a prototype, or afull implementation

Page 3: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

3

Experimental evaluation

• controlled evaluation of specific aspects of interactive behaviour

• evaluator chooses hypothesis to be tested

• a number of experimental conditions are considered which differ only in the value of some controlled variable.

• changes in behavioural measure are attributed to different conditions

Page 4: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

4

Experimental factors

• Subjects (i.e., the users, aka ‘participants’)– who – representative, sufficient sample

• not the programmer’s friend, boss, etc.• huge variability in performance of individuals

• Variables– things to modify and measure

• Hypothesis– what you’d like to show

• Experimental design– how you are going to show it– Includes ‘Protocol’ – what the subjects do

Page 5: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

5

Variables

• independent variable (IV) characteristic changed to produce different

conditions e.g. interface style, number of menu items

• dependent variable (DV) characteristics measured in the experiment e.g. time taken, number of errors.

Page 6: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

6

Hypothesis

• prediction of outcome– framed in terms of IV and DV

e.g. “error rate will increase as font size decreases”

• null hypothesis:– states no difference between conditions– aim is to disprove this

e.g. null hyp. = “no change with font size”

Page 7: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

7

Experimental design

• “within groups” design (also called “repeated measures”)– each subject performs experiment under each condition– transfer of learning possible (practice makes

performance better; or alternatively fatigue or boredom makes it worse)

– less costly and less likely to suffer from user variation (each user is compared to themselves)

• between groups design– each subject performs under only one condition– no transfer of learning – more users required

Page 8: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

8

Within v. Between

• Consider a test on the difference of beer v. vodka martinis on reaction time– Null hypothesis – no difference in increase in reaction

time between the two beverages

• Design 1:– 30 people try beer; 30 other people try vodka – D.V.

is change in reaction time pre- v. post drinking• Not bad – be sure to randomize who goes into beer

group v. vodka group• But ‘power’ of the experiment will be reduced due to

the great variability of individuals in reaction to alcohol

Page 9: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

9

Within v. Between (contd.)

• Design 2:– All 60 people first try beer, then immediately try

vodka• Problem of carryover effect

• Better Design:– All 60 try beer, then a week later try vodka

• Now each individual is compared with themselves• Still possible problem of ordering effect (e.g., they

might get a little better at the reaction time test)

• Best Design:– 30 try beer, then a week later vodka; 30 try vodka

and then a week later beer

Page 10: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

10

Analysis of data

• Before you start to do any statistics:– look at data (e.g. average=5.25 – but 4.9 without outlier)– save original data

• Choice of statistical technique depends on– type of data– information required

• Type of data– discrete

• finite number of values• may be ordered, or

unordered (e.g., colors)– continuous

• any value0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Page 11: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

11

Analysis - types of test

• parametric– assume normal distribution– robust– powerful

• non-parametric– do not assume normal distribution– less powerful– more reliable

• contingency table– classify data by discrete attributes – count number of data items in each group

Page 12: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

12

Analysis of data (cont.)

• What information is required?– is there a difference?– how big is the difference?– how accurate is the estimate?

• Parametric and non-parametric tests mainly address first of these

Page 13: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

13

ANOVA – analysis of variance

• Quite easy to test whether there’s a significant difference between groups in Excel– Need to invoke

Tools/Add-ins/Analysis Toolpack to enable– Then just apply Tools/Data Analysis/ANOVA:

Single Factor to the data

Page 14: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

14

ANOVA from Excel

Anova: Single Factor

SUMMARYGroups Count Sum Average Variance

Group 1 5 82 16.4 9.3Group 2 5 67 13.4 10.3Group 3 7 79 11.28571 3.904762

ANOVASource of Variation SS df MS F P-value F critBetween Groups 76.28908 2 38.14454 5.244339 0.019959 3.738892Within Groups 101.8286 14 7.273469

Total 178.1176 16

If P-value < 0.05 then we usually say the result is ‘significant’ (result is more than expected chance variation)

Say we have three columns of numbers representing the time to complete a task for 5, 5 and 7 users using three variations of an interface

Page 15: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

15

When is a difference a difference?

• In the world of parametric stats, we look for a statistic to be large enough to be ‘significant’– On the Gaussian (‘normal’) curve a |Z|=1.96

leaves 95% of the area of thecurve behind so is acommon ‘criticalvalue’ forclaimingsignificance

Page 16: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

16

Parametric assumptions• Parametric statistics assume that

some mathematically elegant assumptions hold true for the data– E.g., ANOVA (and standard

‘regression’) assume, among other things, normally distributed random error

– Trivia: The mathematical form of the probability density function for the normal distribution is remarkably formidable

• Centres on mean, , and is flattened by standard deviation,

Galton machine simulates normal distribution (aka ‘bell curve’)

Exponential distribution models time between events happening with a constant average rate

Page 17: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

17

So, what to measure?

• Usually one (or several) of these things:– Speed / efficiency

• How many (or whatever) per unit time can the user process with this interface?

– Accuracy (errors)– Learnability (time to acquire the ability to do

something in particular with the interface)• And retention (how well they can do it some particular

time later)

– Satisfaction• Subjective assessment – how does the user feel about

the interface

Page 18: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

18

Subjective assessment

• Very important– If the user seriously doesn’t like it, probably there’s

something really wrong with the design (just maybe they can’t articulate what)

• Quantifiable– Yes/No, or better, Likert scale (ratings) through

interview or questionnaire• Need to ask the right questions

– E.g., don’t have leading questions (BAD: “Is this the absolute worst system you have used ever?”)

• And need to ask the questions well (so user reliably expresses what they mean)– E.g., don’t trip them up with double negatives

Page 19: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

19

Likert Scale

• Can be from 4 to 7 “points”• Usually about agreement to a phrase

– E.g., “I found the search function easy to use”– Strongly Agree, Agree, Neutral (optional), Disagree,

Strongly Disagree• May also be about importance

– E.g., “A site search function is…”– “not very important” to “extremely important”

• Or a general assessment– E.g., “The performance of the search function was…”– Poor, Fair, Satisfactory, Good, Excellent

• Great to ask open-ended questions, too– E.g., “What was the best aspect of the search function?”– But it’s the Likert scale data that you can quantify

Page 20: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

20

Dimensions and validity

• When designing a questionnaire…– Have in mind a few underlying issues that you are

trying to assess– Ask a few different questions that are coordinated

around each issue– Ask different ways – vary whether positive or

negative favours the issue in question– Ideally, verify the questionnaire by having people

role-play particularly happy or angry, and middle-of-the-road users

• And see if they answer the questions the way you’d expect!

Page 21: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

21

Questionnaire administration

• Face-to-face or telephone interviews– Esp. efficient (and unbiased) to have third party give

the telephone survey

• Mail-out (including email) questionnaire• Web-based questionnaire (maybe email out

URL)• Even a questionnaire is an experiment on

humans from the point of view of research ethics– Easier to achieve an ethical questionnaire

administration if it’s truly anonymous

Page 22: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

22

Response rates and bias

• Low response rate is a problem– Below 50% response rate, one wonders whether

respondents were exceptional (happiest, angriest or a mix; but not the “normal” folks)

– Better if you have some authority to motivate a response

• But back to issues of ethics – e.g., not truly anonymous if you know who to nag about non-response; and the “pressure” may be unfair

• Similar bias problems when you use volunteers for any experiment– Are these volunteers representative of your “normal”

users?

Page 23: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

23

Experimental studies on groupsMore difficult than single-user experiments

Problems with:– subject groups– choice of task– data gathering– Analysis

• Unfortunately (in terms of experimental requirements) a lot of things that are interesting in the real world, involve computers mediating group behaviour

Page 24: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

24

Subject groups

larger number of subjects more expensive

longer time to `settle down’… even more variation!

difficult to timetable

so … often only three or four groups

Page 25: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

25

Groups (contd.): Data gathering

several video cameras+ direct logging of application

problems:– synchronisation– sheer volume!

one solution:– record from each perspective

Page 26: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

26

Groups (contd.): Analysis

N.B. vast variation between groups

solutions:– ‘within groups’ experiments (each group works under

various conditions)– micro-analysis (e.g., gaps in speech)– anecdotal and qualitative analysis

controlled experiments may `waste' resources!– Experiments dominated by dynamics of group formation– Field studies are apt to be more realistic

Page 27: 1 Lecture 19 chapter 9 evaluation techniques (part 2)

27

About statistics

• It’s an amazingly complex field– A lot of hidden complexities in running experiments and

saying that the observed differences really make a difference

• ‘threats to validity’ – are those things that make it possible that your experimental conclusion is in error

– Threats to internal validity: like carryover effects, or lack of randomization

– Threats to external validity: like that your whole population of subjects were unusual in some way, or the task was not representative of real use of the tool

– When the outcomes are serious (e.g., medical trials) professional statisticians are always used in design of the experiment as well as analysis and reporting of the findings

– Plenty of texts and courses on stats available (the Wikipedia is pretty good on this topics, too – e.g., for ANOVA)