intro to evaluation see how (un)usable your software really is…

Intro to Evaluation

See how (un)usable your software really is…

Why evaluation is done?

Summative assess an existing

system judge if it meets some

criteria Formative

assess a system being designed

gather input to inform design

Summative or formative? Depends on

• maturity of system• how evaluation results

will be used

Same technique can be used for either

Other distinctions

Form of results of obtained Quantitative Qualitative

Who is experimenting with the design End users HCI experts

Approach Experimental Naturalistic Predictive

Evaluation techniques

Predictive modeling Questionnaire Empirical user studies (experiments) Heuristic evaluation Cognitive walkthrough Think aloud (protocol analysis) Interviews Experience Sampling Focus Groups

Evaluation techniques

Predictive Evaluation Fitt’s law, Hick’s, etc.

Observation Think-aloud Cooperative evaluation

Watch users perform tasks with your interface

We’ll start talking about this today

More techniques

Empirical user studies (experiments) Test hypotheses about your interface Examine dependent variables against

independent variables More next lecture…

Interviews Questionnaire Focus Groups

Get user feedback More next week…

Still more techniques

Discount usability techniques Use HCI experts instead of users Fast and cheap method to get broad

feedback Heuristic evaluation

• Several experts examine interface using guiding heuristics (like the ones we used in design)

Cognitive Walkthrough• Several experts assess learnability of interface for

novices You will do one of each of these

And still more techniques

Diary studies Users relate experiences on a regular basis Can write down, call in, etc.

Experience Sampling Technique Interrupt users with very short questionnaire

on a random-ish basis

Good to get idea of regular and long term use in the field (real world)

We won’t talk more about these…

Evaluation is Detective Work

Goal: gather evidence that can help you determine whether your usability goals are being met

Evidence (data) should be:RelevantDiagnosticCredibleCorroborated

Data as Evidence

Relevant Appropriate to address the hypotheses

• e.g., Does measuring “number of errors” provide insight into how effective your new air traffic control system supports the users’ tasks?

Diagnostic Data unambiguously provide evidence one

way or the other• e.g., Does asking the users’ preferences clearly

tell you if the system performs better? (Maybe)

Data as Evidence

CredibleAre the data trustworthy?

• Gather data carefully; gather enough data Corroborated

Do more than one source of evidence support the hypotheses?

• e.g. Both accuracy and user opinions indicate that the new system is better than the previous system. But what if completion time is slower?

General Recommendations

Include both objective & subjective data e.g. “completion time” and “preference”

Use multiple measures, within a type e.g. “reaction time” and “accuracy”

Use quantitative measures where possible e.g. preference score (on a scale of 1-7)

Note: Only gather the data required; do so with minimum interruption, hassle, time, etc.

Making an evaluation plan

What criteria are important?

What resources available?evaluators, prototype, subjects, time

Required authenticity of system

Evaluation planning

Decide on techniques, tasks, materials How many people, how long How to record data, how to analyze data Prepare materials – interfaces, storyboards,

questionnaires, etc. Pilot the entire evaluation

Test all materials, tasks, questionnaires, etc. Find and fix the problems with wording,

assumptions Get good feel for length of study

Recruiting Participants

Various “subject pools” Volunteers Paid participants Students (e.g., psych undergrads) for course credit Friends, acquaintances, family, lab members “Public space” participants - e.g., observing people

walking through a museum Email, newsgroup lists

Must fit user population (validity) Note: Ethics, IRB, Consent apply to *all* participants,

including friends & “pilot subjects”

Consent

Why important?People can be sensitive about this

process and issues Errors will likely be made, participant

may feel inadequateMay be mentally or physically

strenuous What are the potential risks (there are

always risks)?

Performing the Study

Be well prepared so participant’s time is not wasted

Explain procedures without compromising results

Session should not be too long , subject can quit anytime

Never express displeasure or anger Data to be stored anonymously, securely,

and/or destroyed Expect anything and everything to go wrong!!

(a little story)

Data Inspection

Look at the results First look at each participant’s data

Were there outliers, people who fell asleep, anyone who tried to mess up the study, etc.?

Then look at aggregateresults and descriptivestatistics

Inspecting Your Data

“What happened in this study?” Keep in mind the goals or hypotheses

you had at the beginning Questions:

Overall, how did people do?“5 W’s” (Where, what, why, when, and

for whom were the problems?)

Making Conclusions

Where did you meet your criteria? Where didn’t you?

What were the problems? How serious are these problems?

What design changes should be made? But don’t make things worse…

Prioritize and plan changes to the design Iterate on entire process

Example: Heather’s study

Software: MeetingViewer interface fully functional

Criteria – learnability, efficiency, see what aspects of interface get used, what might be missing

Resources – subjects were students in a research group, just me as evaluator, plenty of time

Wanted completely authentic experience

Heather’s evaluation

Task: answer questions from a recorded meeting, use my software as desired

Think-aloud Video taped, software logs Also had post questionnaire Wrote my own code for log analysis Watched video and matched behavior to

software logs

Example materials

Data analysis

Basic data compiled: Time to answer a question (or give up) Number of clicks on each type of item Number of times audio played Length of audio played User’s stated difficulty with task User’s suggestions for improvements

More complicated: Overall patterns of behavior in using the

interface User strategies for finding information

Data representation example

Data presentation

Time spent answering question Time

playing Timeline Artifact Meeting # Subject # Q1 Q2 Q3 Q4 audio seeks seeks

1 1 10:10 12:16 5:33 25:40 40 9 2 2 1:44 3:19 3:00 4:59 13:57 44 0 2 3 4:20 2:54 5:18 0 10:22 35 3 3 4 0:45 2:43 2:36 0:59 10:41 2 3 3 5 2:23 0 0 2:59 3:11 8 2 4 6 6:13 7:53 2:53 12:16 14:36 9 7 5 7 4:16 3:14 0:27 4:14 21 0 5 8 8:01 4:41 1:33 9:51 2 3 6 9 4:45 0 0:59 5:59 12:27 53 3 6 10 3:22 0 1:20 2:00 6:56 15 5 6 11 0 2:40 1:33 2:12 6:49 19 10 7 3 3:04 1:35 5:52 2:36 14:13 10 11 7 6 0 1:03 0 0:29 0:00 0 0 7 12 NA NA NA NA 23:15 98 5 8 3 2:00 0:00 1:04 1:13 1:20 8 0 8 13 2:36 2:13 2:41 2:21 9:03 81 0 8 14 1:15 4:20 0:00 2:36 3:28 15 3 9 3 0 5:19 0 2:24 8:26 50 4 9 6 7:57 0:52 2:13 2:30 12:17 48 1 9 12 0 5:00 0 0 3:11 33 3 9 14 7:52 0:49 1:22 0:19 9:32 56 0 9 15 0 7:42 0 0:53 6:51 36 0 9 16 1:22 7:04 2:24 1:07 10:31 38 3 9 17 5:56 6:56 0:56 1:03 11:50 160 4

10 3 15:07 11:34 4:26 31:08 97 14 10 6 6:57 5:10 2:37 19:54 91 4 10 12 6:06 0 3:38 10:14 45 14 10 13 5:53 5:15 3:06 12:25 125 5 10 14 3:59 5:14 3:50 7:27 13 16 10 15 9:04 5:03 1:54 13:17 31 14 10 17 6:40 7:32 5:06 17:46 94 18 10 18 0 6:57 7:24 10:13 15 14

Average Over all questions: 4:04 11:05 43.5 5.6 St. Dev. 2:58 6:45 39.2 5.4

2 4 6

AgendaAction ItemPresentation

Timeline

0

40

20

2 4 62 4 6

AgendaAction ItemPresentation

Timeline

0

40

20

AgendaAction Item

PresentationTimeline

160

40

80

120

0

0 2 4 6 8 10 12 14 16 18 20

Time (minutes) in session

Loc

atio

n (m

in.)

in m

eeti

ng

reco

rd.

AgendaAction Item

PresentationTimeline

160

40

80

120

0

0 2 4 6 8 10 12 14 16 18 20

Time (minutes) in session

Loc

atio

n (m

in.)

in m

eeti

ng

reco

rd.

Some usability conclusions

Need fast forward and reverse buttons (minor impact)

Audio too slow to load (minor impact) Target labels are confusing, need

something different that shows dynamics (medium impact)

Need more labeling on timeline (medium impact)

Need different place for notes vs. presentations (major impact)

Observing Users

Not as easy as you think

One of the best ways to gather feedback about your interface

Watch, listen and learn as a person interacts with your system

Qualitative & quantitative, end users, experimental or naturalistic

Conducting an Observation

Determine the tasks Determine what data you will gather IRB approval if necessary Recruit participants Collect the data Inspect & analyze the data Draw conclusions to resolve design

problems Redesign and implement the revised

interface

Observation

Direct In same room Can be intrusive Users aware of

your presence Only see it one

time May use 1-way

mirror to reduce intrusiveness

IndirectVideo recordingReduces intrusiveness, but doesn’t eliminate itCameras focused on screen, face & keyboardGives archival record, but can spend a lot of time reviewing it

Location

Observations may beIn lab - Maybe a specially built

usability lab• Easier to control• Can have user complete set of tasks

In field• Watch their everyday actions• More realistic• Harder to control other factors

Understanding what you see

In simple observation, you observe actions but don’t know what’s going on in their head

Often utilize some form of verbal protocol where users describe their thoughts

Engaging Users in Evaluation

Qualitative techniques Think-aloud - can be very helpful Post-hoc verbal protocol - review video Critical incident logging - positive & negative Structured interviews - good questions

• “What did you like best/least?”• “How would you change..?”

Identifying errors can be difficult

Verbal Protocol

One technique: Think aloudUser describes verbally what s/he is

thinking and doing• What they believe is happening• Why they take an action• What they are trying to do

Think Aloud

Very widely used, useful technique Allows you to understand user’s

thought processes better

Potential problems:Can be awkward for participantThinking aloud can modify way user

performs task

Cooperative approach

Another technique: Co-discovery learning (Constructive interation) Join pairs of participants to work together Use think aloud Perhaps have one person be semi-expert

(coach) and one be novice More natural (like conversation) so removes

some awkwardness of individual think aloud Variant: let coach be from design team

(cooperative evaluation)

Alternative

What if thinking aloud during session will be too disruptive?

Can use post-event protocolUser performs session, then watches

video afterwards and describes what s/he was thinking

Sometimes difficult to recallOpens up door of interpretation

Issues

What if user gets stuck on a task? You can ask (in cooperative evaluation)

“What are you trying to do..?” “What made you think..?” “How would you like to perform..?” “What would make this easier to accomplish..?” Maybe offer hints This is why cooperative approaches are used

Can provide design ideas

Inputs / Outcomes

Need operational prototype could use Wizard of Oz simulation

What you get out“process” or “how-to” informationErrors, problems with the interfacecompare user’s (verbalized) mental

model to designer’s intended model

Historical Record

In observing users, how do you capture events in the session for later analysis?

Capturing a Session

1. Paper & pencilCan be slowMay miss thingsIs definitely cheap and easy

Time 10:00 10:03 10:08 10:22

Task 1 Task 2 Task 3 …

Se

Se

Capturing a Session

2. Recording (audio and/or video)Good for think-aloudHard to tie to interfaceMultiple cameras may be neededGood, rich record of sessionCan be intrusiveCan be painful to transcribe and

analyze

Capturing a Session

3. Software loggingModify software to log user actionsCan give time-stamped key press or

mouse eventTwo problems:

• Too low-level, want higher level events• Massive amount of data, need analysis

tools

Example logs2303761098721869683|hrichter|1098722080134|MV|START|5662303761098721869683|hrichter|1098722122205|MV|QUESTION|false|false|false|false|false|false| 2303761098721869683|hrichter|1098724978982|MV|TAB|AGENDA2303761098721869683|hrichter|1098724981146|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098724985161|MV|SLIDECHANGE|52303761098721869683|hrichter|1098724986904|MV|SEEK|PRESENTATION-A|566|604189|02303761098721869683|hrichter|1098724996257|MV|SEEK|PRESENTATION-A|566|604189|6041892303761098721869683|hrichter|1098724998791|MV|SEEK|PRESENTATION-A|566|604189|6041892303761098721869683|hrichter|1098725002506|MV|TAB|AGENDA2303761098721869683|hrichter|1098725003848|MV|SEEK|AGENDA|566|149613|6041892303761098721869683|hrichter|1098725005981|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098725007133|MV|SLIDECHANGE|32303761098721869683|hrichter|1098725009326|MV|SEEK|PRESENTATION|566|315796|1496132303761098721869683|hrichter|1098725011569|MV|PLAY|566|3157962303761098721869683|hrichter|1098725039850|MV|TAB|AV2303761098721869683|hrichter|1098725054241|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098725056053|MV|SLIDECHANGE|22303761098721869683|hrichter|1098725057365|MV|SEEK|PRESENTATION|566|271191|3157962303761098721869683|hrichter|1098725064986|MV|TAB|AV2303761098721869683|hrichter|1098725083373|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098725084534|MV|TAB|AGENDA2303761098721869683|hrichter|1098725085255|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098725088690|MV|TAB|AV2303761098721869683|hrichter|1098725130500|MV|TAB|AGENDA2303761098721869683|hrichter|1098725139643|MV|TAB|AV2303761098721869683|hrichter|1098726430039|MV|STOP|566|2711912303761098721869683|hrichter|1098726432482|MV|END

Analysis

Many approaches Task based

How do users approach the problem What problems do users have Need not be exhaustive, look for interesting

cases Performance based

Frequency and timing of actions, errors, task completion, etc.

Very time consuming!!

UsabilityLab

http://www.surgeworks.com/services/observation_room2.htm

Large viewing area in this one-way mirror which includes an angled sheet of glass the improves light capture and prevents sound transmission between rooms.

Doors for participant room and observation rooms are located such that participants are unaware of observers movements in and out of the observation room.

ObservationRoom

State-of-the-art observation room equipped with three monitors to view participant, participant's monitor, and composite picture in picture.

One-way mirror plus angled glass captures light and isolates sound between rooms.

Comfortable and spacious for three people, but room enough for six seated observers.

Digital mixer for unlimited mixing of input images and recording to VHS, SVHS, or MiniDV recorders.

intro to evaluation see how (un)usable your software really is…

Documents

evidence data

week slide

evaluation results

evaluation plan

diagnostic data

data trustworthy

users tasks

new system