thesis

1

Presented by: Lashanda Lee

1

ASSESSING INTERACTIVE SYSTEM EFFECTIVENESS WITH USABILITY DESIGN HEURISTICS AND MARKOV MODELS OF

USER BEHAVIOR

2

Motivation2

For HCI to be successful, interfaces must be designed to: Effectively translate intentions, actions and inputs of operator for

computer Effectively translate machine outputs for human comprehension

HCI frameworks available can aid in evaluating interface designs and generating subjective data.

Quantitative objective data is also needed as basis for cost justification and determination of ROI

Modeling human behavior may reduce the need for experimentation : Time Expense

Combination of data from OR techniques with subjective data can be used to generate score for overall system effectiveness Allows for comparison of alternative interface designs.

3

Literature ReviewHCI frameworks

3

Norman’s model of HCI Two stages Does not focus on continuous cycle of

communication

Pipe-line model Inputs and outputs of system operate in parallel Complex model with many states Does not show cognitive process of human Explains the computer processing

Dix et al model Focuses on distances between both user and system Focuses on continuous cycle of communication Chosen as basis for evaluation in present research

4

Literature ReviewUsability paradigms and principles

Paradigms: how humans interact with computers ubiquitous computing, intelligent systems, virtual reality

and WIMP Principles: how paradigms work

Flexibility, consistency, robustness, recoverability and learnability

Each paradigm focused on different usability principles.

Specific usability measures can be used to assess certain paradigms

Paradigms address figurative distance of articulation in Dix’s model in different ways

Examples: Intelligent interfaces using NLP:

Greatest reduction in articulation distance for users but furthest from system language

Command line interface Farthest from user in Dix framework but easy for the computer

to understand WIMP

Easy for both the system and user to interpret and equal in distance between the user and the system in the Dix interaction framework

4

5

Literature ReviewMeasures of usabilityQualitative measures

5

Low cost but low discovery Comparisons of designs based on interface qualities Data hard to analyze May not lead to design changes because management considers data unreliable Subjective data

Inspection methods Low cost and quick discovery of problems using low skill evaluators Often fail to find many serious problems and do not provide enough evidence to

create design recommendations Types include:

Heuristic methods, guidelines, style and rule inspections Verbal reports

Hard to find an appropriate way to use the data Gain insight into cognition

Surveys Inexpensive and helps find trouble spots Some information lost due to STM capabilities

6

Literature ReviewMeasures of usabilityQuantitative measures

6

Used to make comparisons of designs based on quantities associated with certain interface features

Useful in presenting information to management Goals may be too ambitious or there are too many goals Cannot cover entire systems Subjective responses

Rankings, ratings or fuzzy set ratings Considered quantitative because they involve manipulation and

analysis of data as a basis for comparing interface alternatives. Objective responses

Measures of effectiveness: binary task completion, number of correct tasks completed and task performance accuracy

Measures of efficiency: task completion time, time in mode, usage patterns and degree of variation from an optimal solution

Fuzzy sets User modeling Count of concrete occurrences and not based on the opinions of

users

7

Literature ReviewQuantitative objective measuresFuzzy sets and user modeling

7

Fuzzy sets Used to compare interface alternatives Aggregate score produced based on count of interface inadequacies Fuzzy sets logic used to determine membership for aggregate score Method uses both subjective and objective measures Requires multiple cycles of user testing to compare scores Doesn’t use variable weights for dimensions considering them all equal

User Modeling Used to predict interface action sequences based on prior use data. Limited in revealing actual human performance, not exact Can be used to help guide users while performing task with an interface Activist GOMS

Estimates task performance times Produces accurate predictions of user actions Takes a long time to create

Benefits include: Model one or more types of users Analyze without additional user testing

8

Literature ReviewUsability measuresSummary

8

Qualitative Used iteratively Low discovery Hard to analyze Usually does not effect change in a display because

management considers data unreliable Quantitative

Appear to be better for detailed usability problem analysis and design recommendations

User modeling can decrease cost Necessary to gain management support

Combine an objective quantitative user modeling approach with subjective usability measures may provide:

An approach effective in finding problems Basis for interface redesign

9

Literature ReviewOperations Research methods of usability evaluation

9

Use of techniques such as mathematical modeling to analyze complex situations

Used to optimize systems Limited use in usability evaluation or interface improvements Methods used:

Markov models Stochastic processes Used for website customization Predict user behavior Research by: Kitajima et al., Thimbleby et al., Jenamani et al.

Probabilistic finite state models Include time distributions and transitional probabilities Generate user behavior predictions Research by: Sholl et al.

Critical path models Algorithm determines longest time Can also incorporate stochastic process predictions Research by: Baber and Mellor (2001)

10

Literature ReviewOperations Research methods of usability evaluationMarkov models: Kitajima et al and Thimbleby et al.

10

Kitajima et al. Markov models used to predict user behavior Determine number of clicks to find relevant articles After interface improvements, used model to predict number of

clicks Number of clicks was reduced Used equation u(i) = 1+Σ Piku(k)

Thimbleby et al. Applied Markov chains to several applications: microwave oven and

cell phone Used Markov chains to predict number of steps Used Mathematica simulation of microwave to gather information Used a mixture of perfect error-free matrix:

Used knowledge factors from 0 to 1, (1 was a fully knowledgeable user) Simulated user behavior Original design took 120 steps (for random user) Improved design took fewer steps (for the random user) Fewer steps considered “easier”

11

Literature ReviewOperations Research methods of usability evaluationSummary

11

Appears to be a viable and useful approach to evaluate interface usability

Provides objective quantitative data without need for several iterations of testing

Used repeatedly to predict behavior, such as number of clicks and task times

Accurately predicts user behavior

12

Summary and Problem Statement

12

Need to use framework describing communication between humans and computers to guide design improvements (Dix et al. was chosen for its simplicity and cyclic structure.)

Usability paradigms help identify types of technology that can be used to improve systems and provide direction in how to evaluate systems. WIMP paradigm chosen for its simplicity accommodation of user and system

Many subjective measures but not adequate for assessing performance and supporting design changes

Objective, quantitative measures often gain the support of management for design changes but are expensive

OR methods: Markov models accurately predict human behavior Need to define approach to using both types of measures to evaluate usability

and require minimal user testing. Combined use of Dix et al. model subjective system evaluations and OR

modeling techniques to predict user behavior of interface Both methods used to produce overall system effectiveness score to compare

alternative designs.

13

MethodOverview of system effectiveness score

13

Dix et al. framework Survey for designers- capture the perceptions of importance of

each link in HCI framework Survey for user with Markov model prediction of average

number of interface actions (clicks) -users rated interfaces with respect to links in the framework

Novelty is measure reflects designer’s intent for application and user’s perception of the system

Designer weights and user ratings are multiplied and summed across links

Weighted sum is divided by Markov model prediction of average number of clicks

Score represents perceived usability per action

14

MethodWeighting factor determination

Designers expected to be most concerned with cognitive load.

Four designers surveyed using the Dix et al. framework: Based on paradigm for

application (WIMP), how important is each link to system effectiveness

Pair-wise comparisons of links Values ranged between 0 and

0.5 Weighting factors averaged

across designers to determine weight for each dimension

Weights were used in calculating overall system subjective score (designer’s rankings x user ratings)

14

15

MethodExperimental task

Used a version of Lenovo.com prototype to find and order ThinkPad R60

Twenty participants: 11 males, 9 females Age range: 17-25

Half participants used old version of Lenovo.com website: Required 11 clicks to buy (optimal

path) Tabs that separated the features

information and the ability to purchase

Half of the participants used a new prototype: Required 9 clicks to buy (optimal

path) All information about type of

computer contained on 1 page Multi-level navigation structure More salient buttons

15

16

MethodDeveloping Markov Chain models

16

JavaScript recorded user actions Old online ordering system used to identify states: Links, Tabs,

Menu options (Radio buttons and popups not included) Used action sequences to create transitional probability matrices

Based on actual number of users going from state i to state k. Assumptions of Markov model include:

Sum of each row must equal 1 Probability of next interface state only depends on current

state To determine average number of clicks to task completion, used

Kitajima et al. (2005) u(i) = 1+Σ Piku(k)

Need state probability matrix based on action sequences Need average number of steps from one state to another (based on

designer analysis)

17

MethodRating system effectiveness (based on Dix framework)

Used Dix et al. framework End users rated links

On a scale from 1 to 10

Presented framework at end of the task

Determined average ratings for each link and used in overall system effectiveness score

17

18

MethodOverall system effectiveness score and Markov model validation

Overall score Used to compare alternative

interface design Average designer weights for

each dimension Average rating by end users Product of two is partial score Partial score divided by

predicted average number of clicks is overall score

Highest ratio considered to indicate higher overall system effectiveness

Validation T-test used to determine if

actual observed number of clicks was significantly different from number of clicks with Markov model.

18

tsrequiremenk tas

user tofeaturesdisplay map tocapability of importancefor eight designer w average a

output through states

systemrepresent accurately tocapability of importancefor eight designer w average a

functions

system tooptionsinput map tocapability of importancefor eight designer w average a

statesinput to

intention calpsychologi map tocapability of importancefor eight designer w average a

outputs ofon translatiof ease theof ratingsuser average obs

output presentingin accuracy and speed system of ratingsuser average pres

inputs tonessresponsive system of ratingsuser average perf

statesinput togoals ofon translatiof ease theof ratinguser average art

, where

clicks ofnumber avgobs)a presa perfaarta (SE

4

3

2

1

4321

System Effectiveness:

19

ResultsAssessment of Markov model assumption

Transition from one state must only be dependent on the current state

Durbin-Watson test used to assess autocorrelation among user steps in interaction

Test statistics were:1.2879 (old) and 2.0815 (new)

Normalization procedure applied to original transitional probability matrices.

Durbin-Watson test conducted on normalized data

Test statistics were: 1.3920 (old) and 2.27 (new)

Test revealed mixed evidence Model was accepted and

applied to predict average number of clicks

19

10000000000

10000000000

01000000000

00100000000

00010000000

00001000000

00000100000

00000010000

00000000010

00000000.90.100

00000000010

new P ik

1.000.000.000.000.000.000.000.000.000.000.00

1.000.000.000.000.000.000.000.000.000.000.00

0.001.000.000.000.000.000.000.000.000.000.00

0.000.001.000.000.000.000.000.000.000.000.00

0.000.000.001.000.000.000.000.000.000.000.00

0.000.000.000.001.000.000.000.000.000.000.00

0.000.000.000.000.001.000.000.000.000.000.00

0.000.000.000.000.000.001.000.000.000.000.00

0.000.000.000.000.000.000.000.000.001.000.00

0.000.000.000.000.000.000.000.500.500.000.00

0.000.000.000.000.000.000.000.000.001.000.00

normalized new P ik

10000000000000

10000000000000

01000000000000

00100000000000

00010000000000

00001000000000

00000100000000

00000010000000

000000076.0008.008.08.

0000000000050.50.0

00000007.087.0007.00

00000000011.89.000

0000000010.050.40.00

0000000000027.73.0

existing Pij

10000000000000

10000000000000

01000000000000

00100000000000

00010000000000

00001000000000

00000100000000

00000010000000

000000046.003.005.46.

0000000000053.47.0

00000008.083.0009.00

00000000047.53.000

00000000038.27.34.00

0000000000029.71.0

rmalizedexistingnoPik

20

ResultsComputation of average number of steps

20

04.54.54.55433311111

10444433311111

2103.53.53.533311111

32103333311111

432102.52.53311111

54321022211111

65432101.51.511111

76543210011111

76543210011111

98765432201111

98765432210111

98765432211011

98765432211101

1098765433221.310

newMik

The average number of steps it takes to get from any one state to the other

Represents individual u(k) in the Kitajima et al. equation

Matrix created by designers of the interface

0.007.337.006.677.506.505.504.503.502.502.502.001.001.00

1.000.006.676.336.006.505.504.503.502.502.502.001.001.00

2.001.000.006.005.675.335.504.503.502.502.502.001.001.00

3.002.001.000.005.335.004.674.503.502.502.502.001.001.00

4.003.002.001.000.004.674.334.003.502.502.502.001.001.00

5.004.003.002.001.000.004.003.673.332.502.502.001.001.00

6.005.004.003.002.001.000.004.003.672.502.502.001.001.00

7.006.005.004.003.002.001.000.002.672.502.332.001.001.00

8.007.006.005.004.003.002.001.000.002.002.002.001.001.00

11.5010.509.508.507.506.505.504.503.500.002.501.501.001.00

9.008.007.006.005.004.003.002.001.002.500.001.501.001.00

10.009.008.007.006.005.004.003.002.001.001.500.001.001.00

10.009.008.007.006.005.004.003.002.002.002.002.000.001.00

11.0010.009.008.007.006.005.004.003.002.002.001.501.000.00

Mijexisting

21

ResultsComputation of average number of clicks

21

Use u(i) = 1+Σ Piku(k)

Consider paths to absorbent state to determine average number of clicks

Markov model predicted number of clicks for each interface: 11.5 for old (actual 12.9) 9 for new (actual 9.2)

T-test used to compare the difference between actual clicks across interfaces T-value: -4.30 with p-value: 0.0004 Actual number of clicks different across interfaces - new was significantly less

T-test used to compare actual click count to predicted click count for all subjects: P-value: 0.439 for new P-value: 0.0605 for old No significant difference between actual and predicted on either interface

T-tests used to compare predicted clicks across interfaces: P-value: 0.0033 New interface reduced number of clicks

22

ResultsPartial system effectiveness score

Each participant rated interfaces on each dimension using scale of 1 to 10

Designers completed pair-wise comparisons Designers expected to rate

articulation and observation higher T-test used to compare designer

ratings of articulation and observation with performance and presentation

Rated articulation and observation higher

Average designer weights were multiplied by average user ratings

T-test used to compare partial score of new against old for all subjects T-value: 5.08; p-value: < .0001 Partial score for new interface is

significantly higher

22

Articulation Observation Presentation PerformanceNew 8.4 9.1 8 8.1

Existing 4.8 4.9 7.2 6.4

Performance PresentationArticulation p = 0.0004 p = 0.0013Observation p = 0.0013 p = 0.0055

p-value

23

ResultsOverall system effectiveness score

23

Partial score was divided by predicted average number of clicks to yield perceived usability per click New: 0.939 Old: 0.475

T-test used to compare overall score for new and old interfaces for all subjects T-value: 5.62; p-value: < .0001 Overall system effectiveness score for new was significantly higher

than old

24

ResultsReducing experimentation

24

Purpose of Markov model was to predict number of clicks and to reduce need of additional user testing.

Designers can speculate an average number of steps to transition among state in the new interface and multiply by probabilities determined for original interface (through user testing)

Predicted number of clicks for new interface was 9.35 (actual 9.2) T-test used to compare if actual number of clicks was different then the

predicted number of clicks T-value: 1.15; p-value: 0.270 Markov model was accurate in predicting the average number of clicks

In order to obtain user ratings, focus groups would be necessary Approach significantly reduces time and money necessary for user

testing

25

DiscussionDesigner ratings

25

Hypothesis: Average designer weighting factors for articulation and observation will be higher than performance and presentation

Designers were concerned with cognitive load, as represented by articulation and observation

If customer cannot find what (s)he is looking for, may lead to: Frustration Lost customers Lost revenue

Designers realize that effectively reducing cognitive load is important

26

DiscussionImproved usability

26

Hypothesis: New interface will improve perceived usability Multi-level navigation was used to reduce cognitive load:

Easier to find and view all options Users could reach many state with 1 click Identified by users of new interface as one of the most usable features

More prominent buttons: Aided in easily identifying next steps In original interface, users had difficult time finding customize button

Often scrolled up and down page or backtracked to determine what to do next

Partial system effectiveness score was higher for new interface (8.6) than the old (5.2)

27

DiscussionHigher system effectiveness score

27

Hypothesis: New interface will produce higher score because of perceived higher usability

Old interface degraded performance: From features tab, some found it difficult to identify what to do next Once users found product tab, some scrolled up and down trying to

determine what to do next (new interface alleviated both these problems -all information on 1 page)

Higher perceived usability and fewer clicks led to higher ratio

28

DiscussionMarkov model accurately predicted average number of clicks

Hypothesis: Markov model will accurately predict average number of clicks used equation detailed by Kitajima

Because Markov models are used to represent stochastic behavior they proved valid in present work

Model revealed the variability among participants but do not show exact magnitude of the error

28

Existing interface: Actual vs. predicted

0

2

4

6

8

10

12

14

16

18

20

1 2 3 4 5 6 7 8 9 10

Nu

mb

er

of

cli

cks

actual

predicted

New interface: Actual vs. predicted

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10

Nu

mb

er

of

cli

cks

actual

predicted

29

Conclusion29

Objective was to create new measure of usability Based on:

Few quantitative objective measures Many subjective measures insufficient to justify design changes

Research supports subjective measure using Dix et al. framework and an objective measure, based on Markov models

Method is: effective in objectively selecting among alternative designs and reducing the

amount of experimentation necessary Easy to implement Can be used with several alternatives without the need for testing Cannot apply to interfaces where selection of next state depends on previous

states and not only current state Future research:

Use Markov models to predict next steps, user will take and make relevant interface options more salient to improve usability

Find a way to incorporate time-on-task in overall effectiveness score: Perceived time-on-task will impact customer retention Research a method to accurately predict

thesis

Documents

quantitative objective

analysis of data

certain interface

timeexpensecombination

interface redesign

outputs of system

additional user testing

subjective usability