utility of human-computer interactions: toward a science of preference measurement michael toomim,...

Utility of Human-Computer Interactions: Toward a Science of Preference Measurement

Michael Toomim, Travis KripleanClaus Pörtner and James A. Landay

University of Washington, dub Group CHI 2011

Discretionary Use of Interfaces• CHI research community grew from discretionary use of computer interfaces

(starting from 1980s), meaning free choices (i.e., people choose which interfaces to use to accomplish their tasks)

• Now, task (and its goal) is a choice (e.g., blogs, web browsing, SNS, Wikipedia), ubiquitous applications (e.g., smartphones, Nike+iPod)

• Widely accepted evaluation metrics in CHI research:– Indirect prediction about whether an interface will be preferred over other

alternatives– Examples: time-on-task, # of errors, subjective interpretations of think-aloud, survey

reports

Evaluating “User Choices”• Industry: A/B testing (split testing, bucket testing)

– Method of marketing testing by which multiple versions of one element are tested against a metric to define which is more successful

– These versions undergo testing simultaneously to determine which is better – Conversions are measured from the different sets of users (between-subjects)

• Yet, A/B testing is challenging: large up-front investment and large existing user-base to deploy/test (say, thousands of people)

vs.

Sample size matters

Control (baseline)

Treatment A

Treatment B Statistical significance test (e.g., t-test or chi-square)

Measuring User’s Preference

• Proposal: a semi-automated approach– Post thousands of “interface test tasks” to M-Turk – Observe how workers choose to complete the

tasks (and how many times they do so)– Analyze the data to measure the preference• How?

Example: Fitts’ law test• Fittsʼ law models the time required to click a widget of a size

and width—this technique can model how much people prefer to use a widget

Width Distance

Difficulty = f(width, distance)

Each time they clicked on the bar, it moved to the opposite side of the screen

Bar moves

Click!

For a given job, subjects are asked to click on a blue rectangle 60 times

Example: Fitts’ law test

Participants were assigned one of three index of difficulty conditions. Each point is the number of clicks a participant completed before quitting (points jittered to show spread)

Participants preferred big buttons to small buttons (p < 0.10)

Participants were allowed a maximum of 3,060 clicks each

The regression line accounts for this maximum using a Tobit analysis

Utility• Utility in Economics:

– The degree to which a person prefers a particular choice among options available• When a user chooses to use system A instead of B, it’s said that Utility(A) >

Utility(B)• Use economic utility to quantify aggregate user preference

– Example: If a user has no preference between (1) being paid $0.25 for using system A, and (2) being paid $0.50 for using system B

– Money-metric of utility: |Utility(A) – Utility(B)| = $0.25

Measuring Utility

• Utility = f(task, interface, context)– A user finds values in completing a task, but takes

some actions with a computer through some interface

– And the user’s context matters (e.g., demographics, social, moral status, etc.)

• Preference measurement begins with determining how much you must pay people to convince them to use an interface for a task

Measuring Utility

• Reservation wage: the wage below which a worker will not take a task

• Present a worker with a job at a price and observe their behavior: the worker will either complete a task at a given price or not

• Gather/analyze all the data: (Interface ID, Worker ID, Wage, Number of Completions)

Measuring Utility• Posting all scenarios/conditions simultaneously to M-Turk• Handling selection bias via a mystery task with “??? price”• Setting a limit on sub-tasks that a single worker can complete

(e.g., 50)• Handling market price fluctuations (as people likes to take

high paying tasks)

Fitts’ Law Study

Subjects clicked on a blue rectangle 60 times

Each time they clicked on thebar, it moved to the opposite

side of the screen

WidthDistance

Difficulty = f(width, distance)

Fitts’ Law Study Price range: $0.01-$0.06Difficulty: easy, medium, hardEach task: 60 clicks Upper limit of # tasks: 515 hours 15 minutes, $970

Aesthetics: CAPTCHAs

Aesthetics: CAPTCHAs• Survival graph shows how many workers made it through how

many tasks, for each of the four experimental conditions • Pretty and ugly lines are separated at the left, but converge

toward the right– This suggests either that the utility effect of aesthetics fades over

time, or that the types of users who complete many CAPTCHAs are more concerned with pay than aesthetics.

The shaded regions are 95% confidence intervals

utility of human-computer interactions: toward a science of preference measurement michael toomim,...

Documents

measuring utility utility

chisquare slide

economic utility

utility of human

ab testing split testing

fitts law test participants

tobit analysis slide

thousands of people