comparison gwap mechanical turk

Comparing human computation services

Elena Simperl (University of Southampton)

Human computation • Outsourcing tasks that machines find difficult

to solve to humans (accuracy, efficiency, costs)

Dimensions of human computation

• What is outsourced – Tasks that require human skills that cannot be easily replicated by

machines (visual recognition, language understanding, knowledge acquisition, basic human communication etc)

– Sometimes only certain steps of a task are outsourced to humans, the rest is executed automatically

• How is the task being outsourced – Tasks broken down into smaller units undertaken in parallel by

different people – Coordination required to handle cases with more complex workflows – Partial or independent answers consolidated and aggregated into

complete solution

See also [Quinn & Bederson, 2012]

Dimensions of human computation (2)

• How are the results validated – Solutions space closed (choice of correct answer) vs open

(collection of potential solutions) – Performance objectively measured or through ratings/votes – Statistical techniques employed to predict accurate solutions

• May take into account confidence values of algorithmically generated solutions

• How can the overall process be optimized – Incentives and motivators (altruism, entertainment, intellectual challenge,

social status, competition, financial compensation) – Assigning tasks to people based on their skills and performance (as

opposed to random assignments) – Symbiotic combinations of human- and machine-driven computation,

including combinations of different forms of crowdsourcing

See also [Quinn & Bederson, 2012]

Games with a purpose (GWAP)

• Human computation disguised as casual games • Tasks are divided into parallelizable atomic units

(challenges) solved (consensually) by players • Game models

– Single vs multi-player – Selection agreement vs input agreement vs inversion-

problem games

See also [van Ahn & Dabbish, 2008]

Dimensions of GWAP design • What tasks are amenable to ‚GWAP-ification‘

– Work is decomposable into simpler (nested) tasks – Performance is measurable according to an obvious rewarding scheme – Skills can be arranged in a smooth learning curve – Player’s retention vs repetitive tasks

• Note: Not all domains are equally appealing – Application domain needs to attract a large user base – Knowledge corpus has to be large-enough to avoid repetitions – Quality of automatically computed input may hamper game

experience • Attracting and retaining players

– You need a critical mass of players to validate the results – Advertisement, building upon an existing user base – Continuous development

Microtask crowdsourcing

• Similar types of tasks, but different incentives model (monetary reward)

• Successfully applied to transcription, classification, and content generation, data collection, image tagging, website feedback, usability tests…

Our experiment

• Goals – Compare the two approaches for a given task

(ontology engineering) – More general: description framework to compare

different human computation models and use them in combination

• Set-up – Re-build OntoPronto within Amazon’s Mechanical

Turk, based on existing OntoPronto data

OntoPronto • Goal: extend Proton upper-

level ontology • Multi-player (single player

using pre-recorded rounds) – Step 1: topic of Wikipedia

article classified as class or instance

– Step 2: browsing the Proton hierarchy from the root to identify most specific class which matches the topic of the article

• Consensual answers, additional points for more specific classes

Validation of players‘ inputs

• A topic is played at least six times • Number of consensual answers to each

question at least four • The number of consensual answers modulo

reliability more than half of the number of total answers received – Reliability measures relation consensual and

correct answers given by a player

Evaluation and collected data

• 270 distinct players, 365 Wikipedia articles, 2905 game rounds

• Approach is effective

– 77% of challenges solved consensually – If agreement, most answers correct (97%)

• …and efficient – 122 classes and entities extending Proton (after

validation)

Implementation through MTurk • Server-side component

– Generates new HITs – Evaluate assignments of

existing HITs • Two types of HITs

– Class or instance (1 cent) – Proton class (5 cent)

• HITs generated using title, first paragraph and first image (if available)

• Qualification test with five questions, turkers with at least 90% accepted tasks

Implementation through MTurk (2)

• Multiple assignments per HIT, four consensual answers needed – (number of answers needed for consensus - 1) x

(number of available answer options) + 1

• HITs with (four) consensual answers are considered completed

• Assignments matching consensus accepted • HIT costs maximally (number of answers needed

for consensus) x (reward per correct assignment)

Evaluation and collected data

Development time and costs per contribution

• OntoPronto: five development months • MTurk: one month

– Additional effort required because of the setting of the experiment

– Less effort as HIT design and validation mechanisms adopted from OntoPronto

• Average cost for a correct answer on MTurk 0.74 $

Quality of contributions

• Both approaches resulted in high-quality data • Diversity and biases (270 players vs 16 turkers)

– Additional functionality of MTurk

• Game-based approach economic in the long run if player retention strategy available

• Microtask-based approach uses ‚predictable‘ motivation framework

• MTurk less diverse (270 players vs 16 turkers)

Challenges and open questions

• Synchronous vs asynchronous modes of interaction – Consensual answers, ratings by other turkers?

• Executing inter-dependent tasks in MTurk – Mapping game steps into HITs – Grouping HITs

• Using game-like interfaces within microtask crowdsourcing platforms – Impact on incentives and turkers‘ behavior?

• Using MTurk to test GWAP design decisions

Challenges and open questions (2)

• Descriptive framework for classification of human computation systems – Types of tasks and their mode of execution – Participants and their roles – Interaction with system and among participants – Validation of results – Consolidation and aggregation of inputs into complete

solution • Reusable collection of algorithms for quality assurance,

task assignment, workflow management, results consolidation etc

• Schemas recording provenance of crowdsourced data

S. Thaler, E. Simperl, S. Wölger. An experiment in comparing human computation techniques. IEEE

Internet Computing, 16(5): 52-58, 2012

For more information email: [email protected]

twitter: @esimperl

mailto:[email protected]�

Theory and practice of social machines

http://sociam.org/www2013/

Deadline: 25.02.2013

http://sociam.org/www2013/�

comparison gwap mechanical turk

Education

outsourced tasks

human skills

accepted tasks

article consensual answers

independent answers

number oftotal answers

casual games tasks

similar types of tasks