comparison gwap mechanical turk
DESCRIPTION
An experiment to replicate GWAP-centered human computation using microtasksTRANSCRIPT
Comparing human computation services
Elena Simperl (University of Southampton)
Human computation • Outsourcing tasks that machines find difficult
to solve to humans (accuracy, efficiency, costs)
Dimensions of human computation
• What is outsourced – Tasks that require human skills that cannot be easily replicated by
machines (visual recognition, language understanding, knowledge acquisition, basic human communication etc)
– Sometimes only certain steps of a task are outsourced to humans, the rest is executed automatically
• How is the task being outsourced – Tasks broken down into smaller units undertaken in parallel by
different people – Coordination required to handle cases with more complex workflows – Partial or independent answers consolidated and aggregated into
complete solution
See also [Quinn & Bederson, 2012]
Dimensions of human computation (2)
• How are the results validated – Solutions space closed (choice of correct answer) vs open
(collection of potential solutions) – Performance objectively measured or through ratings/votes – Statistical techniques employed to predict accurate solutions
• May take into account confidence values of algorithmically generated solutions
• How can the overall process be optimized – Incentives and motivators (altruism, entertainment, intellectual challenge,
social status, competition, financial compensation) – Assigning tasks to people based on their skills and performance (as
opposed to random assignments) – Symbiotic combinations of human- and machine-driven computation,
including combinations of different forms of crowdsourcing
See also [Quinn & Bederson, 2012]
Games with a purpose (GWAP)
• Human computation disguised as casual games • Tasks are divided into parallelizable atomic units
(challenges) solved (consensually) by players • Game models
– Single vs multi-player – Selection agreement vs input agreement vs inversion-
problem games
See also [van Ahn & Dabbish, 2008]
Dimensions of GWAP design • What tasks are amenable to ‚GWAP-ification‘
– Work is decomposable into simpler (nested) tasks – Performance is measurable according to an obvious rewarding scheme – Skills can be arranged in a smooth learning curve – Player’s retention vs repetitive tasks
• Note: Not all domains are equally appealing – Application domain needs to attract a large user base – Knowledge corpus has to be large-enough to avoid repetitions – Quality of automatically computed input may hamper game
experience • Attracting and retaining players
– You need a critical mass of players to validate the results – Advertisement, building upon an existing user base – Continuous development
Microtask crowdsourcing
• Similar types of tasks, but different incentives model (monetary reward)
• Successfully applied to transcription, classification, and content generation, data collection, image tagging, website feedback, usability tests…
Our experiment
• Goals – Compare the two approaches for a given task
(ontology engineering) – More general: description framework to compare
different human computation models and use them in combination
• Set-up – Re-build OntoPronto within Amazon’s Mechanical
Turk, based on existing OntoPronto data
OntoPronto • Goal: extend Proton upper-
level ontology • Multi-player (single player
using pre-recorded rounds) – Step 1: topic of Wikipedia
article classified as class or instance
– Step 2: browsing the Proton hierarchy from the root to identify most specific class which matches the topic of the article
• Consensual answers, additional points for more specific classes
Validation of players‘ inputs
• A topic is played at least six times • Number of consensual answers to each
question at least four • The number of consensual answers modulo
reliability more than half of the number of total answers received – Reliability measures relation consensual and
correct answers given by a player
Evaluation and collected data
• 270 distinct players, 365 Wikipedia articles, 2905 game rounds
• Approach is effective
– 77% of challenges solved consensually – If agreement, most answers correct (97%)
• …and efficient – 122 classes and entities extending Proton (after
validation)
Implementation through MTurk • Server-side component
– Generates new HITs – Evaluate assignments of
existing HITs • Two types of HITs
– Class or instance (1 cent) – Proton class (5 cent)
• HITs generated using title, first paragraph and first image (if available)
• Qualification test with five questions, turkers with at least 90% accepted tasks
Implementation through MTurk (2)
• Multiple assignments per HIT, four consensual answers needed – (number of answers needed for consensus - 1) x
(number of available answer options) + 1
• HITs with (four) consensual answers are considered completed
• Assignments matching consensus accepted • HIT costs maximally (number of answers needed
for consensus) x (reward per correct assignment)
Evaluation and collected data
Development time and costs per contribution
• OntoPronto: five development months • MTurk: one month
– Additional effort required because of the setting of the experiment
– Less effort as HIT design and validation mechanisms adopted from OntoPronto
• Average cost for a correct answer on MTurk 0.74 $
Quality of contributions
• Both approaches resulted in high-quality data • Diversity and biases (270 players vs 16 turkers)
– Additional functionality of MTurk
• Game-based approach economic in the long run if player retention strategy available
• Microtask-based approach uses ‚predictable‘ motivation framework
• MTurk less diverse (270 players vs 16 turkers)
Challenges and open questions
• Synchronous vs asynchronous modes of interaction – Consensual answers, ratings by other turkers?
• Executing inter-dependent tasks in MTurk – Mapping game steps into HITs – Grouping HITs
• Using game-like interfaces within microtask crowdsourcing platforms – Impact on incentives and turkers‘ behavior?
• Using MTurk to test GWAP design decisions
Challenges and open questions (2)
• Descriptive framework for classification of human computation systems – Types of tasks and their mode of execution – Participants and their roles – Interaction with system and among participants – Validation of results – Consolidation and aggregation of inputs into complete
solution • Reusable collection of algorithms for quality assurance,
task assignment, workflow management, results consolidation etc
• Schemas recording provenance of crowdsourced data
S. Thaler, E. Simperl, S. Wölger. An experiment in comparing human computation techniques. IEEE
Internet Computing, 16(5): 52-58, 2012
For more information email: [email protected]
twitter: @esimperl
Theory and practice of social machines
http://sociam.org/www2013/
Deadline: 25.02.2013