crowdsourcing for nlp using amazon mechanical turk and crowdflower matteo negri and yashar mehdad
TRANSCRIPT
Crowdsourcing
• Wikipedia:Crowdsourcing is the act of outsourcing tasks, traditionally performed by an employee or contractor, to a large group of people or community (a crowd), through an open call.
Crowdsourcing services• Web and Logo Design: 99designs (>72000 designers, from $150)• Brand names: namethis ($99 for the best 3 names after a 48 hour
contest/voting session)• Business innovation: Chaordix (engage the crowd via the web to “submit,
discuss, refine and rank ideas…”)• Advertising: Poptent (“connects video creators with Top Brands…”)• Software & usability testing: uTest (>18000 professionals to test Web,
mobile, gaming and desktop apps)• Brainstorming / feedback: kluster (“brainstorming ideas from trusted
people”)• Product redesign: redesignme (“…actively seeks out badly-designed
products…users are then invited to complete design challenges”)• …• Data cleansing & entry / content creation: Amazon’s Mechanical Turk
CrowdFlower
MTurk & CF• MTurk (www.mturk.com) launched in 2005
– Directly accessible only to US requesters– > 500.000 Workers from >100 countries
• CF (www.crowdflower.com) launched in 2007– channel to Mturk accessible to non-US requesters
Ipeirotis, 2010. New demographics of Mechanical Turk.
US (46.8%) India (34%)
68% women 70% male
~40% 20-30 years ~65% 20-30 years
35% Bachelors degree 53% Bachelors degree
~45% $25-60K/yr 55% <$10K/yr
~35% “to kill time” ~30% “primary source of income”
~25% 4-8 hours per week
~36% 20/100 HITs (i.e. work units) per week
~60% earns less than $10 per week
MTurk & CF
• Basic unit of work: "Human Intelligence Task" (HIT)– Simple, repetitive, hard to automate tasks– Prices from $0.01 to $10 (the end of un-supervised learning?)
• Requester– Prepay the money– Publish HITs– Get results
• Worker (aka “turker”)– Complete the HITs– Get paid
Requester HITs
Workers
Completed HITs
Sample HITS from MTurk (July 2, 2010) • Transcribe this audio into text (audio length: 1h3'41’’). $13.37• Visit the given website and complete the short survey. About 5 minutes to complete. $1.00• Tweet a specified message on your valid Twitter account with at least 200 followers. $1.00• Share Your Room Painting Project (photo + description). $1.00• Sell me your old college/university writing assignments and summaries (400+ words). I am
looking for original writing done about university-level topics & readings. $0.50• Share a 16th birthday party idea. 300 + words. $0.50• Click a link to a website, enter your zip code, click submit to test (Takes 10 Seconds). $0.50• Provide on my website quality improvement tip for Singers and aspiring vocalist looking for
vocal training tips. $0.40 • How good is your Refrigerator model? Share your experience! $0.25 • Tell us a true, interesting story from your life about acne, pimples, zits. etc., like products you
tried, bad dates, embarrassing moments, etc. $0.10• Download and rate my free Android App. $0.01 • Adult/inappropriate video identification. You will view or scrub this video and decide if it
contains adult material. $0.01
199,799 HITS
Sample NLP HITS 1• Corpus collection
– Given a topic, prepare a brief speech expressing your true opinion on the topic. Next, prepare a second brief speech expressing the opposite of your opinion
• Word Sense Disambiguation– Given a text passage containing a target word w, select w’s most appropriate sense from a
list • Word similarity
– Assign numeric judgments of word similarity for 30 word pairs on a scale of [0,10] • Textual Entailment
– Given two sentences, choose whether the second sentence can be inferred from the first.• Answer quality evaluation
– Given a question-answer pair, rate the following 13 statements on scale of 1 to 5: “This answer provides enough information for the question”, “this is an easy to read answer”, …
• Sentiment/polarity/bias classification– Given a list of short headlines, assign numeric judgments in the interval [0,100] rating the
headline for six emotions (anger, disgust, fear, joy, sadness, surprise) and a single numeric rating in the interval [-100,100] to denote the overall positive or negative valence of the emotional content of the headline
Sample NLP HITS 2• Machine Translation evaluation
– Given a source text, rank each of the 5 translations from Best to Worst• Speech transcription
– listen to the utterance by using the audio player embedded in the task web page, and transcribe every audible word. You can replay the audio as many times as necessary to produce a satisfactory transcript.
• Temporal ordering of events– Given a verb event pair, take a binary choice on whether the event described
by the first verb occurs before or after the second.• Relation extraction
– Given a text passage with two highlighted terms, indicate if one of the following relations hold between them: …
Sample NLP HITS 3• Word alignment
– link words in the source sentence to one or more target words or the empty word.
JAVASCRIPT API
Popular, simple, fast, cheap,…… BUT tricky!!!
• How to design HITS?– …to attract turkers– …to collect reliable data– …to boost speed
• How to price HITS?• How to ensure quality control?
– …to weed out untrustable workers– …to weed out spammers/cheaters– …to avoid money waste
A bunch of hints• Keep your HIT simple and concise
– Difficult tasks = low agreement, few reliable results, slow progress
• Try different settings before launching a big job– Different definitions of your HIT– Different payment amounts
• Make cheating a hard task– Make successful completion with random clicks impossible– Use a gold standard– Use regional qualifications– Define your HIT in the appropriate language– Transform texts into
images
The importance of gold data 1• Using a gold standard is optional but REMEMBER THAT:• You are going to pay only for successfully completed HITs!!!
– MTurk +10% over the price of successfully completed HITs– CF +30% (!)
• You need a criterion to discriminate successfully/unsuccessfully completed HITs– No criterion=ALL results are good (and paid!)
HIT: Transcribe this audio into text (audio length: 1h3'41’’). $13.37
Agfdagfa ah ah ah!
Valid result without gold standard!!!
The importance of gold data 2• No criterion=ALL results are good (and paid!)…another example
A B Synonyms?
car book -
volume loudness -
volume book -
volume mass -
crab shrimp -
HIT: given two English words A and B, decide if they can be synonyms or not
Data to be annotated
The importance of gold data 2• No criterion=ALL results are good (and paid!)…another example
A B Synonyms?
car book YES
volume loudness NO
volume book NO
volume mass NO
crab shrimp YES
HIT: given two English words A and B, decide if they can be synonyms or not
Valid results without gold standard!!!
Adding gold units 1• Sometimes it’s easy: gold units can be merged with the required annotations
A B Synonyms?
-
car book -
volume loudness -
volume book -
-
volume mass -
crab shrimp -
HIT: given two English words A and B, decide if they can be synonyms or not
Data to be annotated
car automobile
volume table
GOLD
YES
-
-
-
NO
-
-
Gold units
Adding gold units 1• Sometimes it’s easy: gold units can be merged with the required annotations
A B Synonyms?
YES
car book NO
volume loudness YES
volume book YES
YES
volume mass YES
crab shrimp YES
HIT: given two English words A and B, decide if they can be synonyms or not
car automobile
volume table
Gold units
GOLD
YES
-
-
-
NO
-
-
#67911Judgments made: 7Gold Seen: 2 / Missed:1Trust: 50%Worker #67911
Adding gold units 2• Sometimes it’s harder: gold units cannot be directly merged with the required
annotations
HIT 1: translate the given English sentence into Spanish
HIT 2: summarize a 300 words story
HIT 3: Given a list of headlines, assign a numeric rating in the interval [-100,100] to denote the overall positive or negative valence of the emotional content of the headline
• One valid output Vs. multiple valid outputs• Known output Vs. unknown output• Data annotation Vs. survey/content creation
Adding gold units 2• Sometimes it’s harder: gold units cannot be directly merged with the required
annotations
HIT 1: translate the given English sentence into Spanish
PROBLEM:Since there’s not ONE single good translation, we cannot directly
check the quality of turkers’ work through comparison with a gold reference translation
Adding gold units 2• Sometimes it’s harder: gold units cannot be directly merged with the required
annotations
• Possible solution: a 2-steps HIT (validation over gold units + translation)
HIT 1: translate the given English sentence into Spanish
HIT 1.0: given two sentences, S1 in English and S2 in Spanish, decide if S2 is a correct translation of S1. HIT1.1: translate the given English sentence S3 into Spanish.
Gold units
Data to be collected
S1 S2 Correct? Gold S3 Translation2002 Olympic Winter games took place in Salt Lake.
2002 Juegos Olímpicos de Invierno tendrá lugar en Salt Lake.
- NO A variety of mercy killing is when a patient is removed from a life
support system with legal approval.
-
AMT Vs CFAMT CF
Regional qualification ✔ ✔Accessible to international requesters ✗ ✔Multiple channels for job distribution ✗ ✔Built-in gold standard qualification ✗ ✔
Trustability qualification ✔ ✗Qualification certificate ✔ ✗Selection of good workers on your job ✔ ✗Charge on successfully completed HITs +10% +30%
Terminology• Unit (HIT)
– Basic task given to each worker.
• Assignment– Number of units each worker will do at a time.
• Judgment– Completion of an assignment by an individual worker.
• Job– Your published assignments waiting for judgment.– Cost = # Assignments * # Judgments per assignment * Pay per assignment
Creating a new job: word similarity• Task: Given a sentence containing a term t, choose among a list of 3
terms t1,t2,t3 the most similar to t.
• Note: One valid output simple gold standard creation!– gold units can be easily merged with the required annotations– 1-step HIT
HIT: select from a list of terms the most similar to the one extracted from the given sentence
Sentence T T1 T2 T3 Gold Most Similar
He was reading a book while waiting for his guests.
book hat volume cat volume
they left the harbor during the night
harbor seaport airport mountain -
4: ordering
NOTE: • MTurk +10% over the price of successfully completed HITs• CF +30% (!)
Gambit: payments company for social games!Players are paid with “chips” for taking simple, online jobs…
Workers
NOTE: Only workers having seen at least 4 gold units, with >= 70% Trust are paid (and their work is retained)!
Issues
• How to design HITS?– …to attract turkers– …to collect reliable data– …to boost speed
• How to price HITS?• What can we do with low budget?• Quality control, cheating/spam detection• Experts Vs non experts (correlation between the two groups,
what to expect from non experts)
Creating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk:
$100 for a 10-day Rush
T: Wolfgang Amadeus Mozart was born in Salzburg.H: Mozart was born in Austria.
T: Wolfgang Amadeus Mozart was born in Salzburg.H: Mozart nació en Austria.
NAACL 2010 Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk
Joint work with Yashar Mehdad
Translation HIT Validation HIT
Translated T-H pairs
Monolingual TE Corpus
(PASCAL RTE3)
Validated T-H pairs
CLTE Corpus(English-
Spanish)
Naïve methodologyNo qualification mechanisms
Very fast and cheap:• $12 for 800 translations in 1 hour• $12 for 5*800 validations in ~6 hours
Poor quality of the results (61% rejections)
Need of gold standard units!
DAY1$24
Improving validationGold units (50 positive/negative examples)Task definition in Spanish
DAYS 2-7$58
Better results…still at low cost• 97% Accuracy on 20% of the retained translations• +25% in the validation costs
Considerable increase in duration• 4 days for the first iteration (many rejected judgments, automatic pausing mechanism in CF)
Need of qualification mechanisms!More money to boost speed!
Improving translationGold units (validity check)Regional qualification, as in Mturk (upon request)Payment increase
DAYS 8-10$99.75
Better results…• less rejections (45%)• Automatic pausing avoidedFaster procedure• Doubling the payment, halved the accomplishment time
Summary• 800 English pairs (RTE3 Development Set)• 426 validated English/Spanish pairs in our CLTE Corpus• $99.75 spent to define a reliable and fast procedureo translation/validation cycleso non-redundant acquisitionso systematic use of gold unitso simple binary decisions
• Cost-effective solutiono $30 to create the full corpus of 800 pairs
• Some limitations found in the CrowdFlower serviceo lack of regional qualification (only available upon request)o lack of other qualification mechanismso automatic pausing mechanisms
MTurk & CF• MTurk (www.mturk.com) launched in 2005
– Directly accessible only to US requesters– Workers from >100 countries– > 500.000 workers
• ~47% from US (34% from India)• ~68 % women• ~52 % 22-40 years• ~70% to spend free time fruitfully (~15% for “primary” income purposes)• ~25% for 4-8 hours per week• ~60% earning less than $10 per week• >50% with college education
• CF (www.crowdflower.com) launched in 2007– channel to Mturk accessible to non-US requesters
Ipeirotis, 2010. New demographics of Mechanical Turk.