crowdsourcing for nlp using amazon mechanical turk and crowdflower matteo negri and yashar mehdad

Crowdsourcing for NLP Using Amazon Mechanical Turk and CrowdFlower

Matteo Negri and Yashar Mehdad

Crowdsourcing

• Wikipedia:Crowdsourcing is the act of outsourcing tasks, traditionally performed by an employee or contractor, to a large group of people or community (a crowd), through an open call.

Crowdsourcing services• Web and Logo Design: 99designs (>72000 designers, from $150)• Brand names: namethis ($99 for the best 3 names after a 48 hour

contest/voting session)• Business innovation: Chaordix (engage the crowd via the web to “submit,

discuss, refine and rank ideas…”)• Advertising: Poptent (“connects video creators with Top Brands…”)• Software & usability testing: uTest (>18000 professionals to test Web,

mobile, gaming and desktop apps)• Brainstorming / feedback: kluster (“brainstorming ideas from trusted

people”)• Product redesign: redesignme (“…actively seeks out badly-designed

products…users are then invited to complete design challenges”)• …• Data cleansing & entry / content creation: Amazon’s Mechanical Turk

CrowdFlower

http://99designs.com/

http://namethis.com/

http://www.chaordix.com/

http://www.poptent.net/

http://www.utest.com/

http://www.kluster.com/

http://www.redesignme.com/

https://www.mturk.com/mturk/welcome

MTurk & CF• MTurk (www.mturk.com) launched in 2005

– Directly accessible only to US requesters– > 500.000 Workers from >100 countries

• CF (www.crowdflower.com) launched in 2007– channel to Mturk accessible to non-US requesters

Ipeirotis, 2010. New demographics of Mechanical Turk.

US (46.8%) India (34%)

68% women 70% male

~40% 20-30 years ~65% 20-30 years

35% Bachelors degree 53% Bachelors degree

~45% $25-60K/yr 55% <$10K/yr

~35% “to kill time” ~30% “primary source of income”

~25% 4-8 hours per week

~36% 20/100 HITs (i.e. work units) per week

~60% earns less than $10 per week

http://www.mturk.com/

http://www.crowdflower.com/

MTurk & CF

• Basic unit of work: "Human Intelligence Task" (HIT)– Simple, repetitive, hard to automate tasks– Prices from $0.01 to $10 (the end of un-supervised learning?)

• Requester– Prepay the money– Publish HITs– Get results

• Worker (aka “turker”)– Complete the HITs– Get paid

Requester HITs

Workers

Completed HITs

Sample HITS from MTurk (July 2, 2010) • Transcribe this audio into text (audio length: 1h3'41’’). $13.37• Visit the given website and complete the short survey. About 5 minutes to complete. $1.00• Tweet a specified message on your valid Twitter account with at least 200 followers. $1.00• Share Your Room Painting Project (photo + description). $1.00• Sell me your old college/university writing assignments and summaries (400+ words). I am

looking for original writing done about university-level topics & readings. $0.50• Share a 16th birthday party idea. 300 + words. $0.50• Click a link to a website, enter your zip code, click submit to test (Takes 10 Seconds). $0.50• Provide on my website quality improvement tip for Singers and aspiring vocalist looking for

vocal training tips. $0.40 • How good is your Refrigerator model? Share your experience! $0.25 • Tell us a true, interesting story from your life about acne, pimples, zits. etc., like products you

tried, bad dates, embarrassing moments, etc. $0.10• Download and rate my free Android App. $0.01 • Adult/inappropriate video identification. You will view or scrub this video and decide if it

contains adult material. $0.01

199,799 HITS

Sample NLP HITS 1• Corpus collection

– Given a topic, prepare a brief speech expressing your true opinion on the topic. Next, prepare a second brief speech expressing the opposite of your opinion

• Word Sense Disambiguation– Given a text passage containing a target word w, select w’s most appropriate sense from a

list • Word similarity

– Assign numeric judgments of word similarity for 30 word pairs on a scale of [0,10] • Textual Entailment

– Given two sentences, choose whether the second sentence can be inferred from the first.• Answer quality evaluation

– Given a question-answer pair, rate the following 13 statements on scale of 1 to 5: “This answer provides enough information for the question”, “this is an easy to read answer”, …

• Sentiment/polarity/bias classification– Given a list of short headlines, assign numeric judgments in the interval [0,100] rating the

headline for six emotions (anger, disgust, fear, joy, sadness, surprise) and a single numeric rating in the interval [-100,100] to denote the overall positive or negative valence of the emotional content of the headline

Sample NLP HITS 2• Machine Translation evaluation

– Given a source text, rank each of the 5 translations from Best to Worst• Speech transcription

– listen to the utterance by using the audio player embedded in the task web page, and transcribe every audible word. You can replay the audio as many times as necessary to produce a satisfactory transcript.

• Temporal ordering of events– Given a verb event pair, take a binary choice on whether the event described

by the first verb occurs before or after the second.• Relation extraction

– Given a text passage with two highlighted terms, indicate if one of the following relations hold between them: …

Sample NLP HITS 3• Word alignment

– link words in the source sentence to one or more target words or the empty word.

JAVASCRIPT API

Popular, simple, fast, cheap,…… BUT tricky!!!

• How to design HITS?– …to attract turkers– …to collect reliable data– …to boost speed

• How to price HITS?• How to ensure quality control?

– …to weed out untrustable workers– …to weed out spammers/cheaters– …to avoid money waste

A bunch of hints• Keep your HIT simple and concise

– Difficult tasks = low agreement, few reliable results, slow progress

• Try different settings before launching a big job– Different definitions of your HIT– Different payment amounts

• Make cheating a hard task– Make successful completion with random clicks impossible– Use a gold standard– Use regional qualifications– Define your HIT in the appropriate language– Transform texts into

images

The importance of gold data 1• Using a gold standard is optional but REMEMBER THAT:• You are going to pay only for successfully completed HITs!!!

– MTurk +10% over the price of successfully completed HITs– CF +30% (!)

• You need a criterion to discriminate successfully/unsuccessfully completed HITs– No criterion=ALL results are good (and paid!)

HIT: Transcribe this audio into text (audio length: 1h3'41’’). $13.37

Agfdagfa ah ah ah!

Valid result without gold standard!!!

The importance of gold data 2• No criterion=ALL results are good (and paid!)…another example

A B Synonyms?

car book -

volume loudness -

volume book -

volume mass -

crab shrimp -

HIT: given two English words A and B, decide if they can be synonyms or not

Data to be annotated

The importance of gold data 2• No criterion=ALL results are good (and paid!)…another example

A B Synonyms?

car book YES

volume loudness NO

volume book NO

volume mass NO

crab shrimp YES


Valid results without gold standard!!!

Adding gold units 1• Sometimes it’s easy: gold units can be merged with the required annotations

A B Synonyms?

-

car book -

volume loudness -

volume book -

-

volume mass -

crab shrimp -


Data to be annotated

car automobile

volume table

GOLD

YES

-

-

-

NO

-

-

Gold units

Adding gold units 1• Sometimes it’s easy: gold units can be merged with the required annotations

A B Synonyms?

YES

car book NO

volume loudness YES

volume book YES

YES

volume mass YES

crab shrimp YES


car automobile

volume table

Gold units

GOLD

YES

-

-

-

NO

-

-

#67911Judgments made: 7Gold Seen: 2 / Missed:1Trust: 50%Worker #67911

Adding gold units 2• Sometimes it’s harder: gold units cannot be directly merged with the required

annotations

HIT 1: translate the given English sentence into Spanish

HIT 2: summarize a 300 words story

HIT 3: Given a list of headlines, assign a numeric rating in the interval [-100,100] to denote the overall positive or negative valence of the emotional content of the headline

• One valid output Vs. multiple valid outputs• Known output Vs. unknown output• Data annotation Vs. survey/content creation


annotations


PROBLEM:Since there’s not ONE single good translation, we cannot directly

check the quality of turkers’ work through comparison with a gold reference translation


annotations

• Possible solution: a 2-steps HIT (validation over gold units + translation)


HIT 1.0: given two sentences, S1 in English and S2 in Spanish, decide if S2 is a correct translation of S1. HIT1.1: translate the given English sentence S3 into Spanish.

Gold units

Data to be collected

S1 S2 Correct? Gold S3 Translation2002 Olympic Winter games took place in Salt Lake.

2002 Juegos Olímpicos de Invierno tendrá lugar en Salt Lake.

- NO A variety of mercy killing is when a patient is removed from a life

support system with legal approval.

-

AMT Vs CFAMT CF

Regional qualification ✔ ✔Accessible to international requesters ✗ ✔Multiple channels for job distribution ✗ ✔Built-in gold standard qualification ✗ ✔

Trustability qualification ✔ ✗Qualification certificate ✔ ✗Selection of good workers on your job ✔ ✗Charge on successfully completed HITs +10% +30%

Next steps1. Creation/publication of a job

• A simple task: word similarity

2. Monitoring your job

Terminology• Unit (HIT)

– Basic task given to each worker.

• Assignment– Number of units each worker will do at a time.

• Judgment– Completion of an assignment by an individual worker.

• Job– Your published assignments waiting for judgment.– Cost = # Assignments * # Judgments per assignment * Pay per assignment

Creating a new job: word similarity• Task: Given a sentence containing a term t, choose among a list of 3

terms t1,t2,t3 the most similar to t.

• Note: One valid output simple gold standard creation!– gold units can be easily merged with the required annotations– 1-step HIT

HIT: select from a list of terms the most similar to the one extracted from the given sentence

Sentence T T1 T2 T3 Gold Most Similar

He was reading a book while waiting for his guests.

book hat volume cat volume

they left the harbor during the night

harbor seaport airport mountain -

A closer look at

Creating a job

1: upload data

2: define your HIT

3: calibration (optional)

4: ordering

NOTE: • MTurk +10% over the price of successfully completed HITs• CF +30% (!)

Gambit: payments company for social games!Players are paid with “chips” for taking simple, online jobs…

Checking a job(progress and results)

Summary page

Preview

Workers

NOTE: Only workers having seen at least 4 gold units, with >= 70% Trust are paid (and their work is retained)!

A trustable worker

Issues

• How to design HITS?– …to attract turkers– …to collect reliable data– …to boost speed

• How to price HITS?• What can we do with low budget?• Quality control, cheating/spam detection• Experts Vs non experts (correlation between the two groups,

what to expect from non experts)

A recent experience…

Creating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk:

$100 for a 10-day Rush

T: Wolfgang Amadeus Mozart was born in Salzburg.H: Mozart was born in Austria.

T: Wolfgang Amadeus Mozart was born in Salzburg.H: Mozart nació en Austria.

NAACL 2010 Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk

Joint work with Yashar Mehdad

Translation HIT Validation HIT

Translated T-H pairs

Monolingual TE Corpus

(PASCAL RTE3)

Validated T-H pairs

CLTE Corpus(English-

Spanish)

Naïve methodologyNo qualification mechanisms

Very fast and cheap:• $12 for 800 translations in 1 hour• $12 for 5*800 validations in ~6 hours

Poor quality of the results (61% rejections)

Need of gold standard units!

DAY1$24

Improving validationGold units (50 positive/negative examples)Task definition in Spanish

DAYS 2-7$58

Better results…still at low cost• 97% Accuracy on 20% of the retained translations• +25% in the validation costs

Considerable increase in duration• 4 days for the first iteration (many rejected judgments, automatic pausing mechanism in CF)

Need of qualification mechanisms!More money to boost speed!

Improving translationGold units (validity check)Regional qualification, as in Mturk (upon request)Payment increase

DAYS 8-10$99.75

Better results…• less rejections (45%)• Automatic pausing avoidedFaster procedure• Doubling the payment, halved the accomplishment time

Summary• 800 English pairs (RTE3 Development Set)• 426 validated English/Spanish pairs in our CLTE Corpus• $99.75 spent to define a reliable and fast procedureo translation/validation cycleso non-redundant acquisitionso systematic use of gold unitso simple binary decisions

• Cost-effective solutiono $30 to create the full corpus of 800 pairs

• Some limitations found in the CrowdFlower serviceo lack of regional qualification (only available upon request)o lack of other qualification mechanismso automatic pausing mechanisms

MTurk & CF• MTurk (www.mturk.com) launched in 2005

– Directly accessible only to US requesters– Workers from >100 countries– > 500.000 workers

• ~47% from US (34% from India)• ~68 % women• ~52 % 22-40 years• ~70% to spend free time fruitfully (~15% for “primary” income purposes)• ~25% for 4-8 hours per week• ~60% earning less than $10 per week• >50% with college education

• CF (www.crowdflower.com) launched in 2007– channel to Mturk accessible to non-US requesters

Ipeirotis, 2010. New demographics of Mechanical Turk.

http://www.mturk.com/

http://www.crowdflower.com/

crowdsourcing for nlp using amazon mechanical turk and crowdflower matteo negri and yashar mehdad

Documents

yashar mehdad slide

amazon mechanical turk

recent experience slide

trustable worker slide

mturk cf mturk

validation gold units

retained translations

crowdsourcing services