crowdsourcing using mechanical turk quality management and scalability panos ipeirotis – new york...

24
Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Upload: carmella-robbins

Post on 13-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Crowdsourcing using Mechanical Turk

Quality Management and Scalability

Panos Ipeirotis – New York University

Page 2: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

“A Computer Scientist in a Business

School”

http://behind-the-enemy-lines.blogspot

.com/

Email: [email protected]

“A Computer Scientist in a Business

School”

http://behind-the-enemy-lines.blogspot

.com/

Email: [email protected]

Panos Ipeirotis - Introduction

New York University, Stern School of Business

Page 3: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University
Page 4: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

4

Page 5: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

5

Page 6: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University
Page 7: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Example: Build an “Adult Web Site” Classifier

Need a large number of hand-labeled sites Get people to look at sites and classify them as:G (general audience) PG (parental guidance) R (restricted) X

(porn)

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:

$15/hr

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:

$15/hr

Page 8: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Amazon Mechanical Turk: Paid Crowdsourcing

Page 9: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Example: Build an “Adult Web Site” Classifier

Need a large number of hand-labeled sites Get people to look at sites and classify them as:G (general audience) PG (parental guidance) R (restricted) X

(porn)

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:

$15/hr MTurk: 2500 websites/hr, cost: $12/hr

Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:

$15/hr MTurk: 2500 websites/hr, cost: $12/hr

Page 10: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Bad news: Spammers!

Worker ATAMRO447HWJQ

labeled X (porn) sites as G (general

audience)

Worker ATAMRO447HWJQ

labeled X (porn) sites as G (general

audience)

Page 11: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Improve Data Quality through Repeated Labeling

Get multiple, redundant labels using multiple workers Pick the correct label based on majority vote

Probability of correctness increases with number of workers

Probability of correctness increases with quality of workers

1 worker

70%

correct

1 worker

70%

correct

11 workers

93%

correct

11 workers

93%

correct

Page 12: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

11-vote Statistics MTurk: 227 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost:

$15/hr

11-vote Statistics MTurk: 227 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost:

$15/hr

Single Vote Statistics MTurk: 2500 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost:

$15/hr

Single Vote Statistics MTurk: 2500 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost:

$15/hr

But Majority Voting is Expensive

Page 13: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Using redundant votes, we can infer worker quality

Look at our spammer friend ATAMRO447HWJQ together with other 9 workers

Our “friend” ATAMRO447HWJQ mainly marked sites as G.Obviously a spammer…

We can compute error rates for each worker

Error rates for ATAMRO447HWJQ P[X → X]=9.847% P[X → G]=90.153% P[G → X]=0.053% P[G → G]=99.947%

Page 14: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Rejecting spammers and BenefitsRandom answers error rate = 50%

Average error rate for ATAMRO447HWJQ: 45.2% P[X → X]=9.847% P[X → G]=90.153% P[G → X]=0.053% P[G → G]=99.947%

Action: REJECT and BLOCK

Results: Over time you block all spammers Spammers learn to avoid your HITS You can decrease redundancy, as quality of workers is higher

Page 15: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

After rejecting spammers, quality goes up Spam keeps quality down Without spam, workers are of higher quality Need less redundancy for same quality Same quality of results for lower cost

With spam

1 worker

70%

correct

With spam

1 worker

70%

correct

With spam

11 workers

93%

correct

With spam

11 workers

93%

correct

Without

spam

1 worker

80% correct

Without

spam

1 worker

80% correct

Without

spam

5 workers

94% correct

Without

spam

5 workers

94% correct

Page 16: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Correcting biases

Classifying sites as G, PG, R, X Sometimes workers are careful but biased

Classifies G → P and P → R Average error rate for ATLJIK76YH1TF: too high

Is she a spammer?Is she a spammer?

Error Rates for CEO of AdSafe

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Error Rates for CEO of AdSafe

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Page 17: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Correcting biases

For ATLJIK76YH1TF, we simply need to “reverse the errors” (technical details omitted) and separate error and bias

True error-rate ~ 9%

Error Rates for Worker: ATLJIK76YH1TF

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Error Rates for Worker: ATLJIK76YH1TF

P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%

Page 18: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Too much theory?

Demo and Open source implementation available at:

http://qmturk.appspot.com Input:

– Labels from Mechanical Turk– Cost of incorrect labelings (e.g., XG costlier than GX)

Output: – Corrected labels– Worker error rates– Ranking of workers according to their quality

Beta version, more improvements to come! Suggestions and collaborations welcomed!

Page 19: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Scaling Crowdsourcing: Use Machine Learning Human labor is expensive, even when paying cents Need to scale crowdsourcing

Basic idea: Build a machine learning model and use it instead of humans

Data from existing

crowdsourced answers

Data from existing

crowdsourced answers

New CaseNew Case Automatic Model

(through machine

learning)

Automatic Model

(through machine

learning)

Automatic

Answer

Automatic

Answer

Page 20: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

20

Tradeoffs for Automatic Models: Effect of Noise

Get more data Improve model accuracy Improve data quality Improve classification

Example Case: Porn or not?

40

50

60

70

80

90

100

1 20 40 60 80 100120140160180200220240260280300

Number of examples (Mushroom)

Acc

ura

cy

Data Quality = 50%

Data Quality = 60%

Data Quality = 80%

Data Quality = 100%

Page 21: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Confide

ntConfide

nt

Automatic Model

(through machine

learning)

Automatic Model

(through machine

learning)

Scaling Crowdsourcing: Iterative training

Use machine when confident, humans otherwise

Retrain with new human input → improve model → reduce need for humans

Get human(s)

to answer

Get human(s)

to answer

New CaseNew Case

Not confident

Not confident

Automatic

Answer

Automatic

Answer

Data from existing

crowdsourced answers

Data from existing

crowdsourced answers

Page 22: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

22

Tradeoffs for Automatic Models: Effect of Noise

Get more data Improve model accuracy Improve data quality Improve classification

Example Case: Porn or not?

40

50

60

70

80

90

100

1 20 40 60 80 100120140160180200220240260280300

Number of examples (Mushroom)

Acc

ura

cy

Data Quality = 50%

Data Quality = 60%

Data Quality = 80%

Data Quality = 100%

Page 23: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Not confident

Not confident

Confiden

tConfiden

t

Automatic Model

(through machine

learning)

Automatic Model

(through machine

learning)

Scaling Crowdsourcing: Iterative training, with noise

Use machine when confident, humans otherwise Ask as many humans as necessary to ensure quality

Get human(s)

to answer

Get human(s)

to answer

New CaseNew Case

Automatic

Answer

Automatic

Answer

Confident for quality?

Not confident

for quality?

Data from existing

crowdsourced answers

Data from existing

crowdsourced answers

Page 24: Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University

Thank you!

Questions?

“A Computer Scientist in a Business School”

http://behind-the-enemy-lines.blogspot.com

/

Email: [email protected]

“A Computer Scientist in a Business School”

http://behind-the-enemy-lines.blogspot.com

/

Email: [email protected]