generative models for crowdsourced data

48
Generative Models for Crowdsourced Data

Upload: yetty

Post on 22-Feb-2016

63 views

Category:

Documents


0 download

DESCRIPTION

Generative Models for Crowdsourced Data. Outline. What is Crowdsourcing ? Modeling the labeling process Example with real data Extensions Future Directions. What is Crowdsourcing ?. Human based computation. Outsourcing certain steps of a computation to humans. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Generative Models for  Crowdsourced  Data

Generative Models for Crowdsourced Data

Page 2: Generative Models for  Crowdsourced  Data

Outline

• What is Crowdsourcing?• Modeling the labeling process• Example with real data• Extensions• Future Directions

Page 3: Generative Models for  Crowdsourced  Data

What is Crowdsourcing?

• Human based computation.• Outsourcing certain steps of a computation to

humans.• ``Artificial artificial intelligence.’’• Data science:– Making an immediate decision.– Creating a labeled data set for learning.

Page 4: Generative Models for  Crowdsourced  Data

Immediate Decision Workflow

Page 5: Generative Models for  Crowdsourced  Data

Labeled Data Set Workflow

Page 6: Generative Models for  Crowdsourced  Data

An Example HIT

Page 7: Generative Models for  Crowdsourced  Data

An Example HIT

Page 8: Generative Models for  Crowdsourced  Data

Funny enough …

• Not everybody agrees on the gender of a Twitter profile.

• Difficult Instances• Worker Ability / Motivation• Worker Bias• Adversarial Behaviour

Page 9: Generative Models for  Crowdsourced  Data

Difficult Instance

Page 10: Generative Models for  Crowdsourced  Data

Difficult Instance

Page 11: Generative Models for  Crowdsourced  Data

Difficult Instance

Page 12: Generative Models for  Crowdsourced  Data

Worker Ability

Page 13: Generative Models for  Crowdsourced  Data

Worker Ability

Page 14: Generative Models for  Crowdsourced  Data

Worker Ability

Page 15: Generative Models for  Crowdsourced  Data

Worker Motivation

Page 16: Generative Models for  Crowdsourced  Data

Worker Motivation

Page 17: Generative Models for  Crowdsourced  Data

Worker Bias

Page 18: Generative Models for  Crowdsourced  Data

Worker Bias

Page 19: Generative Models for  Crowdsourced  Data

Worker Bias

Page 20: Generative Models for  Crowdsourced  Data

Disagreements

• When some workers say “male” and some workers say “female”, what to do?

Page 21: Generative Models for  Crowdsourced  Data

Majority Rules Heuristic

• Assign label l to item x if a majority of workers agree.

• Otherwise item x remains unlabeled.

Page 22: Generative Models for  Crowdsourced  Data

Majority Rules Heuristic

• Assign label l to item x if a majority of workers agree.

• Otherwise item x remains unlabeled.• Ignores prior worker data.

Page 23: Generative Models for  Crowdsourced  Data

Majority Rules Heuristic

• Assign label l to item x if a majority of workers agree.

• Otherwise item x remains unlabeled.• Ignores prior worker data.• Introduce bias in labeled data.

Page 24: Generative Models for  Crowdsourced  Data

Train on all labels

• For labeled data set workflow.• Add all item-label pairs to the data set.• Equivalent to cost vector of:– P (l | { lw }) = 1/nw S 1{l = lw}

Page 25: Generative Models for  Crowdsourced  Data

Train on all labels

• For labeled data set workflow.• Add all item-label pairs to the data set.• Equivalent to cost vector of:– P (l | { lw }) = 1/nw S 1{l = lw}

• Ignores prior worker data.

Page 26: Generative Models for  Crowdsourced  Data

Train on all labels

• For labeled data set workflow.• Add all item-label pairs to the data set.• Equivalent to cost vector of:– P (l | { lw }) = 1/nw S 1{l = lw}

• Ignores prior worker data.• Models the crowd, not the “ground truth.”

Page 27: Generative Models for  Crowdsourced  Data

What is ground truth

• Different theoretical approaches.– PAC learning with noisy labels.– Fully-adversarial active learning.

• Bayesians have been very active.– “Easy” to posit a functional form and quickly

develop inference algorithms.– Issue of model correctness is ultimately empirical.

Page 28: Generative Models for  Crowdsourced  Data

Bayesian Literature

• (2009) Whitehill et. al. GLAD framework.– (1979) Dawid and Skene. Maximum Likelihood

Estimation of Observer Error-Rates Using the EM Algorithm.

• (2010) Welinder et. al. The Multidimensional Wisdom of Crowds.

• (2010) Raykar et. al. Learning from Crowds.

Page 29: Generative Models for  Crowdsourced  Data

Bayesian Approach

• Define ground truth via a generative model which describes how “ground truth” is related to the observed output of crowdsource workers.

• Fit to observed data.• Extract posterior over ground truth.• Make decision or train classifier.

Page 30: Generative Models for  Crowdsourced  Data

Generative Model

Page 31: Generative Models for  Crowdsourced  Data

Example: Binary Classification

• Each worker has a matrix.

α = ( -1 α01 )

( α10 -1 )

• Each item has a scalar difficulty β > 0.• P (lw = j | z = i) = e-βαij / (Σk e-βαik)

• αij ~ N (μij, 1) ; μij ~ N (0, 1)• log β ~ N (ρ, 1) ; ρ ~ N (0, 1)

Page 32: Generative Models for  Crowdsourced  Data

Other Problems

• Multiclass classification:– Same as binary with larger confusion matrix.

• Ordinal classification: (“Hot or not”)– Confusion matrix has special form.– O (L) parameters instead of O (L2).

• Multilabel classification:– Reduce to multiclass on power set.– Assume low-rank confusion matrix.

Page 33: Generative Models for  Crowdsourced  Data

EM

Page 34: Generative Models for  Crowdsourced  Data

EM

• Initially all workers are assumed moderately accurate and without bias.– Implies initial estimate of ground truth distribution

favors consensus.– Disagreeing with the majority is a likely error.

Page 35: Generative Models for  Crowdsourced  Data

EM

• Initially all workers are assumed moderately accurate.

• Workers consistently in the minority have their confusion probabilities increase.

Page 36: Generative Models for  Crowdsourced  Data

EM

• Initially all workers are assumed moderately accurate.

• Workers consistently in the minority have their confusion probabilities increase.

• Workers with higher confusion probabilities contribute less to the distribution of ground truth.

Page 37: Generative Models for  Crowdsourced  Data

“Different” workers are marginalized

Page 38: Generative Models for  Crowdsourced  Data

“Different” workers are marginalized

• Workers that are consistently in the minority will not contribute strongly to the posterior distribution over ground truth.– Even if they are actually more accurate.

• Can correct when an accurate worker(s) is paired with some inaccurate workers.

• Good for breaking ties.• Raykar et. al.

Page 39: Generative Models for  Crowdsourced  Data

Example with real data

Page 40: Generative Models for  Crowdsourced  Data

Online EM

• Given a set of worker-label pairs for a single item:

• (Inference) Using current α, find most likely β* and distribution q* over ground truth.

• (Training) Do SGD update of α with respect to EM auxiliary function evaluated at β* and q*.

Page 41: Generative Models for  Crowdsourced  Data

Online EM

• Given a set of worker-label pairs for a single item:

• (Inference) Using current α, find most likely β* and distribution q* over ground truth.

• (Training) Do SGD update of α with respect to EM auxiliary function evaluated at β* and q*.

Page 42: Generative Models for  Crowdsourced  Data

Things to do with q*

• Take an immediate cost-sensitive decision– d* = argmind Ez~q*[f (z, d)]

• Train a (importance-weighted) classifier– cost vector cd = Ez~q*[f (z, d)]– e.g. 0/1 loss: cd = (1 - q*d)– e.g. binary 0/1 loss: |c1 – c0| = |1 – 2 q*1|– No need to decide what the true label is!

• Raykar et. al.: why not jointly estimate classifier and worker confusion?

Page 43: Generative Models for  Crowdsourced  Data

Raykar et. al. insight

• Cost vector is constructed by estimating worker confusion matrices.

• Subsequently, classifier is trained; it will sometimes disagree with workers.

• Would be nice to use that disagreement to inform the worker confusion matrices.

• Circular dependency suggests joint estimation.

Page 44: Generative Models for  Crowdsourced  Data

Generative Model

Page 45: Generative Models for  Crowdsourced  Data

Generative Model

Page 46: Generative Models for  Crowdsourced  Data

Online Joint Estimation

Page 47: Generative Models for  Crowdsourced  Data

Online Joint Estimation

• Initially the classifier will output an uninformative prior and therefore will be trained to follow consensus of workers.

• Eventually workers which disagree with the classifier will have their confusion probabilities increase.

• Workers consistently in the minority can contribute strongly to the posterior if they tend to agree with the classifier.

Page 48: Generative Models for  Crowdsourced  Data

Additional Resources

• Software– http://code.google.com/p/nincompoop

• Blog– http://machinedlearnings.com/