aggregating information from the crowd · 2019-07-15 · aggregating information from the crowd...

58
Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi, Vibhor Rastogi, Ravi Kumar, Silvio Lattanzi January 07, 2015

Upload: others

Post on 09-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Aggregating information from the crowd

Anirban DasguptaIIT Gandhinagar

Joint work with Flavio Chiericetti, Nilesh Dalvi, Vibhor Rastogi, Ravi Kumar, Silvio Lattanzi

January 07, 2015

Page 2: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Crowdsourcing

Many different modes of crowdsourcing

Page 3: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Aggregating information using the Crowd: the expertise issue

Is IISc more than 100 years old?

Does IISc have more than UG than PG?

Yes Yes Yes No No

No!

Yes!

Yes YesNo

Typically, the answers to the crowdsourced tasks are unknown!

Page 4: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Aggregating information using the Crowd: the effort issue

Does this article have appropriate references at all places?

Yes Yes Yes No No

Even expert users need to spend effort to give

meaningful answers

Page 5: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

• How to ensure that information collected is “useful”?

– Assume users are strategic

– effort put in when making judgments, truthful opinions

– design the right payment mechanism

• How to aggregate opinions from different agents?

– user behavior stochastic

– varying levels of expertise, unknown

– might not stick around to develop reputation

Elicitation & Aggregation

Page 6: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

This talk: only aggregation

• Formalizing a simple crowdsourcing task

– Tasks with hidden labels, varying user expertise

• Aggregation for binary tasks

– stochastic model of user behaviour

– algorithms to estimate task labels + expertise

• Continuous feedback

• Ranking

Page 7: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Binary Task model

• Tasks have hidden labels:

– {-1, +1}

– E.g. labeling whether good quality article

• Each task is evaluated by a number of users

– not too many

• Each user outputs {-1, +1} per task

• Users and tasks fixedn users

m tasks

Page 8: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Simple User model

• Each user performs set of tasks assigned to her

• Users have proficiency

– Indicates probability that the true signal is seen

– This is not observable

-1

+1

-1+1

+1

+1

[Dawid, Skene, '79]

Note: This does not model bias

Page 9: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Stochastic model

G = user-item graph

q = vector of actual qualities

= rating on by user j on item i

Given n-by-m matrix U, estimate vectors q and p

+1

-1

+1

-1

Page 10: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

From users to items

• If all users are same, then simple majority/average will do

• Else, some notion of weighted majority e.g.

• We will try to estimate user reliabilities first

-1

-1

+1

??

Page 11: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Intuition: if G is complete

• Consider the user x user matrix UUt

UUt = (#agreements - #disagreements) between j and k

is a rank one matrix

If we approximate, UUt ≈ E(UUt), w is rank-1 approximation of UUt

noise

Page 12: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Arbitrary assignment graphs

Then

Hadamard product:

E[agree – disagree] on eachNumber of shared items

Page 13: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Arbitrary assignment graphs

Then

Hadamard product:

E[agree – disagree] on eachNumber of shared items

Similar spectral intuitions hold, only slightly more

work is needed

Page 14: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Algorithms● Core idea is to recover the “expected” matrix using spectral

techniques

● Ghosh, Kale, McAfee'11

– compute topmost eigenvector of item x item matrix

– proves small error for G dense random graph

● Karger, Oh, Shah'11

– using belief propagation on U

– proof of convergence for G sparse random

● Dalvi, D., Kumar, Rastogi'13

– for G an “expander”, use eigenvectors of both GG' and UU'

● EM based recovery Dawid & Skene'79

Page 15: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Empirical: user proficiency can be more or less estimated

Correlation of predicted and actual proficiency on the Y-axis

[ Aggregating crowdsourced binary ratings, WWW'13

Dalvi, D., Kumar, Rastogi ]

Page 16: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Aggregation

Formalizing a simple crowdsourcing task

– Tasks with hidden labels, varying user expertise

Aggregation for binary tasks – stochastic model of user behaviour

– algorithms to estimate task labels + expertise

Continuous feedback

Ranking

Page 17: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Continuous feedback model

• Tasks are continuous:

– Quality

• Each user has a reliability

• Each user outputs a score per task

n users

m tasks

Page 18: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Continuous feedback model

• Tasks are continuous:

– Quality

• Each user has a reliability

• Each user outputs a score per task

Minimize max

n users

m tasks

Page 19: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Some simpler settings & obstacles

Page 20: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Suppose that we know the

Single item, known variances

We want to minimize

Page 21: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Suppose that we know the

Single item, known variances

We want to minimize

it is known that an asymptotically optimal estimate is

Loss =

Page 22: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Single item, unknown variances

We want to minimize

Suppose that we do not know the

Only one sample, so cannot estimate Cannot compute weighted average

Page 23: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Arithmetic Mean

In binary case for single item we can obtain the optimum by using a majority rule.

In a continuous case using the same approach we would compute the arithmetic mean.

Page 24: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Arithmetic Mean

In binary case for single item we can obtain the optimum by using a majority rule.

In a continuous case using the same approach we would compute the arithmetic mean

and hence

Page 25: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Arithmetic Mean

In binary case for single item we can obtain the optimum by using a majority rule.

In a continuous case using the same approach we would compute the arithmetic mean

and hence

Thus the loss

Page 26: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Arithmetic Mean

In binary case for single item we can obtain the optimum by using a majority rule.

In a continuous case using the same approach we would compute the arithmetic mean

and hence

Thus the loss Is this optimal?

Page 27: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Problem with Arithmetic mean

The AM would have error

Page 28: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Problem with Arithmetic mean

The AM would have error

Same problem with the median algorithm

Page 29: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Problem with Arithmetic mean

The AM would have error

By choosing the nearest pair of points, we have a much better

estimate

Same problem with the median algorithm

Page 30: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Shortest gap algorithm

Maybe the optimal algo is to select one of two nearest samples?

In this setting, w.h.p., the two closest points are at distance

But arithmetic mean gives loss

Page 31: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Last obstacle

More is not always better

In this setting, w.h.p., the first two closest points are at distance

Adding bad raters could

actually worsen the shortest

gap algorithm

Mean is not good here either

But so will be some other pair

Page 32: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Single Item case

Page 33: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Results

Theorem 1: There is an algo with expected loss

Theorem 2: There is an example where the gap between any algo and the known variance setting is

[Chiericetti, D., Kumar, Lattanzi' 14]

Page 34: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Algorithm

Combination of two simple algorithms k-median algorithm return the rate of one of the k central raters

Page 35: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Algorithm

Combination of two simple algorithms k-median algorithm return the rate of one of the k central raters

Page 36: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Algorithm

Combination of two simple algorithms k-median algorithm return the rate of one of the k central raters

k-shortest gap Return one of the k closest points

Page 37: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Algorithm

Combination of two simple algorithms k-median algorithm return the rate of one of the k central raters

k-shortest gap Return one of the k closest points

Page 38: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Let be the length of the k-shortest gap

Compute the medianFind the shortest gap and return a point in it

Algorithm

Page 39: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Proof Sketch

w.h.p. contains

WHP , length of the k-shortest gap is at most

Select the median points

Page 40: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Proof Sketch

w.h.p. contains

WHP , length of the k-shortest gap is at most

Select the median points

If we consider points, then WHP there will be no

ratings with variance than

that are within distance

Page 41: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Proof Sketch

Thus the distance of the shortest gap points to the

truth is bounded

Page 42: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Lower bound

Instance: μ selected in

variance of j-th user =

Optimal algorithm (known variance) has loss

Page 43: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Lower bound

Instance: μ selected at random in

variance of j-th user =

Optimal algorithm (known variance) has loss

We will show that maximum likelihood estimation cannot

distinguish between - L and + L → loss

Page 44: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Lower Bound

Consider the two log-likelihoods

Claim: Irrespective of value of μ, can be positive or

negative with const prob.

Page 45: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Lower Bound

Consider the two log-likelihoods

Claim: Irrespective of value of μ, can be positive or

negative with const prob.

Page 46: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

The idea is to use the same algorithm of constant number of items, but to use a smarter version of the k shortest gap that looks for k points at distance at most in all the items

Multiple items

Page 47: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

The idea is to use the same algorithm of constant number of items, but to use a smarter version of the k shortest gap that looks for k points at distance at most in all the items

Multiple items

Page 48: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Multiple items

Theorem: For m=o(log n) , complete graph,

can get an expected loss of

Theorem: For m=Ω(log n), complete or dense random,

expected loss almost identical to the known variance case

Page 49: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Aggregation

Formalizing a simple crowdsourcing task

– Tasks with hidden labels, varying user expertise

Aggregation for binary task

– stochastic model of user behaviour

– algorithms to estimate task labels + expertise

Continuous feedback

Ranking

Page 50: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Crowdsourced rankings

Page 51: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Crowdsourced rankings

How can we

aggregate noisy

rankings

Page 52: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Crowdsourced rankings

How can we

aggregate noisy

rankings

Page 53: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Mallows Model [Mallows 1957]

There is a hidden permutation σ

and a scale parameter β

A permutation π is generated as

κ(σ,π) = Kendall-Tau distance

Braverman, Mossel'09: Finding the MLE for single parameter Mallows

Page 54: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Mallows Model

There is a hidden permutation σ

and a user specific scale parameter βi

Page 55: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Single item with known parameters

Theorem: For m samples, if then can recover σ WHP.

Theorem: If then cannot recover σ

Approximate reconstruction versions of these theorems also hold

[Chiericetti, D, Kumar, Lattanzi, RANDOM'14]

Algo: Weighted Borda count, weights = thresholded β values

Page 56: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Summary

● Host of interesting problems in crowdsourcing aggregation

– Specially for structured outputs

● For binary tasks

– Spectral techniques provide a powerful tool

● For gaussians

– new aggregation problems even for single item

– Combination of k-median & k-shortest gap

● For ranking

– Main technical contribution is calculating the swapping probs

– aggregation with known parameters is nontrivial

Page 57: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Open questions• Continuous feedback

– More natural algorithms for aggregation?

– Better algorithms for multiple items

– Instance optimal algorithms?

– Non-gaussian distributions?

– Mixture learning with lots of components and single/constant samples per component?

• Ranking

– Better estimation of Mallows parameters

– Multiple items, under partial ranking/pairwise preferences?

• More realistic complex model of user?

– Incorporating user bias?

– different kind of expertise, not just reliability

Page 58: Aggregating information from the crowd · 2019-07-15 · Aggregating information from the crowd Anirban Dasgupta IIT Gandhinagar Joint work with Flavio Chiericetti, Nilesh Dalvi,

Thanks!