wrapper generation supervised by a noisy crowd

58
Wrapper Generation Supervised by a Noisy Crowd Valter Crescenzi, Paolo Merialdo, Disheng Qiu Dipartimento di Ingegneria Università degli Studi Roma Tre Via della Vasca Navale, 79, Rome [email protected]

Upload: disheng-qiu

Post on 17-Jul-2015

345 views

Category:

Education


0 download

TRANSCRIPT

Wrapper Generation Supervised by a Noisy Crowd

Valter Crescenzi, Paolo Merialdo, Disheng Qiu

Dipartimento di IngegneriaUniversità degli Studi Roma TreVia della Vasca Navale, 79, Rome

[email protected]

Extracting Data

2M pages from IMDB, and we want to extract ... titles, directors etc ....

2

Extracting Data

2M pages from IMDB, and we want to extract ... titles, directors etc ....

DB#Wrapper!

2

Extracting Data

2M pages from IMDB, and we want to extract ... titles, directors etc ....

Inference algorithm!

DB#Wrapper!

2

r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....

Single Page

Other pages

3

Wrapper as XPath

To generate wrappers:

• From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page• Some of the rules do not work correctly in all the target pages

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....

Single Page

Other pages

3

Wrapper as XPath

To generate wrappers:

• From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page• Some of the rules do not work correctly in all the target pages

page0 page1 page2 ..r1

r2

r3

Spirited Away City of God Howl’s Moving Castle ..

Spirited Away - 9.3 ..

Spirited Away City of God null ..

r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....

Single Page

Other pages

3

Wrapper as XPath

To generate wrappers:

• From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page• Some of the rules do not work correctly in all the target pages

page0 page1 page2 ..r1

r2

r3

Spirited Away City of God Howl’s Moving Castle ..

Spirited Away - 9.3 ..

Spirited Away City of God null ..

r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....

Single Page

Other pages

3

Wrapper as XPath

To generate wrappers:

• From a single annotated page, it generates a pool of XPath • All XPath are correct solutions for the annotated page• Some of the rules do not work correctly in all the target pages

Which one is correct?

Extracting Data

Inference algorithm!

DB#Wrapper!

Scalability Accuracy CoverageSupervised

Unsupervised

Sup.+Annot.

NO OK High

OK NO High

OK OK Low

4

Crowdsourcing

An opportunity to scale supervised approaches

Inference algorithm!

DB#Wrapper!

5

Scaling Wrapper Inference

Scaling out with crowdsourcing platforms opens new challenges:

Issues: Contributions:

Non-expert workers

• Simple interactions• Membership Query (yes/no answer)• Redundant tasks and worker error rate estimation

• Active Learning*• Dynamically engaging workers

Costs

Quality• Quality Model• Sampling algorithm*

6*[Crescenzi WWW2013]

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

Inference Algorithm

r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....

Yes/No !

First annotation

Sample

Worker’s answers

7

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

Inference Algorithm

r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....

Yes/No !

First annotation

Sample

Worker’s answers

7

Quality Model: P(r1)

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

Inference Algorithm

• Rules compatible with the answer more likely to be correct

For each new answer

r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....

Yes/No !

First annotation

Sample

Worker’s answers

7

Quality Model: P(r1)

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

Inference Algorithm

• Rules compatible with the answer more likely to be correct

For each new answer

• If no rule is good enough:• a new query is selected (Active Learning)*

r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....

Yes/No !

First annotation

Sample

Worker’s answers

7*[Crescenzi WWW2013]

Quality Model: P(r1)

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

Inference Algorithm

• Rules compatible with the answer more likely to be correct

For each new answer

• If no rule is good enough:• a new query is selected (Active Learning)*

r1 = /html/table/tr[1]/td/text()r2 = //*[contains(.,”Ratings:”)]/../../tr[1]/td/text()r3 = //*[contains(.,”Director:”)]/../../tr[1]/td/text()....

Yes/No !

First annotation

Sample

Worker’s answers

7*[Crescenzi WWW2013]

Quality Model: P(r1)

Termination Strategies

8

Quality

Costs

HALTᵣExpected quality of the wrapper (probability of correctness)

HALTMQ

Number of used MQ

Quality

Costs

HALTH

Uncertainty of the questioned value (trade-off quality/costs)

Different termination strategies:

Multiple Workers

Workers can make mistakes

We engage multiple workers on the same task, but how many?

?

9

Multiple Workers

Workers can make mistakes

We engage multiple workers on the same task, but how many?

Too many workers

Not enough workers

Waste of money

Quality loss

?

9

Multiple Workers

Workers can make mistakes

We engage multiple workers on the same task, but how many?

Too many workers

Not enough workers

Waste of money

Quality loss

We apply our quality model at runtime to:

• Estimate the workers’ error rates

• Select the right number of redundant tasks

?

9

Dynamically Engaging Workers

Workersanswers

Most Likely Rule

Is it good enough?

• Starts with minimal amount of redundancy

• Collects workers’ answers

• Estimates rule quality and workers’ error rate. Use

• workers’ error rate to estimate rule quality• rule quality to estimate workers’ error rate

• If no rule is good enough a new worker is engaged

Error rate estimation

10

Algorithm main steps:

Dynamically Engaging Workers

Workersanswers

Most Likely Rule

Is it good enough?

• Starts with minimal amount of redundancy

• Collects workers’ answers

• Estimates rule quality and workers’ error rate. Use

• workers’ error rate to estimate rule quality• rule quality to estimate workers’ error rate

• If no rule is good enough a new worker is engaged

Error rate estimation

+

10

Algorithm main steps:

Answers “Spirited Away” “-” “9.3”

η 0.1 0.1 0.1

NoYes NoAnswers “Spirited Away” “City of God” “9.3”

η 0.1 0.1 0.1

NoYes No

Answers “Spirited Away” “City of God” “9.3” “Spirited Away” “-” “9.3”

η 0.1 0.1 0.1 0.1 0.1 0.1

NoYes No Yes No No

• Two real workers are engaged

• A new sequence is defined considering the union of all the answers

11

η = expected error rate

Dynamically Engaging Workers

Answers “Spirited Away” “-” “9.3”

η 0.1 0.1 0.1

NoYes NoAnswers “Spirited Away” “City of God” “9.3”

η 0.1 0.1 0.1

NoYes No

Answers “Spirited Away” “City of God” “9.3” “Spirited Away” “-” “9.3”

η 0.1 0.1 0.1 0.1 0.1 0.1

NoYes No Yes No No

• Two real workers are engaged

• A new sequence is defined considering the union of all the answers

11

η = expected error rate

Dynamically Engaging Workers

• The most likely rule and its values are returned

• The most likely rule and its probability is adopted to estimate the η

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

es: P(r1) = 0.9

12

η = expected error rate

Answers “Spirited Away” “-” “9.3”

η 0.1 0.1 0.1

NoYes NoAnswers “Spirited Away” “City of God” “9.3”

η 0.1 0.1 0.1

NoYes No

Dynamically Engaging Workers

• The most likely rule and its values are returned

• The most likely rule and its probability is adopted to estimate the η

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

es: P(r1) = 0.9

12

η = expected error rate

Answers “Spirited Away” “-” “9.3”

η 0.1 0.1 0.1

NoYes NoAnswers “Spirited Away” “City of God” “9.3”

η 0.37 0.37 0.37

NoYes No

Dynamically Engaging Workers

• The most likely rule and its values are returned

• The most likely rule and its probability is adopted to estimate the η

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

es: P(r1) = 0.9

P(r1) = 0.93

12

η = expected error rate

Answers “Spirited Away” “-” “9.3”

η 0.1 0.1 0.1

NoYes NoAnswers “Spirited Away” “City of God” “9.3”

η 0.37 0.37 0.37

NoYes No

Dynamically Engaging Workers

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

P(r1) = 0.95

• When the computation converges, the system checks the termination condition

• If it is not met, a new worker is considered and the computation starts again

13

η = expected error rate

Answers “Spirited Away” “-” “9.3”

η 0.05 0.05 0.05

NoYes NoAnswers “Spirited Away” “City of God” “9.3”

η 0.35 0.35 0.35

NoYes No

Dynamically Engaging Workers

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

P(r1) = 0.95

P(r1) = 0.95

• When the computation converges, the system checks the termination condition

• If it is not met, a new worker is considered and the computation starts again

13

η = expected error rate

Answers “Spirited Away” “-” “9.3”

η 0.05 0.05 0.05

NoYes NoAnswers “Spirited Away” “City of God” “9.3”

η 0.35 0.35 0.35

NoYes No

Dynamically Engaging Workers

Experiments - Dataset

Site Entity |Pages|www.imdb.com Actor 500k

www.imdb.com Movies 500k

www.allmusic.com Band 500k

www.allmusic.com Albums 500k

www.nasdaq.com Stock Quotes 7k

40 attributes

manually crafted golden rules

Measures:

• Costs #MQ• Quality Precision, Recall and F-measure

14

Simulating Real Workers

0%

10%

20%

30%

40%

0.00 0.10 0.20 0.30 0.40 0.50

error rate�e��x

100 Real (and noisy) AMT workers

Real workers: 1/3 perfect Average η* = 10% ση* = 11%

We simulated the error rate distribution with an exponential function

15

η* > η (optimistic) η* = η (correct) η* < η (pessimistic)

MQ ~10 ~10 ~30

F ~0.65 ~1 ~1

Wrong Estimation

Noisy single worker: - η expected error rate - η* observed error rate

16

η* > η (optimistic) η* = η (correct) η* < η (pessimistic)

MQ ~10 ~10 ~30

F ~0.65 ~1 ~1

Wrong Estimation

Noisy single worker: - η expected error rate - η* observed error rate

16

η close to η*:(good estimation) - few MQ - good F

η* > η (optimistic) η* = η (correct) η* < η (pessimistic)

MQ ~10 ~10 ~30

F ~0.65 ~1 ~1

Wrong Estimation

Noisy single worker: - η expected error rate - η* observed error rate

16

η close to η*:(good estimation) - few MQ - good F

η* > η:(too optimistic) - too few MQ - low F

η* > η (optimistic) η* = η (correct) η* < η (pessimistic)

MQ ~10 ~10 ~30

F ~0.65 ~1 ~1

Wrong Estimation

Noisy single worker: - η expected error rate - η* observed error rate

16

η close to η*:(good estimation) - few MQ - good F

η* > η:(too optimistic) - too few MQ - low F

η > η*:(too pessimistic) - too many MQ - same F

η* > η (optimistic) η* = η (correct) η* < η (pessimistic)

MQ ~10 ~10 ~30

F ~0.65 ~1 ~1

Wrong Estimation

Noisy single worker: - η expected error rate - η* observed error rate

16

η close to η*:(good estimation) - few MQ - good F

η* > η:(too optimistic) - too few MQ - low F

η > η*:(too pessimistic) - too many MQ - same F

Need to estimate the workers’ error rate

Dynamically Engaging Workers

Algorithm F σF #MQ max-MQ max-|W| |η-η*|

ALFη one worker

0.92 17% 7.58 11 1 -

ALFREDno 1 1% 18.6 83 9 -

ALFRED 1 1% 16.1 44 4 0.8%

ALFRED* 1 1% 16.07 40 4 0%

Synthetic (and noisy) workers |W| = # workers

17

Dynamically Engaging Workers

Algorithm F σF #MQ max-MQ max-|W| |η-η*|

ALFη one worker

0.92 17% 7.58 11 1 -

ALFREDno 1 1% 18.6 83 9 -

ALFRED 1 1% 16.1 44 4 0.8%

ALFRED* 1 1% 16.07 40 4 0%

Synthetic (and noisy) workers |W| = # workers

17

lower quality, less MQ

Dynamically Engaging Workers

Algorithm F σF #MQ max-MQ max-|W| |η-η*|

ALFη one worker

0.92 17% 7.58 11 1 -

ALFREDno 1 1% 18.6 83 9 -

ALFRED 1 1% 16.1 44 4 0.8%

ALFRED* 1 1% 16.07 40 4 0%

Synthetic (and noisy) workers |W| = # workers

17

lower quality, less MQ

Almost perfect wrapper

Dynamically Engaging Workers

Algorithm F σF #MQ max-MQ max-|W| |η-η*|

ALFη one worker

0.92 17% 7.58 11 1 -

ALFREDno 1 1% 18.6 83 9 -

ALFRED 1 1% 16.1 44 4 0.8%

ALFRED* 1 1% 16.07 40 4 0%

Synthetic (and noisy) workers |W| = # workers

17

lower quality, less MQ

correct estimation required

Almost perfect wrapper

Dynamically Engaging Workers

Algorithm F σF #MQ max-MQ max-|W| |η-η*|

ALFη one worker

0.92 17% 7.58 11 1 -

ALFREDno 1 1% 18.6 83 9 -

ALFRED 1 1% 16.1 44 4 0.8%

ALFRED* 1 1% 16.07 40 4 0%

Synthetic (and noisy) workers |W| = # workers

17

lower quality, less MQ

correct estimation required

accurate estimation, but achieved only at the end

Almost perfect wrapper

Dynamically Engaging Workers

Algorithm F σF #MQ max-MQ max-|W| |η-η*|

ALFη one worker

0.92 17% 7.58 11 1 -

ALFREDno 1 1% 18.6 83 9 -

ALFRED 1 1% 16.1 44 4 0.8%

ALFRED* 1 1% 16.07 40 4 0%

Synthetic (and noisy) workers |W| = # workers

17

lower quality, less MQ

correct estimation required

accurate estimation, but achieved only at the end

Almost perfect wrapper

2

3

4

0% 25% 50% 75% 100%

2%

6%

92%

% |W|

|W|

|W|

Background in solid machine learning and computational learning theories*

Conclusions

18

We proposed a framework for wrapper generation:

• simple tasks can be completed by non expert workers

• cost effective wrapper generation

• highly predictable quality of the output wrapper

The proposed framework can be applied to other learning tasks:• Crawling• NLP

*[Angluin-Laird1988, Angluin2001]

Thank you for the attention !!

19

Future development

Learning framework applied to problems (NLP, Entity Linkage)

ALFRED adopted to learn structure-driven crawling algorithm

Hybrid approaches human annotations and automatic annotations

Alternative models of truth/error rate

Optimizing the initial number of workers

20

Wrong Estimation

Noisy single worker: - η = 0.1 - η* = from 0.05 to 0.4

21

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4F

d�

HALTrHALTHHALTMQ

4

6

8

10

12

14

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

MQ

d�

HALTrHALTH

Wrong Estimation

Noisy single worker: - η = from 0 to 0.4 - η* = 0.1

22

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4F

d

HALTrHALTHHALTMQ

3

10

100

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

MQ

d

HALTrHALTH

Sampling & Quality

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = r2 = r3

23

Sampling & Quality

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = r2 = r3

page0 page1

r1

r2

r3

Spirited Away City of God

Spirited Away -

Spirited Away City of God

r1 = r3 ≠ r2

23

Sampling & Quality

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = r2 = r3

page0 page1

r1

r2

r3

Spirited Away City of God

Spirited Away -

Spirited Away City of God

r1 = r3 ≠ r2

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

r1 ≠ r3 ≠ r2

23

Sampling & Quality

page0

r1

r2

r3

Spirited Away

Spirited Away

Spirited Away

r1 = r2 = r3

page0 page1

r1

r2

r3

Spirited Away City of God

Spirited Away -

Spirited Away City of God

r1 = r3 ≠ r2

page0 page1 page2

r1

r2

r3

Spirited Away City of God Howl’s Moving Castle

Spirited Away - 9.3

Spirited Away City of God null

r1 ≠ r3 ≠ r2

Pages make apparent the differences among the rules

Find a small set that makes apparent the same differences observed in the

whole set of pages

23

Sampling & Quality

The problem.

Find the smallest set that makes apparent the differences among the rules:(e.g., 100 pages that make apparent the same differences that we would observe in 2M pages).

It is a NP-Hard problem !! Reduction to SET-Cover problem:Find the smallest set of pages that cover all the group of rules (group = equivalent rules).

The smallest set is not needed:A greedy algorithm O(|Pages|) in time and O(1) in space works very well in practice.

24

XPath rules

For every page p: if (p makes apparent new differences) representative pages += p

An offline algorithm that can be easily parallelized

Sampling & Quality

25

Sampling

Entity Sampling |Pages| P R

Movies

Biased 250 0.98 0.71

Movies Random 250 0.99 0.99Movies

Representative 42 1.00 1.00

Actors

Biased 250 1.00 1.00

Actors Random 250 1.00 0.96Actors

Representative 30 1.00 1.00

Stocks

Biased 86 1.00 0.98

Stocks Random 86 1.00 0.99Stocks

Representative 15 1.00 1.00

Albums

Biased 258 1.00 0.99

Albums Random 258 1.00 1.00Albums

Representative 59 1.00 1.00

Bands

Biased 289 1.00 0.68

Bands Random 289 1.00 1.00Bands

Representative 36 1.00 1.00

26

Sampling

Entity Sampling |Pages| P R

Movies

Biased 250 0.98 0.71

Movies Random 250 0.99 0.99Movies

Representative 42 1.00 1.00

Actors

Biased 250 1.00 1.00

Actors Random 250 1.00 0.96Actors

Representative 30 1.00 1.00

Stocks

Biased 86 1.00 0.98

Stocks Random 86 1.00 0.99Stocks

Representative 15 1.00 1.00

Albums

Biased 258 1.00 0.99

Albums Random 258 1.00 1.00Albums

Representative 59 1.00 1.00

Bands

Biased 289 1.00 0.68

Bands Random 289 1.00 1.00Bands

Representative 36 1.00 1.00

Representative perfect

26

Sampling

Entity Sampling |Pages| P R

Movies

Biased 250 0.98 0.71

Movies Random 250 0.99 0.99Movies

Representative 42 1.00 1.00

Actors

Biased 250 1.00 1.00

Actors Random 250 1.00 0.96Actors

Representative 30 1.00 1.00

Stocks

Biased 86 1.00 0.98

Stocks Random 86 1.00 0.99Stocks

Representative 15 1.00 1.00

Albums

Biased 258 1.00 0.99

Albums Random 258 1.00 1.00Albums

Representative 59 1.00 1.00

Bands

Biased 289 1.00 0.68

Bands Random 289 1.00 1.00Bands

Representative 36 1.00 1.00

Biased: recall loss

26

Sampling

Entity Sampling |Pages| P R

Movies

Biased 250 0.98 0.71

Movies Random 250 0.99 0.99Movies

Representative 42 1.00 1.00

Actors

Biased 250 1.00 1.00

Actors Random 250 1.00 0.96Actors

Representative 30 1.00 1.00

Stocks

Biased 86 1.00 0.98

Stocks Random 86 1.00 0.99Stocks

Representative 15 1.00 1.00

Albums

Biased 258 1.00 0.99

Albums Random 258 1.00 1.00Albums

Representative 59 1.00 1.00

Bands

Biased 289 1.00 0.68

Bands Random 289 1.00 1.00Bands

Representative 36 1.00 1.00

Random: better than biasedbut not perfect

26

27

Related Wrapper Generation

Automatic Wrappers for Large Scale Web Extraction Nilesh Dalvi et. al VLDB2011

DIADEM T. Furche

G. Gottlob ... etcWWW2012

Web Data Extraction Based on Partial Tree Alignment Yanhong Zhai WWW2005

Extracting Structured Data from Web PagesArvind Arasu

Hector Garcia-MolinaSIGMOD

2003

RoadRunner Crescenzi VLDB2001

Wrapper Induction for information extraction Kushmerick IJCAI97

Active Learning with Multiple Views Ion Muslea JAIR2006

Interactive Wrapper Generation with Minimal User Effort Utku Irmak WWW2006