zero-shot entity extraction from web pagespliang/papers/extraction-acl2014-talk.pdf · zero-shot...

73
Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Upload: others

Post on 03-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Zero-shot Entity Extraction from Web Pages

ACL

June 23, 2014

Panupong Pasupat and Percy Liang

Page 2: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Focus: Entity Extraction

hiking trails

hiking trails near Baltimore

Avalon Super Loop

Patapsco Valley State Park

Gunpowder Falls State Park

Union Mills Hike

Greenbury Point

...

What are the longest near Baltimore?

Data Source

1

Page 3: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Focus: Entity Extraction

hiking trails

hiking trails near Baltimore

Avalon Super Loop

Patapsco Valley State Park

Gunpowder Falls State Park

Union Mills Hike

Greenbury Point

...

What are the longest near Baltimore?

Data Source

Applications: question answering / semantic parsing / taxonomyconstruction / ontology expansion / knowledge base population / ...

1

Page 4: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Semi-Structured Data on the Web

2

Page 5: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Challenge: Long Tail of Categories

person location organization

3

Page 6: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Challenge: Long Tail of Categories

person location organization

airport battleship acid pitcher

settlement headgear metaphor haircut

poker hand biome enzyme superstition

3

Page 7: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Challenge: Long Tail of Categories

person location organization

airport battleship acid pitcher

settlement headgear metaphor haircut

poker hand biome enzyme superstition

tutorials at ACL 2014

dishes at Pu Pu Hot Pot

Stanford computer science professors

We want to generalize to unseen categories

3

Page 8: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Relevant Approaches

Bootstrapping from Seed Examples:

seeds

Avalon Super Loop

Hilton Area

System

answers

Avalon Super Loop

Hilton Area

Wildlands Loop

...

web pagesweb pagesweb pages

Use seed examples to specify the entity category

[Wang and Cohen, 2009; Google Sets; Sarmento et al. 2007; ...]

4

Page 9: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Relevant Approaches

Bootstrapping from Seed Examples:

seeds

Avalon Super Loop

Hilton Area

System

answers

Avalon Super Loop

Hilton Area

Wildlands Loop

...

web pagesweb pagesweb pages

Use seed examples to specify the entity category

... but we might not have seeds (e.g. in question answering)

[Wang and Cohen, 2009; Google Sets; Sarmento et al. 2007; ...]

4

Page 10: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Our Work

query

hiking trails

near Baltimore

System

answers

Avalon Super Loop

Hilton Area

Wildlands Loop

...

web page

Use a natural language query to specify the entity category

5

Page 11: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Outline

1. Setup

• Problem Setup

• Dataset

2. Approach

3. Results

6

Page 12: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Problem Setup

Input:

• query x

hiking trails near Baltimore

• web page w

7

Page 13: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Problem Setup

Input:

• query x

hiking trails near Baltimore

• web page w

7

Page 14: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Problem Setup

Input:

• query x

hiking trails near Baltimore

• web page w

7

Page 15: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Problem Setup

Input:

• query x

hiking trails near Baltimore

• web page w

Output:

• list of entities y

[Avalon Super Loop, Patapsco Valley State Park, ...]

7

Page 16: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Dataset

We created the OpenWeb dataset with diverse queries and webpages.

airlines of italy

natural causes of global warming

lsu football coaches

bf3 submachine guns

badminton tournaments

foods high in dha

technical colleges in south carolina

songs on glee season 5

singers who use auto tune

san francisco radio stations8

Page 17: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Dataset

We created the OpenWeb dataset with diverse queries and webpages.

airlines of italy natural causes of global warming lsu football coaches

8

Page 18: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Query Generation

Breadth-first search on Google Suggest

list of

Google

Suggest

list of Indian movies

...

[Berant et al., 2013]

9

Page 19: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Query Generation

Breadth-first search on Google Suggest

list of

Google

Suggest

list of Indian movies

...

Template

Extraction

list of movies

list of movies

list of Indian

...

[Berant et al., 2013]

9

Page 20: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Query Generation

Breadth-first search on Google Suggest

list of

Google

Suggest

list of Indian movies

...

Template

Extraction

list of movies

list of movies

list of Indian

...

[Berant et al., 2013]

9

Page 21: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Dataset Annotation

Annotate the first, second, and last entities matching the query usingAmazon Mechanical Turk.

10

Page 22: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Dataset Annotation

Annotate the first, second, and last entities matching the query usingAmazon Mechanical Turk.

airlines of italy

Annotation

First: Air Dolomiti

Second: Air Europe

Last: Wind Jet

10

Page 23: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Dataset Statistics

2773 examples

2269 unique queries

894 unique headwords ← long tail!

1483 unique web domains ← long tail!

(6= wrapper induction)

11

Page 24: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Outline

1. Setup

2. Approach

• Extraction Predicate

• Framework

• Modeling

• Features

3. Results

12

Page 25: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Extraction Predicate

How can we choose what to extract from a web page w?

html

head body

table

tr

td td td td

h1 table

tr

th th

tr

td td

... tr

td td

number of possible entity lists ≈ 2number of nodes

13

Page 26: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Extraction Predicate

Idea: Entities usually share the same tag and tree level

html

head body

table

tr

td td td td

h1 table

tr

th th

tr

td td

... tr

td td

z = /html[1]/body[1]/table[2]/tr/td[1]

[Sahuguet and Azavant, 1999; Liu et al., 2000; Crescenzi et al., 2001]

14

Page 27: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Extraction Predicate

Idea: Entities usually share the same tag and tree level

html

head body

table

tr

td td td td

h1 table

tr

th th

tr

td td

... tr

td td

z = /html[1]/body[1]/table[2]/tr/td[1]

Captures structures such as table columns, list entries, headers ofthe same level, ...

Each web page has ≈ 8500 extraction predicates z

[Sahuguet and Azavant, 1999; Liu et al., 2000; Crescenzi et al., 2001]

14

Page 28: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Framework

x whiking trails

near Baltimore

html

head

...

body

...

15

Page 29: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Framework

x w

Generation

Z

hiking trails

near Baltimore

html

head

...

body

...

(|Z| ≈ 8500)

15

Page 30: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Framework

x w

Generation

Z

Model

z

hiking trails

near Baltimore

html

head

...

body

...

(|Z| ≈ 8500)

/html[1]/body[1]/table[2]/tr/td[1]

15

Page 31: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Framework

x w

Generation

Z

Model

z Execution

y

hiking trails

near Baltimore

html

head

...

body

...

(|Z| ≈ 8500)

/html[1]/body[1]/table[2]/tr/td[1]

[Avalon Super Loop, Patapsco Valley State Park, ...]

15

Page 32: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Framework

x w

Generation

Z

Model

z Execution

y

hiking trails

near Baltimore

html

head

...

body

...

(|Z| ≈ 8500)

/html[1]/body[1]/table[2]/tr/td[1]

[Avalon Super Loop, Patapsco Valley State Park, ...]

A graphical model with latent extraction predicate z

15

Page 33: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Modeling

Let x be a query and w be a web page.

Define a log-linear distribution over the extraction predicates z ∈ Z:

pθ(z | x,w) ∝ exp{θ>φ(x,w, z)}

• θ is a parameter vector

• φ(x,w, z) is a feature vector

16

Page 34: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Modeling

Let x be a query and w be a web page.

Define a log-linear distribution over the extraction predicates z ∈ Z:

pθ(z | x,w) ∝ exp{θ>φ(x,w, z)}

• θ is a parameter vector

• φ(x,w, z) is a feature vector

• Find θ that maximizes the log-likelihood of the training datausing AdaGrad [Duchi et al., 2010]

16

Page 35: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Features

pθ(z | x,w) ∝ exp{θ>φ(x,w, z)}

17

Page 36: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Features

pθ(z | x,w) ∝ exp{θ>φ(x,w, z)}

Structural Features: context

>

17

Page 37: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Features

pθ(z | x,w) ∝ exp{θ>φ(x,w, z)}

Denotation Features: content

hiking trails near Baltimore

Avalon Super Loop

Patapsco Valley State Park

Gunpowder Falls State Park

Rachel Carson Conservation Park

Union Mills Hike

...

>

hiking trails near Baltimore

Home

About Baltimore Tour

Pricing

Contact

Online Support

...

17

Page 38: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Defining Features on Lists

George Washington

John Adams

Thomas Jefferson

James Madison

... (39 more) ...

Barack Obama

John Adams

John Adams

John Adams

John Adams

John Adams

John Adams

... (100 more) ...

John Adams

Blog

Photos and Video

Briefing Room

In the White House

Mobile Apps

Contact Us

good bad bad

18

Page 39: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Defining Features on Lists

George Washington

John Adams

Thomas Jefferson

James Madison

... (39 more) ...

Barack Obama

John Adams

John Adams

John Adams

John Adams

John Adams

John Adams

... (100 more) ...

John Adams

Blog

Photos and Video

Briefing Room

In the White House

Mobile Apps

Contact Us

good bad bad

identity diverse identical diverse

18

Page 40: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Defining Features on Lists

NNP NNP

NNP NNP

NNP NNP

NNP NNP

... (39 more) ...

NNP NNP

NNP NNP

NNP NNP

NNP NNP

NNP NNP

NNP NNP

NNP NNP

... (100 more) ...

NNP NNP

NN

NNS CC NNP

NN NN

IN DT NNP NNP

NNP NNPS

NN PRP

good bad bad

identity diverse identical diverse

POS identical identical diverse18

Page 41: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Defining Features on Lists

Avalon Super Loop

Patapsco Valley State Park

Gunpowder Falls State Park

Union Mills Hike

Greenbury Point

19

Page 42: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Defining Features on Lists

Avalon Super Loop

Patapsco Valley State Park

Gunpowder Falls State Park

Union Mills Hike

Greenbury Point

3

4

4

3

2

1. Abstraction

Map list elements into abstract tokens

19

Page 43: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Defining Features on Lists

Avalon Super Loop

Patapsco Valley State Park

Gunpowder Falls State Park

Union Mills Hike

Greenbury Point

3

4

4

3

2

2 3 4

histogram

Entropy

Majority

MajorityRatio

Single

Mean

Variance

1. Abstraction

Map list elements into abstract tokens

2. Aggregation

Define features using the histogram of the abstract tokens

19

Page 44: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Defining Features on Lists

Avalon Super Loop

Patapsco Valley State Park

Gunpowder Falls State Park

Union Mills Hike

Greenbury Point

3

4

4

3

2

2 3 4

histogram

Entropy

Majority

MajorityRatio

Single

Mean

Variance

1. Abstraction

Map list elements into abstract tokens

2. Aggregation

Define features using the histogram of the abstract tokens

Use this method for both structural and denotation features

19

Page 45: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Outline

1. Setup

2. Approach

3. Results

• Main Results

• Error Analysis

• Feature Analysis

20

Page 46: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Main Results

Baseline

(Most frequent

extraction

predicates)

Accuracy Accuracy @ 50

10

20

30

40

50

60

Accuracy

10.3

21

Page 47: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Main Results

Baseline

(Most frequent

extraction

predicates)

Accuracy Accuracy @ 50

10

20

30

40

50

60

Accuracy

10.3

40.5

55.8

21

Page 48: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Error Analysis

Correct

40.5%

Coverage

Errors

33.4%

Ranking

Errors

26.1%

22

Page 49: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Examples of Correct Predictions

Query: disney channel movies

/html[1]/body/div[2]/div/div/div[3]/div[1]/div/div/div/div/b23

Page 50: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Examples of Correct Predictions

Query: universities in canada

/html[1]/body/div/div/div/div/div/div/div/a/text

24

Page 51: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Examples of Correct Predictions

Query: nobel prize winners

/html[1]/body/div/div[2]/div/div/div/h6/a/text

25

Page 52: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Error Analysis

Correct

40.5%

Coverage

Errors

33.4%

Ranking

Errors

26.1%

26

Page 53: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Error Analysis

Correct

40.5%

Coverage

Errors

33.4%

Ranking

Errors

26.1%

Coverage Errors

No extraction predicate z produces an entity listy matching the annotation

26

Page 54: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Examples of Coverage Errors

Query: companies named after a person

/html/body/div[3]/div[3]/div[4]/ul/li/a

Need richer extraction predicates!

27

Page 55: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Examples of Coverage Errors

Query: hedge funds in new york

/html/body/div[3]/div[3]/div[4]/.../table/tbody/tr/td[2]/a

Need compositionality! 28

Page 56: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Error Analysis

Correct

40.5%

Coverage

Errors

33.4%

Ranking

Errors

26.1%

Coverage Errors

No extraction predicate z produces an entity listy matching the annotation

29

Page 57: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Error Analysis

Correct

40.5%

Coverage

Errors

33.4%

Ranking

Errors

26.1%

Coverage Errors

No extraction predicate z produces an entity listy matching the annotation

Ranking Errors

The system finds a list y matching the anno-tation, but it does not have the highest modelscore.

29

Page 58: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Examples of Ranking Errors

Query: doctors at emory

/html/body/div[3]/div[4]/table/tbody/tr/td[2]

30

Page 59: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Augmenting Denotation Features

Observation: Entities of different categories have different linguisticproperties.

mayors of Chicago universities in Chicago

Rahm Emanuel Aurora University

Richard M. Daley DePaul University

Eugene Sawyer Illinois Institute of Technology

... ...

31

Page 60: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Augmenting Denotation Features

Observation: Entities of different categories have different linguisticproperties.

mayors of Chicago universities in Chicago

Rahm Emanuel Aurora University

Richard M. Daley DePaul University

Eugene Sawyer Illinois Institute of Technology

... ...

Experiment: Augment denotation features with the query category.

POS majority

= NNP NNP (POS majority

= NNP NNP ,query category

= people )

31

Page 61: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Augmenting Denotation Features

Denotation Augmented

Denotation

0

10

20

30

Accuracy

(dev)

19.8

25

32

Page 62: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Augmenting Denotation Features

Structural

+

Denotation

(default)

Structural

+

Augmented

Denotation

0

10

20

30

40

50

Accuracy

(dev)

41.1 41.7

33

Page 63: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Augmenting Denotation Features

Structural

+

Denotation

(default)

Structural

+

Augmented

Denotation

0

10

20

30

40

50

Acc

ura

cy(d

ev)

41.1 41.7

Hypothesis: Structural features have high influence when the webpage comes from Web search result.

33

Page 64: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Augmenting Denotation Features

Hypothesis: Structural features have high influence when the webpage comes from Web search result.

34

Page 65: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Augmenting Denotation Features

Hypothesis: Structural features have high influence when the webpage comes from Web search result.

hiking trails near Baltimore

Verify the hypothesis: Concatenate arandom web page

34

Page 66: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Augmenting Denotation Features

Hypothesis: Structural features have high influence when the webpage comes from Web search result.

hiking trails near Baltimore

Verify the hypothesis: Concatenate arandom web page

• Creates noise: entity lists with highstructural feature scores might notbe the correct list

34

Page 67: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Augmenting Denotation Features

hiking trails near Baltimore

Structural

+

Denotation

(default)

Structural

+

Augmented

Denotation

0

10

20

30

40

Accuracy

(stitched)

19.3

29.2

35

Page 68: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Summary

query

hiking trails

near Baltimore

System

answers

Avalon Super Loop

Hilton Area

Wildlands Loop

...

web page

A framework for extracting entities from a natural language queryand a single web page

36

Page 69: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Summary

tutorials at ACL Focus on the long tail of entitycategories

37

Page 70: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Summary

tutorials at ACL Focus on the long tail of entitycategories

Consider both structural and de-notation features

37

Page 71: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Summary

tutorials at ACL Focus on the long tail of entitycategories

Consider both structural and de-notation features

Avalon ..

Patapsco ..

Gunpowder ..

Union ..

Greenbury ..

3

4

4

3

2

2 3 4

histogram

Handle lists of different sizes withabstraction and aggregation

37

Page 72: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Future Work

• Model relationship between entities and category strings

• Compositionality in natural language

38

Page 73: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang

Download code and dataset:

http://nlp.stanford.edu/software/web-entity-extractor-ACL2014

Thank you!

39