self-supervised probabilistic methods for extracting facts from text

67
1 Self-supervised Probabilistic Methods for Extracting Facts from Text Doug Downey

Upload: lavonn

Post on 12-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Self-supervised Probabilistic Methods for Extracting Facts from Text. Doug Downey. Web Search: Answering Questions. Q: Who did IBM acquire in 2002? A: “IBM acquired * in 2002” Q: Who has won a best actor Oscar for playing a villain? A: “won best actor for playing a villain” – 0 hits! - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Self-supervised Probabilistic Methods for Extracting Facts from Text

1

Self-supervised Probabilistic Methods for Extracting Facts

from TextDoug Downey

Page 2: Self-supervised Probabilistic Methods for Extracting Facts from Text

2

Q: Who did IBM acquire in 2002?

A:“IBM acquired * in 2002”

Q: Who has won a best actor Oscar for playing a villain?

A: “won best actor for playing a villain” – 0 hits!

The answer isn’t on just one Web page

Web Search: Answering Questions

Page 3: Self-supervised Probabilistic Methods for Extracting Facts from Text

3

Q: Who has won a best actor Oscar for playing a villain?

A: Find all $X where the following appear:“$X won best actor for $Y”“$X, who played $Z in $Y”“the villain, $Z”

“Forest Whitaker won best actor for The Last King of Scotland” – 210 hits

“Forest Whitaker, who played Idi Amin in The Last King of Scotland” – 4 hits

“the villian, Idi Amin” – 1 hitAnswer: Forest Whitaker

Solution: Synthesizing Across Pages

Page 4: Self-supervised Probabilistic Methods for Extracting Facts from Text

4

Given: One or more contexts indicating a semantic class C, e.g., “$X starred in $Y” => StarredIn($X, $Y)– User-specified (TextRunner [Banko et al., IJCAI 2007])– Automatically generated (KnowItAll [Etzioni et al., AIJ 2005])– Bootstrapped from resources [Snow et al., NIPS 2004].

Output: instances of Cbut, extraction from contexts is highly imperfect!

=> Output P(x C) for each term x

Self-supervised – no hand-tagged examples

Self-supervised Information Extraction

Page 5: Self-supervised Probabilistic Methods for Extracting Facts from Text

5

Given: One or more contexts suggestive of a semantic class C, and a corpus of text

Output: P(x C) for each term x

KnowItAll Hypothesis – Terms x which occur in the suggestive contexts more

frequently are more likely to be instances of C.

Distributional Hypothesis– Terms in the same class tend to appear in similar contexts.

My task: formalizing these heuristics into statements about P(x C) given a corpus.

Self-supervised Information Extraction

Page 6: Self-supervised Probabilistic Methods for Extracting Facts from Text

6

Who cares about Probabilities?

Why not use rankings (e.g., the precision/recall metric)?

P( WonBestActorFor(Forest Whitaker, The Last King of Scotland) )

And P( PlayedVillainIn(Forest Whitaker, The Last King of Scotland) )

=>Our goal: an estimate of the probability that Forest Whitaker won best actor for playing a villain.

Not possible with rankings!In fact, combining even perfect rankings can yield accuracy < .

Page 7: Self-supervised Probabilistic Methods for Extracting Facts from Text

7

1) Two Research Questions

2) URNS model3) REALM

4) Proposal for DH

5) Chez KnowItAll

Outline

Page 8: Self-supervised Probabilistic Methods for Extracting Facts from Text

8

Term-Context Matrix

Terms

. . . 98 0 2 25 1 513 . . .

. . . 2 0 930 0 0 1 . . .

. . . 1 0 10 0 0 1 . . .

Contexts

E.g., Miami

(Robert De Niro, Raging Bull)

…potential elements of C

Page 9: Self-supervised Probabilistic Methods for Extracting Facts from Text

9

Terms

. . . 98 0 2 25 1 513 . . .

. . . 2 0 930 0 0 1 . . .

. . . 1 0 10 0 0 1 . . .

Contexts

E.g.,

cities such as $X,

$X said $Y offered to,

also: parse trees, bag of words, containing Web domain, etc.

Term-Context Matrix

Page 10: Self-supervised Probabilistic Methods for Extracting Facts from Text

10

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .

. . . 1 1000 0 2 1 1 . . .X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g

KnowItAll Hypothesis

Distributional Hypothesis

Page 11: Self-supervised Probabilistic Methods for Extracting Facts from Text

11

Two Research Questions

-- term-context matrix

-- columns of M for contexts

suggesting C.

-- prior estimate that x C

Formalizing the KnowItAll hypothesis: What is an expression for ?

Formalizing the distributional hypothesis: What is an expression for ?

Page 12: Self-supervised Probabilistic Methods for Extracting Facts from Text

12

Key Requirements for Models

1) Produce probabilities

2) Execute at “interactive” speed

3) No hand-tagged data

Page 13: Self-supervised Probabilistic Methods for Extracting Facts from Text

13

1) Two Research Questions

2) URNS model3) REALM

4) Proposal for DH

5) Chez KnowItAll

Outline

Page 14: Self-supervised Probabilistic Methods for Extracting Facts from Text

14

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .

. . . 1 1000 0 2 1 1 . . .X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g

KnowItAll Hypothesis

Distributional Hypothesis

Page 15: Self-supervised Probabilistic Methods for Extracting Facts from Text

15

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .

. . . 1 1000 0 2 1 1 . . .X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g

KnowItAll Hypothesis

Distributional Hypothesis

Page 16: Self-supervised Probabilistic Methods for Extracting Facts from Text

16

1. Modeling Redundancy – The Problem

Consider a single context, e.g.:“cities such as x”

If an extraction x appears k times in a set of n sentences containing this pattern, what is the probability that x C?

Page 17: Self-supervised Probabilistic Methods for Extracting Facts from Text

17

Modeling with k

“…countries such as Saudi Arabia…”

“…countries such as the United States…”

“…countries such as Saudi Arabia…”

“…countries such as Japan…”

“…countries such as Africa…”

“…countries such as Japan…”

“…countries such as the United Kingdom…”

“…countries such as Iraq…”

“…countries such as Afghanistan…”

“…countries such as Australia…”

Country(x)

extractions, n = 10

Page 18: Self-supervised Probabilistic Methods for Extracting Facts from Text

18

Modeling with k

Country(x)

extractions, n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

k2

2

1

1

1

1

1

1

Noisy-Or Model :

k

ornoisy

p

kxCxP

11

times appears

p is the probability that a single sentence is true, i.e.

p = 0.9

ornoisyP 0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9

Important: –Sample size (n) –Distribution of C }Noisy-or ignores these

Page 19: Self-supervised Probabilistic Methods for Extracting Facts from Text

19

Needed in Model: Sample Size

kJapan

Norway

Israil

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

Country(x)

extractions, n ~50,000 ornoisyP 1723

295

1

1

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

0.9

0.9

Country(x)

extractions, n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

k2

2

1

1

1

1

1

1

ornoisyP 0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9

As sample size increases, noisy-or becomes inaccurate.

Page 20: Self-supervised Probabilistic Methods for Extracting Facts from Text

20

Needed in Model: Distribution of C

nk

freq

p

kxCxP

100011

times appears

kJapan

Norway

Israil

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

Country(x)

extractions, n ~50,000 ornoisyP 1723

295

1

1

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

0.9

0.9

Page 21: Self-supervised Probabilistic Methods for Extracting Facts from Text

21

Needed in Model: Distribution of C

nk

freq

p

kxCxP

100011

times appears

kJapan

Norway

Israil

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

Country(x)

extractions, n ~50,000

1723

295

1

1

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

0.05

0.05

freqP

Page 22: Self-supervised Probabilistic Methods for Extracting Facts from Text

22

Needed in Model: Distribution of C

kToronto

Belgrade

Lacombe

Kent County

Nikki

Ragaz

Villegas

Cres

Northeastwards

City(x)

extractions, n ~50,000

274

81

1

1

1

1

1

1

1

0.9999…

0.98

0.05

0.05

0.05

0.05

0.05

0.05

0.05

freqP

Probability that x C depends on the distribution of C.

kJapan

Norway

Israil

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

Country(x)

extractions, n ~50,000

1723

295

1

1

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

0.05

0.05

freqP

Page 23: Self-supervised Probabilistic Methods for Extracting Facts from Text

23

The URNS Model – Single Urn

Page 24: Self-supervised Probabilistic Methods for Extracting Facts from Text

24

The URNS Model – Single Urn

U.K.

Sydney

Urn for City(x)

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

Page 25: Self-supervised Probabilistic Methods for Extracting Facts from Text

25

Tokyo

The URNS Model – Single Urn

U.K.

Sydney

Urn for City(x)

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

…cities such as Tokyo…

Page 26: Self-supervised Probabilistic Methods for Extracting Facts from Text

26

Single Urn – Formal Definition

C – set of unique target labels

E – set of unique error labels

num(b) – number of balls labeled by b C E

num(B) –distribution giving the number of balls for each label b B.

Page 27: Self-supervised Probabilistic Methods for Extracting Facts from Text

27

Single Urn Example

num(“Atlanta”) = 2

num(C) = {2, 2, 1, 1, 1}

num(E) = {2, 1}

Estimated from data

U.K.

Sydney

Urn for City(x)

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

Page 28: Self-supervised Probabilistic Methods for Extracting Facts from Text

28

Single Urn: Computing Probabilities

If an extraction x appears k times in a set of n sentences containing a pattern, what is the probability that x C?

Page 29: Self-supervised Probabilistic Methods for Extracting Facts from Text

29

Single Urn: Computing Probabilities

Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?

where s is the total number of balls in the urn

Page 30: Self-supervised Probabilistic Methods for Extracting Facts from Text

30

Consider the case where num(ci) = RC and num(ej) = RE

for all ci C, ej E

Then:

Then using a Poisson Approximation:

Odds increase exponentially with k, but decrease exponentially with n.

Uniform Special Case

Page 31: Self-supervised Probabilistic Methods for Extracting Facts from Text

31

The URNS Model – Multiple Urns

Correlation across contexts is higher for elements of C than for elements of E.

Page 32: Self-supervised Probabilistic Methods for Extracting Facts from Text

32

0

1

2

3

4

5

City Film Country MayorOf

De

via

tio

n f

rom

ide

al l

og

lik

elih

oo

d

urns

noisy-or

pmi

Unsupervised Performance

Page 33: Self-supervised Probabilistic Methods for Extracting Facts from Text

33

1) Two Research Questions

2) URNS model3) REALM

4) Proposal for DH

5) Chez KnowItAll

Outline

Page 34: Self-supervised Probabilistic Methods for Extracting Facts from Text

34

0

250

500

0 50000 100000

Frequency rank of extraction

Nu

mb

er

of

tim

es

ex

tra

cti

on

a

pp

ea

rs i

n p

att

ern

Redundancy fails on “sparse” facts

Tend to be correct

e.g., (Michael Bloomberg, New York City)

A mixture of correct and incorrect

e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)

Page 35: Self-supervised Probabilistic Methods for Extracting Facts from Text

35

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .

. . . 1 1000 0 2 1 1 . . .X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g

KnowItAll Hypothesis

Distributional Hypothesis

Page 36: Self-supervised Probabilistic Methods for Extracting Facts from Text

36

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .

. . . 1 1000 0 2 1 1 . . .X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g

KnowItAll Hypothesis

Distributional Hypothesis

Page 37: Self-supervised Probabilistic Methods for Extracting Facts from Text

37

Assessing Sparse Extractions

Task: Identify which sparse extractions are correct.

Strategy:1. Build a model of how common extractions occur in

text2. Rank sparse extractions by fit to model

• The distributional hypothesis!

Our contribution: Unsupervised language models.– Methods for mitigating sparsity– Precomputed, so greatly improved scalability

Page 38: Self-supervised Probabilistic Methods for Extracting Facts from Text

38

The REALM Architecture

RElation Assessment using Language Models

Input: Set of extractions for relation R

ER = {(arg11, arg21), …, (arg1M, arg2M)}

1) Seeds SR = s most frequent pairs in ER

(assume these are correct)

2) Output ranking of (arg1, arg2) ER

by distributional similarity to each (seed1, seed2) in SR

Page 39: Self-supervised Probabilistic Methods for Extracting Facts from Text

39

Distributional Similarity (1)

N-gram Language Model:

Estimate P(wi | wi-1, … wi-k)

#Parameters scales with (Vocab. Size)k+1

wi-k … wi-1 wi

Page 40: Self-supervised Probabilistic Methods for Extracting Facts from Text

40

Distributional Similarity (2)

Naïve Approach:

Compare context distributions:

P(wg,…, wj | seed1, seed2 )

P(wg,…, wj | arg1, arg2)But j-g can be large

Many parameters, sparse data => inaccuracy

wg … wh seed1 wh+2 … wi seed2 wi+2 … wj

wg … wh arg1 wh+2 … wi arg2 wi+2 … wj

Page 41: Self-supervised Probabilistic Methods for Extracting Facts from Text

41

The REALM ArchitectureTwo steps for assessing R(arg1, arg2)• Typechecking

– e.g., AuthorOf( arg1, arg2 )arg1 must be an author, arg2 a written

workValuable, but allows errors like:AuthorOf(Danielle Steele, Hamlet)

• Relation Assessment– Ensure R actually holds between arg1 and arg2

Both steps use small, pre-computed language models=> Scaleable

Page 42: Self-supervised Probabilistic Methods for Extracting Facts from Text

42

Task: For each extraction (arg1, arg2) ER, determine if arg1 and arg2 are the proper type for R.

Solution: Assume seedj SR are of the proper type, and

rank argj by distributional similarity to each seedj

Computing Distributional Similarity:

1) Offline, train Hidden Markov Model (HMM) of corpus

2) Measure distance between argj , seedj in HMM’s N-dimensional latent state space.

Typechecking and HMM-T

Page 43: Self-supervised Probabilistic Methods for Extracting Facts from Text

43

HMM Language Model

ti ti+1 ti+2 ti+3

wi wi+1 wi+2 wi+3

cities such as Seattle

wordsw

Nt

i

i

,...,1

Offline Training: Learn P(w | t), P(ti | ti-1, …, ti-k) to maximize probability of corpus (using EM).

k = 1 case:

Page 44: Self-supervised Probabilistic Methods for Extracting Facts from Text

44

HMM-T

Trained HMM gives “distributional summary” of each w: N-dimensional state distribution P(t | w)

Typecheck each arg by comparing state distributions:

Rank extractions in ascending order of f(arg) summed over arguments.

arg|,|

||

1(arg) tPseedtP

seedsKLf

ii

Page 45: Self-supervised Probabilistic Methods for Extracting Facts from Text

45

Miami: < >Twisp: < >

Problems:– Vectors are large– Intersections are sparse

. . . 71 25 1 513 . . .w

hen

he v

isite

d X

he v

isite

d X

and

visi

ted

X a

nd o

ther

X a

nd o

ther

citi

es

. . . 0 0 0 1 . . .

Why not use context vectors?

Page 46: Self-supervised Probabilistic Methods for Extracting Facts from Text

46

Miami: <

>

P(t | Miami):

Latent state distribution P(t | w)– Compact (efficient – 10-50x less data retrieved)– Dense (accurate)

. . . 71 25 1 513 . . .

0.14 0.01 … 0.06 t=1 2 N

HMM-T Advantages (1)

Page 47: Self-supervised Probabilistic Methods for Extracting Facts from Text

47

HMM-T Advantages (2)

Is Pickerington of the same type as Chicago?

Chicago , Illinois

Pickerington , Ohio

Chicago:

Pickerington:

=> N-grams says no, Dot product is 0!

291 0 …

<x> , Ohio

<x> , Illinois

0 1 …

Page 48: Self-supervised Probabilistic Methods for Extracting Facts from Text

48

HMM Generalizes:

Chicago , Illinois

Pickerington , Ohio

HMM-T Advantages (3)

Page 49: Self-supervised Probabilistic Methods for Extracting Facts from Text

49

HMM-T Limitations

Learning time is proportional to (corpus size *Tk+1)

T = number of latent states

k = HMM order

We use limited values T=20, k=3– Sufficient for typechecking (Santa Clara is a city)– Too coarse for relation assessment

(Santa Clara is where Intel is headquartered)

Page 50: Self-supervised Probabilistic Methods for Extracting Facts from Text

50

1) Two Research Questions

2) URNS model3) REALM

4) Proposal for Formalizing the DH

5) Chez KnowItAll

Outline

Page 51: Self-supervised Probabilistic Methods for Extracting Facts from Text

51

Formalizing the Distributional Hypothesis

How is this not just semi-supervised or transductive learning?

– Starts with prior , not hand-labeled examples.– Features are counts.

Two alternative formalizations– Context Counts– Distance Function

Don’t yet have expression for – Instead: basic formalizations, preliminary results

Page 52: Self-supervised Probabilistic Methods for Extracting Facts from Text

52

Context Counts

Terms

. . . 920 600 293 20 2 1 . . .

. . . 20 110 930 3 0 1 . . .

. . . 43 30 0 1 0 2 . . .

Contexts

Reliable

Unreliable

As the corpus increases in size, the number of reliable contexts increases.

Page 53: Self-supervised Probabilistic Methods for Extracting Facts from Text

53

Context Counts

Terms

. . . 920 600 293 20 2 1 . . .

. . . 20 110 930 3 0 1 . . .

. . . 43 30 0 1 0 2 . . .

Contexts

Reliable

Unreliable

Basic idea: model each reliable context as a “single urn.”

Page 54: Self-supervised Probabilistic Methods for Extracting Facts from Text

54

Context Counts – Assumptions

1) Only a term’s reliable contexts are useful.• Occur at least r times with the term.

2) Contexts conditionally independent given C.

3) Terms and contexts are Zipf distributed.

Key question: how many reliable contexts co-occur with a given term in a corpus of n total tokens?

Can be computed in closed form given the above assumptions.

Page 55: Self-supervised Probabilistic Methods for Extracting Facts from Text

55

Preliminary Result (1)

Assume that the Bayes Risk for a classifier using just one context is at least \Beta. Then for a corpus of n tokens over a vocabulary V and context set \Pi,

Page 56: Self-supervised Probabilistic Methods for Extracting Facts from Text

56

Preliminary Result (2)

Provides non-trivial bounds:

Google n-grams data set (roughly):

n = 1,000,000,000,000

|V| = 15,000,000

|\Pi| = 1,000,000,000

Setting \Beta = 0.45, we get E[accuracy] <= 0.85.

Page 57: Self-supervised Probabilistic Methods for Extracting Facts from Text

57

Alternate Formalization: Distance Functions

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

distance(x, y)

P(x

, y s

am

e c

las

s |

dis

tan

ce

(x, y

))

Page 58: Self-supervised Probabilistic Methods for Extracting Facts from Text

58

Distance Functions

Key Formal Problem:

Given a distance function d(x, y) and prior over P(x C), what isP(x C | , d(xi, yj) for i, j V)

Straightforward to compute, but:

Requires (naively) summing over the power set of V.

Page 59: Self-supervised Probabilistic Methods for Extracting Facts from Text

59

Empirical Investigation

Either formalization is governed by parameters, some specific to C, others more global.

Proposed Experiments – with a variety of classes, measure empirically:Context Counts

Urn parameters for contextsDependence between contexts

Distance FunctionsObserved distance functions, as a function of:term frequency, corpus size, class prevalence.

Page 60: Self-supervised Probabilistic Methods for Extracting Facts from Text

60

1) Two Research Questions

2) URNS model3) REALM

4) Proposal for DH

5) Chez KnowItAll

Outline

Page 61: Self-supervised Probabilistic Methods for Extracting Facts from Text

61

Theoretical Questions:Entrée: DH Formalisms

(Distance Functions, Context Counts, something else?)

Sides: Relationship between KH and DH, generative textual models yielding hypotheses.

Empirical Questions:

Improving REALM’s language modeling techniques

Modeling polysemy

Language modeling accuracy vs. IE accuracy

Applying HMM-T to NER

Page 62: Self-supervised Probabilistic Methods for Extracting Facts from Text

62

Context Counts Advantages:

Explicitly models counts

Leverages Urns model

Likely tractable

Distance Function Advantages

Applicable to semi-supervised learning

More “pure” instantiation of DH

Entrée: DH Formalizms

Page 63: Self-supervised Probabilistic Methods for Extracting Facts from Text

63

Relationship between KH and DH

Theoretical Sides(1)

Terms

. . . 920 400 293 … 2 1 . . .

. . . 200 170 30 … 0 1 . . .

. . . 43 30 50 … 0 2 . . .

Contexts

DH KH (in $X) … (cities such as $X)

Page 64: Self-supervised Probabilistic Methods for Extracting Facts from Text

64

Theoretical Sides(2)

Is there a generative model of text that leads to KH, DH?

E.g., if text is generated by a HMM…

Page 65: Self-supervised Probabilistic Methods for Extracting Facts from Text

65

Empirical Questions (1)

Improving REALM with language modeling enhancementsCharacter level models, syntax, PCFGs, etc.

Modeling PolysemyP(t | Chicago) the same for Chicago the city and Chicago the musical.

Idea: an HMM that selectively bifurcates words into senses when this improves LM accuracy.

Page 66: Self-supervised Probabilistic Methods for Extracting Facts from Text

66

Empirical Questions (2)

Language Modeling Accuracy vs. Information Extraction accuracyIs it monotonic?

Applying HMM-T to Named Entity Recognition

Page 67: Self-supervised Probabilistic Methods for Extracting Facts from Text

67

Thanks!