self-supervised probabilistic methods for extracting facts from text

1

Self-supervised Probabilistic Methods for Extracting Facts

from TextDoug Downey

2

Q: Who did IBM acquire in 2002?

A:“IBM acquired * in 2002”

Q: Who has won a best actor Oscar for playing a villain?

A: “won best actor for playing a villain” – 0 hits!

The answer isn’t on just one Web page

Web Search: Answering Questions

3

Q: Who has won a best actor Oscar for playing a villain?

A: Find all $X where the following appear:“$X won best actor for $Y”“$X, who played $Z in $Y”“the villain, $Z”

“Forest Whitaker won best actor for The Last King of Scotland” – 210 hits

“Forest Whitaker, who played Idi Amin in The Last King of Scotland” – 4 hits

“the villian, Idi Amin” – 1 hitAnswer: Forest Whitaker

Solution: Synthesizing Across Pages

4

Given: One or more contexts indicating a semantic class C, e.g., “$X starred in $Y” => StarredIn($X, $Y)– User-specified (TextRunner [Banko et al., IJCAI 2007])– Automatically generated (KnowItAll [Etzioni et al., AIJ 2005])– Bootstrapped from resources [Snow et al., NIPS 2004].

Output: instances of Cbut, extraction from contexts is highly imperfect!

=> Output P(x C) for each term x

Self-supervised – no hand-tagged examples

Self-supervised Information Extraction

5

Given: One or more contexts suggestive of a semantic class C, and a corpus of text

Output: P(x C) for each term x

KnowItAll Hypothesis – Terms x which occur in the suggestive contexts more

frequently are more likely to be instances of C.

Distributional Hypothesis– Terms in the same class tend to appear in similar contexts.

My task: formalizing these heuristics into statements about P(x C) given a corpus.

Self-supervised Information Extraction

6

Who cares about Probabilities?

Why not use rankings (e.g., the precision/recall metric)?

P( WonBestActorFor(Forest Whitaker, The Last King of Scotland) )

And P( PlayedVillainIn(Forest Whitaker, The Last King of Scotland) )

=>Our goal: an estimate of the probability that Forest Whitaker won best actor for playing a villain.

Not possible with rankings!In fact, combining even perfect rankings can yield accuracy < .

7

1) Two Research Questions

2) URNS model3) REALM

4) Proposal for DH

5) Chez KnowItAll

Outline

8

Term-Context Matrix

Terms

. . . 98 0 2 25 1 513 . . .

. . . 2 0 930 0 0 1 . . .

. . . 1 0 10 0 0 1 . . .

Contexts

E.g., Miami

(Robert De Niro, Raging Bull)

…potential elements of C

9

Terms

. . . 98 0 2 25 1 513 . . .

. . . 2 0 930 0 0 1 . . .

. . . 1 0 10 0 0 1 . . .

Contexts

E.g.,

cities such as $X,

$X said $Y offered to,

also: parse trees, bag of words, containing Web domain, etc.

Term-Context Matrix

10

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .

. . . 1 1000 0 2 1 1 . . .X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g

KnowItAll Hypothesis

Distributional Hypothesis

11

Two Research Questions

-- term-context matrix

-- columns of M for contexts

suggesting C.

-- prior estimate that x C

Formalizing the KnowItAll hypothesis: What is an expression for ?

Formalizing the distributional hypothesis: What is an expression for ?

12

Key Requirements for Models

1) Produce probabilities

2) Execute at “interactive” speed

3) No hand-tagged data

13



4) Proposal for DH

5) Chez KnowItAll

Outline

14

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .

. . . 1 1000 0 2 1 1 . . .X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g



15

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .

. . . 1 1000 0 2 1 1 . . .X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g



16

1. Modeling Redundancy – The Problem

Consider a single context, e.g.:“cities such as x”

If an extraction x appears k times in a set of n sentences containing this pattern, what is the probability that x C?

17

Modeling with k

“…countries such as Saudi Arabia…”

“…countries such as the United States…”

“…countries such as Saudi Arabia…”

“…countries such as Japan…”

“…countries such as Africa…”

“…countries such as Japan…”

“…countries such as the United Kingdom…”

“…countries such as Iraq…”

“…countries such as Afghanistan…”

“…countries such as Australia…”

Country(x)

extractions, n = 10

18

Modeling with k

Country(x)

extractions, n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

k2

2

1

1

1

1

1

1

Noisy-Or Model :

k

ornoisy

p

kxCxP

11

times appears

p is the probability that a single sentence is true, i.e.

p = 0.9

ornoisyP 0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9

Important: –Sample size (n) –Distribution of C }Noisy-or ignores these

19

Needed in Model: Sample Size

kJapan

Norway

Israil

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

Country(x)

extractions, n ~50,000 ornoisyP 1723

295

1

1

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

0.9

0.9

Country(x)

extractions, n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

k2

2

1

1

1

1

1

1

ornoisyP 0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9

As sample size increases, noisy-or becomes inaccurate.

20

Needed in Model: Distribution of C

nk

freq

p

kxCxP

100011

times appears

kJapan

Norway

Israil

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

Country(x)

extractions, n ~50,000 ornoisyP 1723

295

1

1

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

0.9

0.9

21


nk

freq

p

kxCxP

100011

times appears

kJapan

Norway

Israil

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

Country(x)

extractions, n ~50,000

1723

295

1

1

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

0.05

0.05

freqP

22


kToronto

Belgrade

Lacombe

Kent County

Nikki

Ragaz

Villegas

Cres

Northeastwards

City(x)


274

81

1

1

1

1

1

1

1

0.9999…

0.98

0.05

0.05

0.05

0.05

0.05

0.05

0.05

freqP

Probability that x C depends on the distribution of C.

kJapan

Norway

Israil

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

Country(x)


1723

295

1

1

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

0.05

0.05

freqP

23

The URNS Model – Single Urn

24


U.K.

Sydney

Urn for City(x)

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

25

Tokyo


U.K.

Sydney

Urn for City(x)

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

…cities such as Tokyo…

26

Single Urn – Formal Definition

C – set of unique target labels

E – set of unique error labels

num(b) – number of balls labeled by b C E

num(B) –distribution giving the number of balls for each label b B.

27

Single Urn Example

num(“Atlanta”) = 2

num(C) = {2, 2, 1, 1, 1}

num(E) = {2, 1}

Estimated from data

U.K.

Sydney

Urn for City(x)

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

28

Single Urn: Computing Probabilities

If an extraction x appears k times in a set of n sentences containing a pattern, what is the probability that x C?

29

Single Urn: Computing Probabilities

Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?

where s is the total number of balls in the urn

30

Consider the case where num(ci) = RC and num(ej) = RE

for all ci C, ej E

Then:

Then using a Poisson Approximation:

Odds increase exponentially with k, but decrease exponentially with n.

Uniform Special Case

31

The URNS Model – Multiple Urns

Correlation across contexts is higher for elements of C than for elements of E.

32

0

1

2

3

4

5

City Film Country MayorOf

De

via

tio

n f

rom

ide

al l

og

lik

elih

oo

d

urns

noisy-or

pmi

Unsupervised Performance

33



4) Proposal for DH

5) Chez KnowItAll

Outline

34

0

250

500

0 50000 100000

Frequency rank of extraction

Nu

mb

er

of

tim

es

ex

tra

cti

on

a

pp

ea

rs i

n p

att

ern

Redundancy fails on “sparse” facts

Tend to be correct

e.g., (Michael Bloomberg, New York City)

A mixture of correct and incorrect

e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)

35

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .

. . . 1 1000 0 2 1 1 . . .X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g



36

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .

. . . 1 1000 0 2 1 1 . . .X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g



37

Assessing Sparse Extractions

Task: Identify which sparse extractions are correct.

Strategy:1. Build a model of how common extractions occur in

text2. Rank sparse extractions by fit to model

• The distributional hypothesis!

Our contribution: Unsupervised language models.– Methods for mitigating sparsity– Precomputed, so greatly improved scalability

38

The REALM Architecture

RElation Assessment using Language Models

Input: Set of extractions for relation R

ER = {(arg11, arg21), …, (arg1M, arg2M)}

1) Seeds SR = s most frequent pairs in ER

(assume these are correct)

2) Output ranking of (arg1, arg2) ER

by distributional similarity to each (seed1, seed2) in SR

39

Distributional Similarity (1)

N-gram Language Model:

Estimate P(wi | wi-1, … wi-k)

#Parameters scales with (Vocab. Size)k+1

wi-k … wi-1 wi

40

Distributional Similarity (2)

Naïve Approach:

Compare context distributions:

P(wg,…, wj | seed1, seed2 )

P(wg,…, wj | arg1, arg2)But j-g can be large

Many parameters, sparse data => inaccuracy

wg … wh seed1 wh+2 … wi seed2 wi+2 … wj

wg … wh arg1 wh+2 … wi arg2 wi+2 … wj

41

The REALM ArchitectureTwo steps for assessing R(arg1, arg2)• Typechecking

– e.g., AuthorOf( arg1, arg2 )arg1 must be an author, arg2 a written

workValuable, but allows errors like:AuthorOf(Danielle Steele, Hamlet)

• Relation Assessment– Ensure R actually holds between arg1 and arg2

Both steps use small, pre-computed language models=> Scaleable

42

Task: For each extraction (arg1, arg2) ER, determine if arg1 and arg2 are the proper type for R.

Solution: Assume seedj SR are of the proper type, and

rank argj by distributional similarity to each seedj

Computing Distributional Similarity:

1) Offline, train Hidden Markov Model (HMM) of corpus

2) Measure distance between argj , seedj in HMM’s N-dimensional latent state space.

Typechecking and HMM-T

43

HMM Language Model

ti ti+1 ti+2 ti+3

wi wi+1 wi+2 wi+3

cities such as Seattle

wordsw

Nt

i

i

,...,1

Offline Training: Learn P(w | t), P(ti | ti-1, …, ti-k) to maximize probability of corpus (using EM).

k = 1 case:

44

HMM-T

Trained HMM gives “distributional summary” of each w: N-dimensional state distribution P(t | w)

Typecheck each arg by comparing state distributions:

Rank extractions in ascending order of f(arg) summed over arguments.

arg|,|

||

1(arg) tPseedtP

seedsKLf

ii

45

Miami: < >Twisp: < >

Problems:– Vectors are large– Intersections are sparse

. . . 71 25 1 513 . . .w

hen

he v

isite

d X

he v

isite

d X

and

visi

ted

X a

nd o

ther

X a

nd o

ther

citi

es

. . . 0 0 0 1 . . .

Why not use context vectors?

46

Miami: <

>

P(t | Miami):

Latent state distribution P(t | w)– Compact (efficient – 10-50x less data retrieved)– Dense (accurate)

. . . 71 25 1 513 . . .

0.14 0.01 … 0.06 t=1 2 N

HMM-T Advantages (1)

47


Is Pickerington of the same type as Chicago?

Chicago , Illinois

Pickerington , Ohio

Chicago:

Pickerington:

=> N-grams says no, Dot product is 0!

291 0 …

<x> , Ohio

<x> , Illinois

0 1 …

48

HMM Generalizes:

Chicago , Illinois

Pickerington , Ohio


49

HMM-T Limitations

Learning time is proportional to (corpus size *Tk+1)

T = number of latent states

k = HMM order

We use limited values T=20, k=3– Sufficient for typechecking (Santa Clara is a city)– Too coarse for relation assessment

(Santa Clara is where Intel is headquartered)

50



4) Proposal for Formalizing the DH

5) Chez KnowItAll

Outline

51

Formalizing the Distributional Hypothesis

How is this not just semi-supervised or transductive learning?

– Starts with prior , not hand-labeled examples.– Features are counts.

Two alternative formalizations– Context Counts– Distance Function

Don’t yet have expression for – Instead: basic formalizations, preliminary results

52

Context Counts

Terms

. . . 920 600 293 20 2 1 . . .

. . . 20 110 930 3 0 1 . . .

. . . 43 30 0 1 0 2 . . .

Contexts

Reliable

Unreliable

As the corpus increases in size, the number of reliable contexts increases.

53

Context Counts

Terms

. . . 920 600 293 20 2 1 . . .

. . . 20 110 930 3 0 1 . . .

. . . 43 30 0 1 0 2 . . .

Contexts

Reliable

Unreliable

Basic idea: model each reliable context as a “single urn.”

54

Context Counts – Assumptions

1) Only a term’s reliable contexts are useful.• Occur at least r times with the term.

2) Contexts conditionally independent given C.

3) Terms and contexts are Zipf distributed.

Key question: how many reliable contexts co-occur with a given term in a corpus of n total tokens?

Can be computed in closed form given the above assumptions.

55

Preliminary Result (1)

Assume that the Bayes Risk for a classifier using just one context is at least \Beta. Then for a corpus of n tokens over a vocabulary V and context set \Pi,

56

Preliminary Result (2)

Provides non-trivial bounds:

Google n-grams data set (roughly):

n = 1,000,000,000,000

|V| = 15,000,000

|\Pi| = 1,000,000,000

Setting \Beta = 0.45, we get E[accuracy] <= 0.85.

57

Alternate Formalization: Distance Functions

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

distance(x, y)

P(x

, y s

am

e c

las

s |

dis

tan

ce

(x, y

))

58

Distance Functions

Key Formal Problem:

Given a distance function d(x, y) and prior over P(x C), what isP(x C | , d(xi, yj) for i, j V)

Straightforward to compute, but:

Requires (naively) summing over the power set of V.

59

Empirical Investigation

Either formalization is governed by parameters, some specific to C, others more global.

Proposed Experiments – with a variety of classes, measure empirically:Context Counts

Urn parameters for contextsDependence between contexts

Distance FunctionsObserved distance functions, as a function of:term frequency, corpus size, class prevalence.

60



4) Proposal for DH

5) Chez KnowItAll

Outline

61

Theoretical Questions:Entrée: DH Formalisms

(Distance Functions, Context Counts, something else?)

Sides: Relationship between KH and DH, generative textual models yielding hypotheses.

Empirical Questions:

Improving REALM’s language modeling techniques

Modeling polysemy

Language modeling accuracy vs. IE accuracy

Applying HMM-T to NER

62

Context Counts Advantages:

Explicitly models counts

Leverages Urns model

Likely tractable

Distance Function Advantages

Applicable to semi-supervised learning

More “pure” instantiation of DH

Entrée: DH Formalizms

63

Relationship between KH and DH

Theoretical Sides(1)

Terms

. . . 920 400 293 … 2 1 . . .

. . . 200 170 30 … 0 1 . . .

. . . 43 30 50 … 0 2 . . .

Contexts

DH KH (in $X) … (cities such as $X)

64

Theoretical Sides(2)

Is there a generative model of text that leads to KH, DH?

E.g., if text is generated by a HMM…

65

Empirical Questions (1)

Improving REALM with language modeling enhancementsCharacter level models, syntax, PCFGs, etc.

Modeling PolysemyP(t | Chicago) the same for Chicago the city and Chicago the musical.

Idea: an HMM that selectively bifurcates words into senses when this improves LM accuracy.

66

Empirical Questions (2)

Language Modeling Accuracy vs. Information Extraction accuracyIs it monotonic?

Applying HMM-T to Named Entity Recognition

67

Thanks!

self-supervised probabilistic methods for extracting facts from text

Documents