toward unified models of information extraction and data mining andrew mccallum information...

Toward Unified Models of Information Extraction and Data Mining

Andrew McCallum

Information Extraction and Synthesis Laboratory

Computer Science Department

University of Massachusetts Amherst

Joint work with

Aron Culotta, Wei Li, Khashayar Rohanimanesh, Charles Sutton, Ben Wellner

Goal:

Improving our abilityto mine actionable knowledgefrom unstructured text.

Larger Context

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

Problem:

Combined in serial juxtaposition,IE and KD are unaware of each others’ weaknesses and opportunities.

1) KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties.

2) IE is unaware of emerging patterns and regularities in the DB.

The accuracy of both suffers, and significant mining of complex text sources is beyond reach.


IE

Documentcollection

Database


KnowledgeDiscovery

Actionableknowledge


Filter


IE

Documentcollection

Database


DataMining

Spider

Actionableknowledge

Uncertainty Info

Emerging Patterns

Solution:


Filter


IE

Documentcollection

ProbabilisticModel


DataMining

Spider

Actionableknowledge

Solution:

Conditional Random Fields [Lafferty, McCallum, Pereira]

Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]

Discriminatively-trained undirected graphical models

Complex Inference and LearningJust what we researchers like to sink our teeth into!

Unified Model

Outline

• The need for unified IE and DM.

• Review of Conditional Random Fields for IE.

• Preliminary steps toward unification:

– Joint Co-reference Resolution (Graph Partitioning)

– Joint Labeling of Cascaded Sequences (Belief Propagation)

– Joint Segmentation and Co-ref (Iterated Conditional Samples.)

• Conclusions

Hidden Markov Models

St -1

St

Ot

St+1

Ot +1

Ot -1

...

...

Finite state model Graphical model

Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)

∏=

−∝||

11 )|()|(),(

o

ttttt soPssPosP

vvv

HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …

...transitions

observations

o1 o2 o3 o4 o5 o6 o7 o8

Generates:

State sequenceObservation sequence

Usually a multinomial over atomic, fixed alphabet

€

vs = s1,s2,...sn

v o = o1,o2,...on

Joint

Conditional

St-1 St

Ot

St+1

Ot+1Ot-1

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

...

...

€

P(v s ,

v o ) = P(st | st−1)P(ot | st )

t=1

|v o |

∏

€

Φo(t) = exp λ k fk (st ,ot )k

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟ (A super-special case of

Conditional Random Fields.)

[Lafferty, McCallum, Pereira 2001]

€

P(v s |

v o ) =

1

P(v o )

P(st | st−1)P(ot | st )t=1

|v o |

∏

€

=1

Z(v o )

Φs(st ,st−1)Φo(ot ,st )t=1

|v o |

∏

where

From HMMs to Conditional Random Fields

Set parameters by maximum likelihood, using optimization method on L.

Table Extraction from Government ReportsCash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Table Extraction from Government Reports

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was

slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and dealers

as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the

remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

--------------------------------------------------------------------------------

: : Production of Milk and Milkfat 2/

: Number :-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

:Milk Cows 1/:-------------------: of Fat in All :------------------

: : Milk : Milkfat : Milk Produced : Milk : Milkfat

--------------------------------------------------------------------------------

: 1,000 Head --- Pounds --- Percent Million Pounds

:

1993 : 9,589 15,704 575 3.66 150,582 5,514.4

1994 : 9,500 16,175 592 3.66 153,664 5,623.7

1995 : 9,461 16,451 602 3.66 155,644 5,694.3

--------------------------------------------------------------------------------

1/ Average number during year, excluding heifers not yet fresh.

2/ Excludes milk sucked by calves.

CRFLabels:• Non-Table• Table Title• Table Header• Table Data Row• Table Section Data Row• Table Footnote• ... (12 in all)

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

Features:• Percentage of digit chars• Percentage of alpha chars• Indented• Contains 5+ consecutive spaces• Whitespace in this line aligns with prev.• ...• Conjunctions of all previous features,

time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

100+ documents from www.fedstats.gov

Table Extraction Experimental Results

Line labels,percent correct

Table segments,F1

95 % 92 %

65 % 64 %

error = 85%

error = 77%

85 % -

HMM

StatelessMaxEnt

CRF w/outconjunctions

CRF

52 % 68 %

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

IE from Research Papers[McCallum et al ‘99]

IE from Research Papers

Field-level F1

Hidden Markov Models (HMMs) 75.6[Seymore, McCallum, Rosenfeld, 1999]

Support Vector Machines (SVMs) 89.7[Han, Giles, et al, 2003]

Conditional Random Fields (CRFs) 93.9[Peng, McCallum, 2004]

error40%

Main Point #2

Conditional Random Fields were more accurate in practice than a generative model

... on a research paper extraction task,

... and others, including- a table extraction task- noun phrase segmentation- named entity extraction- …

Outline

• The need for unified IE and DM.

• Review of Conditional Random Fields for IE.

• Preliminary steps toward unification:

1. Joint Labeling of Cascaded Sequences (Belief Propagation)Charles Sutton

2. Joint Co-reference Resolution (Graph Partitioning)Aron Culotta

3. Joint Labeling for Semi-Supervision (Graph Partitioning)Wei Li

4. Joint Segmentation and Co-ref (Iterated Conditional Samples.)Andrew McCallum

1. Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]

Joint prediction of part-of-speech and noun-phrase in newswire,equivalent accuracy with only 50% of the training data.

Inference:Tree reparameterization

[Wainwright et al, 2002]

1b. Jointly labeling distant mentionsSkip-chain CRFs

Mr. Ted Green said today … … Mary saw Green at …

…

[Sutton, McCallum, 2004]

14% reduction in error on most repeated field in email seminar announcements.

Inference:Tree reparameterization

[Wainwright et al, 2002]

2. Joint co-reference among all pairsAffinity Matrix CRF

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

99Y/N

Y/N

Y/N

11

[McCallum, Wellner, IJCAI WS 2003]

25% reduction in error on co-reference ofproper nouns in newswire.

Inference:Correlational clusteringgraph partitioning

[Bansal, Blum, Chawla, 2002]

Y/N

Y/N

Y/N

3. Joint Labeling for Semi-SupervisionAffinity Matrix CRF with prototypes

45

99

11

[Li, McCallum, 2003]

50% reduction in error ondocument classificationwith labeled and unlabeleddata.

Inference:Correlational clusteringgraph partitioning

[Bansal, Blum, Chawla, 2002]

y1 y2

x3

x2

x1

p

Databasefield values

c

4. Joint segmentation and co-reference

o

s

o

s

c

c

s

o

Citation attributes

y y

y

Segmentation

[Wellner, McCallum, Peng, Hay, UAI 2004]Inference:Variant of Iterated Conditional Modes

Co-reference decisions

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.

Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.

[Besag, 1986]

World Knowledge

35% reduction in co-reference error by using segmentation uncertainty.

6-14% reduction in segmentation error by using co-reference.

Extraction from and matching of research paper citations.

see also [Marthi, Milch, Russell, 2003]

To Charles

Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , B. Laurel (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Laurel , The Art of Human-Computer Interface Design , 355-366 ,

1990 .






1990 .

• Segment citation fields





• Segment citation fields

• Resolve coreferent papers



1990 .


Y/N






1990 .

?

Segmentation Quality Citation

Co-reference (F1)

No Segmentation .787

CRF Segmentation .913

True Segmentation .932

Incorrect Segmentation Hurts Coreference






1990 .

?

Incorrect Segmentation Hurts Coreference

Solution: Perform segmentation and coreference jointly.

Use segmentation uncertainty to improve coreference

and use coreference to improve segmentation.

o

s

Observed citation

CRF Segmentation

Segmentation + Coreference Model

o

s

c Citation attributes

CRF Segmentation

Observed citation


o

s

c

o

s

c

c

s

o

Citation attributes

Observed citation

CRF Segmentation


o

s

c

o

s

c

c

s

o

Citation attributes

Observed citation

y y

ypairwise coref

CRF Segmentation


Such a highly connected graph makes exact inference intractable, so…

• Loopy Belief

Propagation

v6v5

v3v2v1

v4

m1(v2) m2(v3)

m3(v2)m2(v1) messages passed between nodes

Approximate Inference 1

• Loopy Belief

Propagation

• Generalized Belief

Propagation

v6v5

v3v2v1

v4

m1(v2) m2(v3)

m3(v2)m2(v1)

v6v5

v3v2v1

v4

v9v8v7

messages passed between nodes

messages passed between regions

Here, a message is a conditional probability table passed among nodes.But when message size grows exponentially with region size!


• Iterated Conditional

Modes (ICM)

[Besag 1986]

v6v5

v3v2v1

v4

v6i+1 = argmax P(v6

i | v \ v6

i) v6

i

= held constant



Modes (ICM)

[Besag 1986]

v6v5

v3v2v1

v4

v5j+1 = argmax P(v5

j | v \ v5

j) v5

j

= held constant



Modes (ICM)

[Besag 1986]

v6v5

v3v2v1

v4

v4k+1 = argmax P(v4

k | v \ v4

k) v4

k

= held constant


But greedy, and easily falls into local minima.


Modes (ICM)

[Besag 1986]

• Iterated Conditional Sampling (ICS) (our proposal; related work?) Instead of passing only argmax, sample of argmaxes of P(v4

k | v \ v4

k)

i.e. an N-best list (the top N values)

v6v5

v3v2v1

v4

v4k+1 = argmax P(v4

k | v \ v4

k) v4

k

= held constant

v6v5

v3v2v1

v4


Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once.

Here, a “message” grows only linearly with region size and N!

o

s

c

o

s

c

c

s

o

y y

y

p

pprototype

pairwise vars

Sample = N-best List from CRF Segmentation

Do exact inference over these linear-chain regions

Pass N-best List to coreference

o

s

c

o

s

cy

pairwise vars

Parameterized by N-Best lists

Sample = N-best List from Viterbi

Name Title …

Laurel, B Interface Agents: Metaphors with Character The

…

Laurel, B. Interface Agents: Metaphors with Character

…

Laurel, B. Interface Agents

Metaphors with Character

…

o

s

c

o

s

cy

When calculating similarity with another citation, have more opportunity to find correct, matching fields.

Name Title Book Title Year


The Art of Human Computer Interface Design

1990

Laurel, B. Interface Agents: Metaphors with Character The Art

of Human Computer Interface Design

1990


The Art of Human Computer Interface Design

1990

Sample = N-best List from Viterbi

N Reinforce Face Reason Constraint

1 0.946 0.967 0.945 0.961

3 0.95 0.979 0.961 0.960

7 0.948 0.979 0.951 0.971

9 0.982 0.967 0.960 0.971

Optimal 0.995 0.992 0.994 0.988

Coreference F1 performance

• Average error reduction is 35%.

• “Optimal” makes best use of N-best list by using true labels.

• Indicates that even more improvement can be obtained

Results on 4 Sections of CiteSeer Citations

Conclusions

• Conditional Random Fields combine the benefits of– Conditional probability models (arbitrary features)– Markov models (for sequences or other relations)

• Success in– Factorial finite state models– Coreference analysis – Semi-supervised Learning– Segmentation uncertainty aiding coreference

• Future work:– Structure learning.– Further tight integration of IE and Data Mining– Application to Social Network Analysis.

End of Talk

Application Project:


ResearchPaper

Cites


ResearchPaper

Cites

Person

UniversityConf-

erence

Grant

Groups

Expertise

• ~60k lines of Java• Document classification, information extraction, clustering, co-

reference, POS tagging, shallow parsing, relational classification, …• Many ML basics in common, convenient framework:

– naïve Bayes, MaxEnt, Boosting, SVMs, Dirichlets, Conjugate Gradient• Advanced ML algorithms:

– Conditional Random Fields, Maximum Margin Markov Networks, BFGS, Expectation Propogatation, Tree-Reparameterization, …

• Unlike other toolkits (e.g. Weka) MALLET scales to millions of features, 100k’s training examples, as needed for NLP.

MALLET:Machine Learning for Language Toolkit

Released as Open Source Software.http://mallet.cs.umass.edu

Software Infrastructure

In use at UMass, MIT, CMU, UPenn,

End of Talk

toward unified models of information extraction and data mining andrew mccallum information...

Documents

marketings of milk

pounds of milk

production of milk

milk cows

state probabilities

unaware of emerging

conditional samples

table extraction