information extraction, data mining & joint inference andrew mccallum computer science...

Information Extraction, Data Mining & Joint Inference

Andrew McCallum

Computer Science Department

University of Massachusetts Amherst

Joint work with Charles Sutton, Aron Culotta, Khashayar Rohanemanesh,Ben Wellner, Karl Schultz, Michael Hay, Michael Wick, David Mimno.

My Research

Building models that mine actionable knowledge

from unstructured text.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Extracting Job Openings from the Web

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

A Portal for Job Openings

Job

Op

enin

gs:

Cat

ego

ry =

Hig

h T

ech

Key

wo

rd =

Jav

a L

oca

tio

n =

U.S

.

QuickTime™ and aTIFF (LZW) decompressor


Data Mining the Extracted Job Information

IE fromChinese Documents regarding Weather

Department of Terrestrial System, Chinese Academy of Sciences

200k+ documentsseveral millennia old

- Qing Dynasty Archives- memos- newspaper articles- diaries

What is “Information Extraction”

Information Extraction = segmentation + classification + clustering + association

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation


Information Extraction = segmentation + classification + association + clustering


October 14, 2002, 4:00 a.m. PT







Information Extraction = segmentation + classification + association + clustering


October 14, 2002, 4:00 a.m. PT






NAME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

From Text to Actionable Knowledge

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge

A Natural Language Processing Pipeline

Pragmatics

Anaphora Resolution

Semantic Role Labeling

Entity Recognition

Parsing

Chunking

POS tagging

Errorscascade &

accumulate

Unified Natural Language Processing

Pragmatics

Anaphora Resolution

Semantic Role Labeling

Entity Recognition

Parsing

Chunking

POS tagging

Unified,joint

inference.


Filter


IE

Documentcollection

Database


DataMining

Spider

Actionableknowledge

Uncertainty Info

Emerging Patterns

Solution:


Filter


IE

Documentcollection

ProbabilisticModel


DataMining

Spider

Actionableknowledge

Solution:

Conditional Random Fields [Lafferty, McCallum, Pereira]

Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]

Discriminatively-trained undirected graphical models

Complex Inference and LearningJust what we researchers like to sink our teeth into!

Unified Model

Scientific Questions

• What model structures will capture salient dependencies?

• Will joint inference actually improve accuracy?

• How to do inference in these large graphical models?

• How to do parameter estimation efficiently in these models,which are built from multiple large components?

• How to do structure discovery in these models?

Methods of Inference

• Exact– Exhaustively explore all interpretations– Graphical model has low “tree-width”

• Variational– Represent distribution in simpler model that is close

• Monte-Carlo– Randomly (but cleverly) sample to explore

interpretations

Outline

• Examples of IE and Data Mining.

• Motivate Joint Inference

• Brief introduction to Conditional Random Fields

• Joint inference: Information Extraction Examples

– Joint Labeling of Cascaded Sequences (Belief Propagation)

– Joint Labeling of Distant Entities (BP by Tree Reparameterization)

– Joint Co-reference Resolution (Graph Partitioning)

– Joint Segmentation and Co-ref (Sparse BP)

– Probability + First-order Logic, Co-ref on Entities (MCMC)

• Demo: Rexa, a Web portal for researchers

Hidden Markov Models

St -1

St

Ot

St+1

Ot +1

Ot -1

...

...

Finite state model Graphical model

€

P(v s ,

v o )∝ P(st | st−1)P(ot | st )

t=1

|v o |

∏

HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …

...transitions

observations

o1 o2 o3 o4 o5 o6 o7 o8

Generates:

State sequence

Observation sequence

IE with Hidden Markov Models

Yesterday Ron Parr spoke this example sentence.

Yesterday Ron Parr spoke this example sentence.

Person name: Ron Parr

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “person name”state extract as a person name:

person name

location name

background

We want More than an Atomic View of Words

Would like richer representation of text: many arbitrary, overlapping features of the words.

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchorlast person name was femalenext two words are “and Associates”

…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Problems with Richer Representationand a Joint Model

These arbitrary features are not independent.– Multiple levels of granularity (chars, words, phrases)

– Multiple dependent modalities (words, formatting, layout)

– Past & future

Two choices:

Model the dependencies.Each state would have its own Bayes Net. But we are already starved for training data!

Ignore the dependencies.This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

St -1

St

Ot

St+1

Ot +1

Ot -1

St -1

St

Ot

St+1

Ot +1

Ot -1

Conditional Sequence Models

• We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o):

– Can examine features, but not responsible for generating them.

– Don’t have to explicitly model their dependencies.

– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

Joint

Conditional

St-1 St

Ot

St+1

Ot+1Ot-1

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

...

...

(A super-special case of Conditional Random Fields.)

[Lafferty, McCallum, Pereira 2001]

where

From HMMs to Conditional Random Fields

Set parameters by maximum likelihood, using optimization method on L.

€

P(v s ,

v o ) = P(st | st−1)P(ot | st )

t=1

|v o |

∏

€

vs = s1,s2,...sn

v o = o1,o2,...on

€

P(v s |

v o ) =

1

P(v o )

P(st | st−1)P(ot | st )t=1

|v o |

∏

€

=1

Z(v o )

Φs(st ,st−1)Φo(ot ,st )t=1

|v o |

∏

€

Φo(t) = exp λ k fk (st ,ot )k

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

(Linear Chain) Conditional Random Fields

yt -1

yt

xt

yt+1

xt +1

xt -1

Finite state model Graphical model

Undirected graphical model, trained to maximize

conditional probability of output (sequence) given input (sequence)

. . .

FSM states

observations

yt+2

xt +2

yt+3

xt +3

said Jones a Microsoft VP …

OTHER PERSON OTHER ORG TITLE …

output seq

input seq

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

[Lafferty, McCallum, Pereira 2001]

€

p(y | x) =1

Zx

Φ(y t , y t−1,x, t)t

∏ where

€

Φ(y t ,y t−1,x, t) = exp λ k fk (y t ,y t−1,x, t)k

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

Table Extraction from Government ReportsCash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Table Extraction from Government Reports

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was

slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and dealers

as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the

remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

--------------------------------------------------------------------------------

: : Production of Milk and Milkfat 2/

: Number :-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

:Milk Cows 1/:-------------------: of Fat in All :------------------

: : Milk : Milkfat : Milk Produced : Milk : Milkfat

--------------------------------------------------------------------------------

: 1,000 Head --- Pounds --- Percent Million Pounds

:

1993 : 9,589 15,704 575 3.66 150,582 5,514.4

1994 : 9,500 16,175 592 3.66 153,664 5,623.7

1995 : 9,461 16,451 602 3.66 155,644 5,694.3

--------------------------------------------------------------------------------

1/ Average number during year, excluding heifers not yet fresh.

2/ Excludes milk sucked by calves.

CRFLabels:• Non-Table• Table Title• Table Header• Table Data Row• Table Section Data Row• Table Footnote• ... (12 in all)

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

Features:• Percentage of digit chars• Percentage of alpha chars• Indented• Contains 5+ consecutive spaces• Whitespace in this line aligns with prev.• ...• Conjunctions of all previous features,

time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

100+ documents from www.fedstats.gov

Table Extraction Experimental Results

Line labels,percent correct

Table segments,F1

95 % 92 %

65 % 64 %

85 % -

HMM

StatelessMaxEnt

CRF

[Pinto, McCallum, Wei, Croft, 2003 SIGIR]

IE from Research Papers[McCallum et al ‘99]

IE from Research Papers

Field-level F1

Hidden Markov Models (HMMs) 75.6[Seymore, McCallum, Rosenfeld, 1999]

Support Vector Machines (SVMs) 89.7[Han, Giles, et al, 2003]

Conditional Random Fields (CRFs) 93.9[Peng, McCallum, 2004]

error40%

Outline

• The Need for IE and Data Mining.








– Joint Information Integration (MCMC + Sample Rank)


1. Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]


Part-of-speech


Named-entity tag

English words


But errors cascade--must be perfect at every stage to do well.


Part-of-speech


Named-entity tag

English words


Joint prediction of part-of-speech and noun-phrase in newswire,matching accuracy with only 50% of the training data.

Inference:Loopy Belief Propagation

Outline











2. Jointly labeling distant mentionsSkip-chain CRFs

Senator Joe Green said today … . Green ran for …

…

[Sutton, McCallum, SRL 2004]

Dependency among similar, distant mentions ignored.

2. Jointly labeling distant mentionsSkip-chain CRFs

Senator Joe Green said today … . Green ran for …

…

[Sutton, McCallum, SRL 2004]

14% reduction in error on most repeated field in email seminar announcements.

Inference:Tree reparameterized BP

[Wainwright et al, 2002]

See also[Finkel, et al, 2005]

Outline











3. Joint co-reference among all pairsAffinity Matrix CRF

[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]

~25% reduction in error on co-reference of proper nouns in newswire.

Inference:Correlational clusteringgraph partitioning

[Bansal, Blum, Chawla, 2002]

“Entity resolution”“Object correspondence”

DanaHill

Mr.Hill

AmyHall

sheDana

Coreference Resolution

Input

AKA "record linkage", "database record deduplication", "citation matching", "object correspondence", "identity uncertainty"

Output

News article, with named-entity "mentions" tagged

Number of entities, N = 3

#1 Secretary of State Colin Powell he Mr. Powell Powell

#2 Condoleezza Rice she Rice

#3 President Bush Bush

Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . he . . . . . .. . . . . . . . . . . . . Condoleezza Rice . . . . .. . . . Mr Powell . . . . . . . . . .she . . . . . . . . . . . . . . . . . . . . . Powell . . . . . . . . . . . .. . . President Bush . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

Inside the Traditional Solution

Mention (3) Mention (4)

. . . Mr Powell . . . . . . Powell . . .

N Two words in common 29Y One word in common 13Y "Normalized" mentions are string identical 39Y Capitalized word in common 17Y > 50% character tri-gram overlap 19N < 25% character tri-gram overlap -34Y In same sentence 9Y Within two sentences 8N Further than 3 sentences apart -1Y "Hobbs Distance" < 3 11N Number of entities in between two mentions = 0 12N Number of entities in between two mentions > 4 -3Y Font matches 1Y Default -19

OVERALL SCORE = 98 > threshold=0

Pair-wise Affinity Metric

Y/N?

Entity Resolution

DanaHill

Mr.Hill

AmyHall

sheDana

“mention”

“mention” “mention”

“mention”

“mention”

Entity Resolution

DanaHill

Mr.Hill

AmyHall

sheDana

“entity”

“entity”

Entity Resolution

DanaHill

Mr.Hill

AmyHall

sheDana

“entity”

“entity”

“entity”

The Problem

DanaHill

Mr.Hill

AmyHall

sheDana

Independent pairwise affinity with connected components

Pair-wise mergingdecisions are beingmade independentlyfrom each other

They should be madejointly.

Affinity measures are noisy and imperfect.

C

C N

C

C

NN

C

N

N

DanaHill

Mr.Hill

AmyHall

sheDana

€

P(v y |

v x ) =

1

Z v x

exp λ l f l (x i, x j , y ij ) + λ ' f '(y ij ,y jk, y ik )i, j,k

∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

C

C N

C

C

NN

N

NC

[McCallum & Wellner, 2003, ICML]

Make pair-wise mergingdecisions jointly by:- calculating a joint prob.- including all edge weights- adding dependence on consistent triangles.

CRF for Co-reference

DanaHill

Mr.Hill

AmyHall

sheDana

€

P(v y |

v x ) =

1

Z v x


∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

C

C N

C

C

NN

N

NC

[McCallum & Wellner, 2003, ICML]



€

P(v y |

v x ) =

1

Z v x


∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

C

C N

C

C

NN

N

C

+(23)

+(10)

+(4)

+(17)

-(-55)

∞−

-(-44)

-(-23)+(11)

-(-9) -(-22)

DanaHill

Mr.Hill

AmyHall

sheDana

N

218


€

P(v y |

v x ) =

1

Z v x


∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

C

C N

C

N

NN

N

C

+(23)

+(10)

-(4)

+(17)

-(-55)

-(-44)

-(-23)+(11)

-(-9) -(-22)

DanaHill

Mr.Hill

AmyHall

sheDana

N

0∞−210


€

P(v y |

v x ) =

1

Z v x


∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

C

N C

N

C

NC

N

N

-(23)

+(10)

+(4)

-(17)

+(-55)

-(-44)

-(-23)-(11)

+(-9) -(-22)

DanaHill

Mr.Hill

AmyHall

sheDana

N

0-12

Inference in these MRFs = Graph Partitioning

C

N C

N

C

NC

N

N

-(23)

+(10)

+(4)

-(17)

+(-55)

-(-44)

-(-23)-(11)

+(-9) -(-22)

DanaHill

Mr.Hill

AmyHall

sheDana

N

[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002],

[Yu, Cross, Shi, 2002]

Correlational Clustering[Bansal & Blum]

[Demaine]

€

log P(v y |

v x )( )∝ λ l f l (x i,x j , y ij )

l

∑i, j

∑ = w ij

i, j w/inparitions

∑ − w ij

i, j acrossparitions

∑

Co-reference Experimental Results

Proper noun co-reference

DARPA ACE broadcast news transcripts, 117 stories

Partition F1 Pair F1Single-link threshold 16 % 18 %Best prev match [Morton] 83 % 89 %MRFs 88 % 92 %

error=30% error=28%

DARPA MUC-6 newswire article corpus, 30 stories

Partition F1 Pair F1Single-link threshold 11% 7 %Best prev match [Morton] 70 % 76 %MRFs 74 % 80 %

error=13% error=17%

[McCallum & Wellner, 2003]

Pairwise Affinity is not Enough

C

N C

N

C

NC

N

N

-(23)

+(10)

+(4)

-(17)

+(-55)

-(-44)

-(-23)-(11)

+(-9) -(-22)

DanaHill

Mr.Hill

AmyHall

sheDana

N


C

N C

N

C

NC

N

N

DanaHill

Mr.Hill

AmyHall

sheDana

N


C

C N

C

N

NN

N

Cshe

she

AmyHall

sheshe

N

Pairwise Comparisons Not EnoughExamples:

mentions are pronouns?• Entities have multiple attributes (name, email, institution, location);

need to measure “compatibility” among them.• Having 2 “given names” is common, but not 4.

– e.g. Howard M. Dean / Martin, Dean / Howard Martin

• Need to measure size of the clusters of mentions. a pair of lastname strings that differ > 5?

We need to ask , questions about a set of mentions

We want first-order logic!

Outline












C

C N

C

N

NN

N

Cshe

she

AmyHall

sheshe

N

Partition Affinity CRF

she

she

AmyHall

sheshe

Ask arbitrary questionsabout all entities in a partitionwith first-order logic... ... bringing together LOGIC and PROBABILITY

Partition Affinity CRF

she

she

AmyHall

sheshe

This space complexity is common in probabilistic first-order logic models

ground Markov network

“Markov Logic” First-Order Logic as a Template to Define CRF Parameters

[Richardson & Domingos 2005]

grounding Markov network requires space O(nr)

n = number constants r = highest

clause arity

[Paskin & Russell 2002][Taskar et al 2003]

How can we perform inference and learning in models that cannot be grounded?

Inference in Weighted First-Order LogicSAT Solvers

• Weighted SAT solvers [Kautz et al 1997]

– Requires complete grounding of network

• LazySAT [Singla & Domingos 2006]– Saves memory by only storing clauses that may become unsatisfied

– Initialization still requires time O(nr) to visit all ground clauses

• Gibbs Sampling– Difficult to move between high probability configurations by

changing single variables • Although, consider MC-SAT. [Poon & Domingos ‘06]

• An alternative: Metropolis-Hastings sampling[Culotta & McCallum 2006]

– 2 parts: proposal distribution, acceptance distribution.– Can be extended to partial configurations

• Only instantiate relevant variables

– Successfully used in BLOG models [Milch et al 2005]

– Key advantage: can design arbitrary “smart” jumps

Inference in Weighted First-Order LogicMCMC

Don’t represent all alternatives...

she

she

AmyHall

sheshe

Don’t represent all alternatives... just one

she

she

AmyHall

sheshe

she

she

AmyHall

sheshe

StochasticJump

ProposalDistribution

at a time

Metropolis-Hastings“Jump acceptance probability”

• p(y’)/p(y) : likelihood ratio– Ratio of P(Y|X)

– ZX cancels!

• q(y’|y) : proposal distribution– probability of proposing move y y’– ratio makes up for any biases in the proposal

distribution

Learning the Likelihood Ratio

Given a pair of configurations, learn to rank the “better” configuration higher.

• Most methods require calculating gradient of log-likelihood, P(y1, y2, y3,... | x1, x2, x3,...)...

• ...which in turn requires “expectations of marginals,” P(y1| x1, x2, x3,...)

• But, getting marginal distributions by sampling can be inefficient due to large sample space.

• Alternative: Perceptron. Approx gradient from difference between true output and model’s predicted best output.

• But, even finding model’s predicted best output is expensive.

• We propose: “Sample Rank” [Culotta, Wick, Hall, McCallum, HLT 2007]

Learn to rank intermediate solutions P(y1=1, y2=0, y3=1,... | ...) > P(y1=0, y2=0, y3=1,... | ...)

Parameter Estimation in Large State Spaces

Ranking vs Classification Training

• Instead of training

[Powell, Mr. Powell, he] --> YES[Powell, Mr. Powell, she] --> NO

• ...Rather...

[Powell, Mr. Powell, he] > [Powell, Mr. Powell, she]

• In general, higher-ranked example may contain errors

[Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]

1.

UPDATE

Ranking Intermediate SolutionsExample

2.

∆ Model = -23∆ Truth = -0.2

3.

∆ Model = 10∆ Truth = -0.1

4.

∆ Model = -10∆ Truth = -0.1

5.

∆ Model = 3∆ Truth = 0.3

• Like Perceptron:Proof of convergence under Marginal Separability

• More constrained than Maximum Likelihood:Parameters must correctly rank incorrect solutions!

1. Proposer:2. Performance Metric:3. Inputs: input sequence x and an initial (random) configuration4. Initialization: set the parameter vector5. Output: Parameters 6. Score function:

7. For t = 1,…,T and i = 0, …, n-1 do. Generate a training instance. Let and be the best and worst configurations among and according to the performance metric. . If

. end if

8. end for

€

y t +1 = Propose(x, y t )

€

F(Y )

€

y0

€

α =0

€

α

€

y t +1 = Propose(x,y t )

€

y +

€

y−

€

y t

€

y t +1€

Scorexα (y) = α .φ(x,y )

€

y t +1

€

Scorexα (y +) < Scorex

α (y−)

€

α =α +φ(x,y +) − φ(x,y−)

Sample Rank Algorithm

Weighted Logics TechniquesOverview

• Metropolis-Hastings (MH) for inference– Freely bake-in domain knowledge about fruitful jumps;

MH safely takes care of its biases.– Avoid memory and time consumption with massive

deterministic constraint factors: built jump functions that simply avoid illegal states.

• “Sample Rank”– Don’t train by likelihood of completely correct solution...– ...train to properly rank intermediate configurations

the partition function (normalizer) cancels!...plus other efficiencies

Partition Affinity CRF Experiments

Likelihood-basedTraining

Rank-basedTraining

Partition

Affinity69.2 79.3

Pairwise

Affinity62.4 72.5

B-Cubed F1 Score on ACE 2004 Noun Coreference

Better Representation

Better Training

To our knowledge, best previously reported results:

1997 65%

2002 67%

2005 68%

New

state-of-the-art

[Culotta, Wick, Hall, McCallum, 2007]

Outline











Database A (Schema A)

First Name Last Name Contact

J. Smith 222-444-1337

J. Smith 444 1337

John Smith (1) 4321115555

Database B (Schema B)

Name Phone

John Smith U.S. 222-444-1337

John D. Smith 444 1337

J Smiht 432-111-5555

Schema A Schema B

First Name Name

Last Name Phone

Contact

John #1 John #2

J. Smith John Smith

J. Smith J Smiht

John Smith

John D. Smith

Information Integration

Entity# Name Phone

523 John Smith 222-444-1337

524 John D. Smith 432-111-5555

… … …

Schema MatchingCoreference

Normalized DB

Information Integration Steps

First Name, Last Name

Name Phone

Contact J. Smith

John Smith

J. Smith

A. Jones

Amanda John Smith ..

Amanda Jones ..

… ..

1. Schema Matching

2. Coreference

3. Canonicalization

Problems with a Pipeline

1. Data integration tasks are highly correlated

2. Errors can propagate

Schema Matching First

First Name, Last Name

Name Phone

Contact

1. Schema Matching

Provides Evidence

1. String Identical: F.Name+L.Name==Name2. Same Area Code: 3-gram in Phone/Contact3. …

J. Smith

John Smith

J. Smith

A. JonesAmanda

2. Coreference

NEW FEATURES

Coreference First

1. Field values similar across coref’d records2. Phone+Contact has same value for J. Smith mentions3. …

J. Smith

John Smith

J. Smith

A. JonesAmanda

1. Coreference

NEW FEATURES

Name Phone

Contact

2. Schema Matching

First Name, Last NameProvides Evidence

Problems with a Pipeline

1. Data integration tasks are highly correlated

2. Errors can propagate

Hazards of a Pipeline

Full Name

Company Name

Phone

Contact

1. Schema Matching

Table B

Full Name Company Name

Amanda Jones Smith & Sons

John Smith IBM

Table A

Name Corporation

Amanda Jones J. Smith & Sons

J. Smith IBM

ERRORS PROPOGATE

2. Coreferent?

Coref

Canonicalization

John SmithJ. SmithJ. SmithJ. SmihtJ.S MithJonh smithJohn

Canoni-calization

John Smith

Entity 87

Typically occurs AFTER coreference

Desiterata:• Complete: Contains all information (e.g. first + last)• Error-free: No typos (e.g. avoid “Smiht”)• Central: Represents all mentions (not “Mith”)

Access to such features would be very helpful to Coref

x6 x7

y67

f67x5 x8

x4

f5

y5

f8

y8y54 y54

Schema Matching

y1 y2x3

y3

y13 y23

y12

f1 f2

Coreference and Canonicalization

€

P(Y | X) =1

ZXψw(yi, xi) ψb(yij, xij)

yi,yj∈Y

∏yi∈Y

∏

€

ψ(yi,xi) = exp λ kfk(yi,xi)k

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟f7

y5

y7

• x6 is a set of attributes {phone,contact,telephone}}

• x7 is a set of attributes {last name, last name}

• f67 is a factor between x6/x7

• y67 is a binary variable indicating a match (no)

• f7 is a factor over cluster x7

• y7 is a binary variable indicating match (yes)

x6 x7

y67

f67x5 x8

x4

f5

y5

f8

y8y54 y54

Schema Matching

y1 y2x3

y3

y13 y23

y12

f1 f2


€

P(Y | X) =1


yi,yj∈Y

∏yi∈Y

∏

€


∑ ⎛

⎝ ⎜

⎞

⎠ ⎟f7

y5

y7

x1 x2

• x1 is a set of mentions {J. Smith,John,John Smith}}

• x2 is a set of mentions {Amanda, A. Jones}

• f12 is a factor between x1/x2

• y12 is a binary variable indicating a match (no)

• f1 is a factor over cluster x1

• y1 is a binary variable indicating match (yes)

• Entity/attribute factors omitted for clarity

x6 x7

y67

f67x5 x8

x4

f5

y5

f8

y8y54 y54

Schema Matching

f43

y1 y2x3

y3

y13 y23

y12

f1 f2


€

P(Y | X) =1


yi,yj∈Y

∏yi∈Y

∏

€


∑ ⎛

⎝ ⎜

⎞

⎠ ⎟f7

y5

y7

x1 x2

Dataset

• Faculty and alumni listings from university websites, plus an IE system

• 9 different schemas

• ~1400 mentions, 294 coreferent

Example Schemas

DEX IE Northwestern Fac UPenn Fac

First Name Name Name

Middle Name Title First Name

Last Name PhD Alma Mater Last Name

Title Research Interests Job+Department

Department Office Address

Company Name E-mail

Home Phone

Office Phone

Fax Number

E-mail

Schema Matching Features

• String identical• Sub string matches• TFIDF weighted cosine distance• All of the above with between coreferent

mentions only

First order quantifications/aggregations over:

Systems

• ISO: Each task in isolation• CASC: Coref -> Schema matching• CASC: Schema matching -> Coref• JOINT: Coref + Schema matching

Each system is evaluated with and withoutJoint canonicalization

Our new work

Coreference Results

Pair MUCF1 Prec Recall F1 Prec Recall

No Canon

ISO 72.7 88.9 61.5 75.0 88.9 64.9

CASC 64.0 66.7 61.5 65.7 66.7 64.9

JOINT 76.5 89.7 66.7 78.8 89.7 70.3

Canon

ISO 78.3 90.0 69.2 80.6 90.0 73.0

CASC 65.8 67.6 64.1 67.6 67.6 67.6

JOINT 81.7 90.6 74.4 84.1 90.6 74.4

Note: cascade does worse than ISO

Schema Matching Results

Pair MUCF1 Prec Recall F1 Prec Recall

No Canon

ISO 50.9 40.9 67.5 69.2 81.8 60.0

CASC 50.9 40.9 67.5 69.2 81.8 60.0

JOINT 68.9 100 52.5 69.6 100 53.3

Canon

ISO 50.9 40.9 67.5 69.2 81.8 60.0

CASC 52.3 41.8 70.0 74.1 83.3 66.7

JOINT 71.0 100 55.0 75.0 100 60.0

Note: cascade not as harmful here

Outline











Data Mining Research Literature

• Better understand structure of our own research area.

• Structure helps us learn a new field.• Aid collaboration• Map how ideas travel through social networks

of researchers.

• Aids for hiring and finding reviewers!• Measure impact of papers or people.

Our Data

• Over 1.6 million research papers, gathered as part of Rexa.info portal.

• Cross linked references / citations.



Previous Systems

ResearchPaper

Cites

Previous Systems

ResearchPaper

Cites

Person

UniversityVenue

Grant

Groups

Expertise

More Entities and Relations

Topical TransferCitation counts from one topic to another.

Map “producers and consumers”

Topical Bibliometric Impact Measures

• Topical Citation Counts

• Topical Impact Factors

• Topical Longevity

• Topical Precedence

• Topical Diversity

• Topical Transfer

[Mann, Mimno, McCallum, 2006]

Topical Transfer

Transfer from Digital Libraries to other topics

Other topic Cit’s Paper Title

Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.

Computer Vision 14 On being ‘Undigital’ with digital cameras: extending the dynamic...

Video 12 Lessons learned from the creation and deployment of a terabyte digital video libr..

Graphs 12 Trawling the Web for Emerging Cyber-Communities

Web Pages 11 WebBase: a repository of Web pages

Topical Diversity

Papers that had the most influence across many other fields...

Topical DiversityEntropy of the topic distribution among

papers that cite this paper (this topic).

HighDiversity

LowDiversity

Summary

• Joint inference needed for avoiding cascading errors in information extraction and data mining.– Most fundamental problem in NLP, data mining, ...

• Can be performed in CRFs– Cascaded sequences (Factorial CRFs)– Distant correlations (Skip-chain CRFs)– Co-reference (Affinity-matrix CRFs)– Logic + Probability (efficient by MCMC + Sample Rank)– Information Integration

• Rexa: New research paper search engine, mining the interactions in our community.

Outline

• Model / Feature Engineering

– Brief review of IE w/ Conditional Random Fields

– Flexibility to use non-independent features

• Inference

– Entity Resolution with Probability + First-order Logic

– Resolution + Canonicalization + Schema Mapping

– Inference by Metropolis-Hastings

• Parameter Estimation

– Semi-supervised Learning with Label Regularization

– ...with Feature Labeling

– Generalized Expectation criteria