unified models of information extraction and data mining with application to social network analysis...

Unified Models of Information Extraction and Data Mining

with Application to Social Network Analysis

Andrew McCallumInformation Extraction and Synthesis Laboratory

Computer Science Department

University of Massachusetts Amherst

Joint work with David Jensen

Knowledge Discovery and Dissemination (KDD) Conference

September 2004

Intelligence Technology Innovation Center

ITICQuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Goal:

Improve the state-of-the-art in our abilityto mine actionable knowledgefrom unstructured text.

Extracting Job Openings from the Web

foodscience.com-Job2 Employer: foodscience.com JobTitle: Ice Cream Guru JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1


are needed to see this picture.

Data Mining the Extracted Job Information



IE fromChinese Documents regarding Weather

Department of Terrestrial System, Chinese Academy of Sciences

200k+ documentsseveral millennia old

- Qing Dynasty Archives- memos- newspaper articles- diaries

Traditional Pipeline

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

KnowledgeDiscovery

Spider

Actionableknowledge

Problem:

Combined in serial juxtaposition,IE and KD are unaware of each others’ weaknesses and opportunities.

1) KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties.

2) IE is unaware of emerging patterns and regularities in the DB.

The accuracy of both suffers, and significant mining of complex text sources is beyond reach.


IE

Documentcollection

Database


KnowledgeDiscovery

Actionableknowledge


Filter


IE

Documentcollection

Database


DataMining

Spider

Actionableknowledge

Uncertainty Info

Emerging Patterns

Solution:


Filter


IE

Documentcollection

ProbabilisticModel


DataMining

Spider

Actionableknowledge

Research & Approach:

Conditional Random Fields [Lafferty, McCallum, Pereira]

Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…], …

Conditionally-trained undirected graphical models

Complex Inference and LearningJust what we researchers like to sink our teeth into!

Unified Model

• Extracting answers, and also uncertainty/confidence.– Formally justified as marginalization in graphical models– Applications to new word discovery in Chinese word segmentation,

and correction propagation in interactive IE

• Joint inference, with efficient methods– Multiple, cascaded label sequences (Factorial CRFs)– Multiple distant, but related mentions (Skip-chain CRFs)– Multiple co-reference decisions (Affinity Matrix CRF)

– Integrating extraction with co-reference (Graphs & chains)

• Put it into a large-scale, working system – Social network analysis from Email and the Web– A new portal: research, people, connections.

Accomplishments, Discoveries & Results:

Types of Uncertainty in Knowledge Discovery from Text

• Confidence that extractor correctly obtained statements the author intended.

• Confidence that what was written is truthful– Author could have had misconceptions.– …or have been purposefully trying to mislead.

• Confidence that the emerging, discovered pattern is a reliable fact or generalization.

1. Labeling Sequence DataLinear-chain CRFs

yt - 1

yt

xt

yt+1

xt +1

xt -1

Finite state model Graphical model

Undirected graphical model,

trained to maximize conditional probability of outputs given inputs

. . .

FSM states

observations

yt+2

xt +2

yt+3

xt +3

said Arden Bement NSF Director …

€

p(y | x) =1

Z(x)Φy (y t ,y t−1)Φxy (x t ,y t )

t=1

T

∏

€

Φ(⋅) = exp λ k fk (⋅)k

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟where

OTHER PERSON PERSON ORG TITLE … output seq

input seq

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Segmenting tables in textual gov’t reports, 85% reduction in error over HMMs.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

[Lafferty, McCallum, Pereira 2001]

Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004]

yt - 1

yt

xt

yt+1

xt +1

xt -1

. . . Lattice ofFSM states

observations

yt+2

xt +2

yt+3

xt +3


output sequence

input sequence

OTHER

TITLE

ORG

PERSON

Finite State Lattice

€

p(y | x) =1

Z(x)Φy (y t , y t−1)Φxy (x t , y t )

t=1

T

∏

Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004]

yt - 1

yt

xt

yt+1

xt +1

xt -1

. . . Lattice ofFSM states

observations

yt+2

xt +2

yt+3

xt +3


output sequence

input sequence

OTHER

TITLE

ORG

PERSON

Constrained Forward-Backward

€

p(Arden Bement = PERSON | x) =1

Z(x)Φy (y t ,y t−1)Φxy (x t , y t )

t=1

T

∏y∈C

∑

Forward-Backward Confidence Estimationimproves accuracy/coverage

op

timal

ourforward-backwardconfidence

traditionaltoken-wiseconfidence

no use ofconfidence

Confidence Estimation Applied

• New word discovery inChinese word segmentation

– Improves segmentation accuracy by ~25%

• Highlighting fields for Interactive Information Extraction– After fixing least confident field,

constrained Viterbi automatically reduces error by another 23%.

[Peng, Fangfang, McCallumCOLING 2004]

[Kristiansen, Culotta, Viola, McCallum AAAI 2004]Honorable Mention Award



• Joint inference, with efficient methods– Multiple, cascaded label sequences (Factorial CRFs)– Multiple distant, but related mentions (Skip-chain CRFs)– Multiple co-reference decisions (Affinity Matrix CRF)– Integrating extraction with co-reference (Graphs & chains)



1. Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]


Part-of-speech


Named-entity tag

English words


But errors cascade--must be perfect at every stage to do well.


Part-of-speech


Named-entity tag

English words


Joint prediction of part-of-speech and noun-phrase in newswire,matching accuracy with only 50% of the training data.

Inference:Tree reparameterization BP

[Wainwright et al, 2002]

2. Jointly labeling distant mentionsSkip-chain CRFs

Senator Joe Green said today … . Green ran for …

…

[Sutton, McCallum, SRL 2004]

Dependency among similar, distant mentions ignored.

2. Jointly labeling distant mentionsSkip-chain CRFs

Senator Joe Green said today … . Green ran for …

…

[Sutton, McCallum, SRL 2004]

14% reduction in error on most repeated field in email seminar announcements.

Inference:Tree reparameterization BP

[Wainwright et al, 2002]

3. Joint co-reference among all pairsAffinity Matrix CRF

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

99Y/N

Y/N

Y/N

11

[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]

25% reduction in error on co-reference ofproper nouns in newswire.

Inference:Correlational clusteringgraph partitioning

[Bansal, Blum, Chawla, 2002]

“Entity resolution”“Object correspondence”

Joint IE and Coreference from Research Paper Citations

Textual citation mentions(noisy, with duplicates)

Paper database, with fields,clean, duplicates collapsed

QuickTime™ and aTIFF (LZW) decompressor


AUTHORS TITLE VENUECowell, Dawid… Probab… SpringerMontemerlo, Thrun…FastSLAM… AAAI…Kjaerulff Approxi… Technic…

4. Joint segmentation and co-reference

Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

Citation Segmentation and Coreference






1) Segment citation fields








2) Resolve coreferent citations


Y?N








3) Form canonical database record


AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990

Y?N

Resolving conflicts








3) Form canonical database record


AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990

Y?N

Perform jointly.

x

s

Observed citation

CRF Segmentation

IE + Coreference Model

J Besag 1986 On the…

AUT AUT YR TITL TITL

x

s

Observed citation

CRF Segmentation


Citation mention attributes

J Besag 1986 On the…

AUTHOR = “J Besag”YEAR = “1986”TITLE = “On the…”

c

x

s


c

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Structure for each citation mention

x

s


c

Binary coreference variablesfor each pair of mentions



x

s


c

y n

n



Binary coreference variablesfor each pair of mentions

y n

n

x

s


c



Research paper entity attribute nodes

AUTHOR = “P Smyth”YEAR = “2001”TITLE = “Data Mining…”...

Such a highly connected graph makes exact inference intractable, so…

• Loopy Belief

Propagation

v6v5

v3v2v1

v4

m1(v2) m2(v3)

m3(v2)m2(v1) messages passed between nodes

Approximate Inference 1

• Loopy Belief

Propagation

• Generalized Belief

Propagation

v6v5

v3v2v1

v4

m1(v2) m2(v3)

m3(v2)m2(v1)

v6v5

v3v2v1

v4

v9v8v7

messages passed between nodes

messages passed between regions

Here, a message is a conditional probability table passed among nodes.But when message size grows exponentially with size of overlap between regions!


• Iterated Conditional

Modes (ICM)

[Besag 1986]

v6v5

v3v2v1

v4

v6i+1 = argmax P(v6

i | v \ v6

i) v6

i

= held constant


• Iterated Conditional

Modes (ICM)

[Besag 1986]

v6v5

v3v2v1

v4

v5j+1 = argmax P(v5

j | v \ v5

j) v5

j

= held constant


• Iterated Conditional Modes

(ICM)

[Besag 1986]

v6v5

v3v2v1

v4

v4k+1 = argmax P(v4

k | v \ v4

k) v4

k

= held constant


but greedy, and easily falls into local minima.Structured inference scales well here,

• Iterated Conditional Modes

(ICM) [Besag 1986]

• Iterated Conditional Sampling (ICS) (our name) Instead of selecting only argmax, sample of argmaxes of P(v4

k | v \ v4

k)

e.g. an N-best list (the top N values)

v6v5

v3v2v1

v4

v4k+1 = argmax P(v4

k | v \ v4

k) v4

k

= held constant

v6v5

v3v2v1

v4


Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once.

Here, a “message” grows only linearly with overlap region size and N!


Exact inference onthese linear-chain regions



From each chainpass an N-best List

into coreference




Approximate inferenceby graph partitioning…

…integrating outuncertaintyin samples

of extraction

Make scale to 1Mcitations with Canopies

[McCallum, Nigam, Ungar 2000]

y n

n




Exact (exhaustive) inferenceover entity attributes

y n

n




Revisit exact inferenceon IE linear chain,

now conditioned on entity attributes

y n

n

Parameter Estimation

Coref graph edge weightsMAP on individual edges

Separately for different regions

IE Linear-chainExact MAP

Entity attribute potentialsMAP, pseudo-likelihood

In all cases:Climb MAP gradient with

quasi-Newton method

p

Databasefield values

c

4. Joint segmentation and co-reference

o

s

o

s

c

c

s

o

Citation attributes

y y

y

Segmentation

[Wellner, McCallum, Peng, Hay, UAI 2004]

Inference:Variant of Iterated Conditional Modes

Co-reference decisions

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.

Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.

[Besag, 1986]

World Knowledge

35% reduction in co-reference error by using segmentation uncertainty.

6-14% reduction in segmentation error by using co-reference.

Extraction from and matching of research paper citations.



• Joint inference, with efficient methods– Multiple, cascaded label sequences (Factorial CRFs)– Multiple distant, but related mentions (Skip-chain CRFs)

– Multiple co-reference decisions (Affinity Matrix CRF)– Integrating extraction with co-reference (Graphs & chains)



Workplace effectiveness ~ Ability to leverage network of acquaintances“The power of your little black book”

But filling Contacts DB by hand is tedious, and incomplete.





Email Inbox Contacts DB

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

WWW

Automatically

One Application Project:

System Overview

ContactInfo andPerson Name

Extraction

Person Name

Extraction

NameCoreference

HomepageRetrieval

Social NetworkAnalysis

KeywordExtraction

CRFWWW

names

Email QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

An ExampleTo: “Andrew McCallum” [email protected]

Subject ...

First Name:

Andrew

Middle Name:

Kachites

Last Name:

McCallum

JobTitle: Associate Professor

Company: University of Massachusetts

Street Address:

140 Governor’s Dr.

City: Amherst

State: MA

Zip: 01003

Company Phone:

(413) 545-1323

Links: Fernando Pereira, Sam Roweis,…

Key Words:

Information extraction,

social network,…

Search for new people

Summary of Results

Token

Acc

Field

Prec

Field

Recall

Field

F1

CRF 94.50 85.73 76.33 80.76

Person Keywords

William Cohen Logic programming

Text categorization

Data integration

Rule learning

Daphne Koller Bayesian networks

Relational models

Probabilistic models

Hidden variables

Deborah McGuiness

Semantic web

Description logics

Knowledge representation

Ontologies

Tom Mitchell Machine learning

Cognitive states

Learning apprentice

Artificial intelligence

Contact info and name extraction performance (25 fields)

Ex

amp

le ke

ywo

rds

extrac

ted

1. Expert Finding:When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!)

2. Social Network Analysis:Understand the social structure of your organization.Suggest structural changes for improved efficiency.

QuickTime™ and aTIFF (LZW) decompressor


Main Application Project:


ResearchPaper

Cites


ResearchPaper

Cites

Person

UniversityVenue

Grant

Groups

Expertise

Status:• Spider running. Over 1.5M PDFs in hand.• Best-in-world published results in IE from

research paper headers and references.• First version of multi-entity co-reference running.• First version of Web servlet interface up.• Well-engineered: Java, servlets, SQL, Lucene,

SOAP, etc.

• Public launch this Fall.


• ~80k lines of Java• Document classification, information extraction, clustering, co-

reference, POS tagging, shallow parsing, relational classification, …

• New package: Graphical models and modern inference methods.– Variational, Tree-reparameterization, Stochastic sampling, contrastive

divergence,…• New documentation and interfaces.

• Unlike other toolkits (e.g. Weka) MALLET scales to millions of features, 100k’s training examples, as needed for NLP.

MALLET:Machine Learning for Language Toolkit

Released as Open Source Software.http://mallet.cs.umass.edu

Software Infrastructure

In use at UMass, MIT, CMU, Stanford, Berkeley, UPenn, UT Austin, Purdue…

• Conditional Models of Identity Uncertainty with Application to Noun Coreference. Andrew McCallum and Ben Wellner. Neural Information Processing Systems (NIPS), 2004.

• An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching. Ben Wellner, Andrew McCallum, Fuchun Peng, Michael Hay. Conference on Uncertainty in Artificial Intelligence (UAI), 2004.

• Collective Segmentation and Labeling of Distant Entities in Information Extraction. Charles Sutton and Andrew McCallum. ICML workshop on Statistical Relational Learning, 2004.

• Extracting Social Networks and Contact Information from Email and the Web. Aron Culotta, Ron Bekkerman and Andrew McCallum. Conference on Email and Spam (CEAS) 2004.

• Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. Charles Sutton, Khashayar Rohanimanesh and Andrew McCallum. ICML 2004.

• Interactive Information Extraction with Constrained Conditional Random Fields. Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. AAAI 2004. (Winner of Honorable Mention Award.)

• Accurate Information Extraction from Research Papers using Conditional Random Fields. Fuchun Peng and Andrew McCallum. HLT-NAACL, 2004.

• Chinese Segmentation and New Word Detection using Conditional Random Fields. Fuchun Peng, Fangfang Feng, and Andrew McCallum. International Conference on Computational Linguistics (COLING 2004), 2004.

• Confidence Estimation for Information Extraction. Aron Culotta and Andrew McCallum. (HLT-NAACL), 2004,

Publications and Contact Info

http://www.cs.umass.edu/~mccallum

http://www.cs.umass.edu/~mccallum/papers/coref2004nips.pdf

http://www.cs.umass.edu/~mccallum/papers/coref2004nips.pdf

End of Talk