unified models of information extraction and data mining with application to social network analysis...
TRANSCRIPT
Unified Models of Information Extraction and Data Mining
with Application to Social Network Analysis
Andrew McCallumInformation Extraction and Synthesis Laboratory
Computer Science Department
University of Massachusetts Amherst
Joint work with David Jensen
Knowledge Discovery and Dissemination (KDD) Conference
September 2004
Intelligence Technology Innovation Center
ITICQuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
Goal:
Improve the state-of-the-art in our abilityto mine actionable knowledgefrom unstructured text.
Extracting Job Openings from the Web
foodscience.com-Job2 Employer: foodscience.com JobTitle: Ice Cream Guru JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining the Extracted Job Information
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
IE fromChinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy of Sciences
200k+ documentsseveral millennia old
- Qing Dynasty Archives- memos- newspaper articles- diaries
Traditional Pipeline
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
KnowledgeDiscovery
Spider
Actionableknowledge
Problem:
Combined in serial juxtaposition,IE and KD are unaware of each others’ weaknesses and opportunities.
1) KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties.
2) IE is unaware of emerging patterns and regularities in the DB.
The accuracy of both suffers, and significant mining of complex text sources is beyond reach.
SegmentClassifyAssociateCluster
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
KnowledgeDiscovery
Actionableknowledge
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Uncertainty Info
Emerging Patterns
Solution:
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
ProbabilisticModel
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Research & Approach:
Conditional Random Fields [Lafferty, McCallum, Pereira]
Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…], …
Conditionally-trained undirected graphical models
Complex Inference and LearningJust what we researchers like to sink our teeth into!
Unified Model
• Extracting answers, and also uncertainty/confidence.– Formally justified as marginalization in graphical models– Applications to new word discovery in Chinese word segmentation,
and correction propagation in interactive IE
• Joint inference, with efficient methods– Multiple, cascaded label sequences (Factorial CRFs)– Multiple distant, but related mentions (Skip-chain CRFs)– Multiple co-reference decisions (Affinity Matrix CRF)
– Integrating extraction with co-reference (Graphs & chains)
• Put it into a large-scale, working system – Social network analysis from Email and the Web– A new portal: research, people, connections.
Accomplishments, Discoveries & Results:
• Extracting answers, and also uncertainty/confidence.– Formally justified as marginalization in graphical models– Applications to new word discovery in Chinese word segmentation,
and correction propagation in interactive IE
• Joint inference, with efficient methods– Multiple, cascaded label sequences (Factorial CRFs)– Multiple distant, but related mentions (Skip-chain CRFs)– Multiple co-reference decisions (Affinity Matrix CRF)
– Integrating extraction with co-reference (Graphs & chains)
• Put it into a large-scale, working system – Social network analysis from Email and the Web– A new portal: research, people, connections.
Accomplishments, Discoveries & Results:
Types of Uncertainty in Knowledge Discovery from Text
• Confidence that extractor correctly obtained statements the author intended.
• Confidence that what was written is truthful– Author could have had misconceptions.– …or have been purposefully trying to mislead.
• Confidence that the emerging, discovered pattern is a reliable fact or generalization.
1. Labeling Sequence DataLinear-chain CRFs
yt - 1
yt
xt
yt+1
xt +1
xt -1
Finite state model Graphical model
Undirected graphical model,
trained to maximize conditional probability of outputs given inputs
. . .
FSM states
observations
yt+2
xt +2
yt+3
xt +3
said Arden Bement NSF Director …
€
p(y | x) =1
Z(x)Φy (y t ,y t−1)Φxy (x t ,y t )
t=1
T
∏
€
Φ(⋅) = exp λ k fk (⋅)k
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟where
OTHER PERSON PERSON ORG TITLE … output seq
input seq
Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]
Segmenting tables in textual gov’t reports, 85% reduction in error over HMMs.
Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…
[Lafferty, McCallum, Pereira 2001]
Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004]
yt - 1
yt
xt
yt+1
xt +1
xt -1
. . . Lattice ofFSM states
observations
yt+2
xt +2
yt+3
xt +3
said Arden Bement NSF Director …
output sequence
input sequence
OTHER
TITLE
ORG
PERSON
Finite State Lattice
€
p(y | x) =1
Z(x)Φy (y t , y t−1)Φxy (x t , y t )
t=1
T
∏
Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004]
yt - 1
yt
xt
yt+1
xt +1
xt -1
. . . Lattice ofFSM states
observations
yt+2
xt +2
yt+3
xt +3
said Arden Bement NSF Director …
output sequence
input sequence
OTHER
TITLE
ORG
PERSON
Constrained Forward-Backward
€
p(Arden Bement = PERSON | x) =1
Z(x)Φy (y t ,y t−1)Φxy (x t , y t )
t=1
T
∏y∈C
∑
Forward-Backward Confidence Estimationimproves accuracy/coverage
op
timal
ourforward-backwardconfidence
traditionaltoken-wiseconfidence
no use ofconfidence
Confidence Estimation Applied
• New word discovery inChinese word segmentation
– Improves segmentation accuracy by ~25%
• Highlighting fields for Interactive Information Extraction– After fixing least confident field,
constrained Viterbi automatically reduces error by another 23%.
[Peng, Fangfang, McCallumCOLING 2004]
[Kristiansen, Culotta, Viola, McCallum AAAI 2004]Honorable Mention Award
• Extracting answers, and also uncertainty/confidence.– Formally justified as marginalization in graphical models– Applications to new word discovery in Chinese word segmentation,
and correction propagation in interactive IE
• Joint inference, with efficient methods– Multiple, cascaded label sequences (Factorial CRFs)– Multiple distant, but related mentions (Skip-chain CRFs)– Multiple co-reference decisions (Affinity Matrix CRF)– Integrating extraction with co-reference (Graphs & chains)
• Put it into a large-scale, working system – Social network analysis from Email and the Web– A new portal: research, people, connections.
Accomplishments, Discoveries & Results:
1. Jointly labeling cascaded sequencesFactorial CRFs
Part-of-speech
Noun-phrase boundaries
Named-entity tag
English words
[Sutton, Khashayar, McCallum, ICML 2004]
1. Jointly labeling cascaded sequencesFactorial CRFs
Part-of-speech
Noun-phrase boundaries
Named-entity tag
English words
[Sutton, Khashayar, McCallum, ICML 2004]
1. Jointly labeling cascaded sequencesFactorial CRFs
Part-of-speech
Noun-phrase boundaries
Named-entity tag
English words
[Sutton, Khashayar, McCallum, ICML 2004]
But errors cascade--must be perfect at every stage to do well.
1. Jointly labeling cascaded sequencesFactorial CRFs
Part-of-speech
Noun-phrase boundaries
Named-entity tag
English words
[Sutton, Khashayar, McCallum, ICML 2004]
Joint prediction of part-of-speech and noun-phrase in newswire,matching accuracy with only 50% of the training data.
Inference:Tree reparameterization BP
[Wainwright et al, 2002]
2. Jointly labeling distant mentionsSkip-chain CRFs
Senator Joe Green said today … . Green ran for …
…
[Sutton, McCallum, SRL 2004]
Dependency among similar, distant mentions ignored.
2. Jointly labeling distant mentionsSkip-chain CRFs
Senator Joe Green said today … . Green ran for …
…
[Sutton, McCallum, SRL 2004]
14% reduction in error on most repeated field in email seminar announcements.
Inference:Tree reparameterization BP
[Wainwright et al, 2002]
3. Joint co-reference among all pairsAffinity Matrix CRF
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
99Y/N
Y/N
Y/N
11
[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]
25% reduction in error on co-reference ofproper nouns in newswire.
Inference:Correlational clusteringgraph partitioning
[Bansal, Blum, Chawla, 2002]
“Entity resolution”“Object correspondence”
Joint IE and Coreference from Research Paper Citations
Textual citation mentions(noisy, with duplicates)
Paper database, with fields,clean, duplicates collapsed
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
AUTHORS TITLE VENUECowell, Dawid… Probab… SpringerMontemerlo, Thrun…FastSLAM… AAAI…Kjaerulff Approxi… Technic…
4. Joint segmentation and co-reference
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
Citation Segmentation and Coreference
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1) Segment citation fields
Citation Segmentation and Coreference
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1) Segment citation fields
2) Resolve coreferent citations
Citation Segmentation and Coreference
Y?N
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1) Segment citation fields
2) Resolve coreferent citations
3) Form canonical database record
Citation Segmentation and Coreference
AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990
Y?N
Resolving conflicts
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1) Segment citation fields
2) Resolve coreferent citations
3) Form canonical database record
Citation Segmentation and Coreference
AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990
Y?N
Perform jointly.
x
s
Observed citation
CRF Segmentation
IE + Coreference Model
J Besag 1986 On the…
AUT AUT YR TITL TITL
x
s
Observed citation
CRF Segmentation
IE + Coreference Model
Citation mention attributes
J Besag 1986 On the…
AUTHOR = “J Besag”YEAR = “1986”TITLE = “On the…”
c
x
s
IE + Coreference Model
c
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Structure for each citation mention
x
s
IE + Coreference Model
c
Binary coreference variablesfor each pair of mentions
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
x
s
IE + Coreference Model
c
y n
n
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Binary coreference variablesfor each pair of mentions
y n
n
x
s
IE + Coreference Model
c
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Research paper entity attribute nodes
AUTHOR = “P Smyth”YEAR = “2001”TITLE = “Data Mining…”...
Such a highly connected graph makes exact inference intractable, so…
• Loopy Belief
Propagation
v6v5
v3v2v1
v4
m1(v2) m2(v3)
m3(v2)m2(v1) messages passed between nodes
Approximate Inference 1
• Loopy Belief
Propagation
• Generalized Belief
Propagation
v6v5
v3v2v1
v4
m1(v2) m2(v3)
m3(v2)m2(v1)
v6v5
v3v2v1
v4
v9v8v7
messages passed between nodes
messages passed between regions
Here, a message is a conditional probability table passed among nodes.But when message size grows exponentially with size of overlap between regions!
Approximate Inference 1
• Iterated Conditional
Modes (ICM)
[Besag 1986]
v6v5
v3v2v1
v4
v6i+1 = argmax P(v6
i | v \ v6
i) v6
i
= held constant
Approximate Inference 2
• Iterated Conditional
Modes (ICM)
[Besag 1986]
v6v5
v3v2v1
v4
v5j+1 = argmax P(v5
j | v \ v5
j) v5
j
= held constant
Approximate Inference 2
• Iterated Conditional Modes
(ICM)
[Besag 1986]
v6v5
v3v2v1
v4
v4k+1 = argmax P(v4
k | v \ v4
k) v4
k
= held constant
Approximate Inference 2
but greedy, and easily falls into local minima.Structured inference scales well here,
• Iterated Conditional Modes
(ICM) [Besag 1986]
• Iterated Conditional Sampling (ICS) (our name) Instead of selecting only argmax, sample of argmaxes of P(v4
k | v \ v4
k)
e.g. an N-best list (the top N values)
v6v5
v3v2v1
v4
v4k+1 = argmax P(v4
k | v \ v4
k) v4
k
= held constant
v6v5
v3v2v1
v4
Approximate Inference 2
Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once.
Here, a “message” grows only linearly with overlap region size and N!
IE + Coreference Model
Exact inference onthese linear-chain regions
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
From each chainpass an N-best List
into coreference
IE + Coreference Model
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Approximate inferenceby graph partitioning…
…integrating outuncertaintyin samples
of extraction
Make scale to 1Mcitations with Canopies
[McCallum, Nigam, Ungar 2000]
y n
n
IE + Coreference Model
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Exact (exhaustive) inferenceover entity attributes
y n
n
IE + Coreference Model
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Revisit exact inferenceon IE linear chain,
now conditioned on entity attributes
y n
n
Parameter Estimation
Coref graph edge weightsMAP on individual edges
Separately for different regions
IE Linear-chainExact MAP
Entity attribute potentialsMAP, pseudo-likelihood
In all cases:Climb MAP gradient with
quasi-Newton method
p
Databasefield values
c
4. Joint segmentation and co-reference
o
s
o
s
c
c
s
o
Citation attributes
y y
y
Segmentation
[Wellner, McCallum, Peng, Hay, UAI 2004]
Inference:Variant of Iterated Conditional Modes
Co-reference decisions
Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.
Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.
[Besag, 1986]
World Knowledge
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Extraction from and matching of research paper citations.
• Extracting answers, and also uncertainty/confidence.– Formally justified as marginalization in graphical models– Applications to new word discovery in Chinese word segmentation,
and correction propagation in interactive IE
• Joint inference, with efficient methods– Multiple, cascaded label sequences (Factorial CRFs)– Multiple distant, but related mentions (Skip-chain CRFs)
– Multiple co-reference decisions (Affinity Matrix CRF)– Integrating extraction with co-reference (Graphs & chains)
• Put it into a large-scale, working system – Social network analysis from Email and the Web– A new portal: research, people, connections.
Accomplishments, Discoveries & Results:
Workplace effectiveness ~ Ability to leverage network of acquaintances“The power of your little black book”
But filling Contacts DB by hand is tedious, and incomplete.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Email Inbox Contacts DB
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
WWW
Automatically
One Application Project:
System Overview
ContactInfo andPerson Name
Extraction
Person Name
Extraction
NameCoreference
HomepageRetrieval
Social NetworkAnalysis
KeywordExtraction
CRFWWW
names
Email QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
An ExampleTo: “Andrew McCallum” [email protected]
Subject ...
First Name:
Andrew
Middle Name:
Kachites
Last Name:
McCallum
JobTitle: Associate Professor
Company: University of Massachusetts
Street Address:
140 Governor’s Dr.
City: Amherst
State: MA
Zip: 01003
Company Phone:
(413) 545-1323
Links: Fernando Pereira, Sam Roweis,…
Key Words:
Information extraction,
social network,…
Search for new people
Summary of Results
Token
Acc
Field
Prec
Field
Recall
Field
F1
CRF 94.50 85.73 76.33 80.76
Person Keywords
William Cohen Logic programming
Text categorization
Data integration
Rule learning
Daphne Koller Bayesian networks
Relational models
Probabilistic models
Hidden variables
Deborah McGuiness
Semantic web
Description logics
Knowledge representation
Ontologies
Tom Mitchell Machine learning
Cognitive states
Learning apprentice
Artificial intelligence
Contact info and name extraction performance (25 fields)
Ex
amp
le ke
ywo
rds
extrac
ted
1. Expert Finding:When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!)
2. Social Network Analysis:Understand the social structure of your organization.Suggest structural changes for improved efficiency.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Main Application Project:
Main Application Project:
ResearchPaper
Cites
Main Application Project:
ResearchPaper
Cites
Person
UniversityVenue
Grant
Groups
Expertise
Status:• Spider running. Over 1.5M PDFs in hand.• Best-in-world published results in IE from
research paper headers and references.• First version of multi-entity co-reference running.• First version of Web servlet interface up.• Well-engineered: Java, servlets, SQL, Lucene,
SOAP, etc.
• Public launch this Fall.
Main Application Project:
• ~80k lines of Java• Document classification, information extraction, clustering, co-
reference, POS tagging, shallow parsing, relational classification, …
• New package: Graphical models and modern inference methods.– Variational, Tree-reparameterization, Stochastic sampling, contrastive
divergence,…• New documentation and interfaces.
• Unlike other toolkits (e.g. Weka) MALLET scales to millions of features, 100k’s training examples, as needed for NLP.
MALLET:Machine Learning for Language Toolkit
Released as Open Source Software.http://mallet.cs.umass.edu
Software Infrastructure
In use at UMass, MIT, CMU, Stanford, Berkeley, UPenn, UT Austin, Purdue…
• Conditional Models of Identity Uncertainty with Application to Noun Coreference. Andrew McCallum and Ben Wellner. Neural Information Processing Systems (NIPS), 2004.
• An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching. Ben Wellner, Andrew McCallum, Fuchun Peng, Michael Hay. Conference on Uncertainty in Artificial Intelligence (UAI), 2004.
• Collective Segmentation and Labeling of Distant Entities in Information Extraction. Charles Sutton and Andrew McCallum. ICML workshop on Statistical Relational Learning, 2004.
• Extracting Social Networks and Contact Information from Email and the Web. Aron Culotta, Ron Bekkerman and Andrew McCallum. Conference on Email and Spam (CEAS) 2004.
• Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. Charles Sutton, Khashayar Rohanimanesh and Andrew McCallum. ICML 2004.
• Interactive Information Extraction with Constrained Conditional Random Fields. Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. AAAI 2004. (Winner of Honorable Mention Award.)
• Accurate Information Extraction from Research Papers using Conditional Random Fields. Fuchun Peng and Andrew McCallum. HLT-NAACL, 2004.
• Chinese Segmentation and New Word Detection using Conditional Random Fields. Fuchun Peng, Fangfang Feng, and Andrew McCallum. International Conference on Computational Linguistics (COLING 2004), 2004.
• Confidence Estimation for Information Extraction. Aron Culotta and Andrew McCallum. (HLT-NAACL), 2004,
Publications and Contact Info
http://www.cs.umass.edu/~mccallum
End of Talk