text mining - simon fraser university · 2006. 2. 28. · mining text and web datatext mining...
TRANSCRIPT
CMPT 843, SFU, Martin Ester, 1-06 119
Text Mining
Outline [Feldman 2003]
Introduction
Information Extraction
The ML Approach to Information Extraction
The KE Approach to Information Extraction
Mining the extracted information
CMPT 843, SFU, Martin Ester, 1-06 120
The Web and Web Search
Web Server Crawler
ClusteringClassification Indexer
Storage Server
Inverted IndexTopic Hierarchy
SearchQuery
Business
Root
News Science
Computers AutomobilesPlants Animals
The jaguar, a cat, can run atspeeds reaching 50 mph
The jaguar has a 4 liter engine
enginejaguarcat
jaguar
Repository
Documents in repository
CMPT 843, SFU, Martin Ester, 1-06 121
Search Engines
Keyword Search
“data mining”
519,000 results
CMPT 843, SFU, Martin Ester, 1-06 122
Directory Services
Browsing
“data mining”
~ 200 results
CMPT 843, SFU, Martin Ester, 1-06 123
Mining Text and Web DataText Representation
Preprocessing
• remove HTML tags, punctuation etc.
• define terms single-word / multi-word terms
• remove stopwords
• perform stemming
• count term frequencies
• some words are more important than others
smooth the frequencies,
e.g. weight by inverse document frequency
CMPT 843, SFU, Martin Ester, 1-06 124
Mining Text and Web DataText Representation
Transformation
• Different definitions of inverse document frequency
n(d,t): number of occurrences of term t in document d
• select “significant” subset of all occurring terms
• Vocabulary V, term ti, document d represented as
Bag of Words Model
� most n’s are zeroes for a single document
{ } Vti itdndrep ∈= ),()(
),(max
),(,
),(
),(,
),(
),(
tdn
tdn
tdn
tdn
tdn
tdn
ttd ∑∑
CMPT 843, SFU, Martin Ester, 1-06 125
Mining Text and Web DataText Representation
Similarity Function
)()(
)(),(),(
21
2121
drepdrep
drepdrepddsimilarity
⋅= productinner , =⋅⋅
documentdata
miningsimilar
dissimilar
Cosine Similarity
CMPT 843, SFU, Martin Ester, 1-06 126
Mining Text and Web DataIntroduction
Shortcomings of the Current Methods
Low Precision• Thousands of irrelevant documents returned in response to a search query
99% of information of no interest to 99% of people
Low Recall• In particular, for directory services (due to manual acquisition)• Even largest crawlers cover less than 50% of all web pages
No information / knowledge• Results are documents• User still has to read the documents to obtain information / knowledge
CMPT 843, SFU, Martin Ester, 1-06 127
Mining Text and Web DataText Mining
Overview
Step 1: Information extraction• Automatically extract information from individual documents
• Entities, relationships, events, . . .
� Natural Language Processing
Step 2: Mine the extracted information• Aggregate over the information of an entire document collection
• To find patterns, trends, regularities
� Link Analysis
� Multi-relational data mining
CMPT 843, SFU, Martin Ester, 1-06 128
Mining Text and Web DataText Representation
Overview
• Words
• Linguistic phrases
• Role annotation
• Parse trees
CMPT 843, SFU, Martin Ester, 1-06 129
Mining Text and Web DataText RepresentationWORD BASE
NAMED_ENTITY
POSWORD_SENSE
CONCEPT_IDs
SUPER_CONCEPT_IDs
FULL_PARSE
By by * IN * * * (s2 (S (PP (PP * ) * * * *
making make * VBG 6 -1607166 * (VP (VP (VP * ) (V*) * * *
translational
translational
* JJ 1 -3204585 * (NP (NP (NP (AJ * ) (A2* * * *
fusions fusion * NNS 1(C1293131,07272653)
(C0185023,C0543467,...)
(NP * ) ) * * * *
of of * IN * * * (PP (PP * ) * * * *
LcnC lcnc(BACTE
RIA)NN * (C1448241)
(C0004627,C0242738)
(NP (NP * ) ) ) ) ) ) *) * * *
to to * TO * * * (PP (PP * ) * * * *
the the * DT * * * (NP (DT * ) * * * *
reporter reporter * NN 1 -10363826(909474367,00007626,…)
(NP (NP (NP (NP * )
* * * *
proteins protein * NNS 1 (C0033684,(C0002526,C0007995,…)
(NP (NP * ) * * * *
14534785)galactos
idasegalactosi
dase(PROTEI
N)NN * (C0016955)
(C0017976,C0020289)
(NP * ) ) ) * * * *
( -lrb- * ( * * * (PU (PU (PU * ) * * * *
LacZ lacz (GENE) NN * (C0022959)(C0206414,C
0017337)(NP (NP * ) ) ) * * * *
) -rrb- * ) * * * (PU * ) ) ) * * * *
and and * CC * * * (CJ (CJ * ) * * * *
alkaline alkaline(PROTEI
NNN * (C0002059
(C0031678,C0014442,C12
54349…(NP (NP (NP * ) * * * *
phosphatase
phosphatase
) NN * ) ) (NP * ) ) * * * *
( -lrb- * ( * * * (PU (PU (PU * ) * * * *
PhoA* phoa*(PROTEI
N)NN * (C0756664) (C0243045) (NP (NP * ) ) ) * * * *
) -rrb- * ) * * * (PU * ) ) ) ) ) ) ) ) ) * * * *
, -comma- * , * * * (S (CM * ) * * * *
it it * PRP * * * (S (NP * ) * (A1*) (A2*) *
was be * VBD 1 (02579744) * (VP (VP * ) * (V*) * *
shown show * VBN (02129054) (02117319) (VP (VP * ) * (A2* (V*) *
that that * IN * * * (CP (CP * ) * * (A3* *
both both * CC * * * (S (NP (CJ * ) * * * (A1*
the the * DT * * * (NP (DT * ) * * * *
N- n- * NN * * * (NP (NP * ) * * * *
and and * CC * * * (CJ (CJ * ) * * * *
C-terminal
c-terminal
* JJ * * * (NP (NP (AJ * ) * * * *
parts part * NNS 1 (13628130)(00030334,07
272653,…)(NP * ) ) * * *
of of * IN * * * (PP (PP * ) * * *
LcnC lcnc(PROTEI
N)NN * (C1448241)
(C0004627,C0242738)
(NP (NP * ) ) ) ) ) ) ) )
* * *
are be * VBP 1 (02579744) * (VP (VP (VP * ) * * *
located located * JJ * * * (AJ * ) ) * * *
in in * IN * * * (PP (PP * ) * * *
the the * DT * * * (NP (DT * ) * * *
cytoplasm
cytoplasm
(ORGANISM)
NN 1(C0010834,05366297)
(C0682581,C0243092,…)
(NP * ) ) ) ) ) ) ) ) ) ) )
* *) *)
. . * . * * * ) * * *
PREDICATE_ARGUMENTS
Example
WORD: word in the textBASE: base form of the wordNAMED_ENTITY: entity typePOS: part-of-speech tagWORD_SENSE: ID of the word
sense in WordNetCONCEPT_IDs: Concept
ID from Lexical Resources FULL_PARSE: full syntactic
parsing resultsPREDICATE_ARGUMENTS:
semantic role labels
CMPT 843, SFU, Martin Ester, 1-06 130
Mining Text and Web DataInformation Extraction
Introduction
• Entity: an object of interest such as a person, city, company,
protein . . . � accuracy 90 – 98 %
• Attribute: a property of an entity such as age, population
� accuracy ~ 80 %
• Relationship (fact): a relationship (association) between two or
more entities such as a company headquartered in a city
� accuracy 60 - 70 %
• Event: an activity involving several entities such as management
change or earth quake � accuracy 50 - 60 %
CMPT 843, SFU, Martin Ester, 1-06 131
Mining Text and Web DataInformation Extraction
Approaches
• Knowledge Engineering (KE) approach
Rules formulated by linguists together with domain experts
Based on a set of relevant documents with example information
• Machine Learning (ML) approach
Statistical learning with little / no linguistic knowledge
Learn automatically from a corpus of annotated documents
• Hybrid approach
utilize user input in the development loop
CMPT 843, SFU, Martin Ester, 1-06 132
Mining Text and Web DataThe ML Approach to Information Extraction
Overview
• Entity extraction as classification problem
• Classes: one per named entity class that we want to extract
plus one „no-name“ class
• Hidden Markov Models (HMMs)
one of the most popular classifier for sequence data
• HMM is a finite state automaton with probabilistic state transitions
and symbol emissions
� probabilistic generative process
CMPT 843, SFU, Martin Ester, 1-06 133
Mining Text and Web DataThe ML Approach to Information Extraction
Markov Models
• Markov chains
symbol in a sequence depends only on its preceding symbol(s)
can be used for classification
[Deshpande & Karypis 2002]
• Hidden Markov Models
symbol in a sequence depends on a hidden state
state depends on preceding state
CMPT 843, SFU, Martin Ester, 1-06 134
Mining Text and Web DataThe ML Approach to Information Extraction
1-order Markov Chains
• For each class, determine the conditional probabilities P(si|sj)
for each pair of symbols si and sj
• For each class ci, calculate the probability P(s| ci)
of observing the given sequence
• Choose the class with the highest likelihood
• Decision function for two classes (+ and -)
Lssss L21=
)|(),|(),|()|( 1121 iiiLLi csPcssPcssPcsP ⋅⋅⋅= − L
∑= −
−
−+=
L
i ii
ii
ssP
ssPsf
1 1
1
),|(
),|(log)(
CMPT 843, SFU, Martin Ester, 1-06 135
Mining Text and Web DataThe ML Approach to Information Extraction
Higher-order Markov Chains
Idea
• k-order Markov chain:
symbol in a sequence depends only on its k preceding symbols
Discussion
• in general: higher classification accuracy than 1-order Markov chains
• but
exponential number of transition probabilities
hard to estimate probabilities
CMPT 843, SFU, Martin Ester, 1-06 136
Mining Text and Web DataThe ML Approach to Information Extraction
Hidden Markov Models
• Symbol in a sequence depends on a hidden (unobserved) state
state depends on preceding state
all dependencies are probabilistic
• Special states: initial state and final state
• Pattern is “detected” if it transforms initial state into final state
• HMMs are more compact than Markov chains
single state can represent many subsequences that lead to this state
CMPT 843, SFU, Martin Ester, 1-06 137
Mining Text and Web DataThe ML Approach to Information Extraction
Hidden Markov Models
• Goal: distinguish patterns from background in a sequence
motifs in molecular biology
• Hidden Markov Model (HMM)
generative process for patterns of length L with
consensus pattern (motif)
noise level ε
frequency F
• Hidden states: one for each position of the pattern, one for the background
determines the next symbol to be generated (multinomial distribution)
determines the next state (transition probabilities)
CMPT 843, SFU, Martin Ester, 1-06 138
Mining Text and Web DataThe ML Approach to Information Extraction
Hidden Markov Models
• background state: probability of symbols = frequency in background
• patterns states
symbol at position i in consensus pattern:
other symbols:
• Example (consensus pattern ABBD, uniform background)
LiPi ≤≤1,
ε)1(1 −− Lε
B P1 P2 P3 P4
1.0 1.0 1.0
1.0
0.990.01
A B C D
0.25 0.25 0.25 0.25
A
0.9
B
0.9
B
0.9
D
0.9
CMPT 843, SFU, Martin Ester, 1-06 139
Mining Text and Web DataThe ML Approach to Information Extraction
Discussion
+ HMMs do not require a linguistic expert to formulate rules
+ Model is language independent
- Needs a large annotated (labeled) corpus
- In general, not as accurate as the KE approach
CMPT 843, SFU, Martin Ester, 1-06 140
Mining Text and Web DataThe KE Approach to Information Extraction
Overview
• Rules formulated by linguists together with domain experts
• Rules are sequential patterns consisting of
o string constants and
o variables representing instances from certain entity types
Ex.: Possible_Company FOLLOWED_BY „fired“
FOLLOWED_BY Possible_Person
• Possibly, additional constraints that the matching pattern
in some document must satisfy
Ex.: matches of Company and Person not more than X characters apart
CMPT 843, SFU, Martin Ester, 1-06 141
Mining Text and Web DataThe KE Approach to Information Extraction
Challenges
• Need to consider semantic similarity between words
E.x.: „fired“ or a similar word / wordclass
Microsoft laid off Steve Thompson . . .
• Need to consider the part of speech of a word
E.x.: the function of „fired“ in the sentence must be a verb
Alabama Power‘s new gas-fired electric generating
facility at Plant Barry.
CMPT 843, SFU, Martin Ester, 1-06 142
Mining Text and Web DataThe KE Approach to Information Extraction
Challenges
• Need to deal with co-references
i.e. referential relations between expressions
• Simple version of co-references: noun phrases referencing
the same entity
Ex.: „The Giant Computer Manufacturer“, „The Company“,
„The owner of over 600,000 patents“
• Even simpler version: proper names referencing the same entity
Ex.: „The President“, „George Bush“, „George W. Bush“
CMPT 843, SFU, Martin Ester, 1-06 143
Mining Text and Web DataThe KE Approach to Information Extraction
Approach for Co-references• Mark each noun phrase
with entity type, singular / plural, gender, . . .
• Distinguish scopes of different types of noun phrases
proper names: whole document
definite clause: preceding paragraph
pronoun: previous sentence
• Use this information to filter out wrong matches
Ex.: „George Bush“ does not match „she“, „they“,
„The Company“
CMPT 843, SFU, Martin Ester, 1-06 144
Mining Text and Web DataThe KE Approach to Information Extraction
Discussion
+ Does not require large labeled corpus+ In general, more accurate than the ML approach
- Requires substantial linguistic and domain expertiseto formulate the extraction rules
- Method is application-specificneed new rule set for every new application
CMPT 843, SFU, Martin Ester, 1-06 145
Mining Text and Web DataThe KE Approach to Information Extraction
Bootstrapping Methods
• Provide some labeled / annotated documents
Very time consuming
e.g., 8 hours for 160 documents
• Alternatively, provide examples of entity / relationship types
In general, easier to provide
e.g., 100 example cities or proteins
• Learn extraction rules automatically
• Apply these extraction rules to detect further instances
of the specified entity / relationship type
CMPT 843, SFU, Martin Ester, 1-06 146
Mining Text and Web DataThe KE Approach to Information Extraction
Bootstrapping Methods
• Snowball [Agichtein et al, 2000]User provides instances of entity type (training data)System retrieves webpages containing these instances
and determines textual patterns in their proximityEvaluate precision of these patterns using the training dataApplies patterns to retrieve further instances
• Know-It-All [Etzioni O. et al, 2005]System applies generic patterns to retrieve some instances
from webpagesUses occurences of these instances to discover patterns
specific to given entity / relationship type � fully un-supervised
CMPT 843, SFU, Martin Ester, 1-06 147
Mining Text and Web DataMining the Extracted Information
Link Analysis
• Objective: Detection of relationships between entities that otherwise would be hidden by the mass of data
• Data sources: criminal networks, customer networks, scientificnetworks, biological graph structures, document graphs, . . .
• Detection of central nodes in a networke.g., identification of network vulnerabilitiesor targeting for sales campaigns
• Graph-based data mininge.g., detection of communication patterns that discriminate between threat and non-threat groups
• Summarization of graph structurese.g., document summarization
CMPT 843, SFU, Martin Ester, 1-06 148
Mining Text and Web DataMining the Extracted Information
Centrality of Nodes
• Degreenumber of nodes to which it is directly linkes
• Betweenessnumber of shortest paths between two other nodeswhich pass through it
• Radiusmaximum of the minimum path length to other nodes
• Point strengthincrease in the number of maximal connected subcomponentsafter removal of the node
• Businessamount of information transmitted via the node
CMPT 843, SFU, Martin Ester, 1-06 149
Mining Text and Web DataMining the Extracted Information
Link Analysis Queries
• Who is central in the organization?
• What role(s) does this individual play in the organization?
• Which three individuals’ removal would harm this drug-supply
network the most?
• What communication channels within a terrorist network are
worth monitoring?
• What significant changes in the operation of an organization have
taken place over a given period of time?
CMPT 843, SFU, Martin Ester, 1-06 150
Mining Text and Web DataMining the Extracted Information
Graph-Based Data Mining
• Task: detection of frequent subgraphs
• Challenge:
graph-isomorphisms need to be considered,
test for graph-isomorphisms is NP-hard,
huge number of candidate patterns
• Modified task: detection of subgraphs that compress the input
graph well (Subdue) [Cook and Holder 2000]
takes a labeled graph,
uses Minimum Description Length to measure the degree
of compression
CMPT 843, SFU, Martin Ester, 1-06 151
Mining Text and Web DataMining the Extracted Information
Graph-Based Data Mining [Mukherjee and Holder 2004]
• Task: detection of communication patterns that discriminate
between threat and non-threat groups
• Extension of Subdue to incorporate negative examples
• Discovery of advanced pattern types
cliques, K-plexes, K-cores
CMPT 843, SFU, Martin Ester, 1-06 152
Mining Text and Web DataMining the Extracted Information
Document Summarization [Leskovec et al, 2004]
• Task: summarization of documents,
based on a training set of document and their summaries
• Approach
extract entities and their relationships from documents
(graph structure)
use training dataset to identify relevant subgraphs
apply classifier to summarize test documents
CMPT 843, SFU, Martin Ester, 1-06 153
References
Agichtein E., Gravano L.: "Snowball: Extracting Relations from Large Plain-Text Collections", Proceedings of the 5th ACM International Conference on
Digital Libraries (DL), 2000.
Cook D.J., Holder L.B.: “Graph-Based Data Mining”, IEEE Intelligent Systems, Vol. 15, No. 2, 2000.
Deshpande M., Karypis G.: “Evaluation of Techniques for Classifying Biological Sequences”, PAKDD 2002.
Etzioni O. et al: "Unsupervised Named-Entity Extraction from the Web: An
Experimental Study", Artificial Intelligence, 2005.
Feldman R.: “Information Extraction: Theory and Practice”, Tutorial ICDM 2003.
Leskovec J., Grobelnik M., Milic-Frayling N.: "Learning Sub-structures of Document Semantic Graphs for Document Summarization", Proc. Workshop LinkKDD, 2004.
Mukherjee M., Holder L.B.: “Graph-Based Data Mining on Social Networks”, Proc. Workshop LinkKDD, 2004.