text mining - simon fraser university · 2006. 2. 28. · mining text and web datatext mining...

35
CMPT 843, SFU, Martin Ester, 1-06 119 Text Mining Outline [Feldman 2003] Introduction Information Extraction The ML Approach to Information Extraction The KE Approach to Information Extraction Mining the extracted information

Upload: others

Post on 01-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 119

Text Mining

Outline [Feldman 2003]

Introduction

Information Extraction

The ML Approach to Information Extraction

The KE Approach to Information Extraction

Mining the extracted information

Page 2: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 120

The Web and Web Search

Web Server Crawler

ClusteringClassification Indexer

Storage Server

Inverted IndexTopic Hierarchy

SearchQuery

Business

Root

News Science

Computers AutomobilesPlants Animals

The jaguar, a cat, can run atspeeds reaching 50 mph

The jaguar has a 4 liter engine

enginejaguarcat

jaguar

Repository

Documents in repository

Page 3: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 121

Search Engines

Keyword Search

“data mining”

519,000 results

Page 4: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 122

Directory Services

Browsing

“data mining”

~ 200 results

Page 5: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 123

Mining Text and Web DataText Representation

Preprocessing

• remove HTML tags, punctuation etc.

• define terms single-word / multi-word terms

• remove stopwords

• perform stemming

• count term frequencies

• some words are more important than others

smooth the frequencies,

e.g. weight by inverse document frequency

Page 6: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 124

Mining Text and Web DataText Representation

Transformation

• Different definitions of inverse document frequency

n(d,t): number of occurrences of term t in document d

• select “significant” subset of all occurring terms

• Vocabulary V, term ti, document d represented as

Bag of Words Model

� most n’s are zeroes for a single document

{ } Vti itdndrep ∈= ),()(

),(max

),(,

),(

),(,

),(

),(

tdn

tdn

tdn

tdn

tdn

tdn

ttd ∑∑

Page 7: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 125

Mining Text and Web DataText Representation

Similarity Function

)()(

)(),(),(

21

2121

drepdrep

drepdrepddsimilarity

⋅= productinner , =⋅⋅

documentdata

miningsimilar

dissimilar

Cosine Similarity

Page 8: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 126

Mining Text and Web DataIntroduction

Shortcomings of the Current Methods

Low Precision• Thousands of irrelevant documents returned in response to a search query

99% of information of no interest to 99% of people

Low Recall• In particular, for directory services (due to manual acquisition)• Even largest crawlers cover less than 50% of all web pages

No information / knowledge• Results are documents• User still has to read the documents to obtain information / knowledge

Page 9: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 127

Mining Text and Web DataText Mining

Overview

Step 1: Information extraction• Automatically extract information from individual documents

• Entities, relationships, events, . . .

� Natural Language Processing

Step 2: Mine the extracted information• Aggregate over the information of an entire document collection

• To find patterns, trends, regularities

� Link Analysis

� Multi-relational data mining

Page 10: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 128

Mining Text and Web DataText Representation

Overview

• Words

• Linguistic phrases

• Role annotation

• Parse trees

Page 11: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 129

Mining Text and Web DataText RepresentationWORD BASE

NAMED_ENTITY

POSWORD_SENSE

CONCEPT_IDs

SUPER_CONCEPT_IDs

FULL_PARSE

By by * IN * * * (s2 (S (PP (PP * ) * * * *

making make * VBG 6 -1607166 * (VP (VP (VP * ) (V*) * * *

translational

translational

* JJ 1 -3204585 * (NP (NP (NP (AJ * ) (A2* * * *

fusions fusion * NNS 1(C1293131,07272653)

(C0185023,C0543467,...)

(NP * ) ) * * * *

of of * IN * * * (PP (PP * ) * * * *

LcnC lcnc(BACTE

RIA)NN * (C1448241)

(C0004627,C0242738)

(NP (NP * ) ) ) ) ) ) *) * * *

to to * TO * * * (PP (PP * ) * * * *

the the * DT * * * (NP (DT * ) * * * *

reporter reporter * NN 1 -10363826(909474367,00007626,…)

(NP (NP (NP (NP * )

* * * *

proteins protein * NNS 1 (C0033684,(C0002526,C0007995,…)

(NP (NP * ) * * * *

14534785)galactos

idasegalactosi

dase(PROTEI

N)NN * (C0016955)

(C0017976,C0020289)

(NP * ) ) ) * * * *

( -lrb- * ( * * * (PU (PU (PU * ) * * * *

LacZ lacz (GENE) NN * (C0022959)(C0206414,C

0017337)(NP (NP * ) ) ) * * * *

) -rrb- * ) * * * (PU * ) ) ) * * * *

and and * CC * * * (CJ (CJ * ) * * * *

alkaline alkaline(PROTEI

NNN * (C0002059

(C0031678,C0014442,C12

54349…(NP (NP (NP * ) * * * *

phosphatase

phosphatase

) NN * ) ) (NP * ) ) * * * *

( -lrb- * ( * * * (PU (PU (PU * ) * * * *

PhoA* phoa*(PROTEI

N)NN * (C0756664) (C0243045) (NP (NP * ) ) ) * * * *

) -rrb- * ) * * * (PU * ) ) ) ) ) ) ) ) ) * * * *

, -comma- * , * * * (S (CM * ) * * * *

it it * PRP * * * (S (NP * ) * (A1*) (A2*) *

was be * VBD 1 (02579744) * (VP (VP * ) * (V*) * *

shown show * VBN (02129054) (02117319) (VP (VP * ) * (A2* (V*) *

that that * IN * * * (CP (CP * ) * * (A3* *

both both * CC * * * (S (NP (CJ * ) * * * (A1*

the the * DT * * * (NP (DT * ) * * * *

N- n- * NN * * * (NP (NP * ) * * * *

and and * CC * * * (CJ (CJ * ) * * * *

C-terminal

c-terminal

* JJ * * * (NP (NP (AJ * ) * * * *

parts part * NNS 1 (13628130)(00030334,07

272653,…)(NP * ) ) * * *

of of * IN * * * (PP (PP * ) * * *

LcnC lcnc(PROTEI

N)NN * (C1448241)

(C0004627,C0242738)

(NP (NP * ) ) ) ) ) ) ) )

* * *

are be * VBP 1 (02579744) * (VP (VP (VP * ) * * *

located located * JJ * * * (AJ * ) ) * * *

in in * IN * * * (PP (PP * ) * * *

the the * DT * * * (NP (DT * ) * * *

cytoplasm

cytoplasm

(ORGANISM)

NN 1(C0010834,05366297)

(C0682581,C0243092,…)

(NP * ) ) ) ) ) ) ) ) ) ) )

* *) *)

. . * . * * * ) * * *

PREDICATE_ARGUMENTS

Example

WORD: word in the textBASE: base form of the wordNAMED_ENTITY: entity typePOS: part-of-speech tagWORD_SENSE: ID of the word

sense in WordNetCONCEPT_IDs: Concept

ID from Lexical Resources FULL_PARSE: full syntactic

parsing resultsPREDICATE_ARGUMENTS:

semantic role labels

Page 12: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 130

Mining Text and Web DataInformation Extraction

Introduction

• Entity: an object of interest such as a person, city, company,

protein . . . � accuracy 90 – 98 %

• Attribute: a property of an entity such as age, population

� accuracy ~ 80 %

• Relationship (fact): a relationship (association) between two or

more entities such as a company headquartered in a city

� accuracy 60 - 70 %

• Event: an activity involving several entities such as management

change or earth quake � accuracy 50 - 60 %

Page 13: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 131

Mining Text and Web DataInformation Extraction

Approaches

• Knowledge Engineering (KE) approach

Rules formulated by linguists together with domain experts

Based on a set of relevant documents with example information

• Machine Learning (ML) approach

Statistical learning with little / no linguistic knowledge

Learn automatically from a corpus of annotated documents

• Hybrid approach

utilize user input in the development loop

Page 14: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 132

Mining Text and Web DataThe ML Approach to Information Extraction

Overview

• Entity extraction as classification problem

• Classes: one per named entity class that we want to extract

plus one „no-name“ class

• Hidden Markov Models (HMMs)

one of the most popular classifier for sequence data

• HMM is a finite state automaton with probabilistic state transitions

and symbol emissions

� probabilistic generative process

Page 15: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 133

Mining Text and Web DataThe ML Approach to Information Extraction

Markov Models

• Markov chains

symbol in a sequence depends only on its preceding symbol(s)

can be used for classification

[Deshpande & Karypis 2002]

• Hidden Markov Models

symbol in a sequence depends on a hidden state

state depends on preceding state

Page 16: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 134

Mining Text and Web DataThe ML Approach to Information Extraction

1-order Markov Chains

• For each class, determine the conditional probabilities P(si|sj)

for each pair of symbols si and sj

• For each class ci, calculate the probability P(s| ci)

of observing the given sequence

• Choose the class with the highest likelihood

• Decision function for two classes (+ and -)

Lssss L21=

)|(),|(),|()|( 1121 iiiLLi csPcssPcssPcsP ⋅⋅⋅= − L

∑= −

−+=

L

i ii

ii

ssP

ssPsf

1 1

1

),|(

),|(log)(

Page 17: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 135

Mining Text and Web DataThe ML Approach to Information Extraction

Higher-order Markov Chains

Idea

• k-order Markov chain:

symbol in a sequence depends only on its k preceding symbols

Discussion

• in general: higher classification accuracy than 1-order Markov chains

• but

exponential number of transition probabilities

hard to estimate probabilities

Page 18: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 136

Mining Text and Web DataThe ML Approach to Information Extraction

Hidden Markov Models

• Symbol in a sequence depends on a hidden (unobserved) state

state depends on preceding state

all dependencies are probabilistic

• Special states: initial state and final state

• Pattern is “detected” if it transforms initial state into final state

• HMMs are more compact than Markov chains

single state can represent many subsequences that lead to this state

Page 19: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 137

Mining Text and Web DataThe ML Approach to Information Extraction

Hidden Markov Models

• Goal: distinguish patterns from background in a sequence

motifs in molecular biology

• Hidden Markov Model (HMM)

generative process for patterns of length L with

consensus pattern (motif)

noise level ε

frequency F

• Hidden states: one for each position of the pattern, one for the background

determines the next symbol to be generated (multinomial distribution)

determines the next state (transition probabilities)

Page 20: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 138

Mining Text and Web DataThe ML Approach to Information Extraction

Hidden Markov Models

• background state: probability of symbols = frequency in background

• patterns states

symbol at position i in consensus pattern:

other symbols:

• Example (consensus pattern ABBD, uniform background)

LiPi ≤≤1,

ε)1(1 −− Lε

B P1 P2 P3 P4

1.0 1.0 1.0

1.0

0.990.01

A B C D

0.25 0.25 0.25 0.25

A

0.9

B

0.9

B

0.9

D

0.9

Page 21: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 139

Mining Text and Web DataThe ML Approach to Information Extraction

Discussion

+ HMMs do not require a linguistic expert to formulate rules

+ Model is language independent

- Needs a large annotated (labeled) corpus

- In general, not as accurate as the KE approach

Page 22: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 140

Mining Text and Web DataThe KE Approach to Information Extraction

Overview

• Rules formulated by linguists together with domain experts

• Rules are sequential patterns consisting of

o string constants and

o variables representing instances from certain entity types

Ex.: Possible_Company FOLLOWED_BY „fired“

FOLLOWED_BY Possible_Person

• Possibly, additional constraints that the matching pattern

in some document must satisfy

Ex.: matches of Company and Person not more than X characters apart

Page 23: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 141

Mining Text and Web DataThe KE Approach to Information Extraction

Challenges

• Need to consider semantic similarity between words

E.x.: „fired“ or a similar word / wordclass

Microsoft laid off Steve Thompson . . .

• Need to consider the part of speech of a word

E.x.: the function of „fired“ in the sentence must be a verb

Alabama Power‘s new gas-fired electric generating

facility at Plant Barry.

Page 24: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 142

Mining Text and Web DataThe KE Approach to Information Extraction

Challenges

• Need to deal with co-references

i.e. referential relations between expressions

• Simple version of co-references: noun phrases referencing

the same entity

Ex.: „The Giant Computer Manufacturer“, „The Company“,

„The owner of over 600,000 patents“

• Even simpler version: proper names referencing the same entity

Ex.: „The President“, „George Bush“, „George W. Bush“

Page 25: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 143

Mining Text and Web DataThe KE Approach to Information Extraction

Approach for Co-references• Mark each noun phrase

with entity type, singular / plural, gender, . . .

• Distinguish scopes of different types of noun phrases

proper names: whole document

definite clause: preceding paragraph

pronoun: previous sentence

• Use this information to filter out wrong matches

Ex.: „George Bush“ does not match „she“, „they“,

„The Company“

Page 26: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 144

Mining Text and Web DataThe KE Approach to Information Extraction

Discussion

+ Does not require large labeled corpus+ In general, more accurate than the ML approach

- Requires substantial linguistic and domain expertiseto formulate the extraction rules

- Method is application-specificneed new rule set for every new application

Page 27: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 145

Mining Text and Web DataThe KE Approach to Information Extraction

Bootstrapping Methods

• Provide some labeled / annotated documents

Very time consuming

e.g., 8 hours for 160 documents

• Alternatively, provide examples of entity / relationship types

In general, easier to provide

e.g., 100 example cities or proteins

• Learn extraction rules automatically

• Apply these extraction rules to detect further instances

of the specified entity / relationship type

Page 28: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 146

Mining Text and Web DataThe KE Approach to Information Extraction

Bootstrapping Methods

• Snowball [Agichtein et al, 2000]User provides instances of entity type (training data)System retrieves webpages containing these instances

and determines textual patterns in their proximityEvaluate precision of these patterns using the training dataApplies patterns to retrieve further instances

• Know-It-All [Etzioni O. et al, 2005]System applies generic patterns to retrieve some instances

from webpagesUses occurences of these instances to discover patterns

specific to given entity / relationship type � fully un-supervised

Page 29: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 147

Mining Text and Web DataMining the Extracted Information

Link Analysis

• Objective: Detection of relationships between entities that otherwise would be hidden by the mass of data

• Data sources: criminal networks, customer networks, scientificnetworks, biological graph structures, document graphs, . . .

• Detection of central nodes in a networke.g., identification of network vulnerabilitiesor targeting for sales campaigns

• Graph-based data mininge.g., detection of communication patterns that discriminate between threat and non-threat groups

• Summarization of graph structurese.g., document summarization

Page 30: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 148

Mining Text and Web DataMining the Extracted Information

Centrality of Nodes

• Degreenumber of nodes to which it is directly linkes

• Betweenessnumber of shortest paths between two other nodeswhich pass through it

• Radiusmaximum of the minimum path length to other nodes

• Point strengthincrease in the number of maximal connected subcomponentsafter removal of the node

• Businessamount of information transmitted via the node

Page 31: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 149

Mining Text and Web DataMining the Extracted Information

Link Analysis Queries

• Who is central in the organization?

• What role(s) does this individual play in the organization?

• Which three individuals’ removal would harm this drug-supply

network the most?

• What communication channels within a terrorist network are

worth monitoring?

• What significant changes in the operation of an organization have

taken place over a given period of time?

Page 32: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 150

Mining Text and Web DataMining the Extracted Information

Graph-Based Data Mining

• Task: detection of frequent subgraphs

• Challenge:

graph-isomorphisms need to be considered,

test for graph-isomorphisms is NP-hard,

huge number of candidate patterns

• Modified task: detection of subgraphs that compress the input

graph well (Subdue) [Cook and Holder 2000]

takes a labeled graph,

uses Minimum Description Length to measure the degree

of compression

Page 33: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 151

Mining Text and Web DataMining the Extracted Information

Graph-Based Data Mining [Mukherjee and Holder 2004]

• Task: detection of communication patterns that discriminate

between threat and non-threat groups

• Extension of Subdue to incorporate negative examples

• Discovery of advanced pattern types

cliques, K-plexes, K-cores

Page 34: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 152

Mining Text and Web DataMining the Extracted Information

Document Summarization [Leskovec et al, 2004]

• Task: summarization of documents,

based on a training set of document and their summaries

• Approach

extract entities and their relationships from documents

(graph structure)

use training dataset to identify relevant subgraphs

apply classifier to summarize test documents

Page 35: Text Mining - Simon Fraser University · 2006. 2. 28. · Mining Text and Web DataText Mining Overview Step 1: Information extraction • Automatically extract information from individual

CMPT 843, SFU, Martin Ester, 1-06 153

References

Agichtein E., Gravano L.: "Snowball: Extracting Relations from Large Plain-Text Collections", Proceedings of the 5th ACM International Conference on

Digital Libraries (DL), 2000.

Cook D.J., Holder L.B.: “Graph-Based Data Mining”, IEEE Intelligent Systems, Vol. 15, No. 2, 2000.

Deshpande M., Karypis G.: “Evaluation of Techniques for Classifying Biological Sequences”, PAKDD 2002.

Etzioni O. et al: "Unsupervised Named-Entity Extraction from the Web: An

Experimental Study", Artificial Intelligence, 2005.

Feldman R.: “Information Extraction: Theory and Practice”, Tutorial ICDM 2003.

Leskovec J., Grobelnik M., Milic-Frayling N.: "Learning Sub-structures of Document Semantic Graphs for Document Summarization", Proc. Workshop LinkKDD, 2004.

Mukherjee M., Holder L.B.: “Graph-Based Data Mining on Social Networks”, Proc. Workshop LinkKDD, 2004.