text mining - simon fraser university · 2006. 2. 28. · mining text and web datatext mining...

CMPT 843, SFU, Martin Ester, 1-06 119

Text Mining

Outline [Feldman 2003]

Introduction

Information Extraction

The ML Approach to Information Extraction

The KE Approach to Information Extraction

Mining the extracted information


The Web and Web Search

Web Server Crawler

ClusteringClassification Indexer

Storage Server

Inverted IndexTopic Hierarchy

SearchQuery

Business

Root

News Science

Computers AutomobilesPlants Animals

The jaguar, a cat, can run atspeeds reaching 50 mph

The jaguar has a 4 liter engine

enginejaguarcat

jaguar

Repository

Documents in repository


Search Engines

Keyword Search

“data mining”

519,000 results


Directory Services

Browsing

“data mining”

~ 200 results


Mining Text and Web DataText Representation

Preprocessing

• remove HTML tags, punctuation etc.

• define terms single-word / multi-word terms

• remove stopwords

• perform stemming

• count term frequencies

• some words are more important than others

smooth the frequencies,

e.g. weight by inverse document frequency



Transformation

• Different definitions of inverse document frequency

n(d,t): number of occurrences of term t in document d

• select “significant” subset of all occurring terms

• Vocabulary V, term ti, document d represented as

Bag of Words Model

� most n’s are zeroes for a single document

{ } Vti itdndrep ∈= ),()(

),(max

),(,

),(

),(,

),(

),(

tdn

tdn

tdn

tdn

tdn

tdn

ttd ∑∑



Similarity Function

)()(

)(),(),(

21

2121

drepdrep

drepdrepddsimilarity

⋅= productinner , =⋅⋅

documentdata

miningsimilar

dissimilar

Cosine Similarity


Mining Text and Web DataIntroduction

Shortcomings of the Current Methods

Low Precision• Thousands of irrelevant documents returned in response to a search query

99% of information of no interest to 99% of people

Low Recall• In particular, for directory services (due to manual acquisition)• Even largest crawlers cover less than 50% of all web pages

No information / knowledge• Results are documents• User still has to read the documents to obtain information / knowledge


Mining Text and Web DataText Mining

Overview

Step 1: Information extraction• Automatically extract information from individual documents

• Entities, relationships, events, . . .

� Natural Language Processing

Step 2: Mine the extracted information• Aggregate over the information of an entire document collection

• To find patterns, trends, regularities

� Link Analysis

� Multi-relational data mining



Overview

• Words

• Linguistic phrases

• Role annotation

• Parse trees


Mining Text and Web DataText RepresentationWORD BASE

NAMED_ENTITY

POSWORD_SENSE

CONCEPT_IDs

SUPER_CONCEPT_IDs

FULL_PARSE

By by * IN * * * (s2 (S (PP (PP * ) * * * *

making make * VBG 6 -1607166 * (VP (VP (VP * ) (V*) * * *

translational

translational

* JJ 1 -3204585 * (NP (NP (NP (AJ * ) (A2* * * *

fusions fusion * NNS 1(C1293131,07272653)

(C0185023,C0543467,...)

(NP * ) ) * * * *

of of * IN * * * (PP (PP * ) * * * *

LcnC lcnc(BACTE

RIA)NN * (C1448241)

(C0004627,C0242738)

(NP (NP * ) ) ) ) ) ) *) * * *

to to * TO * * * (PP (PP * ) * * * *

the the * DT * * * (NP (DT * ) * * * *

reporter reporter * NN 1 -10363826(909474367,00007626,…)

(NP (NP (NP (NP * )

* * * *

proteins protein * NNS 1 (C0033684,(C0002526,C0007995,…)

(NP (NP * ) * * * *

14534785)galactos

idasegalactosi

dase(PROTEI

N)NN * (C0016955)

(C0017976,C0020289)

(NP * ) ) ) * * * *

( -lrb- * ( * * * (PU (PU (PU * ) * * * *

LacZ lacz (GENE) NN * (C0022959)(C0206414,C

0017337)(NP (NP * ) ) ) * * * *

) -rrb- * ) * * * (PU * ) ) ) * * * *

and and * CC * * * (CJ (CJ * ) * * * *

alkaline alkaline(PROTEI

NNN * (C0002059

(C0031678,C0014442,C12

54349…(NP (NP (NP * ) * * * *

phosphatase

phosphatase

) NN * ) ) (NP * ) ) * * * *

( -lrb- * ( * * * (PU (PU (PU * ) * * * *

PhoA* phoa*(PROTEI

N)NN * (C0756664) (C0243045) (NP (NP * ) ) ) * * * *

) -rrb- * ) * * * (PU * ) ) ) ) ) ) ) ) ) * * * *

, -comma- * , * * * (S (CM * ) * * * *

it it * PRP * * * (S (NP * ) * (A1*) (A2*) *

was be * VBD 1 (02579744) * (VP (VP * ) * (V*) * *

shown show * VBN (02129054) (02117319) (VP (VP * ) * (A2* (V*) *

that that * IN * * * (CP (CP * ) * * (A3* *

both both * CC * * * (S (NP (CJ * ) * * * (A1*

the the * DT * * * (NP (DT * ) * * * *

N- n- * NN * * * (NP (NP * ) * * * *

and and * CC * * * (CJ (CJ * ) * * * *

C-terminal

c-terminal

* JJ * * * (NP (NP (AJ * ) * * * *

parts part * NNS 1 (13628130)(00030334,07

272653,…)(NP * ) ) * * *

of of * IN * * * (PP (PP * ) * * *

LcnC lcnc(PROTEI

N)NN * (C1448241)

(C0004627,C0242738)

(NP (NP * ) ) ) ) ) ) ) )

* * *

are be * VBP 1 (02579744) * (VP (VP (VP * ) * * *

located located * JJ * * * (AJ * ) ) * * *

in in * IN * * * (PP (PP * ) * * *

the the * DT * * * (NP (DT * ) * * *

cytoplasm

cytoplasm

(ORGANISM)

NN 1(C0010834,05366297)

(C0682581,C0243092,…)

(NP * ) ) ) ) ) ) ) ) ) ) )

* *) *)

. . * . * * * ) * * *

PREDICATE_ARGUMENTS

Example

WORD: word in the textBASE: base form of the wordNAMED_ENTITY: entity typePOS: part-of-speech tagWORD_SENSE: ID of the word

sense in WordNetCONCEPT_IDs: Concept

ID from Lexical Resources FULL_PARSE: full syntactic

parsing resultsPREDICATE_ARGUMENTS:

semantic role labels


Mining Text and Web DataInformation Extraction

Introduction

• Entity: an object of interest such as a person, city, company,

protein . . . � accuracy 90 – 98 %

• Attribute: a property of an entity such as age, population

� accuracy ~ 80 %

• Relationship (fact): a relationship (association) between two or

more entities such as a company headquartered in a city

� accuracy 60 - 70 %

• Event: an activity involving several entities such as management

change or earth quake � accuracy 50 - 60 %


Mining Text and Web DataInformation Extraction

Approaches

• Knowledge Engineering (KE) approach

Rules formulated by linguists together with domain experts

Based on a set of relevant documents with example information

• Machine Learning (ML) approach

Statistical learning with little / no linguistic knowledge

Learn automatically from a corpus of annotated documents

• Hybrid approach

utilize user input in the development loop


Mining Text and Web DataThe ML Approach to Information Extraction

Overview

• Entity extraction as classification problem

• Classes: one per named entity class that we want to extract

plus one „no-name“ class

• Hidden Markov Models (HMMs)

one of the most popular classifier for sequence data

• HMM is a finite state automaton with probabilistic state transitions

and symbol emissions

� probabilistic generative process



Markov Models

• Markov chains

symbol in a sequence depends only on its preceding symbol(s)

can be used for classification

[Deshpande & Karypis 2002]

• Hidden Markov Models

symbol in a sequence depends on a hidden state

state depends on preceding state



1-order Markov Chains

• For each class, determine the conditional probabilities P(si|sj)

for each pair of symbols si and sj

• For each class ci, calculate the probability P(s| ci)

of observing the given sequence

• Choose the class with the highest likelihood

• Decision function for two classes (+ and -)

Lssss L21=

)|(),|(),|()|( 1121 iiiLLi csPcssPcssPcsP ⋅⋅⋅= − L

∑= −

−

−+=

L

i ii

ii

ssP

ssPsf

1 1

1

),|(

),|(log)(



Higher-order Markov Chains

Idea

• k-order Markov chain:

symbol in a sequence depends only on its k preceding symbols

Discussion

• in general: higher classification accuracy than 1-order Markov chains

• but

exponential number of transition probabilities

hard to estimate probabilities



Hidden Markov Models

• Symbol in a sequence depends on a hidden (unobserved) state

state depends on preceding state

all dependencies are probabilistic

• Special states: initial state and final state

• Pattern is “detected” if it transforms initial state into final state

• HMMs are more compact than Markov chains

single state can represent many subsequences that lead to this state




• Goal: distinguish patterns from background in a sequence

motifs in molecular biology

• Hidden Markov Model (HMM)

generative process for patterns of length L with

consensus pattern (motif)

noise level ε

frequency F

• Hidden states: one for each position of the pattern, one for the background

determines the next symbol to be generated (multinomial distribution)

determines the next state (transition probabilities)




• background state: probability of symbols = frequency in background

• patterns states

symbol at position i in consensus pattern:

other symbols:

• Example (consensus pattern ABBD, uniform background)

LiPi ≤≤1,

ε)1(1 −− Lε

B P1 P2 P3 P4

1.0 1.0 1.0

1.0

0.990.01

A B C D

0.25 0.25 0.25 0.25

A

0.9

B

0.9

B

0.9

D

0.9



Discussion

+ HMMs do not require a linguistic expert to formulate rules

+ Model is language independent

- Needs a large annotated (labeled) corpus

- In general, not as accurate as the KE approach


Mining Text and Web DataThe KE Approach to Information Extraction

Overview

• Rules formulated by linguists together with domain experts

• Rules are sequential patterns consisting of

o string constants and

o variables representing instances from certain entity types

Ex.: Possible_Company FOLLOWED_BY „fired“

FOLLOWED_BY Possible_Person

• Possibly, additional constraints that the matching pattern

in some document must satisfy

Ex.: matches of Company and Person not more than X characters apart



Challenges

• Need to consider semantic similarity between words

E.x.: „fired“ or a similar word / wordclass

Microsoft laid off Steve Thompson . . .

• Need to consider the part of speech of a word

E.x.: the function of „fired“ in the sentence must be a verb

Alabama Power‘s new gas-fired electric generating

facility at Plant Barry.



Challenges

• Need to deal with co-references

i.e. referential relations between expressions

• Simple version of co-references: noun phrases referencing

the same entity

Ex.: „The Giant Computer Manufacturer“, „The Company“,

„The owner of over 600,000 patents“

• Even simpler version: proper names referencing the same entity

Ex.: „The President“, „George Bush“, „George W. Bush“



Approach for Co-references• Mark each noun phrase

with entity type, singular / plural, gender, . . .

• Distinguish scopes of different types of noun phrases

proper names: whole document

definite clause: preceding paragraph

pronoun: previous sentence

• Use this information to filter out wrong matches

Ex.: „George Bush“ does not match „she“, „they“,

„The Company“



Discussion

+ Does not require large labeled corpus+ In general, more accurate than the ML approach

- Requires substantial linguistic and domain expertiseto formulate the extraction rules

- Method is application-specificneed new rule set for every new application



Bootstrapping Methods

• Provide some labeled / annotated documents

Very time consuming

e.g., 8 hours for 160 documents

• Alternatively, provide examples of entity / relationship types

In general, easier to provide

e.g., 100 example cities or proteins

• Learn extraction rules automatically

• Apply these extraction rules to detect further instances

of the specified entity / relationship type



Bootstrapping Methods

• Snowball [Agichtein et al, 2000]User provides instances of entity type (training data)System retrieves webpages containing these instances

and determines textual patterns in their proximityEvaluate precision of these patterns using the training dataApplies patterns to retrieve further instances

• Know-It-All [Etzioni O. et al, 2005]System applies generic patterns to retrieve some instances

from webpagesUses occurences of these instances to discover patterns

specific to given entity / relationship type � fully un-supervised


Mining Text and Web DataMining the Extracted Information

Link Analysis

• Objective: Detection of relationships between entities that otherwise would be hidden by the mass of data

• Data sources: criminal networks, customer networks, scientificnetworks, biological graph structures, document graphs, . . .

• Detection of central nodes in a networke.g., identification of network vulnerabilitiesor targeting for sales campaigns

• Graph-based data mininge.g., detection of communication patterns that discriminate between threat and non-threat groups

• Summarization of graph structurese.g., document summarization



Centrality of Nodes

• Degreenumber of nodes to which it is directly linkes

• Betweenessnumber of shortest paths between two other nodeswhich pass through it

• Radiusmaximum of the minimum path length to other nodes

• Point strengthincrease in the number of maximal connected subcomponentsafter removal of the node

• Businessamount of information transmitted via the node



Link Analysis Queries

• Who is central in the organization?

• What role(s) does this individual play in the organization?

• Which three individuals’ removal would harm this drug-supply

network the most?

• What communication channels within a terrorist network are

worth monitoring?

• What significant changes in the operation of an organization have

taken place over a given period of time?



Graph-Based Data Mining

• Task: detection of frequent subgraphs

• Challenge:

graph-isomorphisms need to be considered,

test for graph-isomorphisms is NP-hard,

huge number of candidate patterns

• Modified task: detection of subgraphs that compress the input

graph well (Subdue) [Cook and Holder 2000]

takes a labeled graph,

uses Minimum Description Length to measure the degree

of compression



Graph-Based Data Mining [Mukherjee and Holder 2004]

• Task: detection of communication patterns that discriminate

between threat and non-threat groups

• Extension of Subdue to incorporate negative examples

• Discovery of advanced pattern types

cliques, K-plexes, K-cores



Document Summarization [Leskovec et al, 2004]

• Task: summarization of documents,

based on a training set of document and their summaries

• Approach

extract entities and their relationships from documents

(graph structure)

use training dataset to identify relevant subgraphs

apply classifier to summarize test documents


References

Agichtein E., Gravano L.: "Snowball: Extracting Relations from Large Plain-Text Collections", Proceedings of the 5th ACM International Conference on

Digital Libraries (DL), 2000.

Cook D.J., Holder L.B.: “Graph-Based Data Mining”, IEEE Intelligent Systems, Vol. 15, No. 2, 2000.

Deshpande M., Karypis G.: “Evaluation of Techniques for Classifying Biological Sequences”, PAKDD 2002.

Etzioni O. et al: "Unsupervised Named-Entity Extraction from the Web: An

Experimental Study", Artificial Intelligence, 2005.

Feldman R.: “Information Extraction: Theory and Practice”, Tutorial ICDM 2003.

Leskovec J., Grobelnik M., Milic-Frayling N.: "Learning Sub-structures of Document Semantic Graphs for Document Summarization", Proc. Workshop LinkKDD, 2004.

Mukherjee M., Holder L.B.: “Graph-Based Data Mining on Social Networks”, Proc. Workshop LinkKDD, 2004.