text mining overview

Text Mining Overview

Piotr [email protected]

Warsaw University of Technology

Data Mining Group

22 November 2001

Topics1. Natural Language Processing

2. Text Mining vs. Data Mining

3. The toolbox• Language processing methods• Single document processing• Document corpora processing

4. Document categorization – a closer look

5. Applications• Classic• Profiled document delivery• Related areas

• Web Content Mining & Web Farming

WUTDMGNOV 2001

Natural Language Processing

• Natural language – test for Artificial Intelligence• Alan Turing

• NLP and NLU

WUTDMGNOV 2001

• Linguistics – exploring mysteries of a language• William Jones• Comparative linguistics - Jakob Grimm, Rasmus Rask• Noam Chomsky

• I-Language and E-Language• poverty of stimulus

• Statistical approaches – Markov and Shannon

Natural language processing (NLP)

anything that deals with text content

Natural language understanding (NLU)

semantics and logic

Information explosion

WUTDMGNOV 2001

1970 19801990 2000

1

10

100

1000

10000

100000

Number of bookspublished weekly

Number of articlespublished monthly

• Increasing popularity of the Internet as a publishing medium• Electronic media’s minimal duplication costs

Primitive information retrieval and data management tools

Data Mining

WUTDMGNOV 2001

Data Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from large databases. – Piatetsky-Shapiro

• Association rule discovery• Sequential pattern discovery• Categorization• Clustering• Statistics (mostly regression)• Visualization

Knowledge pyramid

WUTDMGNOV 2001

Signals

Data Mining area

Data

Information

Knowledge

Wisdom

Resources occupied

Semantic level

Text Mining – a definition

Text Mining =

Data Mining (applied to text data) +

basic linguistics

WUTDMGNOV 2001

Text Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from textual document repositories.

Language tools

Single document tools

Multiple document tools

Text Mining tools

• Linguistic analysis• Thesauri, dictionaries, grammar analysers etc.

• Machine translation

• Automatic feature extraction

• Automatic summarization

• Document categorization

• Document clustering

• Information retrieval

• Visualization methods

WUTDMGNOV 2001

Language analysis

WUTDMGNOV 2001

• Syntactic analysers construction• Grammatical sentence decomposition• Part-of-speech tagging• Word sense disambiguation

This is not that simple – consider for example

This is a delicious butter - noun

You should butter your toast - verb

Rule based systems or self-learning classification systems (using VMM and HMM)

Thesaurus construction

WUTDMGNOV 2001

Telephone

Cell phone

Telecommunications

Fax machine

Data transmission network

Electronic mail

ADBTRT

Post and telecom

Thesaurus (semantic network) stores information about relationships between terms

• Ascriptor - Descriptor relations• „Broader term” – „Narrower term” relations• „Related term” relations

The U.S.S Nashville arrived in Colon harbour with 42 marines

With the warship in Colon harbour, the Colombian troops withdrew

Construction can be manual (but this is a laborious process) or automatic.

Machine translation

Problems

WUTDMGNOV 2001

Word level W łóżku jest szybka In bed is window-pane

Syntactic level She is a window-pane in bedW łóżku jest szybka

Semantic level She is quick in bedW łóżku jest szybka

Knowledge representation

She is quick in bedW łóżku jest szybka

Formal knowledge representation language

Source: Polish

Target: English

• Different vocabularies• Different grammars and flexion rules• Even different character sets

Książka okazała się adjective, The book turned out to be adjective

WUTDMGNOV 2001

Fully automatic approach

Based on learning word usage patterns from large corpora of translated documents (bitext)

Problems

• Still quite few bitexts exist• Sentences must be aligned prior to learning

• Keyword matching• Sentence length based alignment

• Parameterisation is necessary

Feature extraction

Not all words are equally important

WUTDMGNOV 2001

• Technical multiword terminology• Abbreviations• Relations• Names• Numbers

Discovering important terms

• Finding lexical affinities• Gap variance measurement

• Dictionary-based methods• Grammar based heuristics

Data bases

Databases

Knowledge discovery in databases

MineIT

Microsoft

Micro$oft

Knowledge discovery in databases

Knowledge discovery in large databases

Knowledge discovery in big databases

Document summarization

Abstracts

Extracts

Indicative summaries

Summaries

Summary creation methods: • statistical analysis of sentence and word frequency + dictionary analysis (i.e. „abstract”, „conclusion” words etc.)

• text representation methods – grammatical analysis of sentences

• document structure analysis (question-answer patterns, formatting, vocabulary shifts etc.)

WUTDMGNOV 2001

Informative summaries

Unknown document

Document categorization & clustering

Clustering – dividing set of documents into groupsCategorization – grouping based on predefined category scheme

WUTDMGNOV 2001

Typical categorization scenario

Step 1 : Create training hierarchy

Step 2 : Perform training

Step 3 : Actual classification

Class 2Class 1

Repository

Class fingerprints

categorization

Categorization/clustering system

Documents Representation conversion

Classic DM algorithm

Clustering – k-means, agglomerative,...Categorization – kNN, DT, Bayes,...

Representation processingDeriving metrics

WUTDMGNOV 2001

Information retrieval

Two types of search methods

• exact match – in most cases uses some simple Boolean query specification language

• fuzzy – uses statistical methods to estimate relevance of the document

1999 data - Scooter (AltaVista) : 1.5GB RAM, 30GB disk, 4x533 MHz Alpha, 1GB/s I/O (crawler) - about 1 month needed to recrawl

WUTDMGNOV 2001

Modern IR tools seem to be very effective...

2000 data - 40-50% of the Web indexed at all

IR – exact match

Most popular method – inverted files

a

b

c

d

...

z

• Very fast• Boolean queries very easy to process• Very simple

WUTDMGNOV 2001

IR – fuzzy search

k

lil

k

ll

k

llil

ii

dq

qd

QDQDsim

1

2

1

2

1),cos(),(

Query can be a set of keywords, a document, or even a set of documents – also represented as a vector

WUTDMGNOV 2001

Documents are represented as vectors over word (feature) space

Repository

Initial query

IROutput Selection Output

It’s possible to perform it iteratively – relevance feedback

Document visualization

Peak represents many strongly related documents

Water represents assorted documents, creating semantic noise

Island represents several documents sharing similar subject, and separated from others - hence creating a group of interest

WUTDMGNOV 2001

Document visualization

WUTDMGNOV 2001

Document categorization

A closer look

Measuring quality

Binary categorization scenario is analogous to document retrieval

DB

dr

ds dr – relevant documents

ds – documents labelled as relevant

DB – document database

ds

drdsPR

dr

drdsR

DB

drdsDBdrdsA

drDB

drdsFO

WUTDMGNOV 2001

Metrics

1),(0;),(

gfPRbaba

agfPR1),(0;),(

gfRca

ca

agfR

dcba

dagfA

),(1),(0;),(

gfFOdbdb

bgfFO

RPR

F1

)1(1

1

WUTDMGNOV 2001

Multiple class scenario

l

PRgfPR

l

ii

ma

1),(

Mk

M={M1, M2,...,Ml}

Macro-averaging Micro-averaging

PR={PR1, PR2, ..., PRl}

WUTDMGNOV 2001

Categorization example

WUTDMGNOV 2001

Document representations

• unigram representations (bag-of-words)• binary• multivariate

• n-gram representations

• -gram representation

• positional representation

WUTDMGNOV 2001

Bigram example

Twas brillig, and the slithy tovesDid gyre and gimble in the wabe

WUTDMGNOV 2001

Probabilistic interpretation

)()))((( DRDRGR

Operations:

• R(D) – creating representation R from document D• G(R) – generating document D based on representation R

unigramsaid has further that of a upon an the a see joined heavy cut alice on once you is and open the edition t of a to brought he it she she she kinds I came this away look declare four re and not vain the muttered in at was cried and her keep with I to gave I voice of at arm if smokes her tell she cry they finished some next kitten each can imitate only sit like nights you additional she software or courses for rule she is only to think damaged s blaze nice the shut prisoner no

Consider your white queen shook his head and rang through my punishments. She ought to me and alice said that distance said nothing. Just then he would you seem very hopeful so dark. There it begins with one on how many candlesticks in a white quilt and all alive before an upright on somehow kitty. Dear friend and

without the room in a thing that a king and butter.

bigram

WUTDMGNOV 2001

0

5000

10000

15000

20000

25000

30000

35000

0 10 20 30 40 50 60

Posit

ion

Occurence

AnyDumpty

Positional representation

WUTDMGNOV 2001

i

rk

rkj

iij

v

wpw

Vvvwgdy

kfi

.0

,1

)(1

1

n

vif

2r

Word occurences

f(k)=2 (before norm.)k

Creating positional representation

WUTDMGNOV 2001

0

5e-005

0.0001

0.00015

0.0002

0.00025

f an

y

any

r=500r=5000

0

5e-005

0.0001

0.00015

0.0002

0.00025

0.0003

0.00035

0.0004

f d

um

pty

dumpty

r=500r=5000E

xam

ple

sWUTDMGNOV 2001

Processing representations

1

10

100

1000

10000

0 500 1000 1500 2000 2500 3000 3500

Fre

quency

Word ID

Word Frequency

The 1664

And 940

To 789

A 788

It 683

You 666

I 658

She 543

Of 538

said 473

Zipf’s law

WUTDMGNOV 2001

There is no information about penguins in this document

Stopwords?

information penguins document

• Expanding

• Trimming

• Scaling functions

• Attribute selection

• Remapping attribute space

Expanding and trimming

WUTDMGNOV 2001

ns

j jx

yxnkkilap

sM

MwwvP

1 ,

,11

1),...,|(Laplace

Lidstone

ns

j jx

yxnkkilid

sM

MwwvP

1 ,

,11 ),...,|(

Expanding

Representation processing

)log()log(1),(i

ijjilln df

Ntfdw

00)log(1)log()log(1),( ijijjilln tfN

Ntfdw

)log()log()log()log(1),( ijijjilln tfNNtfdw

TF/IDF

term frequency tfi, document frequency dfiN – all documents in system

Attribute present in one document

Attribute present in all documents

Scaling

WUTDMGNOV 2001

)|(log)|()(

)|(log)|()()(log)()(

1

11

ij

l

j iji

ij

l

j iji

l

j jji

wkPwkPwP

wkPwkPwPkPkPwIG

Example – Information Gain

Attribute selection

WUTDMGNOV 2001

Statistical tests can be also applied to check if a feature – class correlation exists

P(wi) – probability of encountering attribute wi in a randomly selected

documentP(kj) – probability, that randomly selected document belongs to class kj

P(kj|wi) – probability, that document selected from these containing wi

belongs to class kj

Attribute clustering

Attribute space remapping

Attribute – class

clustering

Semantic clustering

Representation matrix processing

(example - SVD)

Clustering according to

density function similarity

Attribute space remapping

WUTDMGNOV 2001

Applications

• Classic

• Mail analysis and mail routing

• Event tracking

• Internet related

• Web Content Mining and Web Farming

• Focused crawling and assisted browsing

WUTDMGNOV 2001

Thank you