text mining overview

39
Text Mining Overview Piotr Gawrysiak [email protected] Warsaw University of Technology Data Mining Group 22 November 2001

Upload: nelly

Post on 04-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Text Mining Overview. Piotr Gawrysiak [email protected] Warsaw University of Technology Data Mining Group. 22 November 2001. WUT DMG NOV 2001. Topics. Natural Language Processing Text Mining vs. Data Mining The toolbox Language processing methods Single document processing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Text Mining Overview

Text Mining Overview

Piotr [email protected]

Warsaw University of Technology

Data Mining Group

22 November 2001

Page 2: Text Mining Overview

Topics1. Natural Language Processing

2. Text Mining vs. Data Mining

3. The toolbox• Language processing methods• Single document processing• Document corpora processing

4. Document categorization – a closer look

5. Applications• Classic• Profiled document delivery• Related areas

• Web Content Mining & Web Farming

WUTDMGNOV 2001

Page 3: Text Mining Overview

Natural Language Processing

• Natural language – test for Artificial Intelligence• Alan Turing

• NLP and NLU

WUTDMGNOV 2001

• Linguistics – exploring mysteries of a language• William Jones• Comparative linguistics - Jakob Grimm, Rasmus Rask• Noam Chomsky

• I-Language and E-Language• poverty of stimulus

• Statistical approaches – Markov and Shannon

Natural language processing (NLP)

anything that deals with text content

Natural language understanding (NLU)

semantics and logic

Page 4: Text Mining Overview

Information explosion

WUTDMGNOV 2001

1970 19801990 2000

1

10

100

1000

10000

100000

Number of bookspublished weekly

Number of articlespublished monthly

• Increasing popularity of the Internet as a publishing medium• Electronic media’s minimal duplication costs

Primitive information retrieval and data management tools

Page 5: Text Mining Overview

Data Mining

WUTDMGNOV 2001

Data Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from large databases. – Piatetsky-Shapiro

• Association rule discovery• Sequential pattern discovery• Categorization• Clustering• Statistics (mostly regression)• Visualization

Page 6: Text Mining Overview

Knowledge pyramid

WUTDMGNOV 2001

Signals

Data Mining area

Data

Information

Knowledge

Wisdom

Resources occupied

Semantic level

Page 7: Text Mining Overview

Text Mining – a definition

Text Mining =

Data Mining (applied to text data) +

basic linguistics

WUTDMGNOV 2001

Text Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from textual document repositories.

Page 8: Text Mining Overview

Language tools

Single document tools

Multiple document tools

Text Mining tools

• Linguistic analysis• Thesauri, dictionaries, grammar analysers etc.

• Machine translation

• Automatic feature extraction

• Automatic summarization

• Document categorization

• Document clustering

• Information retrieval

• Visualization methods

WUTDMGNOV 2001

Page 9: Text Mining Overview

Language analysis

WUTDMGNOV 2001

• Syntactic analysers construction• Grammatical sentence decomposition• Part-of-speech tagging• Word sense disambiguation

This is not that simple – consider for example

This is a delicious butter - noun

You should butter your toast - verb

Rule based systems or self-learning classification systems (using VMM and HMM)

Page 10: Text Mining Overview

Thesaurus construction

WUTDMGNOV 2001

Telephone

Cell phone

Telecommunications

Fax machine

Data transmission network

Electronic mail

ADBTRT

Post and telecom

Thesaurus (semantic network) stores information about relationships between terms

• Ascriptor - Descriptor relations• „Broader term” – „Narrower term” relations• „Related term” relations

The U.S.S Nashville arrived in Colon harbour with 42 marines

With the warship in Colon harbour, the Colombian troops withdrew

Construction can be manual (but this is a laborious process) or automatic.

Page 11: Text Mining Overview

Machine translation

Problems

WUTDMGNOV 2001

Word level W łóżku jest szybka In bed is window-pane

Syntactic level She is a window-pane in bedW łóżku jest szybka

Semantic level She is quick in bedW łóżku jest szybka

Knowledge representation

She is quick in bedW łóżku jest szybka

Formal knowledge representation language

Source: Polish

Target: English

• Different vocabularies• Different grammars and flexion rules• Even different character sets

Page 12: Text Mining Overview

Książka okazała się adjective, The book turned out to be adjective

WUTDMGNOV 2001

Fully automatic approach

Based on learning word usage patterns from large corpora of translated documents (bitext)

Problems

• Still quite few bitexts exist• Sentences must be aligned prior to learning

• Keyword matching• Sentence length based alignment

• Parameterisation is necessary

Page 13: Text Mining Overview

Feature extraction

Not all words are equally important

WUTDMGNOV 2001

• Technical multiword terminology• Abbreviations• Relations• Names• Numbers

Discovering important terms

• Finding lexical affinities• Gap variance measurement

• Dictionary-based methods• Grammar based heuristics

Data bases

Databases

Knowledge discovery in databases

MineIT

Microsoft

Micro$oft

Knowledge discovery in databases

Knowledge discovery in large databases

Knowledge discovery in big databases

Page 14: Text Mining Overview

Document summarization

Abstracts

Extracts

Indicative summaries

Summaries

Summary creation methods: • statistical analysis of sentence and word frequency + dictionary analysis (i.e. „abstract”, „conclusion” words etc.)

• text representation methods – grammatical analysis of sentences

• document structure analysis (question-answer patterns, formatting, vocabulary shifts etc.)

WUTDMGNOV 2001

Informative summaries

Page 15: Text Mining Overview

Unknown document

Document categorization & clustering

Clustering – dividing set of documents into groupsCategorization – grouping based on predefined category scheme

WUTDMGNOV 2001

Typical categorization scenario

Step 1 : Create training hierarchy

Step 2 : Perform training

Step 3 : Actual classification

Class 2Class 1

Repository

Class fingerprints

categorization

Page 16: Text Mining Overview

Categorization/clustering system

Documents Representation conversion

Classic DM algorithm

Clustering – k-means, agglomerative,...Categorization – kNN, DT, Bayes,...

Representation processingDeriving metrics

WUTDMGNOV 2001

Page 17: Text Mining Overview

Information retrieval

Two types of search methods

• exact match – in most cases uses some simple Boolean query specification language

• fuzzy – uses statistical methods to estimate relevance of the document

1999 data - Scooter (AltaVista) : 1.5GB RAM, 30GB disk, 4x533 MHz Alpha, 1GB/s I/O (crawler) - about 1 month needed to recrawl

WUTDMGNOV 2001

Modern IR tools seem to be very effective...

2000 data - 40-50% of the Web indexed at all

Page 18: Text Mining Overview

IR – exact match

Most popular method – inverted files

a

b

c

d

...

z

• Very fast• Boolean queries very easy to process• Very simple

WUTDMGNOV 2001

Page 19: Text Mining Overview

IR – fuzzy search

k

lil

k

ll

k

llil

ii

dq

qd

QDQDsim

1

2

1

2

1),cos(),(

Query can be a set of keywords, a document, or even a set of documents – also represented as a vector

WUTDMGNOV 2001

Documents are represented as vectors over word (feature) space

Repository

Initial query

IROutput Selection Output

It’s possible to perform it iteratively – relevance feedback

Page 20: Text Mining Overview

Document visualization

Peak represents many strongly related documents

Water represents assorted documents, creating semantic noise

Island represents several documents sharing similar subject, and separated from others - hence creating a group of interest

WUTDMGNOV 2001

Page 21: Text Mining Overview

Document visualization

WUTDMGNOV 2001

Page 22: Text Mining Overview

Document categorization

A closer look

Page 23: Text Mining Overview

Measuring quality

Binary categorization scenario is analogous to document retrieval

DB

dr

ds dr – relevant documents

ds – documents labelled as relevant

DB – document database

ds

drdsPR

dr

drdsR

DB

drdsDBdrdsA

drDB

drdsFO

WUTDMGNOV 2001

Page 24: Text Mining Overview

Metrics

1),(0;),(

gfPRbaba

agfPR1),(0;),(

gfRca

ca

agfR

dcba

dagfA

),(1),(0;),(

gfFOdbdb

bgfFO

RPR

F1

)1(1

1

WUTDMGNOV 2001

Page 25: Text Mining Overview

Multiple class scenario

l

PRgfPR

l

ii

ma

1),(

Mk

M={M1, M2,...,Ml}

Macro-averaging Micro-averaging

PR={PR1, PR2, ..., PRl}

WUTDMGNOV 2001

Page 26: Text Mining Overview

Categorization example

WUTDMGNOV 2001

Page 27: Text Mining Overview

Document representations

• unigram representations (bag-of-words)• binary• multivariate

• n-gram representations

• -gram representation

• positional representation

WUTDMGNOV 2001

Page 28: Text Mining Overview

Bigram example

Twas brillig, and the slithy tovesDid gyre and gimble in the wabe

WUTDMGNOV 2001

Page 29: Text Mining Overview

Probabilistic interpretation

)()))((( DRDRGR

Operations:

• R(D) – creating representation R from document D• G(R) – generating document D based on representation R

unigramsaid has further that of a upon an the a see joined heavy cut alice on once you is and open the edition t of a to brought he it she she she kinds I came this away look declare four re and not vain the muttered in at was cried and her keep with I to gave I voice of at arm if smokes her tell she cry they finished some next kitten each can imitate only sit like nights you additional she software or courses for rule she is only to think damaged s blaze nice the shut prisoner no

Consider your white queen shook his head and rang through my punishments. She ought to me and alice said that distance said nothing. Just then he would you seem very hopeful so dark. There it begins with one on how many candlesticks in a white quilt and all alive before an upright on somehow kitty. Dear friend and

without the room in a thing that a king and butter.

bigram

WUTDMGNOV 2001

Page 30: Text Mining Overview

0

5000

10000

15000

20000

25000

30000

35000

0 10 20 30 40 50 60

Posit

ion

Occurence

AnyDumpty

Positional representation

WUTDMGNOV 2001

Page 31: Text Mining Overview

i

rk

rkj

iij

v

wpw

Vvvwgdy

kfi

.0

,1

)(1

1

n

vif

2r

Word occurences

f(k)=2 (before norm.)k

Creating positional representation

WUTDMGNOV 2001

Page 32: Text Mining Overview

0

5e-005

0.0001

0.00015

0.0002

0.00025

f an

y

any

r=500r=5000

0

5e-005

0.0001

0.00015

0.0002

0.00025

0.0003

0.00035

0.0004

f d

um

pty

dumpty

r=500r=5000E

xam

ple

sWUTDMGNOV 2001

Page 33: Text Mining Overview

Processing representations

1

10

100

1000

10000

0 500 1000 1500 2000 2500 3000 3500

Fre

quency

Word ID

Word Frequency

The 1664

And 940

To 789

A 788

It 683

You 666

I 658

She 543

Of 538

said 473

Zipf’s law

WUTDMGNOV 2001

There is no information about penguins in this document

Stopwords?

information penguins document

Page 34: Text Mining Overview

• Expanding

• Trimming

• Scaling functions

• Attribute selection

• Remapping attribute space

Expanding and trimming

WUTDMGNOV 2001

Page 35: Text Mining Overview

ns

j jx

yxnkkilap

sM

MwwvP

1 ,

,11

1),...,|(Laplace

Lidstone

ns

j jx

yxnkkilid

sM

MwwvP

1 ,

,11 ),...,|(

Expanding

Representation processing

)log()log(1),(i

ijjilln df

Ntfdw

00)log(1)log()log(1),( ijijjilln tfN

Ntfdw

)log()log()log()log(1),( ijijjilln tfNNtfdw

TF/IDF

term frequency tfi, document frequency dfiN – all documents in system

Attribute present in one document

Attribute present in all documents

Scaling

WUTDMGNOV 2001

Page 36: Text Mining Overview

)|(log)|()(

)|(log)|()()(log)()(

1

11

ij

l

j iji

ij

l

j iji

l

j jji

wkPwkPwP

wkPwkPwPkPkPwIG

Example – Information Gain

Attribute selection

WUTDMGNOV 2001

Statistical tests can be also applied to check if a feature – class correlation exists

P(wi) – probability of encountering attribute wi in a randomly selected

documentP(kj) – probability, that randomly selected document belongs to class kj

P(kj|wi) – probability, that document selected from these containing wi

belongs to class kj

Page 37: Text Mining Overview

Attribute clustering

Attribute space remapping

Attribute – class

clustering

Semantic clustering

Representation matrix processing

(example - SVD)

Clustering according to

density function similarity

Attribute space remapping

WUTDMGNOV 2001

Page 38: Text Mining Overview

Applications

• Classic

• Mail analysis and mail routing

• Event tracking

• Internet related

• Web Content Mining and Web Farming

• Focused crawling and assisted browsing

WUTDMGNOV 2001

Page 39: Text Mining Overview

Thank you