text mining overview
DESCRIPTION
Text Mining Overview. Piotr Gawrysiak [email protected] Warsaw University of Technology Data Mining Group. 22 November 2001. WUT DMG NOV 2001. Topics. Natural Language Processing Text Mining vs. Data Mining The toolbox Language processing methods Single document processing - PowerPoint PPT PresentationTRANSCRIPT
Text Mining Overview
Piotr [email protected]
Warsaw University of Technology
Data Mining Group
22 November 2001
Topics1. Natural Language Processing
2. Text Mining vs. Data Mining
3. The toolbox• Language processing methods• Single document processing• Document corpora processing
4. Document categorization – a closer look
5. Applications• Classic• Profiled document delivery• Related areas
• Web Content Mining & Web Farming
WUTDMGNOV 2001
Natural Language Processing
• Natural language – test for Artificial Intelligence• Alan Turing
• NLP and NLU
WUTDMGNOV 2001
• Linguistics – exploring mysteries of a language• William Jones• Comparative linguistics - Jakob Grimm, Rasmus Rask• Noam Chomsky
• I-Language and E-Language• poverty of stimulus
• Statistical approaches – Markov and Shannon
Natural language processing (NLP)
anything that deals with text content
Natural language understanding (NLU)
semantics and logic
Information explosion
WUTDMGNOV 2001
1970 19801990 2000
1
10
100
1000
10000
100000
Number of bookspublished weekly
Number of articlespublished monthly
• Increasing popularity of the Internet as a publishing medium• Electronic media’s minimal duplication costs
Primitive information retrieval and data management tools
Data Mining
WUTDMGNOV 2001
Data Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from large databases. – Piatetsky-Shapiro
• Association rule discovery• Sequential pattern discovery• Categorization• Clustering• Statistics (mostly regression)• Visualization
Knowledge pyramid
WUTDMGNOV 2001
Signals
Data Mining area
Data
Information
Knowledge
Wisdom
Resources occupied
Semantic level
Text Mining – a definition
Text Mining =
Data Mining (applied to text data) +
basic linguistics
WUTDMGNOV 2001
Text Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from textual document repositories.
Language tools
Single document tools
Multiple document tools
Text Mining tools
• Linguistic analysis• Thesauri, dictionaries, grammar analysers etc.
• Machine translation
• Automatic feature extraction
• Automatic summarization
• Document categorization
• Document clustering
• Information retrieval
• Visualization methods
WUTDMGNOV 2001
Language analysis
WUTDMGNOV 2001
• Syntactic analysers construction• Grammatical sentence decomposition• Part-of-speech tagging• Word sense disambiguation
This is not that simple – consider for example
This is a delicious butter - noun
You should butter your toast - verb
Rule based systems or self-learning classification systems (using VMM and HMM)
Thesaurus construction
WUTDMGNOV 2001
Telephone
Cell phone
Telecommunications
Fax machine
Data transmission network
Electronic mail
ADBTRT
Post and telecom
Thesaurus (semantic network) stores information about relationships between terms
• Ascriptor - Descriptor relations• „Broader term” – „Narrower term” relations• „Related term” relations
The U.S.S Nashville arrived in Colon harbour with 42 marines
With the warship in Colon harbour, the Colombian troops withdrew
Construction can be manual (but this is a laborious process) or automatic.
Machine translation
Problems
WUTDMGNOV 2001
Word level W łóżku jest szybka In bed is window-pane
Syntactic level She is a window-pane in bedW łóżku jest szybka
Semantic level She is quick in bedW łóżku jest szybka
Knowledge representation
She is quick in bedW łóżku jest szybka
Formal knowledge representation language
Source: Polish
Target: English
• Different vocabularies• Different grammars and flexion rules• Even different character sets
Książka okazała się adjective, The book turned out to be adjective
WUTDMGNOV 2001
Fully automatic approach
Based on learning word usage patterns from large corpora of translated documents (bitext)
Problems
• Still quite few bitexts exist• Sentences must be aligned prior to learning
• Keyword matching• Sentence length based alignment
• Parameterisation is necessary
Feature extraction
Not all words are equally important
WUTDMGNOV 2001
• Technical multiword terminology• Abbreviations• Relations• Names• Numbers
Discovering important terms
• Finding lexical affinities• Gap variance measurement
• Dictionary-based methods• Grammar based heuristics
Data bases
Databases
Knowledge discovery in databases
MineIT
Microsoft
Micro$oft
Knowledge discovery in databases
Knowledge discovery in large databases
Knowledge discovery in big databases
Document summarization
Abstracts
Extracts
Indicative summaries
Summaries
Summary creation methods: • statistical analysis of sentence and word frequency + dictionary analysis (i.e. „abstract”, „conclusion” words etc.)
• text representation methods – grammatical analysis of sentences
• document structure analysis (question-answer patterns, formatting, vocabulary shifts etc.)
WUTDMGNOV 2001
Informative summaries
Unknown document
Document categorization & clustering
Clustering – dividing set of documents into groupsCategorization – grouping based on predefined category scheme
WUTDMGNOV 2001
Typical categorization scenario
Step 1 : Create training hierarchy
Step 2 : Perform training
Step 3 : Actual classification
Class 2Class 1
Repository
Class fingerprints
categorization
Categorization/clustering system
Documents Representation conversion
Classic DM algorithm
Clustering – k-means, agglomerative,...Categorization – kNN, DT, Bayes,...
Representation processingDeriving metrics
WUTDMGNOV 2001
Information retrieval
Two types of search methods
• exact match – in most cases uses some simple Boolean query specification language
• fuzzy – uses statistical methods to estimate relevance of the document
1999 data - Scooter (AltaVista) : 1.5GB RAM, 30GB disk, 4x533 MHz Alpha, 1GB/s I/O (crawler) - about 1 month needed to recrawl
WUTDMGNOV 2001
Modern IR tools seem to be very effective...
2000 data - 40-50% of the Web indexed at all
IR – exact match
Most popular method – inverted files
a
b
c
d
...
z
• Very fast• Boolean queries very easy to process• Very simple
WUTDMGNOV 2001
IR – fuzzy search
k
lil
k
ll
k
llil
ii
dq
qd
QDQDsim
1
2
1
2
1),cos(),(
Query can be a set of keywords, a document, or even a set of documents – also represented as a vector
WUTDMGNOV 2001
Documents are represented as vectors over word (feature) space
Repository
Initial query
IROutput Selection Output
It’s possible to perform it iteratively – relevance feedback
Document visualization
Peak represents many strongly related documents
Water represents assorted documents, creating semantic noise
Island represents several documents sharing similar subject, and separated from others - hence creating a group of interest
WUTDMGNOV 2001
Document visualization
WUTDMGNOV 2001
Document categorization
A closer look
Measuring quality
Binary categorization scenario is analogous to document retrieval
DB
dr
ds dr – relevant documents
ds – documents labelled as relevant
DB – document database
ds
drdsPR
dr
drdsR
DB
drdsDBdrdsA
drDB
drdsFO
WUTDMGNOV 2001
Metrics
1),(0;),(
gfPRbaba
agfPR1),(0;),(
gfRca
ca
agfR
dcba
dagfA
),(1),(0;),(
gfFOdbdb
bgfFO
RPR
F1
)1(1
1
WUTDMGNOV 2001
Multiple class scenario
l
PRgfPR
l
ii
ma
1),(
Mk
M={M1, M2,...,Ml}
Macro-averaging Micro-averaging
PR={PR1, PR2, ..., PRl}
WUTDMGNOV 2001
Categorization example
WUTDMGNOV 2001
Document representations
• unigram representations (bag-of-words)• binary• multivariate
• n-gram representations
• -gram representation
• positional representation
WUTDMGNOV 2001
Bigram example
Twas brillig, and the slithy tovesDid gyre and gimble in the wabe
WUTDMGNOV 2001
Probabilistic interpretation
)()))((( DRDRGR
Operations:
• R(D) – creating representation R from document D• G(R) – generating document D based on representation R
unigramsaid has further that of a upon an the a see joined heavy cut alice on once you is and open the edition t of a to brought he it she she she kinds I came this away look declare four re and not vain the muttered in at was cried and her keep with I to gave I voice of at arm if smokes her tell she cry they finished some next kitten each can imitate only sit like nights you additional she software or courses for rule she is only to think damaged s blaze nice the shut prisoner no
Consider your white queen shook his head and rang through my punishments. She ought to me and alice said that distance said nothing. Just then he would you seem very hopeful so dark. There it begins with one on how many candlesticks in a white quilt and all alive before an upright on somehow kitty. Dear friend and
without the room in a thing that a king and butter.
bigram
WUTDMGNOV 2001
0
5000
10000
15000
20000
25000
30000
35000
0 10 20 30 40 50 60
Posit
ion
Occurence
AnyDumpty
Positional representation
WUTDMGNOV 2001
i
rk
rkj
iij
v
wpw
Vvvwgdy
kfi
.0
,1
)(1
1
n
vif
2r
Word occurences
f(k)=2 (before norm.)k
Creating positional representation
WUTDMGNOV 2001
0
5e-005
0.0001
0.00015
0.0002
0.00025
f an
y
any
r=500r=5000
0
5e-005
0.0001
0.00015
0.0002
0.00025
0.0003
0.00035
0.0004
f d
um
pty
dumpty
r=500r=5000E
xam
ple
sWUTDMGNOV 2001
Processing representations
1
10
100
1000
10000
0 500 1000 1500 2000 2500 3000 3500
Fre
quency
Word ID
Word Frequency
The 1664
And 940
To 789
A 788
It 683
You 666
I 658
She 543
Of 538
said 473
Zipf’s law
WUTDMGNOV 2001
There is no information about penguins in this document
Stopwords?
information penguins document
• Expanding
• Trimming
• Scaling functions
• Attribute selection
• Remapping attribute space
Expanding and trimming
WUTDMGNOV 2001
ns
j jx
yxnkkilap
sM
MwwvP
1 ,
,11
1),...,|(Laplace
Lidstone
ns
j jx
yxnkkilid
sM
MwwvP
1 ,
,11 ),...,|(
Expanding
Representation processing
)log()log(1),(i
ijjilln df
Ntfdw
00)log(1)log()log(1),( ijijjilln tfN
Ntfdw
)log()log()log()log(1),( ijijjilln tfNNtfdw
TF/IDF
term frequency tfi, document frequency dfiN – all documents in system
Attribute present in one document
Attribute present in all documents
Scaling
WUTDMGNOV 2001
)|(log)|()(
)|(log)|()()(log)()(
1
11
ij
l
j iji
ij
l
j iji
l
j jji
wkPwkPwP
wkPwkPwPkPkPwIG
Example – Information Gain
Attribute selection
WUTDMGNOV 2001
Statistical tests can be also applied to check if a feature – class correlation exists
P(wi) – probability of encountering attribute wi in a randomly selected
documentP(kj) – probability, that randomly selected document belongs to class kj
P(kj|wi) – probability, that document selected from these containing wi
belongs to class kj
Attribute clustering
Attribute space remapping
Attribute – class
clustering
Semantic clustering
Representation matrix processing
(example - SVD)
Clustering according to
density function similarity
Attribute space remapping
WUTDMGNOV 2001
Applications
• Classic
• Mail analysis and mail routing
• Event tracking
• Internet related
• Web Content Mining and Web Farming
• Focused crawling and assisted browsing
WUTDMGNOV 2001
Thank you