why text mining? adapted from text mining: finding nuggets in mountains of textual data

48

Upload: karen-wiggins

Post on 30-Dec-2015

235 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data
Page 2: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

What is Text Mining?What are the application areas?What are the challenges?What are the tools?

Page 3: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Why Text Mining?

0

10

20

30

40

50

60

70

80

90

Percentage

Collections ofText

StructuredData

Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Page 4: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Prelude

•Amount of information is growing exponentially•Majority of Information is stored in text documents:

• journals, web pages, emails, reports, memos, social networks...

•Extracting useful knowledge from all this information is vital

Page 5: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data
Page 6: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Need to automatically analyze, summarize text

Page 7: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Text Mining : Briefly

Text Mining• the procedure of synthesizing the information by

analyzing the relations, the patterns, and the rules among textual data - semi-structured or unstructured text.

Techniques • data mining• machine learning • information retrieval• Statistics• natural-language understanding• case-based reasoning

Ref:[1]-[4]

Page 8: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

UN/SEMI structured

What is Text Mining?

The discovery by computer of

, information,

by automatically extracting information from a

usually large amount of different

textual resources.

previously unknownnew

genuinely new information

Page 9: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

What are the important implications of this definition?

Genuinely new information.

Not merely finding patterns

Free naturally occurring text.

Not DB records, HTML,XML, …

Page 10: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

What is Text Mining?

• seeks to extract useful information from a collection of documents.

• similar to data mining– but the data sources are unstructured or semi-

structured documents.

Deals with structured numeric data

Typically in a data warehouse

Page 11: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

What Is Text Mining?

“The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.”

“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.”

Page 12: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

What is Text Mining?

• “…finding interesting regularities in large textual datasets…”

• “…finding semantic and abstract information from the surface form of textual data…”

• The process of deriving high quality information from text

Is this easy?

Page 13: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Typical steps involved in Text Mining

Page 14: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Text Mining is difficult

• Abstract concepts are difficult to represent • “Countless” combinations of subtle, abstract

relationships among concepts • Many ways to represent similar concepts

– E.g. space ship, flying saucer, UFO• Concepts are difficult to visualize • High dimensionality • Tens or hundreds of thousands of features

Page 15: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Text Mining is difficult

• Automatic learning methods are typically supervised– annotating training data is a time-consuming and

expensive task.• Develop better unsupervised algorithm?• Use of a small set of labeled example more

efficiently?

Page 16: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Text Mining is Difficult• Data collection is “free text”

– Data is not well-organized• Semi-structured or unstructured

– Natural language text contains ambiguities on many levels

• Lexical, syntactic, semantic, and pragmatic

– Learning techniques for processing text typically need annotated training examples

• Consider bootstrapping techniques

Page 17: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Text Mining is Easy

• Highly redundant data– …most of the methods count on this property

• Just about any simple algorithm can get “good” results for simple tasks: – Pull out “important” phrases – Find “meaningfully” related words – Create some sort of summary from documents

Page 18: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

How is Text Mining done?

• structure the input text – pre-processing & representation

• identification / extraction of representative features

• derive patterns within the structured data– analysis

• evaluation and interpretation of the output– evaluation

Page 19: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Typical tasks

– Categorization, – clustering, – concept/entity extraction, – production of taxonomies, – sentiment analysis, – document summarization, and – entity relation modeling

Page 20: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

• procedure of extracting partial content reflecting the document as a whole

Text summarization

• procedure of assigning a category to the text among predefined categories

Text categorization

• procedure of dividing texts into several clusters, where each cluster contains relevant text and each cluster differs from others

Text clustering

Page 21: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Text Mining is Multidisciplinary

• TM exploits techniques / methodologies from – data mining, – machine learning, – information retrieval, – corpus-based computational linguistics &Natural

Language Processing– Statistics– …

Page 22: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

22

Applications of Text Mining• Direct applications

– Discovery-driven (Bioinformatics, Business Intelligence, etc): We have specific questions; how can we exploit text mining to answer the questions?

– Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of textual data; what can we do with it?

• Indirect applications– Assist information access (e.g., discover latent topics to better

summarize search results)– Assist information organization (e.g., discover hidden structures)

Adapted from Presentation by ChengXiang (“Cheng”) Zhai

Page 23: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Sample Application Areas of Text Mining

Search engine technology : Find relevant documents among a vast repository of documents based on a word or phrase, which may include misspellings

Analysis of survey data : • Analyze information in open ended survey questions.

Spam identification : Analyze title line and contents of e-mails are analyzed to identify which are spam and which are legitimate

Surveillance : Monitor telephone, internet and other communications for evidence of terrorism

Call center routing : Route calls to help desks and technical support lines based on verbal answers to questions.

Public health early warning :

Global Public Health Intelligence Network (GPHIN) monitors global newspaper articles and other media to provide an early warning of potential public health threats including disease epidemics such as SARS, and chemical or radioactive threats.

Alias identification :

The aliases of health care and other providers are analyzed to detect over billing and fraud. For instance, a bill may have been submitted by John Smith, J. Smith and Smith, John. The same approaches may be used to identify abuse by claimants, where given claimants submit numerous insurance claims under different aliases.

Page 24: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

• End of Hour 1 (lecture 1)

Page 25: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Levels of text representations

LexicalCharacter (character n-grams and sequences)Words (stop-words, stemming, lemmatization)Phrases (word n-grams, proximity features)Part-of-speech tagsTaxonomies / thesauri

Syntactic Vector-space model

Language models

Full-parsing

Cross-modality

Semantic Collaborative tagging / Web2.0

Templates / Frames

Ontologies / First order theories

Page 26: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Character level

• Character level representation of a text consists from sequences of characters…– …a document is represented by a frequency

distribution of sequences– Usually we deal with contiguous strings…– …each character sequence of length 1, 2, 3, …

represent a feature w 127361 ith its frequency

Lexical

Page 27: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Good and bad sides• Representation has several important strengths:

– …it is very robust since avoids language morphology • (useful for e.g. language identification)

– …it captures simple patterns on character level • (useful for e.g. spam detection, copy detection)

– …because of redundancy in text data it could be used for many analytic tasks

• (learning, clustering, search)• It is used as a basis for “string kernels” in combination with SVM

for capturing complex character sequence patterns

• …for deeper semantic tasks, the representation is too weak

Lexical

Page 28: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Word level

• The most common representation of text used for many techniques– …there are many tokenization software packages

which split text into the words• Important to know:

– Word is well defined unit in western languages – e.g. Chinese has different notion of semantic unit

Lexical

Page 29: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Words Properties• Relations among word surface forms and their senses:

– Homonomy: same form, but different meaning (e.g. bank: river bank, financial institution)

– Polysemy: same form, related meaning (e.g. bank: blood bank, financial institution)

– Synonymy: different form, same meaning (e.g. singer, vocalist)

– Hyponymy: one word denotes a subclass of an another (e.g. breakfast, meal)

• Word frequencies in texts have power distribution:– …small number of very frequent words– …big number of low frequency words

Lexical

Page 30: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Stop-words• Stop-words are words that from non-linguistic view do not carry

information– …they have mainly functional role– …usually we remove them to help the methods to perform

better

• Stop words are language dependent – examples:– English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN, AGAINST, ALL,

ALMOST, ALONE, ALONG, ALREADY, ... – Dutch: de, en, van, ik, te, dat, die, in, een, hij, het, niet, zijn, is, was,

op, aan, met, als, voor, had, er, maar, om, hem, dan, zou, of, wat, mijn, men, dit, zo, ...

– Slovenian: A, AH, AHA, ALI, AMPAK, BAJE, BODISI, BOJDA, BRŽKONE, BRŽČAS, BREZ, CELO, DA, DO, ...

Lexical

Page 31: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Word character level normalization

• Hassle which we usually avoid:– Since we have plenty of character encodings in

use, it is often nontrivial to identify a word and write it in unique form

– …e.g. in Unicode the same word could be written in many ways – canonization of words:

Lexical

Page 32: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Stemming (1/2)

• Different forms of the same word are usually problematic for text data analysis, because they have different spelling and similar meaning (e.g. learns, learned, learning,…)

• Stemming is a process of transforming a word into its stem (normalized form)– …stemming provides an inexpensive mechanism

to merge

Lexical

Page 33: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Stemming (2/2)• For English is mostly used Porter stemmer at

http://www.tartarus.org/~martin/PorterStemmer/• Example cascade rules used in English Porter stemmer

– ATIONAL -> ATE relational -> relate– TIONAL -> TION conditional -> condition– ENCI -> ENCE valenci -> valence– ANCI -> ANCE hesitanci -> hesitance– IZER -> IZE digitizer -> digitize– ABLI -> ABLE conformabli -> conformable– ALLI -> AL radicalli -> radical– ENTLI -> ENT differentli -> different– ELI -> E vileli -> vile– OUSLI -> OUS analogousli -> analogous

Lexical

Page 34: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Phrase level

• Instead of having just single words we can deal with phrases

• We use two types of phrases:– Phrases as frequent contiguous word sequences– Phrases as frequent non-contiguous word

sequences– …both types of phrases could be identified by

simple dynamic programming algorithm• The main effect of using phrases is to more

precisely identify sense

Lexical

Page 35: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Google n-gram corpus• In September 2006 Google announced availability of n-

gram corpus:– http://googleresearch.blogspot.com/2006/08/all-our-n-gram-ar

e-belong-to-you.html#links– Some statistics of the corpus:

• File sizes: approx. 24 GB compressed (gzip'ed) text files• Number of tokens: 1,024,908,267,229• Number of sentences: 95,119,665,584• Number of unigrams: 13,588,391• Number of bigrams: 314,843,401• Number of trigrams: 977,069,902• Number of fourgrams: 1,313,818,354• Number of fivegrams: 1,176,470,663

Lexical

Page 36: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Example: Google n-grams• ceramics collectables collectibles 55

ceramics collectables fine 130ceramics collected by 52ceramics collectible pottery 50ceramics collectibles cooking 45ceramics collection , 144ceramics collection . 247ceramics collection </S> 120ceramics collection and 43ceramics collection at 52ceramics collection is 68ceramics collection of 76ceramics collection | 59ceramics collections , 66ceramics collections . 60ceramics combined with 46ceramics come from 69ceramics comes from 660ceramics community , 109ceramics community . 212ceramics community for 61ceramics companies . 53ceramics companies consultants 173ceramics company ! 4432ceramics company , 133ceramics company . 92ceramics company </S> 41ceramics company facing 145ceramics company in 181ceramics company started 137ceramics company that 87ceramics component ( 76ceramics composed of 85

• serve as the incoming 92serve as the incubator 99serve as the independent 794serve as the index 223serve as the indication 72serve as the indicator 120serve as the indicators 45serve as the indispensable 111serve as the indispensible 40serve as the individual 234serve as the industrial 52serve as the industry 607serve as the info 42serve as the informal 102serve as the information 838serve as the informational 41serve as the infrastructure 500serve as the initial 5331serve as the initiating 125serve as the initiation 63serve as the initiator 81serve as the injector 56serve as the inlet 41serve as the inner 87serve as the input 1323serve as the inputs 189serve as the insertion 49serve as the insourced 67serve as the inspection 43serve as the inspector 66serve as the inspiration 1390serve as the installation 136serve as the institute 187

Lexical

N-gram here concerns words, in other contexts it is used for characters!

Page 37: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Part-of-Speech level

• By introducing part-of-speech tags we introduce word-types enabling to differentiate words functions– For text-analysis part-of-speech information is used mainly for

“information extraction” where we are interested in e.g. named entities which are “noun phrases”

– Another possible use is reduction of the vocabulary (features)• …it is known that nouns carry most of the information in text

documents

• Part-of-Speech taggers are usually learned by HMM algorithm on manually tagged data

Lexical

Page 38: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Part-of-Speech Table

http://www.englishclub.com/grammar/parts-of-speech_1.htm

Lexical

Page 39: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Part-of-Speech examples

http://www.englishclub.com/grammar/parts-of-speech_2.htm

Lexical

Page 40: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Taxonomies/thesaurus level

• Thesaurus has a main function to connect different surface word forms with the same meaning into one sense (synonyms)– …additionally we often use hypernym relation to relate general-

to-specific word senses– …by using synonyms and hypernym relation we compact the

feature vectors

• The most commonly used general thesaurus is WordNet which exists in many other languages (e.g. EuroWordNet)– http://www.illc.uva.nl/EuroWordNet/

Lexical

Page 41: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

WordNet database of lexical relations

• WordNet is the most well developed and widely used lexical database for English– …it consist from 4 databases (nouns,

verbs, adjectives, and adverbs)• Each database consists from sense

entries – each sense consists from a set of synonyms, e.g.:– musician, instrumentalist, player– person, individual, someone– life form, organism, being

Category Unique Forms

Number of Senses

Noun 94474 116317

Verb 10319 22066

Adjective 20170 29881

Adverb 4546 5677

Lexical

Page 42: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Is_amouthLocationface

peck limbgoose

creaturemake

Is_a

Typ_obj sound

Typ_subj

preen

Is_a

Part feather

Not_is_a plant

Is_a

gaggleIs_a

Is_a

PartPart

catch

Typ_subjTyp_obj

claw

wing

Is_a

turtle

beak

Is_a

Is_a

strike

Means

hawk

Typ_subj

opening

Is_a

chatter

Means

Typ_subjTyp_obj

Is_a

Is_a

clean

smooth

billIs_a

duck

Is_a

Typ_obj keep

animal

quack

Is_a

Caused_byPurpose

bird

meat

egg

poultry

Is_a

supplyPurpose Typ_obj

Quesp

hen

chicken

Is_a

Is_a

leg

arm

Is_a

Is_a

Is_a

Typ_subj

fly

Classifier

number

WordNet – excerpt from the graph

sense

sense

relation

26 relations116k senses

Lexical

Page 43: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

WordNet relations• Each WordNet entry is connected with other entries in the graph through

relations• Relations in the database of nouns:

Relation Definition Example

Hypernym From lower to higher concepts

breakfast -> meal

Hyponym From concepts to subordinates

meal -> lunch

Has-Member From groups to their members

faculty -> professor

Member-Of From members to their groups

copilot -> crew

Has-Part From wholes to parts table -> leg

Part-Of From parts to wholes course -> meal

Antonym Opposites leader -> follower

Page 44: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Vector-space model level

• The most common way to deal with documents is first to transform them into sparse numeric vectors and then deal with them with linear algebra operations– …by this, we forget everything about the linguistic structure

within the text– …this is sometimes called “structural curse” because this way of

forgetting about the structure doesn’t harm efficiency of solving many relevant problems

– This representation is referred to also as “Bag-Of-Words” or “Vector-Space-Model”

– Typical tasks on vector-space-model are classification, clustering, visualization etc.

Syntactic

Page 45: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Bag-of-words document representationSyntactic

Page 46: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Example document and its vector representation

• TRUMP MAKES BID FOR CONTROL OF RESORTS Casino owner and real estate Donald Trump has offered to acquire all Class B common shares of Resorts International Inc, a spokesman for Trump said. The estate of late Resorts chairman James M. Crosby owns 340,783 of the 752,297 Class B shares. Resorts also has about 6,432,000 Class A common shares outstanding. Each Class B share has 100 times the voting power of a Class A share, giving the Class B stock about 93 pct of Resorts' voting power.

• [RESORTS:0.624] [CLASS:0.487] [TRUMP:0.367] [VOTING:0.171] [ESTATE:0.166] [POWER:0.134] [CROSBY:0.134] [CASINO:0.119] [DEVELOPER:0.118] [SHARES:0.117] [OWNER:0.102] [DONALD:0.097] [COMMON:0.093] [GIVING:0.081] [OWNS:0.080] [MAKES:0.078] [TIMES:0.075] [SHARE:0.072] [JAMES:0.070] [REAL:0.068] [CONTROL:0.065] [ACQUIRE:0.064] [OFFERED:0.063] [BID:0.063] [LATE:0.062] [OUTSTANDING:0.056] [SPOKESMAN:0.049] [CHAIRMAN:0.049] [INTERNATIONAL:0.041] [STOCK:0.035] [YORK:0.035] [PCT:0.022] [MARCH:0.011]

Original text

Bag-of-Wordsrepresentation(high dimensional sparse vector)

Syntactic

Page 47: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

Ontologies level

• Ontologies are the most general formalism for describing data objects– …in the recent years ontologies got popular

through Semantic Web and OWL standard– Ontologies can be of various complexity – from

relatively simple ones (light weight described with simple) to heavy weight (described with first order theories.

– Ontologies could be understood also as very generic data-models where we can store extracted information from text

Semantic

Page 48: Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data

END OF LECTURE