text mining: finding nuggets in mountains of textual data jochen d ö rre, peter gerstl, and roland...

37
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert

Post on 19-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Text Mining: Finding Nuggets in Mountains

of Textual Data

Jochen Dörre, Peter Gerstl, and Roland Seiffert

Page 2: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Overview Introduction to Mining Text How Text Mining differs from data mining Mining Within a Document: Feature

Extraction Mining in Collections of Documents:

Clustering and Categorization Text Mining Applications Exam Questions/Answers

Page 3: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Introduction to Mining Text

Page 4: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Reasons for Text MiningReasons for Text Mining

0

10

20

30

40

50

60

70

80

90

Percentage

Collections ofText

StructuredData

Page 5: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Corporate Knowledge “Ore” Email Insurance claims News articles Web pages Patent portfolios

Customer complaint letters

Contracts Transcripts of phone

calls with customers Technical

documents

Page 6: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Challenges in Text Mining Information is in unstructured textual

form.Not readily accessible to be used by

computers.Dealing with huge collections of

documents

Page 7: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Two Mining PhasesKnowledge Discovery: Extraction of

codified information (features) Information Distillation: Analysis of the

feature distribution

Page 8: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

How Text Mining Differs from Data Mining

Page 9: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Comparison of Procedures

Data Mining Identify data sets Select features Prepare data Analyze distribution

Text Mining Identify documents Extract features Select features by

algorithm Prepare data Analyze distribution

Page 10: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

IBM Intelligent Miner for TextSDK: Software Development KitContains necessary components for

“real text mining”Also contains more traditional

components: IBM Text Search Engine IBM Web Crawler drop-in Intranet search solutions

Page 11: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Mining Within a Document: Feature Extraction

Page 12: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Feature ExtractionTo recognize and classify significant

vocabulary items in unrestricted natural language texts.

Let’s see an example…

Page 13: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Example of Vocabulary found Certificate of deposit CMOs Commercial bank Commercial paper Commercial Union

Assurance Commodity Futures

Trading Commission Consul Restaurant Convertible bond Credit facility Credit line

Debt security Debtor country Detroit Edison Digital Equipment Dollars of debt End-March Enserch Equity warrant Eurodollar …

Page 14: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Implementation of Feature Extraction relies onLinguistically motivated heuristicsPattern matchingLimited amounts of lexical information,

such as part-of-speech information.Not used: huge amounts of lexicalized

informationNot used: in-depth syntactic and

semantic analyses of texts

Page 15: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Goals of Feature ExtractionVery fast processing to be able to deal

with mass dataDomain-independence for general

applicability

Page 16: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Extracted information categoriesNames of persons, organizations and

placesMultiword termsAbbreviationsRelationsOther useful stuff

Page 17: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Canonical FormsNormalized forms of dates, numbers, …Allows applications to use information

very easilyAbstracts from different morphological

variants of a single term

Page 18: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Canonical Names

President Bush

Mr. Bush

George Bush

Canonical Name:George Bush

The canonical name is the most explicit, least ambiguous name constructed from the different variants found in the document

Reduces ambiguity of variants

Page 19: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Disambiguating Proper Names: Nominator Program

Page 20: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Principles of Nominator DesignApply heuristics to strings, instead of

interpreting semantics.The unit of context for extraction is a

document.The unit of context for aggregation is a

corpus.The heuristics represent English naming

conventions.

Page 21: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Mining in Collections of Documents: Clustering and Categorization

Page 22: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

1. Clustering Partitions a given collection into groups of

documents similar in contents, i.e., in their feature vectors.

Two clustering engines Hierarchical Clustering tool Binary Relational Clustering tool

Both tools help to identify the topic of a group by listing terms or words that are common in the documents in the group.

Thus, provides overview of the contents of a collection of documents

Page 23: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Groups documents similar in their feature vectors

Page 24: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

2. CategorizationTopic Categorization ToolAssign documents to preexisting

categories (“topics” or “themes”)Categories are chosen to match the

intended use of the collectioncategories defined by providing a set of

sample documents for each category

Page 25: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

2. Categorization (cont.)This “training” phase produces a special

index, called the categorization schemacategorization tool returns a list of

category names and confidence levels for each document

If the confidence level is low, document is put aside for human categorizer

Page 26: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

2. Categorization (cont.)Effectiveness:

Tests have shown that the Topic Categorization tool agrees with human categorizers to the same degree as human categorizers agree with one another.

Page 27: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Set of sample documents

Training phase

Special index used to categorize new documents

Returns list of category names and confidence

levels for each

document

Page 28: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Text Mining Applications

Page 29: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Main Advantages of mining technology over traditional ‘information broker’ business

Ability to quickly process large amounts of textual data

“Objectivity” and customizabilityAutomation

Page 30: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Applications used to:Gain insights about trends, relations

between people/places/organizationsClassify and organize documents

according to their contentOrganize repositories of document-

related meta-information for search and retrieval

Retrieve documents

Page 31: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Main ApplicationsKnowledge Discovery

Information Distillation

Page 32: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

CRI: Customer Relationship Intelligence Appropriate documents selected Converted to common format Feature extraction and clustering tools are

used to create a database User may select parameters for

preprocessing and clustering step Clustering produces groups of feedback that

share important linguistic elements Categorization tool used to assign new

incoming feedback to identified categories.

Page 33: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

CRI (continued)Knowledge Discovery

Clustering used to create a structure that can be interpreted

Information Distillation Refinement and extension of the clustering

results Interpreting the results Tuning of the clustering process Selecting meaningful clusters

Page 34: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Exam Question #1Name an example of each of the two

main classes of applications of text mining. Knowledge Discovery: Discovering a

common customer complaint among much feedback.

Information Distillation: Filtering future comments into pre-defined categories

Page 35: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Exam Question #2How does the procedure for text mining

differ from the procedure for data mining? Adds feature extraction function Not feasible to have humans select

features Highly dimensional, sparsely populated

feature vectors

Page 36: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

Exam Question #3 In the Nominator program of IBM’s

Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text? Does not perform in-depth syntactic or

semantic analyses of texts

Page 37: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert

THE END

http://www-3.ibm.com/software/data/iminer/fortext/