text mining: finding nuggets in mountains of textual data jochen d ö rre, peter gerstl, and roland...
Post on 19-Dec-2015
224 views
TRANSCRIPT
![Page 1: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/1.jpg)
Text Mining: Finding Nuggets in Mountains
of Textual Data
Jochen Dörre, Peter Gerstl, and Roland Seiffert
![Page 2: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/2.jpg)
Overview Introduction to Mining Text How Text Mining differs from data mining Mining Within a Document: Feature
Extraction Mining in Collections of Documents:
Clustering and Categorization Text Mining Applications Exam Questions/Answers
![Page 3: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/3.jpg)
Introduction to Mining Text
![Page 4: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/4.jpg)
Reasons for Text MiningReasons for Text Mining
0
10
20
30
40
50
60
70
80
90
Percentage
Collections ofText
StructuredData
![Page 5: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/5.jpg)
Corporate Knowledge “Ore” Email Insurance claims News articles Web pages Patent portfolios
Customer complaint letters
Contracts Transcripts of phone
calls with customers Technical
documents
![Page 6: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/6.jpg)
Challenges in Text Mining Information is in unstructured textual
form.Not readily accessible to be used by
computers.Dealing with huge collections of
documents
![Page 7: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/7.jpg)
Two Mining PhasesKnowledge Discovery: Extraction of
codified information (features) Information Distillation: Analysis of the
feature distribution
![Page 8: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/8.jpg)
How Text Mining Differs from Data Mining
![Page 9: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/9.jpg)
Comparison of Procedures
Data Mining Identify data sets Select features Prepare data Analyze distribution
Text Mining Identify documents Extract features Select features by
algorithm Prepare data Analyze distribution
![Page 10: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/10.jpg)
IBM Intelligent Miner for TextSDK: Software Development KitContains necessary components for
“real text mining”Also contains more traditional
components: IBM Text Search Engine IBM Web Crawler drop-in Intranet search solutions
![Page 11: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/11.jpg)
Mining Within a Document: Feature Extraction
![Page 12: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/12.jpg)
Feature ExtractionTo recognize and classify significant
vocabulary items in unrestricted natural language texts.
Let’s see an example…
![Page 13: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/13.jpg)
Example of Vocabulary found Certificate of deposit CMOs Commercial bank Commercial paper Commercial Union
Assurance Commodity Futures
Trading Commission Consul Restaurant Convertible bond Credit facility Credit line
Debt security Debtor country Detroit Edison Digital Equipment Dollars of debt End-March Enserch Equity warrant Eurodollar …
![Page 14: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/14.jpg)
Implementation of Feature Extraction relies onLinguistically motivated heuristicsPattern matchingLimited amounts of lexical information,
such as part-of-speech information.Not used: huge amounts of lexicalized
informationNot used: in-depth syntactic and
semantic analyses of texts
![Page 15: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/15.jpg)
Goals of Feature ExtractionVery fast processing to be able to deal
with mass dataDomain-independence for general
applicability
![Page 16: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/16.jpg)
Extracted information categoriesNames of persons, organizations and
placesMultiword termsAbbreviationsRelationsOther useful stuff
![Page 17: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/17.jpg)
Canonical FormsNormalized forms of dates, numbers, …Allows applications to use information
very easilyAbstracts from different morphological
variants of a single term
![Page 18: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/18.jpg)
Canonical Names
President Bush
Mr. Bush
George Bush
Canonical Name:George Bush
The canonical name is the most explicit, least ambiguous name constructed from the different variants found in the document
Reduces ambiguity of variants
![Page 19: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/19.jpg)
Disambiguating Proper Names: Nominator Program
![Page 20: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/20.jpg)
Principles of Nominator DesignApply heuristics to strings, instead of
interpreting semantics.The unit of context for extraction is a
document.The unit of context for aggregation is a
corpus.The heuristics represent English naming
conventions.
![Page 21: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/21.jpg)
Mining in Collections of Documents: Clustering and Categorization
![Page 22: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/22.jpg)
1. Clustering Partitions a given collection into groups of
documents similar in contents, i.e., in their feature vectors.
Two clustering engines Hierarchical Clustering tool Binary Relational Clustering tool
Both tools help to identify the topic of a group by listing terms or words that are common in the documents in the group.
Thus, provides overview of the contents of a collection of documents
![Page 23: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/23.jpg)
Groups documents similar in their feature vectors
![Page 24: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/24.jpg)
2. CategorizationTopic Categorization ToolAssign documents to preexisting
categories (“topics” or “themes”)Categories are chosen to match the
intended use of the collectioncategories defined by providing a set of
sample documents for each category
![Page 25: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/25.jpg)
2. Categorization (cont.)This “training” phase produces a special
index, called the categorization schemacategorization tool returns a list of
category names and confidence levels for each document
If the confidence level is low, document is put aside for human categorizer
![Page 26: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/26.jpg)
2. Categorization (cont.)Effectiveness:
Tests have shown that the Topic Categorization tool agrees with human categorizers to the same degree as human categorizers agree with one another.
![Page 27: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/27.jpg)
Set of sample documents
Training phase
Special index used to categorize new documents
Returns list of category names and confidence
levels for each
document
![Page 28: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/28.jpg)
Text Mining Applications
![Page 29: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/29.jpg)
Main Advantages of mining technology over traditional ‘information broker’ business
Ability to quickly process large amounts of textual data
“Objectivity” and customizabilityAutomation
![Page 30: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/30.jpg)
Applications used to:Gain insights about trends, relations
between people/places/organizationsClassify and organize documents
according to their contentOrganize repositories of document-
related meta-information for search and retrieval
Retrieve documents
![Page 31: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/31.jpg)
Main ApplicationsKnowledge Discovery
Information Distillation
![Page 32: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/32.jpg)
CRI: Customer Relationship Intelligence Appropriate documents selected Converted to common format Feature extraction and clustering tools are
used to create a database User may select parameters for
preprocessing and clustering step Clustering produces groups of feedback that
share important linguistic elements Categorization tool used to assign new
incoming feedback to identified categories.
![Page 33: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/33.jpg)
CRI (continued)Knowledge Discovery
Clustering used to create a structure that can be interpreted
Information Distillation Refinement and extension of the clustering
results Interpreting the results Tuning of the clustering process Selecting meaningful clusters
![Page 34: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/34.jpg)
Exam Question #1Name an example of each of the two
main classes of applications of text mining. Knowledge Discovery: Discovering a
common customer complaint among much feedback.
Information Distillation: Filtering future comments into pre-defined categories
![Page 35: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/35.jpg)
Exam Question #2How does the procedure for text mining
differ from the procedure for data mining? Adds feature extraction function Not feasible to have humans select
features Highly dimensional, sparsely populated
feature vectors
![Page 36: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/36.jpg)
Exam Question #3 In the Nominator program of IBM’s
Intelligent Miner for Text, an objective of the design is to enable rapid extraction of names from large amounts of text. How does this decision affect the ability of the program to interpret the semantics of text? Does not perform in-depth syntactic or
semantic analyses of texts
![Page 37: Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert](https://reader036.vdocuments.net/reader036/viewer/2022062313/56649d3f5503460f94a185d1/html5/thumbnails/37.jpg)
THE END
http://www-3.ibm.com/software/data/iminer/fortext/