practical text analytics
TRANSCRIPT
-
8/13/2019 Practical Text Analytics
1/32
Reading Between the Lines:
Practical Text AnalyticsJohn B. Rollins, Ph.D., P.E., BI Solution Architect, IBM Corporation, [email protected] Venkatesh, Ph.D., Data Mining Specialist, IBM Corporation, [email protected]
Alexander Lang, Ph.D., Software Engineer, IBM Corporation, [email protected]
Stefan Abraham, Ph.D., Software Engineer, IBM Corporation, [email protected]
Session Number 2336
-
8/13/2019 Practical Text Analytics
2/32
1
Agenda
Business motivation for analyzing unstructured data
Text analysis: from text to structure
Text analysis in InfoSphere Warehouse Practical examples
InfoSphere Warehouse and IBM Content Analyzer
Trends in unstructured analytics
-
8/13/2019 Practical Text Analytics
3/32
2
Information Irony
Water, water, everywhere,Nor any drop to drink.
Samuel Taylor Coleridge
The Rime of the Ancient Mariner
Data is being generated at anunprecedented rate
IDC estimates data will grow
from 161 exabytes in 2006 to988 exabytes in 2010 (1 exabyte= 1 billion Gigabytes!)
Much of it (80% by some
estimates) is unstructured. Information rich
Extracting value is difficult
Our focus is unstructured
TEXTUAL INFORMATION!
-
8/13/2019 Practical Text Analytics
4/32
3
Information Warehouse Growth Trend
TDWI Survey, 2007
Respondents expecthuge increase inunstructured data aswarehouse sources
Collaborative content(Email, IM, Wikis)
Content management
Voice transcriptions
Claims records
Chart & Survey: P. Russom,BI Search and Text Analytics,TDWI Best Practices Report, 2007
Rapid Growth inUnstructured Data
-
8/13/2019 Practical Text Analytics
5/32
4
Business Scenarios for Unstructured Data
Example: improve product innovation and quality
Use information from customer service records, repairnotes, online reviews, and other unstructured sources
Reduce reliance on static, predefined problem codesby extracting detailed problem descriptions
Understand faster whya product has problems
Identify gaps in product portfolio new functionality todrive product innovation
-
8/13/2019 Practical Text Analytics
6/32
5
Business Scenarios for Unstructured Data
Product Innovation and Quality
Static Problem Codes
(which problems occurred)
Text Analysis
(why problems occurred)
-
8/13/2019 Practical Text Analytics
7/32
6
Example: reduce customer churn
Identify unhappy customers as early as possible Analyze text to identify emerging problems, e.g., call center
complaints about dropped calls May be too late to take action by the time a problem appears
in structured data, e.g., declining number of calls over time fora given cell phone service provider
Analyze customer email and call center logs to detectnegative sentiment
Is the customer angry?
Is the customer mentioning competitors pricing offers? Is the customer complaining about a particular problem withservice or product?
Business Scenarios for Unstructured Data
-
8/13/2019 Practical Text Analytics
8/32
7
Text Analysis:
From Text to Structure
-
8/13/2019 Practical Text Analytics
9/32
8
Text to Structure
Create new structured variables from text
Example: analysis of vehicle complaints report
Extract accident attributes from text field
Create new variables to represent accident attributes
CONSUMER WAS SEVERLY INJURED IN AN
ACCIDENT. THEABS ANTI-LOCK BRAKE FAILEDANDPASSENGERS AIR BAG DIDN'T DEPLOY.
CMPLID INJURY ACCIDENT ABS AIRBAG_DRV
Failed --YesSevere
AIRBAG_PASS
17869 No Deploy
-
8/13/2019 Practical Text Analytics
10/32
9
Types of Information Extraction
Named entity recognition
Extract person or place names, monetary expressions,
etc.
Co-reference resolution
Identify expressions that refer to the same entity
Alex Lang is our co-author. He is not at IOD this year.
Relationship detection
Extract entities (e.g., products and problems) and use
data mining to find relationships among them
More robust than elaborate, hand-crafted rules
Associations, clustering, predictive modeling
-
8/13/2019 Practical Text Analytics
11/32
10
Which Extraction Technique to Use?
Depends on the concepts to be extracted
Concept is a f ixedlist of instances: use dictionaries
Product names from database
List of employee names from LDAP
Concept follows a s implepattern: use regularexpressions
Phone numbers, product codes, etc.
Concept follows complexpattern: use advancedanalysis components
Relationships among concepts/entities
Additional tools/capabilities Text exploration (e.g., OmniFind Analytics / Content Analyzer)
Customized annotators (e.g., sentiments)
-
8/13/2019 Practical Text Analytics
12/32
11
Text Analysis inInfosphere Warehouse
-
8/13/2019 Practical Text Analytics
13/32
12
Text Analytics in Infosphere Warehouse
Data understanding:View text columns in database, text statistics, frequent terms analysis
Run advanced UIMA
engines (IBM Omnifind,
Partners, IBM Research)
- Extract frequent term patterns
- Create entity dictionary from terms
List of terms
to extract
Regular expression-
based extraction
Enable hierarchical
grouping of terms
-
8/13/2019 Practical Text Analytics
14/32
13
Focused vs. Explorative Approaches
Focused Create dictionaries, rules, or annotators to extract
precisely the relevant concepts
More upstream effort for creation and testing Allows analysis focused on a specific question
Example: Show the top 10 car parts that occur in repair reportsfor automobile make X
Explorative
Create dictionaries that contain terms with certainparts-of-speech patterns (e.g., adjective-noun)
Use downstream analysis (e.g., association rulemining) to weed out irrelevant terms
Allows detection of unknown events or relationships
Example: Part X + Failure Attribute Y Vehicle Crash
-
8/13/2019 Practical Text Analytics
15/32
14
Explorative Approach: Example
Frequent terms + Associations analysis Cognos report
Frequent Terms Analysis:count the most frequently-
occurring terms for Auto Make X
(relative occurrence may or may
not be relevant to understandingthe problem)
Associations Mining:discover correlated terms
that are relevant to
resolving the problem
-
8/13/2019 Practical Text Analytics
16/32
15
Using Information Extracted from Text
Enrich database with extracted terms/variables
Identify Frequent Terms
Create Dictionary
Use Dictionary Lookup
to extract terms
Text
Field
Structured
DataRegular Expression
Extraction
Other Extraction
Techniques
Enriched
Data
Information Extraction
Derived
Variables
Aggregate
by Row
-
8/13/2019 Practical Text Analytics
17/32
16
Using Information Extracted from Text
Use new structured variables for reporting and analytics
Create additional dimensions for OLAP, e.g.:
Extract skills mentioned in job postings
Add skills to OLAP cube to enable reports like What are the
top 10 skills sought by my competitors?
Improve predictive power and interpretability of data
mining models, e.g.:
Extract information on parts and failure attributes from safety
complaints
Associations find correlations between part failures and
crashes
Decision tree add information on parts and failure attributes
to gain additional insights and improve predictive power
-
8/13/2019 Practical Text Analytics
18/32
17
Practical Example:
Camera Product Review
-
8/13/2019 Practical Text Analytics
19/32
18
Scenario
A company has gathered customer comments and
product ratings on their digital cameras from an
external forum.
Goal: Improve customer satisfaction by understanding
the key drivers of customer sentiment
Identify camera features that are correlated withpositive/negative reviews
Identify areas needing improvement and/or product
differentiation
-
8/13/2019 Practical Text Analytics
20/32
19
Live Demo
Infosphere Warehouse
Camera product review
-
8/13/2019 Practical Text Analytics
21/32
20
Practical Example:
NHTSA Vehicle Safety Complaints
-
8/13/2019 Practical Text Analytics
22/32
21
Scenario
The National Highway Traffic Safety Administration(NHTSA) COMPLAINTS dataset contains all safety-related defect complaints received by NHTSA since
January 1, 1995. Dataset contains structured variables (Make, Model,
Year, etc.) and an unstructured text field (consumer
complaints description). Goal: Enrich predictive mining models of vehicle
safety by incorporating variables extracted from text
Extract key variables related to vehicle safety Combine extracted variables with existing ones to
develop insights and improve predictive power of
mining models of vehicle safety
-
8/13/2019 Practical Text Analytics
23/32
22
Live Demo
Infosphere Warehouse
NHTSA vehicle safety complaints
-
8/13/2019 Practical Text Analytics
24/32
23
Infosphere Warehouse andIBM Content Analyzer
-
8/13/2019 Practical Text Analytics
25/32
24
Scenarios Using Advanced Text Analysis
Customer sentiment requires detection of
Negation: Product X is nota good choice negative
Non-facts: Product X is a good choicevs. Product X
is perhaps a good choice
Identify parts that failed without a list of parts
Want to extract gasket failure, wiring harness hasfailed, but not severe failure, when it has failed
Requires rules like:
One or two nouns, followed by one or two arbitrary words,followed by fail
Can be addressed by analysis components from other
IBM products, e.g., IBM Content Analyzer
InfoSphere Warehouse and
-
8/13/2019 Practical Text Analytics
26/32
25
InfoSphere Warehouse and
IBM Content Analyzer
InfoSphere Warehouse 9.5 Extract predefined concepts, based on lists and regular
expressions Text analysis is a key component within ETL flow
Results can be used in data mining and reporting
IBM Content Analyzer Extract concepts based on grammar and parts-of-speech Combine search and text mining to discover insights
Combined approach Use IBM Content Analyzer to explore documents and identify
relevant concepts
Operationalize insights by putting Content Analyzer text
analysis into InfoSphere Warehouse ETL flows
U i C A l I I f S h Fl
-
8/13/2019 Practical Text Analytics
27/32
26
Using Content Analyzer In InfoSphere Flow
Descriptions ofvehicle failures
Use ICA Text Analysis to extract nounsthat are followed by fail or leak
Result: car parts thatfailed
-
8/13/2019 Practical Text Analytics
28/32
27
Trends in UnstructuredAnalytics
T d i U t t d A l ti
-
8/13/2019 Practical Text Analytics
29/32
28
Trends in Unstructured Analytics
Speech analytics
Combines traditional text analytics with speech as the
source of the text
Gives access to "voice of the customer" for wide range
of interesting insights into customer behavior, e.g.:
Identifying cross-sell and up-sell opportunities
Identifying indicators of high risk of lapsing (e.g., expressingdissatisfaction, mentioning competitors)
Multi-modal image analysis
Improved detection of patterns and anomalies in images
Example: medical imaging to look for evidence of
disease or injury
T d i U t t d A l ti ( td)
-
8/13/2019 Practical Text Analytics
30/32
29
Trends in Unstructured Analytics (contd)
Sentiment detection in web sources
Insights on products and companies (blogs, chats)
Sentiments influence product/service directions
Noisy unstructured data analysis
Extract information from highly noisy unstructured text
sources such as: Online chats, text messages, emails, message boards,
newsgroups, blogs, wikis, web pages, printed/handwritten text
Text produced by processing speech
Noise includes:
Spelling errors, abbreviations, non-standard words, missing
punctuations, missing case information, pauses, verbal fillers
S mmar
-
8/13/2019 Practical Text Analytics
31/32
30
Summary
Text analysis is becoming increasingly moreimportant with the rapid growth in unstructured data.
Infosphere Warehouse provides many capabilities forpractical text analysis.
ISW provides an integrated platform for data mining,text analysis, and reporting.
Practical examples illustrate how to perform text
analytics and combine it with data mining
IBM Content Analyzer and UIMA-compliantannotators can extend text analysis capabilities.
Emerging unstructured analytics technologies areextending the value and applications of TA in manyimportant fields.
IBM Research is active in many of these areas.
Disclaimer
-
8/13/2019 Practical Text Analytics
32/32
31
Copy r ight IBM Corpo rat ion 2008. Al l r igh ts reserved.
U.S. Governm ent Users Restr icted Rights - Use, dup l icat ion or disc losu re restr icted by GSA ADP ScheduleContract with IBM Corp.
THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES
ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE
INFORMATION CONTA INED IN THIS PRESENTATION, IT IS PROVIDED AS IS WITHOUT WARRANTY OF
ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBMS CURRENT
PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM
SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE
RELA TED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTA INED IN THIS
PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR
REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND
CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR
SOFTWARE.
IBM, the IBM logo, ibm.com, Infosphere Warehouse, Content Analyzer, and Omnifind Analytics Edition are trademarks or
registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these
and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ),
these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published.
Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is
available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
Disclaimer
http://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/legal/copytrade.shtml