practical text analytics

8/13/2019 Practical Text Analytics

1/32

Reading Between the Lines:

Practical Text AnalyticsJohn B. Rollins, Ph.D., P.E., BI Solution Architect, IBM Corporation, [email protected] Venkatesh, Ph.D., Data Mining Specialist, IBM Corporation, [email protected]

Alexander Lang, Ph.D., Software Engineer, IBM Corporation, [email protected]

Stefan Abraham, Ph.D., Software Engineer, IBM Corporation, [email protected]

Session Number 2336


2/32

1

Agenda

Business motivation for analyzing unstructured data

Text analysis: from text to structure

Text analysis in InfoSphere Warehouse Practical examples

InfoSphere Warehouse and IBM Content Analyzer

Trends in unstructured analytics


3/32

2

Information Irony

Water, water, everywhere,Nor any drop to drink.

Samuel Taylor Coleridge

The Rime of the Ancient Mariner

Data is being generated at anunprecedented rate

IDC estimates data will grow

from 161 exabytes in 2006 to988 exabytes in 2010 (1 exabyte= 1 billion Gigabytes!)

Much of it (80% by some

estimates) is unstructured. Information rich

Extracting value is difficult

Our focus is unstructured

TEXTUAL INFORMATION!


4/32

3

Information Warehouse Growth Trend

TDWI Survey, 2007

Respondents expecthuge increase inunstructured data aswarehouse sources

Collaborative content(Email, IM, Wikis)

Content management

Voice transcriptions

Claims records

Chart & Survey: P. Russom,BI Search and Text Analytics,TDWI Best Practices Report, 2007

Rapid Growth inUnstructured Data


5/32

4

Business Scenarios for Unstructured Data

Example: improve product innovation and quality

Use information from customer service records, repairnotes, online reviews, and other unstructured sources

Reduce reliance on static, predefined problem codesby extracting detailed problem descriptions

Understand faster whya product has problems

Identify gaps in product portfolio new functionality todrive product innovation


6/32

5


Product Innovation and Quality

Static Problem Codes

(which problems occurred)

Text Analysis

(why problems occurred)


7/32

6

Example: reduce customer churn

Identify unhappy customers as early as possible Analyze text to identify emerging problems, e.g., call center

complaints about dropped calls May be too late to take action by the time a problem appears

in structured data, e.g., declining number of calls over time fora given cell phone service provider

Analyze customer email and call center logs to detectnegative sentiment

Is the customer angry?

Is the customer mentioning competitors pricing offers? Is the customer complaining about a particular problem withservice or product?



8/32

7

Text Analysis:

From Text to Structure


9/32

8

Text to Structure

Create new structured variables from text

Example: analysis of vehicle complaints report

Extract accident attributes from text field

Create new variables to represent accident attributes

CONSUMER WAS SEVERLY INJURED IN AN

ACCIDENT. THEABS ANTI-LOCK BRAKE FAILEDANDPASSENGERS AIR BAG DIDN'T DEPLOY.

CMPLID INJURY ACCIDENT ABS AIRBAG_DRV

Failed --YesSevere

AIRBAG_PASS

17869 No Deploy


10/32

9

Types of Information Extraction

Named entity recognition

Extract person or place names, monetary expressions,

etc.

Co-reference resolution

Identify expressions that refer to the same entity

Alex Lang is our co-author. He is not at IOD this year.

Relationship detection

Extract entities (e.g., products and problems) and use

data mining to find relationships among them

More robust than elaborate, hand-crafted rules

Associations, clustering, predictive modeling


11/32

10

Which Extraction Technique to Use?

Depends on the concepts to be extracted

Concept is a f ixedlist of instances: use dictionaries

Product names from database

List of employee names from LDAP

Concept follows a s implepattern: use regularexpressions

Phone numbers, product codes, etc.

Concept follows complexpattern: use advancedanalysis components

Relationships among concepts/entities

Additional tools/capabilities Text exploration (e.g., OmniFind Analytics / Content Analyzer)

Customized annotators (e.g., sentiments)


12/32

11

Text Analysis inInfosphere Warehouse


13/32

12

Text Analytics in Infosphere Warehouse

Data understanding:View text columns in database, text statistics, frequent terms analysis

Run advanced UIMA

engines (IBM Omnifind,

Partners, IBM Research)

- Extract frequent term patterns

- Create entity dictionary from terms

List of terms

to extract

Regular expression-

based extraction

Enable hierarchical

grouping of terms


14/32

13

Focused vs. Explorative Approaches

Focused Create dictionaries, rules, or annotators to extract

precisely the relevant concepts

More upstream effort for creation and testing Allows analysis focused on a specific question

Example: Show the top 10 car parts that occur in repair reportsfor automobile make X

Explorative

Create dictionaries that contain terms with certainparts-of-speech patterns (e.g., adjective-noun)

Use downstream analysis (e.g., association rulemining) to weed out irrelevant terms

Allows detection of unknown events or relationships

Example: Part X + Failure Attribute Y Vehicle Crash


15/32

14

Explorative Approach: Example

Frequent terms + Associations analysis Cognos report

Frequent Terms Analysis:count the most frequently-

occurring terms for Auto Make X

(relative occurrence may or may

not be relevant to understandingthe problem)

Associations Mining:discover correlated terms

that are relevant to

resolving the problem


16/32

15

Using Information Extracted from Text

Enrich database with extracted terms/variables

Identify Frequent Terms

Create Dictionary

Use Dictionary Lookup

to extract terms

Text

Field

Structured

DataRegular Expression

Extraction

Other Extraction

Techniques

Enriched

Data

Information Extraction

Derived

Variables

Aggregate

by Row


17/32

16

Using Information Extracted from Text

Use new structured variables for reporting and analytics

Create additional dimensions for OLAP, e.g.:

Extract skills mentioned in job postings

Add skills to OLAP cube to enable reports like What are the

top 10 skills sought by my competitors?

Improve predictive power and interpretability of data

mining models, e.g.:

Extract information on parts and failure attributes from safety

complaints

Associations find correlations between part failures and

crashes

Decision tree add information on parts and failure attributes

to gain additional insights and improve predictive power


18/32

17

Practical Example:

Camera Product Review


19/32

18

Scenario

A company has gathered customer comments and

product ratings on their digital cameras from an

external forum.

Goal: Improve customer satisfaction by understanding

the key drivers of customer sentiment

Identify camera features that are correlated withpositive/negative reviews

Identify areas needing improvement and/or product

differentiation


20/32

19

Live Demo

Infosphere Warehouse

Camera product review


21/32

20

Practical Example:

NHTSA Vehicle Safety Complaints


22/32

21

Scenario

The National Highway Traffic Safety Administration(NHTSA) COMPLAINTS dataset contains all safety-related defect complaints received by NHTSA since

January 1, 1995. Dataset contains structured variables (Make, Model,

Year, etc.) and an unstructured text field (consumer

complaints description). Goal: Enrich predictive mining models of vehicle

safety by incorporating variables extracted from text

Extract key variables related to vehicle safety Combine extracted variables with existing ones to

develop insights and improve predictive power of

mining models of vehicle safety


23/32

22

Live Demo

Infosphere Warehouse

NHTSA vehicle safety complaints


24/32

23

Infosphere Warehouse andIBM Content Analyzer


25/32

24

Scenarios Using Advanced Text Analysis

Customer sentiment requires detection of

Negation: Product X is nota good choice negative

Non-facts: Product X is a good choicevs. Product X

is perhaps a good choice

Identify parts that failed without a list of parts

Want to extract gasket failure, wiring harness hasfailed, but not severe failure, when it has failed

Requires rules like:

One or two nouns, followed by one or two arbitrary words,followed by fail

Can be addressed by analysis components from other

IBM products, e.g., IBM Content Analyzer

InfoSphere Warehouse and


26/32

25

InfoSphere Warehouse and

IBM Content Analyzer

InfoSphere Warehouse 9.5 Extract predefined concepts, based on lists and regular

expressions Text analysis is a key component within ETL flow

Results can be used in data mining and reporting

IBM Content Analyzer Extract concepts based on grammar and parts-of-speech Combine search and text mining to discover insights

Combined approach Use IBM Content Analyzer to explore documents and identify

relevant concepts

Operationalize insights by putting Content Analyzer text

analysis into InfoSphere Warehouse ETL flows

U i C A l I I f S h Fl


27/32

26

Using Content Analyzer In InfoSphere Flow

Descriptions ofvehicle failures

Use ICA Text Analysis to extract nounsthat are followed by fail or leak

Result: car parts thatfailed


28/32

27

Trends in UnstructuredAnalytics

T d i U t t d A l ti


29/32

28

Trends in Unstructured Analytics

Speech analytics

Combines traditional text analytics with speech as the

source of the text

Gives access to "voice of the customer" for wide range

of interesting insights into customer behavior, e.g.:

Identifying cross-sell and up-sell opportunities

Identifying indicators of high risk of lapsing (e.g., expressingdissatisfaction, mentioning competitors)

Multi-modal image analysis

Improved detection of patterns and anomalies in images

Example: medical imaging to look for evidence of

disease or injury

T d i U t t d A l ti ( td)


30/32

29

Trends in Unstructured Analytics (contd)

Sentiment detection in web sources

Insights on products and companies (blogs, chats)

Sentiments influence product/service directions

Noisy unstructured data analysis

Extract information from highly noisy unstructured text

sources such as: Online chats, text messages, emails, message boards,

newsgroups, blogs, wikis, web pages, printed/handwritten text

Text produced by processing speech

Noise includes:

Spelling errors, abbreviations, non-standard words, missing

punctuations, missing case information, pauses, verbal fillers

S mmar


31/32

30

Summary

Text analysis is becoming increasingly moreimportant with the rapid growth in unstructured data.

Infosphere Warehouse provides many capabilities forpractical text analysis.

ISW provides an integrated platform for data mining,text analysis, and reporting.

Practical examples illustrate how to perform text

analytics and combine it with data mining

IBM Content Analyzer and UIMA-compliantannotators can extend text analysis capabilities.

Emerging unstructured analytics technologies areextending the value and applications of TA in manyimportant fields.

IBM Research is active in many of these areas.

Disclaimer


32/32

31

Copy r ight IBM Corpo rat ion 2008. Al l r igh ts reserved.

U.S. Governm ent Users Restr icted Rights - Use, dup l icat ion or disc losu re restr icted by GSA ADP ScheduleContract with IBM Corp.

THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES

ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE

INFORMATION CONTA INED IN THIS PRESENTATION, IT IS PROVIDED AS IS WITHOUT WARRANTY OF

ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBMS CURRENT

PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM

SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE

RELA TED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTA INED IN THIS

PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR

REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND

CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR

SOFTWARE.

IBM, the IBM logo, ibm.com, Infosphere Warehouse, Content Analyzer, and Omnifind Analytics Edition are trademarks or

registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these

and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ),

these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published.

Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is

available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.

Disclaimer
http://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/legal/copytrade.shtml

practical text analytics

Documents