practical text analytics

Upload: eddy-chan

Post on 04-Jun-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Practical Text Analytics

    1/32

    Reading Between the Lines:

    Practical Text AnalyticsJohn B. Rollins, Ph.D., P.E., BI Solution Architect, IBM Corporation, [email protected] Venkatesh, Ph.D., Data Mining Specialist, IBM Corporation, [email protected]

    Alexander Lang, Ph.D., Software Engineer, IBM Corporation, [email protected]

    Stefan Abraham, Ph.D., Software Engineer, IBM Corporation, [email protected]

    Session Number 2336

  • 8/13/2019 Practical Text Analytics

    2/32

    1

    Agenda

    Business motivation for analyzing unstructured data

    Text analysis: from text to structure

    Text analysis in InfoSphere Warehouse Practical examples

    InfoSphere Warehouse and IBM Content Analyzer

    Trends in unstructured analytics

  • 8/13/2019 Practical Text Analytics

    3/32

    2

    Information Irony

    Water, water, everywhere,Nor any drop to drink.

    Samuel Taylor Coleridge

    The Rime of the Ancient Mariner

    Data is being generated at anunprecedented rate

    IDC estimates data will grow

    from 161 exabytes in 2006 to988 exabytes in 2010 (1 exabyte= 1 billion Gigabytes!)

    Much of it (80% by some

    estimates) is unstructured. Information rich

    Extracting value is difficult

    Our focus is unstructured

    TEXTUAL INFORMATION!

  • 8/13/2019 Practical Text Analytics

    4/32

    3

    Information Warehouse Growth Trend

    TDWI Survey, 2007

    Respondents expecthuge increase inunstructured data aswarehouse sources

    Collaborative content(Email, IM, Wikis)

    Content management

    Voice transcriptions

    Claims records

    Chart & Survey: P. Russom,BI Search and Text Analytics,TDWI Best Practices Report, 2007

    Rapid Growth inUnstructured Data

  • 8/13/2019 Practical Text Analytics

    5/32

    4

    Business Scenarios for Unstructured Data

    Example: improve product innovation and quality

    Use information from customer service records, repairnotes, online reviews, and other unstructured sources

    Reduce reliance on static, predefined problem codesby extracting detailed problem descriptions

    Understand faster whya product has problems

    Identify gaps in product portfolio new functionality todrive product innovation

  • 8/13/2019 Practical Text Analytics

    6/32

    5

    Business Scenarios for Unstructured Data

    Product Innovation and Quality

    Static Problem Codes

    (which problems occurred)

    Text Analysis

    (why problems occurred)

  • 8/13/2019 Practical Text Analytics

    7/32

    6

    Example: reduce customer churn

    Identify unhappy customers as early as possible Analyze text to identify emerging problems, e.g., call center

    complaints about dropped calls May be too late to take action by the time a problem appears

    in structured data, e.g., declining number of calls over time fora given cell phone service provider

    Analyze customer email and call center logs to detectnegative sentiment

    Is the customer angry?

    Is the customer mentioning competitors pricing offers? Is the customer complaining about a particular problem withservice or product?

    Business Scenarios for Unstructured Data

  • 8/13/2019 Practical Text Analytics

    8/32

    7

    Text Analysis:

    From Text to Structure

  • 8/13/2019 Practical Text Analytics

    9/32

    8

    Text to Structure

    Create new structured variables from text

    Example: analysis of vehicle complaints report

    Extract accident attributes from text field

    Create new variables to represent accident attributes

    CONSUMER WAS SEVERLY INJURED IN AN

    ACCIDENT. THEABS ANTI-LOCK BRAKE FAILEDANDPASSENGERS AIR BAG DIDN'T DEPLOY.

    CMPLID INJURY ACCIDENT ABS AIRBAG_DRV

    Failed --YesSevere

    AIRBAG_PASS

    17869 No Deploy

  • 8/13/2019 Practical Text Analytics

    10/32

    9

    Types of Information Extraction

    Named entity recognition

    Extract person or place names, monetary expressions,

    etc.

    Co-reference resolution

    Identify expressions that refer to the same entity

    Alex Lang is our co-author. He is not at IOD this year.

    Relationship detection

    Extract entities (e.g., products and problems) and use

    data mining to find relationships among them

    More robust than elaborate, hand-crafted rules

    Associations, clustering, predictive modeling

  • 8/13/2019 Practical Text Analytics

    11/32

    10

    Which Extraction Technique to Use?

    Depends on the concepts to be extracted

    Concept is a f ixedlist of instances: use dictionaries

    Product names from database

    List of employee names from LDAP

    Concept follows a s implepattern: use regularexpressions

    Phone numbers, product codes, etc.

    Concept follows complexpattern: use advancedanalysis components

    Relationships among concepts/entities

    Additional tools/capabilities Text exploration (e.g., OmniFind Analytics / Content Analyzer)

    Customized annotators (e.g., sentiments)

  • 8/13/2019 Practical Text Analytics

    12/32

    11

    Text Analysis inInfosphere Warehouse

  • 8/13/2019 Practical Text Analytics

    13/32

    12

    Text Analytics in Infosphere Warehouse

    Data understanding:View text columns in database, text statistics, frequent terms analysis

    Run advanced UIMA

    engines (IBM Omnifind,

    Partners, IBM Research)

    - Extract frequent term patterns

    - Create entity dictionary from terms

    List of terms

    to extract

    Regular expression-

    based extraction

    Enable hierarchical

    grouping of terms

  • 8/13/2019 Practical Text Analytics

    14/32

    13

    Focused vs. Explorative Approaches

    Focused Create dictionaries, rules, or annotators to extract

    precisely the relevant concepts

    More upstream effort for creation and testing Allows analysis focused on a specific question

    Example: Show the top 10 car parts that occur in repair reportsfor automobile make X

    Explorative

    Create dictionaries that contain terms with certainparts-of-speech patterns (e.g., adjective-noun)

    Use downstream analysis (e.g., association rulemining) to weed out irrelevant terms

    Allows detection of unknown events or relationships

    Example: Part X + Failure Attribute Y Vehicle Crash

  • 8/13/2019 Practical Text Analytics

    15/32

    14

    Explorative Approach: Example

    Frequent terms + Associations analysis Cognos report

    Frequent Terms Analysis:count the most frequently-

    occurring terms for Auto Make X

    (relative occurrence may or may

    not be relevant to understandingthe problem)

    Associations Mining:discover correlated terms

    that are relevant to

    resolving the problem

  • 8/13/2019 Practical Text Analytics

    16/32

    15

    Using Information Extracted from Text

    Enrich database with extracted terms/variables

    Identify Frequent Terms

    Create Dictionary

    Use Dictionary Lookup

    to extract terms

    Text

    Field

    Structured

    DataRegular Expression

    Extraction

    Other Extraction

    Techniques

    Enriched

    Data

    Information Extraction

    Derived

    Variables

    Aggregate

    by Row

  • 8/13/2019 Practical Text Analytics

    17/32

    16

    Using Information Extracted from Text

    Use new structured variables for reporting and analytics

    Create additional dimensions for OLAP, e.g.:

    Extract skills mentioned in job postings

    Add skills to OLAP cube to enable reports like What are the

    top 10 skills sought by my competitors?

    Improve predictive power and interpretability of data

    mining models, e.g.:

    Extract information on parts and failure attributes from safety

    complaints

    Associations find correlations between part failures and

    crashes

    Decision tree add information on parts and failure attributes

    to gain additional insights and improve predictive power

  • 8/13/2019 Practical Text Analytics

    18/32

    17

    Practical Example:

    Camera Product Review

  • 8/13/2019 Practical Text Analytics

    19/32

    18

    Scenario

    A company has gathered customer comments and

    product ratings on their digital cameras from an

    external forum.

    Goal: Improve customer satisfaction by understanding

    the key drivers of customer sentiment

    Identify camera features that are correlated withpositive/negative reviews

    Identify areas needing improvement and/or product

    differentiation

  • 8/13/2019 Practical Text Analytics

    20/32

    19

    Live Demo

    Infosphere Warehouse

    Camera product review

  • 8/13/2019 Practical Text Analytics

    21/32

    20

    Practical Example:

    NHTSA Vehicle Safety Complaints

  • 8/13/2019 Practical Text Analytics

    22/32

    21

    Scenario

    The National Highway Traffic Safety Administration(NHTSA) COMPLAINTS dataset contains all safety-related defect complaints received by NHTSA since

    January 1, 1995. Dataset contains structured variables (Make, Model,

    Year, etc.) and an unstructured text field (consumer

    complaints description). Goal: Enrich predictive mining models of vehicle

    safety by incorporating variables extracted from text

    Extract key variables related to vehicle safety Combine extracted variables with existing ones to

    develop insights and improve predictive power of

    mining models of vehicle safety

  • 8/13/2019 Practical Text Analytics

    23/32

    22

    Live Demo

    Infosphere Warehouse

    NHTSA vehicle safety complaints

  • 8/13/2019 Practical Text Analytics

    24/32

    23

    Infosphere Warehouse andIBM Content Analyzer

  • 8/13/2019 Practical Text Analytics

    25/32

    24

    Scenarios Using Advanced Text Analysis

    Customer sentiment requires detection of

    Negation: Product X is nota good choice negative

    Non-facts: Product X is a good choicevs. Product X

    is perhaps a good choice

    Identify parts that failed without a list of parts

    Want to extract gasket failure, wiring harness hasfailed, but not severe failure, when it has failed

    Requires rules like:

    One or two nouns, followed by one or two arbitrary words,followed by fail

    Can be addressed by analysis components from other

    IBM products, e.g., IBM Content Analyzer

    InfoSphere Warehouse and

  • 8/13/2019 Practical Text Analytics

    26/32

    25

    InfoSphere Warehouse and

    IBM Content Analyzer

    InfoSphere Warehouse 9.5 Extract predefined concepts, based on lists and regular

    expressions Text analysis is a key component within ETL flow

    Results can be used in data mining and reporting

    IBM Content Analyzer Extract concepts based on grammar and parts-of-speech Combine search and text mining to discover insights

    Combined approach Use IBM Content Analyzer to explore documents and identify

    relevant concepts

    Operationalize insights by putting Content Analyzer text

    analysis into InfoSphere Warehouse ETL flows

    U i C A l I I f S h Fl

  • 8/13/2019 Practical Text Analytics

    27/32

    26

    Using Content Analyzer In InfoSphere Flow

    Descriptions ofvehicle failures

    Use ICA Text Analysis to extract nounsthat are followed by fail or leak

    Result: car parts thatfailed

  • 8/13/2019 Practical Text Analytics

    28/32

    27

    Trends in UnstructuredAnalytics

    T d i U t t d A l ti

  • 8/13/2019 Practical Text Analytics

    29/32

    28

    Trends in Unstructured Analytics

    Speech analytics

    Combines traditional text analytics with speech as the

    source of the text

    Gives access to "voice of the customer" for wide range

    of interesting insights into customer behavior, e.g.:

    Identifying cross-sell and up-sell opportunities

    Identifying indicators of high risk of lapsing (e.g., expressingdissatisfaction, mentioning competitors)

    Multi-modal image analysis

    Improved detection of patterns and anomalies in images

    Example: medical imaging to look for evidence of

    disease or injury

    T d i U t t d A l ti ( td)

  • 8/13/2019 Practical Text Analytics

    30/32

    29

    Trends in Unstructured Analytics (contd)

    Sentiment detection in web sources

    Insights on products and companies (blogs, chats)

    Sentiments influence product/service directions

    Noisy unstructured data analysis

    Extract information from highly noisy unstructured text

    sources such as: Online chats, text messages, emails, message boards,

    newsgroups, blogs, wikis, web pages, printed/handwritten text

    Text produced by processing speech

    Noise includes:

    Spelling errors, abbreviations, non-standard words, missing

    punctuations, missing case information, pauses, verbal fillers

    S mmar

  • 8/13/2019 Practical Text Analytics

    31/32

    30

    Summary

    Text analysis is becoming increasingly moreimportant with the rapid growth in unstructured data.

    Infosphere Warehouse provides many capabilities forpractical text analysis.

    ISW provides an integrated platform for data mining,text analysis, and reporting.

    Practical examples illustrate how to perform text

    analytics and combine it with data mining

    IBM Content Analyzer and UIMA-compliantannotators can extend text analysis capabilities.

    Emerging unstructured analytics technologies areextending the value and applications of TA in manyimportant fields.

    IBM Research is active in many of these areas.

    Disclaimer

  • 8/13/2019 Practical Text Analytics

    32/32

    31

    Copy r ight IBM Corpo rat ion 2008. Al l r igh ts reserved.

    U.S. Governm ent Users Restr icted Rights - Use, dup l icat ion or disc losu re restr icted by GSA ADP ScheduleContract with IBM Corp.

    THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES

    ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE

    INFORMATION CONTA INED IN THIS PRESENTATION, IT IS PROVIDED AS IS WITHOUT WARRANTY OF

    ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBMS CURRENT

    PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM

    SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE

    RELA TED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTA INED IN THIS

    PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR

    REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND

    CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR

    SOFTWARE.

    IBM, the IBM logo, ibm.com, Infosphere Warehouse, Content Analyzer, and Omnifind Analytics Edition are trademarks or

    registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these

    and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ),

    these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published.

    Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is

    available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml

    Other company, product, or service names may be trademarks or service marks of others.

    Disclaimer

    http://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/legal/copytrade.shtml