text analytics · pdf fileextraction, and search-meta data ... rules –hard to maintain...

31
Text Analytics World Future Directions of Text Analytics: Smarter, Bigger, and Better Tom Reamy Chief Knowledge Architect KAPS Group Program Chair Text Analytics World Knowledge Architecture Professional Services http://www.kapsgroup.com

Upload: lekiet

Post on 10-Mar-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

Text Analytics World Future Directions of Text Analytics:

Smarter, Bigger, and Better

Tom Reamy

Chief Knowledge Architect

KAPS Group

Program Chair – Text Analytics World

Knowledge Architecture Professional Services

http://www.kapsgroup.com

Page 2: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

2

Text Analytics World Highlights

Keynote – Peter Morville, Information Architecture+

Keynote – Future of Text Analytics – Bigger, Better, Smarter

Social Media and Enterprise Text Analytics – new techniques,

new applications, new directions - Integration

Two Panels– leading TA experts: Interactive: What you always

wanted to know about TA, but were afraid to ask.

Great Companies: Visit Sponsors & hear great case studies

Text Analytics Workshop – Thursday

Logistics

Page 3: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

3

Agenda

Introduction:

– Current State of Text Analytics

– Survey / Report

Enterprise Text Analytics - Search – still fundamental

– Shift from information to business

Social Media – Next Generation

– Different World: Content, Structures, Applications

Future of Text Analytics

– Roadblocks, Deep Vision

Questions

Page 4: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

4

Introduction: KAPS Group

Knowledge Architecture Professional Services – Network of Consultants

Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies

Services:

– Strategy – IM & KM - Text Analytics, Social Media, Integration

– Taxonomy/Text Analytics development, consulting, customization

– Text Analytics Quick Start – Audit, Evaluation, Pilot

– Social Media: Text based applications – design & development

Partners – SAS, Smart Logic, Expert Systems, SAP, IBM, FAST, Concept Searching, Attensity, Clarabridge, Lexalytics

Projects – Portals, taxonomy, Text analytics – news, expertise location, information strategy, text analytics evaluation, Quick Start in Text A.

Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, etc.

Presentations, Articles, White Papers – www.kapsgroup.com

Page 5: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

5

Introduction: Coming Soon

New Book: Text Analytics: How to Conquer Information Overload and Get Real Value from Social Media

Due end of May

Free Copy to Workshop Attendees

One randomly selected person at the conference will receive a free copy – stay tuned!

Page 6: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

6

Text Analytics World

Current State of Text Analytics

History – academic research, focus on NLP

Inxight –out of Zerox Parc

– Moved TA from academic and NLP to auto-categorization, entity extraction, and Search-Meta Data

Explosion of companies – many based on Inxight extraction with some analytical-visualization front ends

– Half from 2008 are gone - Lucky ones got bought

Early applications – News aggregation and Enterprise Search –

Second Wave = shift to sentiment analysis

Third Wave = Multiple Enterprise & Social Applications

– Watson = New Levels of Excitement

– Need practical version

Page 7: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

7

Text Analytics World

Current State of Text Analytics: Vendor Space

Taxonomy Management – SchemaLogic, Pool Party

Taxonomy & Semantic Networks - Text Analytics Solutions

– Access Innovation, Luminoso

Extraction and Analytics

– Linguamatics (Pharma), Temis, whole range of companies

Business Intelligence – Clear Forest, Inxight

Sentiment Analysis – Attensity, Lexalytics, Clarabridge

Open Source – GATE

Stand alone text analytics platforms – IBM, SAS, SAP, Smart Logic, Expert System, Basis, Open Text, Megaputer, Temis, Concept Searching

Embedded in Content Management, Search

– Autonomy, FAST, Endeca, Exalead, etc.

Market Mindshare – IBM, SAS, Clarabridge, Lexalytics

Page 8: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

8

Current Market: Text Analytics

Surveys, Seth Grimes Report

Market – 2014 - $2Bil

Enterprise search – 30-50% of market ($1Bil)

Text Analytics is growing 20% a year, 10% of analytics

Fragmented market – no clear leader

Social and Voice of Customer is huge

Money (investor) is still mostly social

Cloud-based Software as Service continues to grow

Growth as a market – slowed, as a technique – expanding

– (Me – time for new direction, characterization of field, etc.)

US market different than Europe/Asia – project oriented

Page 9: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

9

Seth Grimes Report + Interviews Leading Analysts:

Current Trends

From Mundane to Advanced – reducing manual labor to

“Cognitive Computing”

Enterprise – Shift from Information to Business – cost cutting

rather than productivity gains

Embedded solutions – not called TA (but should be because they

suffer from weak TA)

Graph databases (saying since 2010 – he’ll be right one of these

years: Open Knowledge Graphs

Human-Machine – still need human hybrid

Rules – hard to maintain and new text (wrong kind of rules)

Page 10: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

10

Seth Grimes Report

Current and Future Trends

Top four in Grimes survey:

– Ability to generate taxonomies (64%)

– Ability to use specialized, taxonomies, ontologies, etc. (54%)

– Broad information extraction (53%)

– Document Classification (53%)

Top business applications

– Brand/product/reputation management (38%)

– Voice of the Customer (39%)

– Competitive Intelligence (33%)

– Search, Info Access, etc. (29%)

– (Research 38% - not listed as a choice)

Page 11: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

11

Seth Grimes Report

Current and Future Trends

Current extract more, more diverse types of info, applying

insights in new ways and for new purposes – yet user

satisfaction still lagging- accuracy and ease of use

74% satisfied with TA – only 4% disappointed

Most dissatisfaction – ease of use (29%) and availability of

professional services/support (50%)

48% likely to recommend their provider – 36% would

recommend against

Page 12: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

12

Enterprise Text Analytics

Search is still #1 = 30-50% of applications

New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies, clustering

Trend = Text Analytics/Search as Semantic Infrastructure

– Platform for Info Apps (Search-based applications)

SharePoint – Major focus of TA companies – fix problems with taxonomy/folksonomy

– Hybrid workflow – Publish document -> TA analysis -> suggestions for categorization, entities, metadata -> present to author

External information = more automation, extraction – precision more important

Page 13: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

13

Enterprise Text Analytics

Adding Structure to Unstructured Content

Beyond Documents – categorization by corpus, by page, sections

or even sentence or phrase

Documents are not unstructured – variety of structures

– Sections – Specific - “Abstract” to Function “Evidence”

– Corpus – document types/purpose

– Textual complexity, level of generality

Need to develop flexible categorization and taxonomy – tweets to

200 page PDF

Applications require sophisticated rules, not just categorization by

similarity

Page 14: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

14

Page 15: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

15

Enterprise Text Analytics

Document Type Rules

(START_2000, (AND, (OR, _/article:"[Abstract]",

_/article:"[Methods]“), (OR,_/article:"clinical trial*",

_/article:"humans",

(NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe",

_/article:"use", _/article:"animals"),

If the article has sections like Abstract or Methods

AND has phrases around “clinical trials / Humans” and not words

like “animals” within 5 words of “clinical trial” words – count it and

add up a relevancy score

Primary issue – major mentions, not every mention

– Combination of noun phrase extraction and categorization

– Results – virtually 100%

Page 16: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

16

Enterprise Text Analytics

Building on the Foundation: Applications

Focus on business value, cost cutting

Enhancing information access is means, not an end

– Governance, Records Management, Doc duplication,

Compliance

– Applications – Business Intelligence, CI, Behavior Prediction

– eDiscovery, litigation support

– Risk Management

– Productivity / Portals – spider and categorize, extract – KM

communities & knowledge bases

• New sources – field notes into expertise, knowledge base –

capture real time, own language-concepts

Page 17: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

17

Enterprise Text Analytics: Applications

Pronoun Analysis: Fraud Detection; Enron Emails Function words = pronouns, articles, prepositions, conjunctions, etc.

– Used at a high rate, short and hard to detect, very social, processed

in the brain differently than content words

Patterns of “Function” words reveal wide range of insights

Areas: sex, age, power-status, personality – individuals and groups

Lying / Fraud detection: Documents with lies have:

– Fewer, shorter words, fewer conjunctions, more positive emotion

words

– More use of “if, any, those, he, she, they, you”, less “I”

Current research – 76% accuracy in some contexts

Text Analytics can improve accuracy and utilize new sources

Combine with Data analytics can improve accuracy

Page 18: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

18

Social Media: Next Generation

Beyond Simple Sentiment

Beyond Good and Evil (positive and negative)

– Degrees of intensity, complexity of emotions and documents

Importance of Context – around positive and negative words

– Rhetorical reversals – “I was expecting to love it”

– Issues of sarcasm, (“Really Great Product”), slanguage

Essential – need full categorization and concept extraction

Voice of the Customer: Must Have

– Need full Text Analytics to do well

New conceptual models, models of users, communities

Page 19: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

19

New Content Characteristics

It’s a Very Different World

Scale – orders of magnitude – 100’s of millions, Billions

Speed – 20-100 million a day

Size – Twitter, Blogs, forums, email

– 140 characters to a few sentences

Quality – misspellings, lack of structure, incoherence

Conversations – not stand alone docs

– Can’t tell what a “document” is about without reference to previous threads

Purpose – communicate - social grooming, rant

– Not exchange of ideas, policies, etc.

Simple Content Complexity – single thoughts, simplicity of emotion

Page 20: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

20

New Content Characteristics

It’s a Very Different World – Search and Taxonomy

i tried very slow, NO GOOGLE search, some apps not working.. This is not a "with GOOGLE" My friend has incredible, that is much batter.. Anyways i returned samsung, replace incredible. What's great about it: 4" LCD What's not so great: NOT A GOOGLE PHONE

(nt 2.0)willie John ci to/for: wanted to know about charges for pic mail for ;bill date 4/5/2010 | repeat: no | auth: pin | ptns affected: 7777777777 | information/instructions given: sup gave pic mail for free and gave adj for $ 2.40 new bal is $ 147.53 | any mobile, anytime: n | ir: yes | ir-email: n |

Page 21: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

21

New Content Characteristics

It’s a Very Different World – Topical Current Content

Content not archived (for users)

No real need for search (or just very simple search)

Very Poor (if any) metadata – not faceted search

Focus on phrases, sentences – not documents

Little need of a complex subject taxonomy

About emotions, things, products, people

Emotion – simple structures, infinite kinds of expression

Page 22: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

22

It’s a Very Different World

Companies are mining this resource and they need to add structure to get deeper understanding

Varieties of structure:

– Simple topical taxonomies 2-3 levels

– Emotion taxonomies, Ontologies and Semantic Networks

– Dynamic taxonomies – built on public taxonomies, enterprise taxonomy – exposed in hierarchical triples .

Need more automatic / semi-automatic solutions

– Advanced text analytics

Page 23: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

New Kinds of Social Taxonomies

New Taxonomies – Appraisal

– Appraisal Groups – Adjective and modifiers – “not very good”

– Four types – Attitude, Orientation, Graduation, Polarity

– Supports more subtle distinctions than positive or negative

Emotion taxonomies

– Joy, Sadness, Fear, Anger, Surprise, Disgust

– New Complex – pride, shame, embarrassment, love, awe

– New situational/transient – confusion, concentration, skepticism

Beyond Keywords – Need Text Analytics

– Analysis of phrases, multiple contexts – conditionals, oblique

– Analysis of conversations – dynamic of exchange, private language

– Enterprise taxonomy rolled into a categorization taxonomy

23

Page 24: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

24

Social Media: Next Generation

Variety of New Applications

Crowd Sourcing Technical Support

– User Forums – find problem area, nearby text for solution

– Automatic or Human mediated

Legal Review

– Significant trend – computer-assisted review (manual =too many)

– TA- categorize and filter to smaller, more relevant set

– Payoff is big – One firm with 1.6 M docs – saved $2M

Financial Services

– Trend – using text analytics with predictive analytics – risk and fraud

– Combine unstructured text (why) and transaction data (what)

– Customer Relationship Management, Fraud Detection

– Stock Market Prediction – Twitter, impact articles

Page 25: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

25

Social Media: Next Generation

Variety of New Applications

Voice of the Customer (Employee, Voter)

– Early discovery of issues with product, service, customer issues

– Identify opportunities for new products and service, sales or new

feature improvements

– Enable companies to find and understand correlations between

promotional campaigns and customer reactions

– It can lead to business or competitor intelligence

Current – better at gathering information than analyzing

Possibilities are (almost) endless

And a little bit scary – deep psychology, conservative-liberal

brains

Page 26: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

26

Social Media: Next Generation

Behavior Prediction – Telecom Customer Service

Problem – distinguish customers likely to cancel from mere threats

Basic Rule

– (START_20, (AND, (DIST_7,"[cancel]", "[cancel-what-cust]"),

– (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))

Examples:

– customer called to say he will cancell his account if the does not stop receiving

a call from the ad agency.

– cci and is upset that he has the asl charge and wants it off or her is going to

cancel his act

More sophisticated analysis of text and context in text

Combine text analytics with Predictive Analytics and traditional behavior

monitoring for new applications

Page 27: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

27

Future of Text Analytics

Obstacles - Survey Results

What factors are holding back adoption of TA?

– Lack of clarity about TA and business value - 47%

– Lack of senior management buy-in - 8.5%

Need articulated strategic vision and immediate practical win

Issue – TA is strategic, US wants short term projects

– Sneak Project in, then build infrastructure – difficulty of speaking enterprise

Integration Issue – who owns infrastructure? IT, Library, ?

– IT understands infrastructure, but not text

– Need interdisciplinary collaboration – Stanford is offering English-

Computer Science Degree – close, but really need a library-

computer science degree

Page 28: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

28

Future of Text Analytics

Primary Obstacle: Complexity

Usability of software is one element

More important is difficulty of conceptual-document models

– Language is easy to learn , hard to understand and model

Need to add more intelligence (semantic networks) and ways for

the system to learn – social feedback

Customization – Text Analytics– heavily context dependent

– Content, Questions, Taxonomy-Ontology

– Level of specificity – Telecommunications

– Specialized vocabularies, acronyms

Page 29: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

29

New Directions in Text Analytics

Conclusions

Text Analytics still growing: more mature applications and

technique

Find the right balance of infrastructure and application focus

Essential theme – integration – text and data, enterprise and

social

Big obstacles remain

– Strategic Vision of text analytics in the enterprise

– Concrete and quick application to drive acceptance

Future – Women, Fire, and Dangerous Things

– Text Analytics and Cognitive Science = Metaphor Analysis, deep

language understanding, common sense?

Page 30: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

30

New Directions in Text Analytics

Conclusions

Bigger:

– Big Data gets the press, but Big Text is bigger – and potentially more

valuable – Needs more systemic solutions

– Number and variety of TA Applications still growing

Better:

– Libraries of Modules – Ensemble Methods

– Cognitive Computing – TA Foundation

Smarter:

– Not AI, but smarts without waiting for 50 years

Great Time to get into Text Analytics

Page 31: Text Analytics  · PDF fileextraction, and Search-Meta Data ... Rules –hard to maintain and new text (wrong kind of rules) 10 ... Basic Rule

Questions?

Tom Reamy

Program Chair – Text Analytics World

[email protected]

KAPS Group

http://www.kapsgroup.com