big data meets metadata – analyzing large data sets

Smartlogic TM

Lucene Revolution 2012

Jeremy Bentley, CEO

1st degree of order

Filing management • 80% of enterprise information is unstructured • Doubling every 19 months and accelerating [Gartner] • Increasing burden of compliance • Enterprise 2.0 additions • Big Data connotations

2nd degree of order

Index management • File plans and metadata schema • Manually applied classification • Low level of consistency and quality

3rd degree Order

Digital Asset

Management

Publishing Systems

SharePoint

eDiscovery

Document Management

Content Management

Enterprise Search

Records Management

Portal Infrastructure

Process Management &

Workflow

Automation of 1st & 2nd Degrees

A 10 year Flatline

•  2001, IDC, “Quan5fying Enterprise Search” Searchers are successful in finding what they seek 50% of the 9me or less

•  2011, MindMetre/SmartLogic More than half (52%) cannot find the informa9on they need using their Enterprise search system

5

2001 2011

User Search Sa5sfac5on

50% 48%

Terabytes o

f data

Source: the Na5onal Archives

The explosion of information

2001-‐2009 1993-‐2001

? 4Tb

80Tb

20 5mes increase in Informa5on volume

Volume + other disruptive factors

Copyright @ 2011 Smartlogic Semaphore Limited 7

Velocity Variety Complexity

Cross-‐organiza5onal and cross pla[orm informa5on needs

Changing requirements for informa5on over 5me

New 4th degree of order

Digital Asset

Management

Publishing Systems

SharePoint

eDiscovery

Document Management

Content Management

Enterprise Search

Records Management

Portal Infrastructure

Process Management &

Workflow

Content

Intelligence

Content Intelligence

Informa5on Manufacturing

Knowledge Recovery

Content Analy5cs

Data Loss Preven5on Risk & Compliance

Mone5sa5on

Metadata

Knowing what you have

Metadata

Crea5on Date

Modified Date

Author

Format (PDF,DOC,XLS)

Subject

Loca5on

Project

Func5on (IT,HR,Finance)

Expe

rt

Protec5ve

Marker

Reten5

on

Expiry

Publish

er

Site

Structural Process

Information

4th degree of order Content Intelligence

Content Intelligence Pla[orm

FAST

SharePoint

What is Content Intelligence

informa5on based on its meaning and context to make !mely and informed business decisions.

IDENTIFYING CLASSIFYING ANALYZING SURFACING EXTRACTING

Content Intelligence is the process of

Content Intelligence Solutions

MICROTARGETING & DISTRIBUTION

GOVERNANCE, COMPLIANCE &

RISK

KNOWLEDGE ACQUSITION

& REUSE

WEB-‐BASED SELF SERVICE

Big Data + Content Intelligence

From Gartner, 2011

Semaphore – Three Core Capabilities

16

Users

Seman5c Model

Apply

Inform

Expose

Content

Build, Manage and Deploy Vocabularies/

Libraries

Explore data to find insights

Automate the Metadata Enrichment

Ontology Manager

ClassificaJon Server

SemanJc Enhancement

Server

SEMAPHORE

Enterprise Classification

Important requirements for Velocity/Volume: •  Scalability for large volumes of content, users, metadata and systems

•  Easy integra5on with processing systems -‐ search, content, records and document management systems as well as file shares and content migra5on tools

•  Support for all the organiza5on‘s languages and data formats

From Many Different Sources

Metadata Generation

Creation Date

Modified Date

Author

Format (PDF,DOC,XLS)

Brand

Service

Geography

Products

Exp

ert

Pro

tect

ive

Mar

ker

Ret

entio

n E

xpiry

Pub

lishe

r

Site

Structural Process

Information

Different Vocabulary and Ambiguity You Say I Say

Perpetrator Burglar Thief

Swine Flu Swine Influenza Virus H1N1

Touchscreen Touch screen Mul5-‐touch

You Say What do you mean?

Apple A fruit? Fiona -‐ A singer / songwriter? An electronics company?

Rights Employment rights? Equal rights? Right of way?

Ford Ford Motor Forward Industrials (5cker=FORD) A shallow river crossing

Missing results

Too many results

© 2010 20

Without Accurate Metadata

Big Data has its perils. With huge data sets and fine-‐grained measurement,

there is increased risk of “false discoveries.” The trouble with seeking a meaningful needle in massive haystacks of data is that “many bits of straw look

like needles.”

-‐ Trevor Has5e, Sta5s5cs Professor at Stanford University

What Classification Must Handle

Capability Included

Look for all the vocabulary associated with topic/en5ty

Determine aboutness / avoid passing men5ons

Address term ambiguity

Handle stemming errors

Determine if topics in the same context

Split documents into components

Generate scores (so most relevant content bubbles to top)

Show dynamic summaries to users

Enhancing Metadata •  Accurately classify content into subject areas

defined in a taxonomy/ontology •  En5ty extrac5on (Text Mining) •  Sen5ment Analysis •  Fact Extrac5on

Physical Architecture Ontology Management Services

Ontology Manager Server

Ontology Manager Standalone Desktop

Classifica5on Server Search Enhancement Server

Google Classifica5on Handler

Win 7, Vista 2Gb RAM 2GHz Dual CPU

Ontology Manager Desktop


Ontology Manager Desktop

Win7, Vista 2Gb RAM 2GHz Dual CPU

Ontology Instance 1

Win 7, Vista, 2003, 2008 +R2 Linux 2Gb RAM 2GHz CPU

Port 8001 Ontology Instance 2

Port 8002 Op5onal RDBMS data store

Oracle MySQL

SQL Server 2005 + 2008 + 2008 R2

Rule and Template Editor


Classifica5on Instance

Port 5058

Windows Server 2003 ,2008 (32bit/64bit) + R2 Linux CPU and RAM intensive. Scale to volume of content and number of publishing users

Classifica5on Test Interface

Internet Explorer Firefox

Search Enhancement

Instance

Windows Server 2003 ,2008 (32bit/64bit) +R2 Linux IIS/Apache HTTP Server RAM and disk access intensive. Scale to expected peak search throughput

Google Search Appliance

Dispatcher Proxy

Microsou FAST ESP Server Farm

Microsou Office SharePoint Server 2007 / 2010 Server Farm

Document Library Components

Search Web Parts

Search Applica5on Framework

Semaphore Document Processor Search Applica5on Framework

Seman5c Enhancement Server Content Classifica5on Server

Integra5on Components

Windows Server 2003 ,2008 (32bit/64bit) +R2 Scale for throughput of GSA Indexing Crawler

GSA Extensions FAST Extensions

Sharepoint Extensions

SOLR

Search Applica5on Framework

Semaphore Document Processor

Leveraging Metadata Schemes

Examples – Customer Service

Examples – Following Trends

Examples – Fact Extraction

How Else Does Semaphore Help

Perfectly formed filters organised by facet

Disambiguate queries

Supporting documents

Explore relationships

Graphical drill down

Happy, Successful Customers

big data meets metadata – analyzing large data sets

Technology