big data meets metadata – analyzing large data sets

30
Smartlogic TM Lucene Revolution 2012 Jeremy Bentley, CEO

Upload: lucenerevolution

Post on 07-Jul-2015

404 views

Category:

Technology


2 download

DESCRIPTION

Presented by Jeremy Bently| Smartlogic. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 As Big Data becomes more pervasive, the need for increased metadata management becomes critical to the understanding and mining of that content. Metadata is what unlocks the value of information assets. When metadata is well managed, the information assets are more useful and valuable. Badly managed metadata can make information assets less useful and less valuable — creating increased costs and risks related to those assets. During this presentation, we'll discuss the different types of metadata, the role of search and analytics in Big Data and the integration of Apache Solr with Content Intelligence to enable better metadata management of Big Data.

TRANSCRIPT

Page 1: Big Data Meets Metadata – Analyzing Large Data Sets

Smartlogic TM

Lucene Revolution 2012    

Jeremy  Bentley,  CEO  

Page 2: Big Data Meets Metadata – Analyzing Large Data Sets

1st degree of order

Filing management • 80% of enterprise information is unstructured • Doubling every 19 months and accelerating [Gartner] • Increasing burden of compliance • Enterprise 2.0 additions • Big Data connotations

Page 3: Big Data Meets Metadata – Analyzing Large Data Sets

2nd degree of order

Index management • File plans and metadata schema • Manually applied classification • Low level of consistency and quality

Page 4: Big Data Meets Metadata – Analyzing Large Data Sets

3rd degree Order

Digital  Asset  

Management  

Publishing  Systems  

SharePoint  

eDiscovery  

Document    Management  

Content  Management  

Enterprise  Search  

Records  Management  

Portal  Infrastructure  

Process    Management  &  

Workflow  

Automation of 1st & 2nd Degrees

Page 5: Big Data Meets Metadata – Analyzing Large Data Sets

A 10 year Flatline

•  2001,  IDC,  “Quan5fying  Enterprise  Search”    Searchers  are  successful  in  finding  what  they  seek  50%  of  the  9me  or  less    

 •  2011,  MindMetre/SmartLogic  More  than  half    (52%)  cannot  find  the  informa9on  they  need  using  their  Enterprise  search  system    

5  

2001   2011  

User  Search  Sa5sfac5on  

50%  48%  

Page 6: Big Data Meets Metadata – Analyzing Large Data Sets

Terabytes  o

f  data  

Source:  the  Na5onal  Archives  

The explosion of information

2001-­‐2009  1993-­‐2001  

?  4Tb  

80Tb  

20  5mes  increase  in  Informa5on  volume  

Page 7: Big Data Meets Metadata – Analyzing Large Data Sets

Volume + other disruptive factors

Copyright  @  2011  Smartlogic  Semaphore  Limited   7  

Velocity    Variety    Complexity    

 Cross-­‐organiza5onal    and  cross  pla[orm  informa5on  needs    

 Changing  requirements  for  informa5on  over  5me    

Page 8: Big Data Meets Metadata – Analyzing Large Data Sets

New 4th degree of order

Digital  Asset  

Management  

Publishing  Systems  

SharePoint  

eDiscovery  

Document    Management  

Content  Management  

Enterprise  Search  

Records  Management  

Portal  Infrastructure  

Process    Management  &  

Workflow  

Content

Intelligence

Page 9: Big Data Meets Metadata – Analyzing Large Data Sets

Content Intelligence

Informa5on  Manufacturing  

Knowledge  Recovery  

Content    Analy5cs  

Data  Loss  Preven5on  Risk  &  Compliance  

Mone5sa5on  

Metadata  

Page 10: Big Data Meets Metadata – Analyzing Large Data Sets

Knowing what you have

Page 11: Big Data Meets Metadata – Analyzing Large Data Sets

Metadata

Crea5on  Date  

Modified  Date  

Author  

Format  (PDF,DOC,XLS)  

Subject  

Loca5on  

Project  

Func5on  (IT,HR,Finance)  

Expe

rt  

Protec5ve  

Marker  

Reten5

on  

Expiry  

Publish

er  

Site  

Structural Process

Information

Page 12: Big Data Meets Metadata – Analyzing Large Data Sets

4th degree of order Content Intelligence

Content  Intelligence  Pla[orm  

     FAST  

SharePoint  

Page 13: Big Data Meets Metadata – Analyzing Large Data Sets

What is Content Intelligence

           

informa5on  based  on  its  meaning  and  context  to  make    !mely  and  informed  business  decisions.  

 

IDENTIFYING   CLASSIFYING   ANALYZING   SURFACING  EXTRACTING  

Content  Intelligence  is  the  process  of      

Page 14: Big Data Meets Metadata – Analyzing Large Data Sets

Content Intelligence Solutions

MICROTARGETING  &  DISTRIBUTION  

GOVERNANCE,  COMPLIANCE  &  

RISK  

KNOWLEDGE    ACQUSITION  

&  REUSE  

WEB-­‐BASED  SELF  SERVICE  

Page 15: Big Data Meets Metadata – Analyzing Large Data Sets

Big Data + Content Intelligence

From  Gartner,  2011    

Page 16: Big Data Meets Metadata – Analyzing Large Data Sets

Semaphore – Three Core Capabilities

16  

Users  

Seman5c    Model  

Apply  

Inform  

Expose  

Content  

Build,  Manage  and  Deploy  Vocabularies/  

Libraries  

Explore  data  to  find  insights  

Automate  the  Metadata  Enrichment  

Ontology    Manager  

ClassificaJon  Server  

SemanJc  Enhancement  

Server  

SEMAPHORE  

Page 17: Big Data Meets Metadata – Analyzing Large Data Sets

Enterprise Classification

Important  requirements  for  Velocity/Volume:  •  Scalability  for  large  volumes  of  content,  users,  metadata  and  systems  

•  Easy  integra5on  with  processing  systems  -­‐  search,  content,  records  and  document  management  systems  as  well  as  file  shares  and  content  migra5on  tools  

•  Support  for  all  the  organiza5on‘s  languages  and  data  formats  

Page 18: Big Data Meets Metadata – Analyzing Large Data Sets

From Many Different Sources

Page 19: Big Data Meets Metadata – Analyzing Large Data Sets

Metadata Generation

Creation Date

Modified Date

Author

Format (PDF,DOC,XLS)

Brand

Service

Geography

Products

Exp

ert

Pro

tect

ive

Mar

ker

Ret

entio

n E

xpiry

Pub

lishe

r

Site

Structural Process

Information

Page 20: Big Data Meets Metadata – Analyzing Large Data Sets

Different Vocabulary and Ambiguity You  Say   I  Say  

Perpetrator   Burglar  Thief  

Swine  Flu   Swine  Influenza  Virus  H1N1  

Touchscreen   Touch  screen  Mul5-­‐touch  

You  Say   What  do  you  mean?  

Apple   A  fruit?  Fiona  -­‐  A  singer  /  songwriter?  An  electronics  company?  

Rights   Employment  rights?  Equal  rights?  Right  of  way?  

Ford   Ford  Motor  Forward  Industrials  (5cker=FORD)  A  shallow  river  crossing  

 Missing  results  

 Too  many  results  

©  2010   20  

Page 21: Big Data Meets Metadata – Analyzing Large Data Sets

Without Accurate Metadata      

Big  Data  has  its  perils.  With  huge  data  sets  and  fine-­‐grained  measurement,  

there  is  increased  risk  of  “false  discoveries.”  The  trouble  with  seeking  a  meaningful  needle  in  massive  haystacks  of  data  is  that  “many  bits  of  straw  look  

like  needles.”    

-­‐  Trevor  Has5e,    Sta5s5cs  Professor  at  Stanford  University    

Page 22: Big Data Meets Metadata – Analyzing Large Data Sets

What Classification Must Handle

Capability   Included  

Look  for  all  the  vocabulary  associated  with  topic/en5ty  

Determine  aboutness  /  avoid  passing  men5ons  

Address  term  ambiguity  

Handle  stemming  errors  

Determine  if  topics  in  the  same  context  

Split  documents  into  components  

Generate  scores  (so  most  relevant  content  bubbles  to  top)  

Show  dynamic  summaries  to  users  

Page 23: Big Data Meets Metadata – Analyzing Large Data Sets

Enhancing Metadata •  Accurately  classify  content  into  subject  areas  

defined  in  a  taxonomy/ontology  •  En5ty  extrac5on  (Text  Mining)  •  Sen5ment  Analysis  •  Fact  Extrac5on  

Page 24: Big Data Meets Metadata – Analyzing Large Data Sets

Physical Architecture Ontology  Management  Services  

Ontology  Manager  Server  

Ontology  Manager  Standalone  Desktop  

Classifica5on  Server  Search  Enhancement  Server  

Google  Classifica5on  Handler  

Win  7,  Vista  2Gb  RAM  2GHz  Dual  CPU  

Ontology  Manager  Desktop  

Win  7,  Vista  2Gb  RAM  2GHz  Dual  CPU  

Ontology  Manager  Desktop  

Win7,  Vista  2Gb  RAM  2GHz  Dual  CPU  

Ontology  Instance  1  

Win  7,  Vista,  2003,  2008  +R2  Linux  2Gb  RAM  2GHz  CPU  

Port  8001  Ontology  Instance  2  

Port  8002  Op5onal  RDBMS  data  store  

Oracle  MySQL  

SQL  Server  2005  +  2008  +  2008  R2  

Rule  and  Template  Editor  

Win  7,  Vista  2Gb  RAM  2GHz  Dual  CPU  

Classifica5on  Instance  

Port  5058  

Windows  Server  2003  ,2008  (32bit/64bit)  +  R2  Linux  CPU    and  RAM  intensive.  Scale  to  volume  of  content  and  number  of  publishing  users  

Classifica5on  Test  Interface  

Internet  Explorer  Firefox  

Search  Enhancement  

Instance  

Windows  Server  2003  ,2008  (32bit/64bit)  +R2  Linux  IIS/Apache  HTTP  Server  RAM  and  disk  access  intensive.  Scale  to  expected  peak  search  throughput  

Google  Search  Appliance  

Dispatcher  Proxy  

Microsou  FAST  ESP  Server  Farm  

Microsou  Office  SharePoint  Server  2007  /    2010  Server  Farm  

Document  Library  Components  

Search  Web  Parts  

Search  Applica5on  Framework  

Semaphore  Document  Processor  Search  Applica5on  Framework  

Seman5c  Enhancement  Server   Content  Classifica5on  Server  

Integra5on  Components  

Windows  Server  2003  ,2008  (32bit/64bit)  +R2  Scale  for  throughput  of  GSA  Indexing  Crawler  

GSA  Extensions  FAST  Extensions  

Sharepoint  Extensions  

SOLR  

Search  Applica5on  Framework  

Semaphore  Document  Processor  

Page 25: Big Data Meets Metadata – Analyzing Large Data Sets

Leveraging Metadata Schemes

Page 26: Big Data Meets Metadata – Analyzing Large Data Sets

Examples – Customer Service

Page 27: Big Data Meets Metadata – Analyzing Large Data Sets

Examples – Following Trends

Page 28: Big Data Meets Metadata – Analyzing Large Data Sets

Examples – Fact Extraction

Page 29: Big Data Meets Metadata – Analyzing Large Data Sets

How Else Does Semaphore Help    

Perfectly formed filters organised by facet

Disambiguate queries

Supporting documents

Explore relationships

Graphical drill down

Page 30: Big Data Meets Metadata – Analyzing Large Data Sets

Happy, Successful Customers