big data meets metadata – analyzing large data sets
DESCRIPTION
Presented by Jeremy Bently| Smartlogic. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 As Big Data becomes more pervasive, the need for increased metadata management becomes critical to the understanding and mining of that content. Metadata is what unlocks the value of information assets. When metadata is well managed, the information assets are more useful and valuable. Badly managed metadata can make information assets less useful and less valuable — creating increased costs and risks related to those assets. During this presentation, we'll discuss the different types of metadata, the role of search and analytics in Big Data and the integration of Apache Solr with Content Intelligence to enable better metadata management of Big Data.TRANSCRIPT
Smartlogic TM
Lucene Revolution 2012
Jeremy Bentley, CEO
1st degree of order
Filing management • 80% of enterprise information is unstructured • Doubling every 19 months and accelerating [Gartner] • Increasing burden of compliance • Enterprise 2.0 additions • Big Data connotations
2nd degree of order
Index management • File plans and metadata schema • Manually applied classification • Low level of consistency and quality
3rd degree Order
Digital Asset
Management
Publishing Systems
SharePoint
eDiscovery
Document Management
Content Management
Enterprise Search
Records Management
Portal Infrastructure
Process Management &
Workflow
Automation of 1st & 2nd Degrees
A 10 year Flatline
• 2001, IDC, “Quan5fying Enterprise Search” Searchers are successful in finding what they seek 50% of the 9me or less
• 2011, MindMetre/SmartLogic More than half (52%) cannot find the informa9on they need using their Enterprise search system
5
2001 2011
User Search Sa5sfac5on
50% 48%
Terabytes o
f data
Source: the Na5onal Archives
The explosion of information
2001-‐2009 1993-‐2001
? 4Tb
80Tb
20 5mes increase in Informa5on volume
Volume + other disruptive factors
Copyright @ 2011 Smartlogic Semaphore Limited 7
Velocity Variety Complexity
Cross-‐organiza5onal and cross pla[orm informa5on needs
Changing requirements for informa5on over 5me
New 4th degree of order
Digital Asset
Management
Publishing Systems
SharePoint
eDiscovery
Document Management
Content Management
Enterprise Search
Records Management
Portal Infrastructure
Process Management &
Workflow
Content
Intelligence
Content Intelligence
Informa5on Manufacturing
Knowledge Recovery
Content Analy5cs
Data Loss Preven5on Risk & Compliance
Mone5sa5on
Metadata
Knowing what you have
Metadata
Crea5on Date
Modified Date
Author
Format (PDF,DOC,XLS)
Subject
Loca5on
Project
Func5on (IT,HR,Finance)
Expe
rt
Protec5ve
Marker
Reten5
on
Expiry
Publish
er
Site
Structural Process
Information
4th degree of order Content Intelligence
Content Intelligence Pla[orm
FAST
SharePoint
What is Content Intelligence
informa5on based on its meaning and context to make !mely and informed business decisions.
IDENTIFYING CLASSIFYING ANALYZING SURFACING EXTRACTING
Content Intelligence is the process of
Content Intelligence Solutions
MICROTARGETING & DISTRIBUTION
GOVERNANCE, COMPLIANCE &
RISK
KNOWLEDGE ACQUSITION
& REUSE
WEB-‐BASED SELF SERVICE
Big Data + Content Intelligence
From Gartner, 2011
Semaphore – Three Core Capabilities
16
Users
Seman5c Model
Apply
Inform
Expose
Content
Build, Manage and Deploy Vocabularies/
Libraries
Explore data to find insights
Automate the Metadata Enrichment
Ontology Manager
ClassificaJon Server
SemanJc Enhancement
Server
SEMAPHORE
Enterprise Classification
Important requirements for Velocity/Volume: • Scalability for large volumes of content, users, metadata and systems
• Easy integra5on with processing systems -‐ search, content, records and document management systems as well as file shares and content migra5on tools
• Support for all the organiza5on‘s languages and data formats
From Many Different Sources
Metadata Generation
Creation Date
Modified Date
Author
Format (PDF,DOC,XLS)
Brand
Service
Geography
Products
Exp
ert
Pro
tect
ive
Mar
ker
Ret
entio
n E
xpiry
Pub
lishe
r
Site
Structural Process
Information
Different Vocabulary and Ambiguity You Say I Say
Perpetrator Burglar Thief
Swine Flu Swine Influenza Virus H1N1
Touchscreen Touch screen Mul5-‐touch
You Say What do you mean?
Apple A fruit? Fiona -‐ A singer / songwriter? An electronics company?
Rights Employment rights? Equal rights? Right of way?
Ford Ford Motor Forward Industrials (5cker=FORD) A shallow river crossing
Missing results
Too many results
© 2010 20
Without Accurate Metadata
Big Data has its perils. With huge data sets and fine-‐grained measurement,
there is increased risk of “false discoveries.” The trouble with seeking a meaningful needle in massive haystacks of data is that “many bits of straw look
like needles.”
-‐ Trevor Has5e, Sta5s5cs Professor at Stanford University
What Classification Must Handle
Capability Included
Look for all the vocabulary associated with topic/en5ty
Determine aboutness / avoid passing men5ons
Address term ambiguity
Handle stemming errors
Determine if topics in the same context
Split documents into components
Generate scores (so most relevant content bubbles to top)
Show dynamic summaries to users
Enhancing Metadata • Accurately classify content into subject areas
defined in a taxonomy/ontology • En5ty extrac5on (Text Mining) • Sen5ment Analysis • Fact Extrac5on
Physical Architecture Ontology Management Services
Ontology Manager Server
Ontology Manager Standalone Desktop
Classifica5on Server Search Enhancement Server
Google Classifica5on Handler
Win 7, Vista 2Gb RAM 2GHz Dual CPU
Ontology Manager Desktop
Win 7, Vista 2Gb RAM 2GHz Dual CPU
Ontology Manager Desktop
Win7, Vista 2Gb RAM 2GHz Dual CPU
Ontology Instance 1
Win 7, Vista, 2003, 2008 +R2 Linux 2Gb RAM 2GHz CPU
Port 8001 Ontology Instance 2
Port 8002 Op5onal RDBMS data store
Oracle MySQL
SQL Server 2005 + 2008 + 2008 R2
Rule and Template Editor
Win 7, Vista 2Gb RAM 2GHz Dual CPU
Classifica5on Instance
Port 5058
Windows Server 2003 ,2008 (32bit/64bit) + R2 Linux CPU and RAM intensive. Scale to volume of content and number of publishing users
Classifica5on Test Interface
Internet Explorer Firefox
Search Enhancement
Instance
Windows Server 2003 ,2008 (32bit/64bit) +R2 Linux IIS/Apache HTTP Server RAM and disk access intensive. Scale to expected peak search throughput
Google Search Appliance
Dispatcher Proxy
Microsou FAST ESP Server Farm
Microsou Office SharePoint Server 2007 / 2010 Server Farm
Document Library Components
Search Web Parts
Search Applica5on Framework
Semaphore Document Processor Search Applica5on Framework
Seman5c Enhancement Server Content Classifica5on Server
Integra5on Components
Windows Server 2003 ,2008 (32bit/64bit) +R2 Scale for throughput of GSA Indexing Crawler
GSA Extensions FAST Extensions
Sharepoint Extensions
SOLR
Search Applica5on Framework
Semaphore Document Processor
Leveraging Metadata Schemes
Examples – Customer Service
Examples – Following Trends
Examples – Fact Extraction
How Else Does Semaphore Help
Perfectly formed filters organised by facet
Disambiguate queries
Supporting documents
Explore relationships
Graphical drill down
Happy, Successful Customers