gilbane 2009 -- how can content management software keep pace?
DESCRIPTION
The amount of data stored is growing at a phenomenal rate. This paper documents the growth and suggests that a new standard, CMIS, may be useful in getting better control over data and data repositories.TRANSCRIPT
How Can Content Management SoftwareContent Management Software
Keep Pace?
San Francisco Gilbane Conference 2009Content Integration Strategies
Dick WeisingergJune 4, 2009
Dick Weisinger Vice President and Chief Technologist
Formtek, IncFormtek, Inc 20+ years of experience in Content,
Document and Image Management g g Regular blogger at
http://www.formtek.com/blog
Formtek An ECM software and services company
– 25-year history25 year history Experts in general ECM and CM space Depth of experience in engineering dataDepth of experience in engineering data
management Formtek Orion ECM SoftwareFormtek Orion ECM Software Alfresco Gold Integration Partner
Drowning in Digital Data Hand-held devices High-resolution video
E-Discovery / Records ManagementDi iti d B i D t High-End Video Games
High-Resolution G hi d I
Digitized Business Data Financial and Health
RecordsGraphics and Images Scientific Data
Records Business Continuity
Backups
Analysts at:Gartner Group, Forester ResearchForester Research, IDC and The 451 Group
all predict massive growth in digital dataall predict massive growth in digital data.
Size of the Digital Universe 2003 – 20 exabytes 2006 – 161 exabytes 2007 – 281 exabytes2007 281 exabytes 2008 – 486 exabytes 2010 – 988 exabytes of data 2011 – 1800 exabytes of data 2012 – 2500 exabytes of data
(30% of data is created by enterprises) Source: IDC(30% of data is created by enterprises) Source: IDC
One Exabyte == 1 billion gigabytes or 1000 petabytes(about 250 million DVDs)(about 250 million DVDs)
161 exabytes is the equivalent of 12 stacks of books each extending 93 million miles from the earth to the Sun.
Data in Business and Science Walmart adds a billion rows of data to
its 600 terabyte database every hourits 600 terabyte database every hour Chevron’s gas and oil exploration
collects 2 terabytes of data dailyy y Large Hadron collider in Switzerland to
collect 300 exabytes per year Department of Energy has increased
their data by a factor of 10 every four years since 1990
Hardware’s Shrinking Cost
Year Cost/MB1986 $51.301991 $13.00
Storage costs are plummeting but not as fast1991 $13.00
1994 $1.001997 $0 09
plummeting, but not as fast as the amount of data is growing.
1997 $0.092000 $0.072003 $0 02
Cheap storage costs also encourage applications to2003 $0.02
2009 $0.0002encourage applications to store ever more data.
Can Software Keep Pace?How Can We Find Anything?How Can We Find Anything?
Search Algorithms have evolved and Search Algorithms have evolved and improved, but…
Internet Search is only Fair to Good Internet Search is only Fair to Good – Google Page-Rank 8+ billion web pages, hundreds of thousands of p g ,
servers
Enterprise Search is Poor– Usage patterns are hard to model
The Problem of Search
49 percent of business users say that finding d t i diffi lt d ti idata is difficult and time consuming.
-- AIIM 2008 Market Study
Users have a 50 percent success rate at hsearch
-- Recommind SurveyMarch 2009March 2009
Scattered Data Repositories p Corporate Applications
– ERP– PLM/PDM– Business Intelligence / Knowledge Management
Content and Document Management– Content and Document Management
Relational Databases Local and Shared File Systemsoca a d S a ed e Syste s Internet/Intranet HTTP servers Email Servers Disk Appliances (digital cameras, cell phone…)
Multiple Repository Challengep p y g
Problem How to access and search data to achieve: How to access and search data to achieve:
ComplianceeDiscoveryBusiness IntelligenceBusiness Intelligence
Challenge Many organization have multiple repositories from y g p p
multiple vendors Lack of standards around API and query language Each system is different and has very little common Each system is different and has very little common
reuse
Unstructured Data Search is HardUnstructured Data Search is Hard
80 percent of enterprise data is unstructuredp p– Eg., emails, PDF, Word and Office docs
No underlying data model or schemay g– emails and IM often lack context and use
shorthand and abbreviations that increase the search challengesearch challenge
Huge Data Sets Brings Huge Problems
Search gets harder as data sets grow Search gets harder as data sets grow– Longer to index and search– Harder to determine context
The more systems, the harder to secure The more systems, the harder to
consolidate search Conflicting or Inconsistent Data
Whi h i th t f f ?– Which is the system of reference?
Getting Data Under ControlGetting Data Under Control
Ultimate goal: Content Intelligence– Knowledge extraction – Ability to distill, condense and summarize data
How? Apply more Structure and ReuseApply more Structure and Reuse
– XML Tags Allow greater access across data sources
– Consolidation of Systems– Integration of Systems
Creating StructureS SSemi-Structured Data Use a structured native data format
– XML Authoring/Publishing applications DITA publishing XML
– Microsoft Office 2007 docx, etc. (Office OpenMicrosoft Office 2007 docx, etc. (Office Open XML) Complex: 29 namespaces and 89 schema models
Add Structure Add Structure– Append Headers and Embedded Properties Eg., Tiff, jpeg images PDF and embedded Microsoft Office files
Associate tags and metadata with unstructured dataunstructured data
Centralized Repository Efficiency
Management efficiencies of scale More efficient search
– No need to consolidate search results Available to users via a single interface
Integration of Repositories Content-Intelligence Platforms can
integrate/unite multiple repositoriesintegrate/unite multiple repositories XML is the pipeline for integration Integration via APIs or XML WebIntegration via APIs or XML Web
services– REST Web Services have momentum– Integration with SOA
CMIS -- ECM Integration
ECM vendors have united to create a new interoperability standard: Content Management Interoperability Services (CMIS)Services (CMIS)– Web services for sharing information
between different content repositoriesp– “SQL for Document Management”
What is CMIS?
Content Management Interoperability Services– Defines a lowest-denominator CM capability set– CM content is accessed as SOAP or AtomPub
(REST) web services(REST) web services– A single application works identically with content
from any CMIS vendory
CMIS Timeline
1993 – ODMA (Open Document Management API)
1996 DMA 1996 – DMA (AIIM Document Management Alliance)
1996 – WebDAV (Web-based Distributed Authoring and Versioning )
2002 - JSR-170 / Java Content Repository (Day Software)2002 JSR 170 / Java Content Repository (Day Software)
2005 – iECM (AIIM Interoperable ECM)
October 2006 – CMIS started August 2008 - Contributing members invited September 2008 - Draft Specification submitted to
OASISOASIS Possible completion and acceptance in late 2009 or
early 2010
JCR versus CMIS
Session-based API Services BasedJava Only Language AgnosticJava Only Language Agnostic“Complete” ECM Core ECM functionsInfrastructure Interoperabilityp yTargets DM, RM, DAM, WCM…
Intended specifically for DM
Complex SimplePrescriptive Little or No ChangeConnectors by Day Vendor ConnectorsVersion 2.0 Version .61Design spearheaded by Day Software
Design Led by Top Tier ECM Vendors
CMIS: Creators and Participants Founding Companies for the Original Standard
– EMC/Documentum– IBM/Filenet– Microsoft
Contributing Members (after August 7, 2008)– Alfresco– Open Text– Oracle
SAP– SAP– More …
CMIS – The ModelCMIS The Model
DocumentsEg Office document or image– Eg., Office document or image
– Content, Metadata and Version History Folders
– Defines Organization and Hierarchy– Container, Metadata and Hierarchy/Organization
Object Links and Relationsj– Reference between two folders or documents– Requires a source and target
PoliciesPolicies– Set of rules that can be applied to control other objects, eg.
ACLs or retention policy
Benefits of CMIS Standardized Core ECM functions Enables Interoperability between repositories p y p Encourages Flexible Application Development Encourages ‘mash-up’ composite applications A single application can consolidate and
aggregate content from multiple CMIS repositoriesrepositories
Business Processes/Workflow can span and touch all enterprise content
CMIS Weak Points Only Basic Content Functions Available Does not cover Admin/Management Does not cover Admin/Management Does not cover User Authentication Does not handle Security/Authorization Does not handle Security/Authorization
Applications Workflow/Business Processes
– Connect work packages from anyConnect work packages from any repository
Portals and Mash-ups– Aggregated Content from multiple sources
E-Discovery and Compliance
Summary Massive Growth in Content Creation Advances in hardware technology is Advances in hardware technology is
fueling content creation and storage Search and Retrieval of content growsSearch and Retrieval of content grows
in complexity with its volume Content Intelligence is needed to bringContent Intelligence is needed to bring
understanding to data Standards like XML and CMIS provide p
consistent classification and handling of data