zen and the art of datanauting
DESCRIPTION
Exploration of large and complex data estates to gain an accurate understanding of the data structures and data quality. Presentation given by Ontology Systems and BSkyB at SemTechBiz - The Semantic Technology & Business Conference on October 2nd 2013TRANSCRIPT
Exploration of large and complex data estates to gain an accurate understanding of the data structures and data quality
Zen, and the art of Datanauting
Carl Bray Product Manager, Ontology Systems
Matt ClarkDesign Authority, BSkyB
DatanautingBoldly going where no data integrator has gone before…
2
3
15 years of transaction data
10 million+ customers
900 engineers making changes
30 TB of data
20+ Applications
Q) How do you start to understand this data estate?
The company
• UK subsidiary of a global media organisation
• Provides fixed line telephone, Internet and television entertainment services to UK residents
• 10 million+ customers, trading for 15 years
Business drivers:
• Driven by marketing innovation
• Extend and upsell to customer base
• React to competitive threats
• Technical infrastructure impacting commercial agility
The motivation behind the projectBackground and Business Drivers
4
Objective
• Significantly reduce the time to capture new business strategies in IT systems
Significant change in IT delivery
• Embrace Agile delivery of new functionality
• Develop new payment and sales systems
• Access and extend existing data
• Multiple SCRUM teams using test-driven development
• Phased delivery
Short-term technical drivers
• Quickly understand the structure, nature and consistency of the existing data
Longer term technical drivers
• Introduce a service-based semantic agent to access software services
Fundamentally changing the way IT functionality is deliveredA new IT Strategy
5
Subject matter experts (SMEs)
• Understanding the data means interfacing with SMEs
• Multiple SCRUM teams need access to SMEs
• Knowledge is in Silos and not co-located with SCRUM teams
• SMEs may not know the answers
Bottleneck / Choke point
• SCRUM teams need quick answers to data / process questions
• SME bandwidth stifles SCRUM agility
• Introduces a single project bottleneck/choke point
Overwhelming the SMEs
• Free and unfettered access to the SMEs would create chaos
• Need to filter questions to the SMEs
ChallengesMany technical challenges stood in their way
6
CRM
Billing
Ref Data
Debt
Orders
Ticketing
Content
Product
SME
SCRUM
SCRUM
SCRUM
SCRUM
SME
SME
SME
SME
SME
SME
SME
Many systems with complex interdependencies
• CRM
• Billing
• Reference Data
• Debt processing
• Order handling
• Trouble ticketing systems
• Subscriber card management systems
• Content access entitlements
• Product catalogue
Fragmentation
• Business entities fragmented
• “Customer” properties in many systems
The Scope and Scale of the ProblemPayments and sales system involving 20+ systems and legacy data
7
Data estate problems
• Data quality isn’t consistent
• Data fragmentation is high
• Understanding the data is complex
• How are business entities stored in different applications and data sources?
• What impact should processes have on the data – flags, statuses, etc.
• When data is duplicated, which data sources should take preference?
• Scale of data
• 30+ TB of historic trading data
• 3 Vs - The Variety and Volume of data are very high
The Data30TB of transactional data over 15 years of system changes
8
?
Non-semantic alternatives
• Train more SMEs
• Work around SME’s other priorities
• Educational workshops
• Take time to document systems
Data-profiling alternatives
• Reverse engineering schemas
• ETL Tooling
• Didn’t want to create yet another data warehouse
Chose a datanauting approach
• Supports their commitment to Agile development
• Allows SCRUM teams to explore and ask questions of the data without overloading SMEs
AlternativesAlternative approaches to solving the problem were considered
9
What we do, and why we’re different
• Ontology leverages graph and semantic search technologies to address enterprise data issues
• We address complex data integration problems
• Data Acquisition
• Data Correlation
• Data Migration
• We produce fully fledged operational applications that use semantic search in
• Telecommunications
• Media
• Financial services
• The Ontology Difference
• Inherently agile – no schema
• Datanauting: data-first, structure later
• Just enough modelling
• Structured and unstructured data
How we approached the problemThe Ontology Approach
10
Exploration of data sources…The Ontology Approach - Datanaughting
Identify sources
Connect to sources•Index source
Search for entities•Refactor entities•Create URI pattern matching•Map entities to RDF
Search for linked entities•Add references
Search for equivalent entities•Create matching URIs•Map entities to RDF
• DBs
• SPARQL Endpoints
• Structured files
• MS Excel, CSV, XML, RDF
• CISCO and other device configurations
• Propriety formats
• Unstructured files
• MS Word, PDFs, etc.
The Ontology Approach - DatanaughtingIdentify sources
Identify sources
Connect to sources
Search for entities
Search for linked entities
Search for equivalent
entities
• Setup the connection
• Index sources
• Add search facets
• Tokenise compound values e.g.
• Service names are concatenated “Service-LON/01”
• Product names use “CamelCase”
The Ontology Approach - DatanaughtingConnect to sources
Identify sources
Connect to sources
Search for entities
Search for linked entities
Search for equivalent
entities
• Search for business entities
• Refactor “denormalised” data
• Choose a URI pattern to represent instances
• Set a type for the entity
• Map properties to owl:DatatypeProperty
The Ontology Approach - DatanaughtingSearch for entities
Identify sources
Connect to sources
Search for entities
Search for linked entities
Search for equivalent
entities
• Search for entities that should be linked
• Add references (owl:ObjectProperty) between entities that are to be linked
The Ontology Approach - DatanaughtingSearch for linked entities
Identify sources
Connect to sources
Search for entities
Search for linked entities
Search for equivalent
entities
• Search for semantically equivalent entities in other data sources
• Search based on property names
• Search based on strict value matching/weighting
• Search based on sub-string matching/weighting
• Reuse the URI pattern
• Create references
The Ontology Approach - DatanaughtingSearch for equivalent entities
Identify sources
Connect to sources
Search for entities
Search for linked entities
Search for equivalent
entities
High-level solution to the problems the organisation faced
• Removed the SME bottleneck - a key enabler for the Agile / SCRUM approach
• Creates a searchable domain model, breaking the data into discrete “chunks”
• Ontology allows the SCRUM teams to understand the legacy data through ad-hoc queries
• Can understand how business concepts are mapped across multiple contradictory data repositories
• The quality and suitability of data can more easily be assessed
• Provides a definitive view of the commercial position for a given subscriber or set of subscribers
• Backlog and sprint priorities are based on a complete understanding of the complexity of the task
• Provide data to facilitate mock ups and test harnesses
Ontology provides SCRUM members with insight into the dataProject Results
17
18
Project ResultsSCRUM teams gain insight into data
CRM
Billing
Ref Data
Debt
Orders
Ticketing
Content
Product
SME
SCRUM
SCRUM
SCRUM
SCRUM
SME
SME
SME
SME
SME
SME
SME
}
Project ResultsProduct Architecture
19
Modeller
External Event
Sources
Web UI
Ontology Intelligent 360 Ontology Integrity Manager
Semantic Graph Store
Query API
Universal Search Core
Semantic Processing Core
Universal Search Core
Aut
hen
ticat
ion
and
N
otifi
catio
n
LDAP Server
(optional)
Mail Server
(optional)
HTTPS
RTIA
Fully Modelled Data Sources
CSV
RDBMS
XML
JDBC
XLS
Other Data Sources
DOC PDF XLS MAIL
XML
Ontology 4 Modeller Ontology 4 RuntimeHTTPS
End Users(Browser Access)
Variety
• Ability to access data in a variety of formats
• Avoid integration to live systems
• Possible to work from database - dumps avoids politics
• Embracing change – inherently agile
Volume
• Ontology techniques for managing data scale
• Partial index of data
• Partial modelling
• Semantic search with SQL query to live systems
Velocity
VarietyVolume
Project ResultsDealing with two large Vees
20
Why Ontology?
• Agile response through inherently agile technology
• Datanauting provides agile response to SCRUM teams
• SME time can now be used for valuable queries
Technical advantages
• No Schema, No Integration, No Big Bang, No Search Restrictions, No Upfront Risk
Benefits delivered
• Speed – Greatly accelerated the analysis phase of the project
• Risk – Project is not viable without an understanding of the data
No Upfront
Risk
No Schema
No Integration No Big Bang
No Search Restrictions
Zen, and the art of DatanautingAdvantages of the Ontology approach to Data Integration
21
Learn More
To learn more about Ontology Systems,
or to access more detailed information
about our products and services, please
either:
Call +44 20 7239 4949
Visit ontology.com
Email [email protected]
Subject to change. All rights reserved. © 2013
No part of this document may be reproduced in any
form or by any means for any purpose without our
written permission. All other trademarks appearing
in this document are acknowledged as the trademarks
of their respective owners.
Ontology-Partners Limited trading as Ontology Systems
Ontology Systems
Phoenix Yard,
65 Kings Cross Road,
London WC1X 9LW,
UNITED KINGDOM
Registered in England No. 5794201.
Registered Office.
Dalton House,
60 Windsor Avenue,
London SW19 2RR
UNITED KINGDOM