bi masterclass slides (reference architecture v3)
DESCRIPTION
This is a presentation that is covered in Oracle BI Masterclass.TRANSCRIPT
Information Management Reference Architecture
EMEA Enterprise Architecture
Contents Introduction
Conceptual view
Design Patterns
IM Logical view and component outline
Discovery Lab
R/T Event Engine logical view
Mapping to previous Reference Architecture release
Introduction
Introduction
This PPT documents the main architectural components of Oracle’s
Information Management Reference Architecture.
The architecture is intended to be practical and pragmatic, with many of the
ideas and experiences that inform the approach dating back almost 20 years
in Oracle.
These ideas and concepts have been continually refined through the
engagement of our Enterprise Architecture team on real world customer
engagements.
3rd Evolution of Oracle’s Information Management Reference Architecture
What is Information Management
“Information Management is the means by which an
organisation maximises the efficiency with which it plans,
collects, organises, uses, controls, stores, disseminates,
and disposes of its Information, and through which it
ensures that the value of that information is identified and
exploited to the maximum extent possible”
We define Information Management to mean…
Aligning analytical requirements and IM architecture
Enabling Analytics 3.0 with a pragmatic architecture
Analytics 2.0
Analytics 3.0
Analytics 1.0
• Reporting with limited use of
descriptive analytics
• Limited range of tabular data
• Batch oriented analysis
• Analysis bolted onto limited
set of business processes
• Firms “Competing on Analytics”
• Extended analytics to larger
and less structured datasets
• Emergence of Big Data into the
commercial world
• Recognition of Data Science
role in commercial orgs.
• Platform for monetisation
• Deeper analysis & more data
• Faster test-do-learn iterations
• Different types of data & wider
business process coverage
• Analysts focus on discovery and
driving business value
• “Agile” with operational elements
incorporated into design patterns
Adapted from Tom Davenport material
Conceptual View
Actionable
Events
Event Engine Data Reservoir
Data Factory Enterprise Information Store
Reporting
Discovery Lab
Actionable
Information
Actionable
Insights
Data
Streams
Execution
Innovation
Discovery Output
Events & Data
Conceptual View
Structured
Enterprise
Data
Other
Data
Component Outline
Event Engine Respond to R/T events in appropriate and/or optimised fashion
Data Reservoir Raw data Reservoir – typically event data at lowest grain
Data Factory Managed ETL onto, within and between platforms
Enterprise Data Data stores for Information Management
Reporting BI tools and infrastructure components
Discovery Lab Platform, data and tools to support discovery process
Execution – things you do every day
Innovation – innovation to drive tomorrows business Line of Governance!
Discovery Output
– Possible outputs include new knowledge, mining models / parameters, scored data…
Design Patterns
Design Pattern: Discovery Lab
Specific focus on identifying commercial value for exploitation
Small group of highly skilled individuals (aka Data Scientists)
Iterative development approach – data oriented NOT development oriented
Wide range of tools and techniques applied
Data provisioned through
Data Factory or own ETL
Typically separate infrastructure
but could also be unified Reservoir
if resource managed effectively
Design Pattern : Information Platform
Build the next generation Information Management platform
Either Business Strategy driven or IT cost / capability driven initiative
Initial project may be specifically linked to lower data grain or retention
BUT it is the platform as a whole that forms the solution required
Platform for consolidating other IM assets onto
Key issues related to differences in
procurement, development process,
governance and skills differences
Discovery Lab may be implemented
as a pragmatic initial POV.
Design Pattern : Data Application
Big Data technologies applied to a specific business problem
e.g. Genome sequence analysis using BLAST or log data from
pharmaceutical production plant and machinery required for traceability
Limited or no integration to broader Information Management estate
Specific solution so Non-functional requirements have less impact
on solution quality or long term costs
Platform costs and scalability are
important considerations
Design Pattern: Information Solution
Specific solution based on Big Data technologies requiring broader
integration to the wider Information Management estate e.g. ETL pre-processor for the DW or affordably store a lower level of grain
Non-functional requirements more critical in this solution
Scalable integration to IM estate
an important factor for success
Analysis may take place in Reservoir
or Reservoir only used as an aggregator
Design Pattern: Real-Time Events
May take place at multiple locations between place of data origination and the
Data Centre – requiring careful design and implementation
May include Next-Best-Activity, declarative rules and Data Mining technologies
to optimise decisions. i.e. optimise across declarative, data mining, customer
preference & business-defined rules
May include considerations for
personal preferences and privacy
(e.g. opt-out) for customer related
events
Common component seen across
many industries & markets
e.g. connected vehicle
Real-Time optimisation of events
Design Pattern against component usage map
Design pattern Discovery Lab Information
Platform Data Application Information Solution R/T Events
Outline
Data science lab
Assess the value of
the data
Next Generation
information platform to
align IM capability with
business strategy
Addressing a specific data
problem in Hadoop with no
broader integration required.
Addressing a specific data
problem but requires broader
enterprise wide integrations. e.g.
ETL pre-processing, Event Store
at lower grain than existing DW
Execution platform to
respond to R/T events
Examples Gov. Healthcare
Mobile operator
Spanish Bank (Business led)
UK Gov. Dept. (Tech. led) Pharma Genome project
Pharma production archive
Investment Bank – trade risk
Mobile Operator – ETL processing
Mobile operator –
location based offers
Data Engine Possible Yes
Data Reservoir Yes Yes Yes
Data Factory Yes Yes Yes
Enterprise Data Yes
Reporting Yes
Discovery Lab Yes Implied Alternative approach
to Reservoir + Factory above
IM Logical View and Components
Information Management – Logical View Data Sources
Data Ingestion
Methods and process
to load data into our
managed data store
and manage data
quality
• Contemporary Information Management solutions must be able to ingest any type of data from any source in any format and
mechanism and at any frequency. e.g. Flat file loads, streaming…
• The data may be highly unstructured, mono-structured or highly poly-structured.
• Data will vary in volume and in Data Quality.
• Operational isolation should be considered to ensure operational applications will continue in the event of the loss of the
Information Management system
Data Engines & Poly-structured sources
Content
Docs Web & Social Media
SMS
Structured Data Sources
• Operational Data
• COTS Data
• Streaming & BAM
Master & Reference Data Sources
Information Management – Logical View Information Ingestion
Data Ingestion
Information Interpretation
Methods and process
to load data and
manage Data
Quality
Methods and
process needed to
access information
Managed Data
Load
All data under management
Query
• Data structure and processing required to load data into managed data stores
• Shape represents the work done on the data to load data and/or process between layers
• Layer may include file mechanism where required to facilitate loading
(e.g. Fuse fs or ZFS for operational isolation and file concat)
• Normal rules of micro-batch, taking all the data and KISS principles recommended
• DQ and loading stats presented through BI dashboards as a non-judgemental mechanism to improve DQ.
• Data may be landed in the Ingestion layer to facilitate loading but not typically stored for any length of time. e.g. Raw data loaded from web
logs but sessionised data then loaded to Raw. Another example is data used to manage CDC may be stored in this layer.
Information Management – Logical View Data Interpretation
Data Ingestion
Information Interpretation
Methods and process
to load data and
manage Data
Quality
Methods and
process needed to
access information
Managed Data
Load
All data under management
Query
• Methods and processes required to access information in each of the stores
• Shape represents the cost of interpreting the data under management
• For schema-on-read the cost may include the AVRO, SerDe or reader class as well as the associated processing code to
select, filter and process the data.
• For schema-on-write the cost is represented by the complexity of the SQL required to access the data only – more complex
typically for 3NF than for a dimensional query.
Information Management – Logical View Data Layers – cost, quality and concurrency trade off
Managed Data Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Immutable raw data reservoir
Raw data at rest is not interpreted
Immutable modelled data. Business
Process Neutral form. Abstracted
from business process changes
Past, current and future interpretation of
enterprise data. Structured to support
agile access & navigation
• Increasing enrichment
• Increasing data quality
• Reducing concurrency costs
• Data under management includes 3 key layers – Raw, Foundation and Access and Performance layers.
• Data normally loaded into Raw and Foundation layers BUT BI Apps loads data directly into APL and federated warehouses may
well also load data at aggregate level from federated operating companies.
• Data Factory is responsible for loading and then managing data between layers.
• Work is done to elevate the data between layers – typically further enriching and improving data quality.
• Work done in processing the data between the layers significantly reduce query costs. i.e. higher levels of concurrency can be
sustained for the same processing power.
• Increasing formalisation of definition
Information Management – Logical View Data Layers – Analytical processing
Managed Data Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
• Analytical processing capabilities of Hadoop and RDBMS used to elevate data between layers as previously described.
• These analytical capabilities can also be leveraged by tools that access the data directly.
Typically this would be by a Data Scientist for Discovery Lab operations or BI Tools and Services that are processing data using
a model previously defined by the Data Scientist.
OLAP
Data Mining
Statistics
OLAP
Text Mining
Other Analytical
Processing
Data Mining
Text Mining
Image Processing
• Increasing enrichment
• Increasing data quality
• Reducing concurrency costs
• Increasing formalisation of definition
Information Management – Logical View Data Layers – Raw Data Reservoir
Managed Data Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Immutable raw data reservoir
Raw data at rest is not interpreted
Immutable modelled data. Business
Process Neutral form. Abstracted
from business process changes
Past, current and future interpretation of
enterprise data. Structured to support
agile access & navigation
• Immutable data store with data at lowest level of grain.
• Typically implemented in Hadoop or NoSQL for cost reasons but not always.
• May be:
• Queries directly,
• Used to derive base level data for Foundation Layer. Data may be represented logically in Foundation or physically as the
store is immutable BUT this effects ILM policy.
• or used to derive values or aggregates for Access and Performance layer. (e.g. propensity score or total monthly SMS’s)
Information Management – Logical View Data Layers – Foundation Data Layer
Managed Data Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Immutable raw data reservoir
Raw data at rest is not interpreted
Immutable modelled data. Business
Process Neutral form. Abstracted
from business process changes
Past, current and future interpretation of
enterprise data. Structured to support
agile access & navigation
• Immutable integrated and standardised store of enterprise class data. Stuff the business has agreed and organises around.
• Data at lowest level of grain of value for Enterprise data.
• Stored in business process neutral fashion to avoid data maintenance tasks to keep in step with current business interpretations.
• Typically close to 3NF. Special attention to modelling hierarchy, flexible entity attributions, customer / supplier etc.
• ONLY implemented in relational technology BUT this could be logical as previously noted in Raw Data Reservoir.
• May be queries directly by a select few individuals. Wider access to detail data provided through views in APL, often with VPD
implemented to prevent queries to antecedent data.
• Data in the Foundation Layer should be retained for as long as possible.
• Consideration should be given to retaining data in Raw Data Reservoir rather than archiving.
Information Management – Logical View Data Layers – Access and Performance Layer
Managed Data Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Immutable raw data reservoir
Raw data at rest is not interpreted
Immutable modelled data. Business
Process Neutral form. Abstracted
from business process changes
Past, current and future interpretation of
enterprise data. Structured to support
agile access & navigation
• Layer facilitates access, navigation and performance of queries.
• Allows for multiple interpretations of data from Foundation or Raw data Reservoir.
• Most structures can be thrown away and re-built from scratch based on Foundation and Raw Reservoir.
• The exception is derived and aggregate data which may have to be retained if the underlying data/mechanism is archived.
• Most users presenting information in a standardised fashion on dashboards and reports will access this layer only.
Access and Performance Layer
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
• Data destined for Raw Data Reservoir may be loaded directly (e.g. through Flume) or may be stored temporarily in fs prior to
loading (e.g. Fuse fs)
• Relational data ingested in most appropriate mechanism before persisting in Foundation Data Layer (usual rules apply…)
• Ideally micro batch using simplest mechanism possible
• Only data of agreed quality loaded in FDL
• For efficient loading relationally data may be pre-staged in fs so a large number of small files can be concatenated
Information Management – Logical View Data Factory Ingestion flow
Data Ingestion
Batch & Real-Time
ETL / ELT
CDC
Stream
File Ops.
Data Engines & Poly-structured sources
Content
Docs Web & Social Media
SMS
Structured Data Sources
• Operational Data
• COTS Data
• Streaming & BAM
Master & Reference Data Sources
Access and Performance Layer
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Flow shown:
1. Data to be formalised from HDFS store extracted and loaded into Foundation Data Layer.
e.g. where Flume/HDFS is being used as an ETL pre-processor for Enterprise Data
or where HDFS data is being logically modelled in the foundation layer
2. Data is re-structured and/or aggregated to facilitate access by users and business processes
3. Data may also be re-structured and/or aggregated from HDFS store where there are no specific
requirements to manage Enterprise Data in a more formal data store over time
1
2
3
Information Management – Logical View Data Factory intra data processing flow
Access and Performance Layer
Information Management – Logical View Information Provisioning – BI & Data Science Components
Vir
tua
lisa
tion
&
Que
ry F
ed
era
tio
n
Enterprise Performance Management
Pre-built &
Ad-hoc BI Assets
Information
Services
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Vir
tua
lisa
tion
&
Qu
ery
Fe
de
ratio
n
• Data Virtualisation and the various components to access the data are as per our previous view on BI tools.
• By far the majority of users will access data via Access and Performance Layer although data may come from Raw Store or Foundation
• Data Virtualisation is a key components that helps to deliver tools independence, services integration and a future state roadmap
• Big Data has focused considerable attention on Data Science
• Analytical capabilities delivered through analytical processing in the data layers and Advanced Analytical Tools used to drive capabilities
• Data Mining in particular often involves complex data processing to flatten data into a longitudinal form. This derived data and model results are
typically written to a project based sandbox.
• Agile discovery is often best served through a separate Discovery Lab infrastructure (see later details)
Data Science
Access and Performance Layer
Information Management – Logical View Information Provisioning Typical BI Flows
Vir
tua
lisa
tion
&
Qu
ery
Fe
de
ratio
n
Enterprise Performance Management
Pre-built &
Ad-hoc BI Assets
Information
Services
Data Science
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
2
3
1. Typical access mechanism for Enterprise data via Access and Performance layer structures
2. Access to Foundation Layer Data to specific functions, processes and users only
3. Data interpretation & DQ assured through encoded logic, Avro, SerDe, FileReader, HCat etc.
4. Diagonal flows shows how data can be joined between layers as well as accessed directly. e.g. Raw Data
can be queried directly through HIVE connector or joined to the RDBMS data and queried.
1
4
4
Information Management – Logical View Data / Information Quality
Access and Performance Layer
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Vir
tua
lisa
tion
&
Qu
ery
Fe
de
ratio
n
Enterprise Performance Management
Pre-built &
Ad-hoc BI Assets
Information
Services
Data Science
Quality of data at rest assured by a number of factors in addition to the underlying quality of data at source
– File and event handling to ensure data is not missed (e.g. missing log files assured by log file sequence numbering)
– The processing of data between Raw and FDL / APL layers. This can be seen as a DQ firewall to ensure only data of known and
acceptable quality is loaded. Typically this involves an element of synchronisation as some data will need to be held off until required
reference data is available due to the micro-batch incremental loading approach.
Quality of information presented to downstream tools and services determined by
– Model quality, understanding and performance of provisioning from modelled layers
– Consistency of definition, code quality and query performance when accessing Hadoop data (e.g. HR code, Avro definition…)
Access and Performance Layer
Information Management – Logical View Information Provisioning Direct Flow from Source Systems
Vir
tua
lisa
tion
&
Qu
ery
Fe
de
ratio
n
Enterprise Performance Management
Pre-built &
Ad-hoc BI Assets
Information
Services
Data Science
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
• Direct access from source systems to BI and Discovery or through the Data Virtualisation layer is also possible
• This is a fairly typical requirement for EPM and Data Science. Much less common for general BI other than as
part of a temporary expedient.
Data Sources
Data Engines & Poly-structured sources
Content
Docs Web & Social Media
SMS
Structured Data Sources
• Operational Data
• COTS Data
• Streaming & BAM
Master & Reference Data Sources
Immutable raw data reservoir
Raw data at rest is not interpreted
Immutable modelled data. Business
Process Neutral form. Abstracted
from business process changes
Past, current and future interpretation of
enterprise data. Structured to support
agile access & navigation
Information Management – Logical View Information Provisioning Direct Flow from Source Systems
Vir
tua
lisa
tion
&
Qu
ery
Fe
de
ratio
n
Enterprise Performance Management
Pre-built &
Ad-hoc BI Assets
Information
Services
Data Science
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
• Another view showing how the quality of data is altered between stores
Data Sources
Data Engines & Poly-structured sources
Content
Docs Web & Social Media
SMS
Structured Data Sources
• Operational Data
• COTS Data
• Streaming & BAM
Master & Reference Data Sources
Information Management – Logical View
Vir
tua
lisa
tion
&
Qu
ery
Fe
de
ratio
n
Enterprise Performance Management
Pre-built & Ad-hoc BI Assets
Information
Services
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Data Science
Data Engines & Poly-structured sources
Content
Docs Web & Social Media
SMS
Structured Data Sources
• Operational Data
• COTS Data
• Streaming & BAM
Immutable raw data reservoir
Raw data at rest is not interpreted
Immutable modelled data. Business
Process Neutral form. Abstracted
from business process changes
Past, current and future interpretation of
enterprise data. Structured to support
agile access & navigation
Discovery Lab Sandboxes Rapid Development Sandboxes
Project based data stores
to support specific
discovery objectives
Project based data stored
to facilitate rapid content /
presentation delivery
Data Sources
Data Reservoir & Enterprise Information Store – complete view
Master & Reference Data Sources
Discovery Lab Sandboxes
Data Mining Method – Conceptual Map
Data Understand
Prepare Data
Model
Evaluate
Deploy
Monitor
Discovery
Business
Goals
• Data scientist led discovery
• Domain expertise also critical
• Wide range of tools & data
• Data preparation is a significant challenge
• Able to quickly mashup & transform data
Data Mining Method – Conceptual Map
Data Understand
Prepare Data
Model
Evaluate
Deploy
Monitor
Discovery
Business
Goals
• Choice of deployment options • Organisational learning
• Automated event and/or response
(e.g. inbound call and CSR support)
• Manual list generation based on detected risk events
• Tools support depending on deployment option • Visualisations, numerical presentation…etc
• Provision for Marketing Analyst data mashup
Data Mining Method – Conceptual Map
Data Understand
Prepare Data
Model
Evaluate
Deploy
Monitor
Discovery
Business
Goals
• Agile incorporation into standard reporting framework
• Expose new risk indicators and interventions
• Track model lift and trigger perturbation or rebuild
automatic or Data Science led activity
Analysis Processing & Delivery
Discovery Lab & Data Science Tooling
Data Reservoir & Enterprise Data
Data
Science
(Primary
Toolset)
Statistics Tools
Data & Text Mining Tools
Faceted Query Tools
Programming & Scripting
Data Modeling Tools
Query & Search Tools
Pre-Built
Intelligence
Assets
Intelligence
Analysis
Tools
Ad Hoc Query & Analysis Tools
OLAP Tools
Forecasting & Simulation Tools
Reporting Tools
Data
Scientist
Vir
tua
lisa
tion
&
Info
rma
tio
n S
erv
ice
s
Data Factory flow
1. Data Factory responsible for
access provisioning to data
or replication (all or sample)
to Sandbox in Discovery Lab.
2. Direct connection from Data
Science tools and analysis
sandbox. Data Science tools
read and write data from/to
project sandboxes.
3. Data Scientist can also
access standard dashboards,
reports and KPI’s through
Data Virtualisation layer
Data Quality & Profiling
Graphical rendering tools
Dashboards & Reports
Scorecards
Charts & Graphs
Sandbox – Project 3
Sandbox – Project 2
Sandbox – Project 1
2
Data store Analytical
Processing
Information Management – Logical View Discovery Lab data flow
General BI flow
3
1
Rapid Development Sandboxes
Analysis Processing & Delivery
Development Environment Tooling
Pre-Built
Intelligence
Assets
Intelligence
Analysis
Tools
Ad Hoc Query & Analysis Tools
OLAP Tools
Forecasting & Simulation Tools
Reporting Tools
BICC
Vir
tua
lisa
tion
&
Info
rma
tio
n S
erv
ice
s
Data Factory flow
1. The majority of BI development
activity will be from existing
sources – developing new
reports to existing or new
channels.
2. BICC or other expert users
may quickly develop new
reporting through mashups
from any available sources.
Careful governance is required
once the report is completed to
ensure data and report are
professionally managed.
Dashboards & Reports
Scorecards
Charts & Graphs
Sandbox – Project 3
Sandbox – Project 2
Dev Sandbox – Project 1
Information Management – Logical View Discovery Lab data flow
2
Data Reservoir & Enterprise Data
1
2
General BI flow
R/T event Engine – Logical View and Components
Real-time
Data Engine
To Event Subscribers
(Events / Data)
Privacy Filter
Data Transform
Rules & Models
Mediation
Next Best Action
Real-Time
Data Store
From Input Events
Reference
Data Models
& Rules
Privacy
Data
Analytics
Real-Time Data Engine – Logical View
Business Activity Monitoring
Real-Time event
monitoring
Real-Time Data Engine
Message mediation service
Privacy filter for event data. i.e. apply customer specified privacy
and preference filters to the data stream
Transformation of the message data to outbound form
Apply declarative rules and models to the data stream to detect
events for further downstream processing
Next Best Activity (NBA) event detection and processing. NBA
typically also includes control group management and global
optimisation of rules
Business Activity Monitoring
Local data store – local persistence of rules and metadata
Components
Privacy Filter
Data Transform
Rules & Models
Mediation
Next Best Action
Real-Time Data
Store
BAM
Mapping from the previous release of the architecture
Oracle’s Information Management Reference Architecture (3rd Edition)
More relevant to Big Data oriented audience
Better representation of pragmatic customer projects
Includes Raw data store as part of the architecture
Show effort / cost to store and interpret data that separates
schema-on-read and schema-on-write approaches
Aligned to Analytics 3.0
Consistent with Oracle’s engineering efforts
What’s changed?
Oracle’s Information Management Reference Architecture (3rd Edition)
“All those layers and definitions in your
Reference Architecture, I just don’t get
it… and it looks complicated !”
Hadoop developer knee deep in complex Map:Reduce code
What’s changed?
Business Trends
Technology Trends
Data Trends
Information Management Reference Architecture Version 2.0 of the Architecture
Information Management Reference Architecture
Interpretation layer
shows the relative cost
of reading data
depending on its
location
Previous staging layer
now split into Data
Ingestion and Raw
store.
Ingestion layer
includes methods and
processes to load data
and manage Data
Quality. Shape
represents the relative
cost of these
processes. i.e. from
none for HDFS to lots
in APL.
Raw Reservoir is
typically at the lowest
level of grain. Often
lower than the
enterprise cares about
and so may not have
been included in
previous
representation.
Renamed from
Knowledge Discovery
to Discovery Lab but
otherwise unchanged.
The role of Discovery
Labs is becoming
more central though so
additional operational
guidance will be
added.
Discovery Lab
Still an immutable
store but may be
physically
implemented in
relational or non-
relational technologies
Key differences from 2.0 to 3.0 of the Architecture