bi masterclass slides (reference architecture v3)

Information Management Reference Architecture

EMEA Enterprise Architecture

Contents Introduction

Conceptual view

Design Patterns

IM Logical view and component outline

Discovery Lab

R/T Event Engine logical view

Mapping to previous Reference Architecture release

Introduction

Introduction

This PPT documents the main architectural components of Oracle’s

Information Management Reference Architecture.

The architecture is intended to be practical and pragmatic, with many of the

ideas and experiences that inform the approach dating back almost 20 years

in Oracle.

These ideas and concepts have been continually refined through the

engagement of our Enterprise Architecture team on real world customer

engagements.

3rd Evolution of Oracle’s Information Management Reference Architecture

What is Information Management

“Information Management is the means by which an

organisation maximises the efficiency with which it plans,

collects, organises, uses, controls, stores, disseminates,

and disposes of its Information, and through which it

ensures that the value of that information is identified and

exploited to the maximum extent possible”

We define Information Management to mean…

Aligning analytical requirements and IM architecture

Enabling Analytics 3.0 with a pragmatic architecture

Analytics 2.0

Analytics 3.0

Analytics 1.0

• Reporting with limited use of

descriptive analytics

• Limited range of tabular data

• Batch oriented analysis

• Analysis bolted onto limited

set of business processes

• Firms “Competing on Analytics”

• Extended analytics to larger

and less structured datasets

• Emergence of Big Data into the

commercial world

• Recognition of Data Science

role in commercial orgs.

• Platform for monetisation

• Deeper analysis & more data

• Faster test-do-learn iterations

• Different types of data & wider

business process coverage

• Analysts focus on discovery and

driving business value

• “Agile” with operational elements

incorporated into design patterns

Adapted from Tom Davenport material

Conceptual View

Actionable

Events

Event Engine Data Reservoir

Data Factory Enterprise Information Store

Reporting

Discovery Lab

Actionable

Information

Actionable

Insights

Data

Streams

Execution

Innovation

Discovery Output

Events & Data

Conceptual View

Structured

Enterprise

Data

Other

Data

Component Outline

Event Engine Respond to R/T events in appropriate and/or optimised fashion

Data Reservoir Raw data Reservoir – typically event data at lowest grain

Data Factory Managed ETL onto, within and between platforms

Enterprise Data Data stores for Information Management

Reporting BI tools and infrastructure components

Discovery Lab Platform, data and tools to support discovery process

Execution – things you do every day

Innovation – innovation to drive tomorrows business Line of Governance!

Discovery Output

– Possible outputs include new knowledge, mining models / parameters, scored data…

Design Patterns

Design Pattern: Discovery Lab

Specific focus on identifying commercial value for exploitation

Small group of highly skilled individuals (aka Data Scientists)

Iterative development approach – data oriented NOT development oriented

Wide range of tools and techniques applied

Data provisioned through

Data Factory or own ETL

Typically separate infrastructure

but could also be unified Reservoir

if resource managed effectively

Design Pattern : Information Platform

Build the next generation Information Management platform

Either Business Strategy driven or IT cost / capability driven initiative

Initial project may be specifically linked to lower data grain or retention

BUT it is the platform as a whole that forms the solution required

Platform for consolidating other IM assets onto

Key issues related to differences in

procurement, development process,

governance and skills differences

Discovery Lab may be implemented

as a pragmatic initial POV.

Design Pattern : Data Application

Big Data technologies applied to a specific business problem

e.g. Genome sequence analysis using BLAST or log data from

pharmaceutical production plant and machinery required for traceability

Limited or no integration to broader Information Management estate

Specific solution so Non-functional requirements have less impact

on solution quality or long term costs

Platform costs and scalability are

important considerations

Design Pattern: Information Solution

Specific solution based on Big Data technologies requiring broader

integration to the wider Information Management estate e.g. ETL pre-processor for the DW or affordably store a lower level of grain

Non-functional requirements more critical in this solution

Scalable integration to IM estate

an important factor for success

Analysis may take place in Reservoir

or Reservoir only used as an aggregator

Design Pattern: Real-Time Events

May take place at multiple locations between place of data origination and the

Data Centre – requiring careful design and implementation

May include Next-Best-Activity, declarative rules and Data Mining technologies

to optimise decisions. i.e. optimise across declarative, data mining, customer

preference & business-defined rules

May include considerations for

personal preferences and privacy

(e.g. opt-out) for customer related

events

Common component seen across

many industries & markets

e.g. connected vehicle

Real-Time optimisation of events

Design Pattern against component usage map

Design pattern Discovery Lab Information

Platform Data Application Information Solution R/T Events

Outline

Data science lab

Assess the value of

the data

Next Generation

information platform to

align IM capability with

business strategy

Addressing a specific data

problem in Hadoop with no

broader integration required.

Addressing a specific data

problem but requires broader

enterprise wide integrations. e.g.

ETL pre-processing, Event Store

at lower grain than existing DW

Execution platform to

respond to R/T events

Examples Gov. Healthcare

Mobile operator

Spanish Bank (Business led)

UK Gov. Dept. (Tech. led) Pharma Genome project

Pharma production archive

Investment Bank – trade risk

Mobile Operator – ETL processing

Mobile operator –

location based offers

Data Engine Possible Yes

Data Reservoir Yes Yes Yes

Data Factory Yes Yes Yes

Enterprise Data Yes

Reporting Yes

Discovery Lab Yes Implied Alternative approach

to Reservoir + Factory above

IM Logical View and Components

Information Management – Logical View Data Sources

Data Ingestion

Methods and process

to load data into our

managed data store

and manage data

quality

• Contemporary Information Management solutions must be able to ingest any type of data from any source in any format and

mechanism and at any frequency. e.g. Flat file loads, streaming…

• The data may be highly unstructured, mono-structured or highly poly-structured.

• Data will vary in volume and in Data Quality.

• Operational isolation should be considered to ensure operational applications will continue in the event of the loss of the

Information Management system

Data Engines & Poly-structured sources

Content

Docs Web & Social Media

SMS

Structured Data Sources

• Operational Data

• COTS Data

• Streaming & BAM

Master & Reference Data Sources

Information Management – Logical View Information Ingestion

Data Ingestion

Information Interpretation

Methods and process

to load data and

manage Data

Quality

Methods and

process needed to

access information

Managed Data

Load

All data under management

Query

• Data structure and processing required to load data into managed data stores

• Shape represents the work done on the data to load data and/or process between layers

• Layer may include file mechanism where required to facilitate loading

(e.g. Fuse fs or ZFS for operational isolation and file concat)

• Normal rules of micro-batch, taking all the data and KISS principles recommended

• DQ and loading stats presented through BI dashboards as a non-judgemental mechanism to improve DQ.

• Data may be landed in the Ingestion layer to facilitate loading but not typically stored for any length of time. e.g. Raw data loaded from web

logs but sessionised data then loaded to Raw. Another example is data used to manage CDC may be stored in this layer.

Information Management – Logical View Data Interpretation

Data Ingestion


Methods and process

to load data and

manage Data

Quality

Methods and

process needed to

access information

Managed Data

Load

All data under management

Query

• Methods and processes required to access information in each of the stores

• Shape represents the cost of interpreting the data under management

• For schema-on-read the cost may include the AVRO, SerDe or reader class as well as the associated processing code to

select, filter and process the data.

• For schema-on-write the cost is represented by the complexity of the SQL required to access the data only – more complex

typically for 3NF than for a dimensional query.

Information Management – Logical View Data Layers – cost, quality and concurrency trade off

Managed Data Access & Performance Layer

Foundation Data Layer

Raw Data Reservoir

Immutable raw data reservoir

Raw data at rest is not interpreted

Immutable modelled data. Business

Process Neutral form. Abstracted

from business process changes

Past, current and future interpretation of

enterprise data. Structured to support

agile access & navigation

• Increasing enrichment

• Increasing data quality

• Reducing concurrency costs

• Data under management includes 3 key layers – Raw, Foundation and Access and Performance layers.

• Data normally loaded into Raw and Foundation layers BUT BI Apps loads data directly into APL and federated warehouses may

well also load data at aggregate level from federated operating companies.

• Data Factory is responsible for loading and then managing data between layers.

• Work is done to elevate the data between layers – typically further enriching and improving data quality.

• Work done in processing the data between the layers significantly reduce query costs. i.e. higher levels of concurrency can be

sustained for the same processing power.

• Increasing formalisation of definition

Information Management – Logical View Data Layers – Analytical processing



Raw Data Reservoir

• Analytical processing capabilities of Hadoop and RDBMS used to elevate data between layers as previously described.

• These analytical capabilities can also be leveraged by tools that access the data directly.

Typically this would be by a Data Scientist for Discovery Lab operations or BI Tools and Services that are processing data using

a model previously defined by the Data Scientist.

OLAP

Data Mining

Statistics

OLAP

Text Mining

Other Analytical

Processing

Data Mining

Text Mining

Image Processing

• Increasing enrichment

• Increasing data quality

• Reducing concurrency costs

• Increasing formalisation of definition

Information Management – Logical View Data Layers – Raw Data Reservoir



Raw Data Reservoir









• Immutable data store with data at lowest level of grain.

• Typically implemented in Hadoop or NoSQL for cost reasons but not always.

• May be:

• Queries directly,

• Used to derive base level data for Foundation Layer. Data may be represented logically in Foundation or physically as the

store is immutable BUT this effects ILM policy.

• or used to derive values or aggregates for Access and Performance layer. (e.g. propensity score or total monthly SMS’s)

Information Management – Logical View Data Layers – Foundation Data Layer



Raw Data Reservoir









• Immutable integrated and standardised store of enterprise class data. Stuff the business has agreed and organises around.

• Data at lowest level of grain of value for Enterprise data.

• Stored in business process neutral fashion to avoid data maintenance tasks to keep in step with current business interpretations.

• Typically close to 3NF. Special attention to modelling hierarchy, flexible entity attributions, customer / supplier etc.

• ONLY implemented in relational technology BUT this could be logical as previously noted in Raw Data Reservoir.

• May be queries directly by a select few individuals. Wider access to detail data provided through views in APL, often with VPD

implemented to prevent queries to antecedent data.

• Data in the Foundation Layer should be retained for as long as possible.

• Consideration should be given to retaining data in Raw Data Reservoir rather than archiving.

Information Management – Logical View Data Layers – Access and Performance Layer



Raw Data Reservoir









• Layer facilitates access, navigation and performance of queries.

• Allows for multiple interpretations of data from Foundation or Raw data Reservoir.

• Most structures can be thrown away and re-built from scratch based on Foundation and Raw Reservoir.

• The exception is derived and aggregate data which may have to be retained if the underlying data/mechanism is archived.

• Most users presenting information in a standardised fashion on dashboards and reports will access this layer only.

Access and Performance Layer


Access & Performance Layer


Raw Data Reservoir

• Data destined for Raw Data Reservoir may be loaded directly (e.g. through Flume) or may be stored temporarily in fs prior to

loading (e.g. Fuse fs)

• Relational data ingested in most appropriate mechanism before persisting in Foundation Data Layer (usual rules apply…)

• Ideally micro batch using simplest mechanism possible

• Only data of agreed quality loaded in FDL

• For efficient loading relationally data may be pre-staged in fs so a large number of small files can be concatenated

Information Management – Logical View Data Factory Ingestion flow

Data Ingestion

Batch & Real-Time

ETL / ELT

CDC

Stream

File Ops.


Content


SMS



• COTS Data

• Streaming & BAM



Data Ingestion




Raw Data Reservoir

Flow shown:

1. Data to be formalised from HDFS store extracted and loaded into Foundation Data Layer.

e.g. where Flume/HDFS is being used as an ETL pre-processor for Enterprise Data

or where HDFS data is being logically modelled in the foundation layer

2. Data is re-structured and/or aggregated to facilitate access by users and business processes

3. Data may also be re-structured and/or aggregated from HDFS store where there are no specific

requirements to manage Enterprise Data in a more formal data store over time

1

2

3

Information Management – Logical View Data Factory intra data processing flow


Information Management – Logical View Information Provisioning – BI & Data Science Components

Vir

tua

lisa

tion

&

Que

ry F

ed

era

tio

n

Enterprise Performance Management

Pre-built &

Ad-hoc BI Assets

Information

Services

Data Ingestion




Raw Data Reservoir

Vir

tua

lisa

tion

&

Qu

ery

Fe

de

ratio

n

• Data Virtualisation and the various components to access the data are as per our previous view on BI tools.

• By far the majority of users will access data via Access and Performance Layer although data may come from Raw Store or Foundation

• Data Virtualisation is a key components that helps to deliver tools independence, services integration and a future state roadmap

• Big Data has focused considerable attention on Data Science

• Analytical capabilities delivered through analytical processing in the data layers and Advanced Analytical Tools used to drive capabilities

• Data Mining in particular often involves complex data processing to flatten data into a longitudinal form. This derived data and model results are

typically written to a project based sandbox.

• Agile discovery is often best served through a separate Discovery Lab infrastructure (see later details)

Data Science


Information Management – Logical View Information Provisioning Typical BI Flows

Vir

tua

lisa

tion

&

Qu

ery

Fe

de

ratio

n


Pre-built &

Ad-hoc BI Assets

Information

Services

Data Science

Data Ingestion




Raw Data Reservoir

2

3

1. Typical access mechanism for Enterprise data via Access and Performance layer structures

2. Access to Foundation Layer Data to specific functions, processes and users only

3. Data interpretation & DQ assured through encoded logic, Avro, SerDe, FileReader, HCat etc.

4. Diagonal flows shows how data can be joined between layers as well as accessed directly. e.g. Raw Data

can be queried directly through HIVE connector or joined to the RDBMS data and queried.

1

4

4

Information Management – Logical View Data / Information Quality


Data Ingestion




Raw Data Reservoir

Vir

tua

lisa

tion

&

Qu

ery

Fe

de

ratio

n


Pre-built &

Ad-hoc BI Assets

Information

Services

Data Science

Quality of data at rest assured by a number of factors in addition to the underlying quality of data at source

– File and event handling to ensure data is not missed (e.g. missing log files assured by log file sequence numbering)

– The processing of data between Raw and FDL / APL layers. This can be seen as a DQ firewall to ensure only data of known and

acceptable quality is loaded. Typically this involves an element of synchronisation as some data will need to be held off until required

reference data is available due to the micro-batch incremental loading approach.

Quality of information presented to downstream tools and services determined by

– Model quality, understanding and performance of provisioning from modelled layers

– Consistency of definition, code quality and query performance when accessing Hadoop data (e.g. HR code, Avro definition…)


Information Management – Logical View Information Provisioning Direct Flow from Source Systems

Vir

tua

lisa

tion

&

Qu

ery

Fe

de

ratio

n


Pre-built &

Ad-hoc BI Assets

Information

Services

Data Science

Data Ingestion




Raw Data Reservoir

• Direct access from source systems to BI and Discovery or through the Data Virtualisation layer is also possible

• This is a fairly typical requirement for EPM and Data Science. Much less common for general BI other than as

part of a temporary expedient.

Data Sources


Content


SMS



• COTS Data

• Streaming & BAM










Information Management – Logical View Information Provisioning Direct Flow from Source Systems

Vir

tua

lisa

tion

&

Qu

ery

Fe

de

ratio

n


Pre-built &

Ad-hoc BI Assets

Information

Services

Data Science

Data Ingestion




Raw Data Reservoir

• Another view showing how the quality of data is altered between stores

Data Sources


Content


SMS



• COTS Data

• Streaming & BAM


Information Management – Logical View

Vir

tua

lisa

tion

&

Qu

ery

Fe

de

ratio

n


Pre-built & Ad-hoc BI Assets

Information

Services

Data Ingestion




Raw Data Reservoir

Data Science


Content


SMS



• COTS Data

• Streaming & BAM









Discovery Lab Sandboxes Rapid Development Sandboxes

Project based data stores

to support specific

discovery objectives

Project based data stored

to facilitate rapid content /

presentation delivery

Data Sources

Data Reservoir & Enterprise Information Store – complete view


Discovery Lab Sandboxes

Data Mining Method – Conceptual Map

Data Understand

Prepare Data

Model

Evaluate

Deploy

Monitor

Discovery

Business

Goals

• Data scientist led discovery

• Domain expertise also critical

• Wide range of tools & data

• Data preparation is a significant challenge

• Able to quickly mashup & transform data


Data Understand

Prepare Data

Model

Evaluate

Deploy

Monitor

Discovery

Business

Goals

• Choice of deployment options • Organisational learning

• Automated event and/or response

(e.g. inbound call and CSR support)

• Manual list generation based on detected risk events

• Tools support depending on deployment option • Visualisations, numerical presentation…etc

• Provision for Marketing Analyst data mashup


Data Understand

Prepare Data

Model

Evaluate

Deploy

Monitor

Discovery

Business

Goals

• Agile incorporation into standard reporting framework

• Expose new risk indicators and interventions

• Track model lift and trigger perturbation or rebuild

automatic or Data Science led activity

Analysis Processing & Delivery

Discovery Lab & Data Science Tooling

Data Reservoir & Enterprise Data

Data

Science

(Primary

Toolset)

Statistics Tools

Data & Text Mining Tools

Faceted Query Tools

Programming & Scripting

Data Modeling Tools

Query & Search Tools

Pre-Built

Intelligence

Assets

Intelligence

Analysis

Tools

Ad Hoc Query & Analysis Tools

OLAP Tools

Forecasting & Simulation Tools

Reporting Tools

Data

Scientist

Vir

tua

lisa

tion

&

Info

rma

tio

n S

erv

ice

s

Data Factory flow

1. Data Factory responsible for

access provisioning to data

or replication (all or sample)

to Sandbox in Discovery Lab.

2. Direct connection from Data

Science tools and analysis

sandbox. Data Science tools

read and write data from/to

project sandboxes.

3. Data Scientist can also

access standard dashboards,

reports and KPI’s through

Data Virtualisation layer

Data Quality & Profiling

Graphical rendering tools

Dashboards & Reports

Scorecards

Charts & Graphs

Sandbox – Project 3



2

Data store Analytical

Processing

Information Management – Logical View Discovery Lab data flow

General BI flow

3

1

Rapid Development Sandboxes

Analysis Processing & Delivery

Development Environment Tooling

Pre-Built

Intelligence

Assets

Intelligence

Analysis

Tools

Ad Hoc Query & Analysis Tools

OLAP Tools

Forecasting & Simulation Tools

Reporting Tools

BICC

Vir

tua

lisa

tion

&

Info

rma

tio

n S

erv

ice

s

Data Factory flow

1. The majority of BI development

activity will be from existing

sources – developing new

reports to existing or new

channels.

2. BICC or other expert users

may quickly develop new

reporting through mashups

from any available sources.

Careful governance is required

once the report is completed to

ensure data and report are

professionally managed.

Dashboards & Reports

Scorecards

Charts & Graphs



Dev Sandbox – Project 1

Information Management – Logical View Discovery Lab data flow

2

Data Reservoir & Enterprise Data

1

2

General BI flow

R/T event Engine – Logical View and Components

Real-time

Data Engine

To Event Subscribers

(Events / Data)

Privacy Filter

Data Transform

Rules & Models

Mediation

Next Best Action

Real-Time

Data Store

From Input Events

Reference

Data Models

& Rules

Privacy

Data

Analytics

Real-Time Data Engine – Logical View

Business Activity Monitoring

Real-Time event

monitoring

Real-Time Data Engine

Message mediation service

Privacy filter for event data. i.e. apply customer specified privacy

and preference filters to the data stream

Transformation of the message data to outbound form

Apply declarative rules and models to the data stream to detect

events for further downstream processing

Next Best Activity (NBA) event detection and processing. NBA

typically also includes control group management and global

optimisation of rules

Business Activity Monitoring

Local data store – local persistence of rules and metadata

Components

Privacy Filter

Data Transform

Rules & Models

Mediation

Next Best Action

Real-Time Data

Store

BAM

Mapping from the previous release of the architecture

Oracle’s Information Management Reference Architecture (3rd Edition)

More relevant to Big Data oriented audience

Better representation of pragmatic customer projects

Includes Raw data store as part of the architecture

Show effort / cost to store and interpret data that separates

schema-on-read and schema-on-write approaches

Aligned to Analytics 3.0

Consistent with Oracle’s engineering efforts

What’s changed?

Oracle’s Information Management Reference Architecture (3rd Edition)

“All those layers and definitions in your

Reference Architecture, I just don’t get

it… and it looks complicated !”

Hadoop developer knee deep in complex Map:Reduce code

What’s changed?

Business Trends

Technology Trends

Data Trends

Information Management Reference Architecture Version 2.0 of the Architecture

Information Management Reference Architecture

Interpretation layer

shows the relative cost

of reading data

depending on its

location

Previous staging layer

now split into Data

Ingestion and Raw

store.

Ingestion layer

includes methods and

processes to load data

and manage Data

Quality. Shape

represents the relative

cost of these

processes. i.e. from

none for HDFS to lots

in APL.

Raw Reservoir is

typically at the lowest

level of grain. Often

lower than the

enterprise cares about

and so may not have

been included in

previous

representation.

Renamed from

Knowledge Discovery

to Discovery Lab but

otherwise unchanged.

The role of Discovery

Labs is becoming

more central though so

additional operational

guidance will be

added.

Discovery Lab

Still an immutable

store but may be

physically

implemented in

relational or non-

relational technologies

Key differences from 2.0 to 3.0 of the Architecture