Эволюция big data и information management. reference architecture

Information Management Reference Architecture3rd Evolution

EMEA Enterprise Architecture

Contents

Introduction Conceptual view Design Patterns IM Logical view and component outline Discovery Lab R/T Event Engine logical view Mapping to previous Reference Architecture release

Introduction

Introduction

This PPT documents the main architectural components of Oracle’s Information Management Reference Architecture.

The architecture is intended to be practical and pragmatic, with many of the ideas and experiences that inform the approach dating back almost 20 years in Oracle and are based on real world customer experiences.

We define Information Management to mean the following. Please note that this definition embraces all types and forms of data as well as embracing aspects such as Information Discovery and Governance:

“Information Management is the means by which an organisation maximises the efficiency with which it plans, collects, organises, uses, controls, stores, disseminates, and disposes of its Information, and through which it ensures that the value of that information is identified and exploited to the maximum extent possible”

3rd Evolution of Oracle’s Information Management Reference Architecture

Oracle’s Information Management Reference Architecture (3rd Edition)

More relevant to Big Data oriented audience Better representation of pragmatic customer projects Includes Raw data store as part of the architecture Show effort / cost to store and interpret data that separates schema-on-read and schema-on-write approaches

Aligned to Analytics 3.0 Consistent with Oracle’s engineering efforts

What’s changed?

Aligning analytical requirements and IM architectureEnabling Analytics 3.0 with a pragmatic architecture

Analytics 2.0

Analytics 3.0

Analytics 1.0

• Reporting with limited use of descriptive analytics

• Limited range of tabular data

• Batch oriented analysis • Analysis bolted onto limited

set of business processes

• Firms “Competing on Analytics”• Extended analytics to larger

and less structured datasets• Emergence of Big Data into

the commercial world• Recognition of Data Science

role in commercial orgs.

• Platform for monetisation• Deeper analysis & more data• Faster test-do-learn iterations• Different types of data & wider

business process coverage• Analysts focus on discovery and

driving business value• “Agile” with operational elements

incorporated into design patterns

Adapted from Tom Davenport material

Oracle’s Information Management Reference Architecture (3rd Edition)

“All those layers and definitions in your Reference Architecture, I just don’t get it… and it looks complicated !”

Hadoop developer knee deep in complex Map:Reduce code

What’s changed?

Business Trends

Technology Trends

Data

Trends

Conceptual View

ActionableEvents

Event Engine Data Reservoir

Data Factory Enterprise Information Store

Reporting

Discovery Lab

ActionableInformation

ActionableInsights

InputEvents

Execution

Innovation

Discovery Output

Events & Data

Conceptual View

StructuredEnterprise Data

OtherData

Component OutlineData Engine Respond to R/T events in appropriate and/or optimised fashion

Data Reservoir Raw data Reservoir – typically event data at lowest grain

Data Factory Managed ETL onto, within and between platforms

Enterprise Data Data stores for Information Management

Reporting BI tools and infrastructure components

Discovery Lab Platform, data and tools to support discovery process

Execution – things you do every day

Innovation – innovation to drive tomorrows businessLine of Governance!

Discovery Output

– Possible outputs include new knowledge, mining models / parameters, scored data…

Design Patterns

Design Pattern: Discovery Lab

Specific focus on identifying commercial value for exploitation Small group of highly skilled individuals (aka Data Scientists) Iterative development approach – data oriented NOT development oriented

Wide range of tools and techniques applied Data provisioned through

Data Factory or own ETL Typically separate infrastructure

but could also be unified Reservoirif resource managed effectively

Design Pattern : Information Platform

Build the next generation Information Management platform Either Business Strategy driven or IT cost / capability driven initiative Initial project may be specifically linked to lower data grain or retention

BUT it is the platform as a whole that forms the solution required Platform for consolidating other IM assets onto Key issues related to differences in

procurement, development process,governance and skills differences

Discovery Lab may be implementedas a pragmatic initial POV.

Design Pattern : Data Application

Big Data technologies applied to a specific business probleme.g. Genome sequence analysis using BLAST or log data from pharmaceutical production plant and machinery required for traceabiliy

Limited or no integration to broader Information Management estate Specific solution so Non-functional requirements have less impact

on solution quality or long term costs Platform costs and scalability are

important considerations

Design Pattern: Information Solution

Specific solution based on Big Data technologies requiring broader integration to the wider Information Management estatee.g. ETL pre-processor for the DW or affordably store a lower level of grain

Non-functional requirements more critical in this solution Scalable integration to IM estate

an important factor for success Analysis may take place in Reservoir

or Reservoir only used as an aggregator

Design Pattern: Real-Time Events

May take place at multiple locations between place of data origination and the Data Centre – requiring careful design and implementation

May include Next-Best-Activity, declarative rules and Data Mining technologies to optimise decisions. i.e. optimise across declarative, data mining, customer preference & business-defined rules

May include considerations for personal preferences and privacy(e.g. opt-out) for customer relatedevents

Common component seen acrossmany industries & marketse.g. connected vehicle

Real-Time optimisation of events

Design Pattern against component usage map

Design pattern Discovery Lab Information Platform Data Application Information Solution R/T Events

Outline

Data science labAssess the value of the data

Next Generation information platform to align IM capability with business strategy

Addressing a specific data problem in Hadoop with no broader integration required.

Addressing a specific data problem but requires broader enterprise wide integrations. e.g. ETL pre-processing, Event Store at lower grain than existing DW

Execution platform to respond to R/T events

Examples Gov. HealthcareMobile operator

Spanish Bank (Business led)UK Gov. Dept. (Tech. led)

Pharma Genome projectPharma production archive

Investment Bank – trade riskMobile Operator – ETL processing

Mobile operator – location based offers

Data Engine Possible Yes

Data Reservoir Yes Yes Yes

Data Factory Yes Yes Yes

Enterprise Data Yes

Reporting Yes

Discovery Lab Yes Implied Alternative approach to Reservoir + Factory above

IM Logical View and Components

Information Management – Logical ViewData Sources

Data Ingestion

Methods and process to load data into ourmanaged data store and manage dataquality

• Contemporary Information Management solutions must be able to ingest any type of data from any source in any format and mechanism and at any frequency. e.g. Flat file loads, streaming…

• The data may be highly unstructured, mono-structured or highly poly-structured.• Data will vary in volume and in Data Quality. • Operational isolation should be considered to ensure operational applications will continue in the event of the loss of the

Information Management system

Data Engines & Poly-structured sources

Content

Docs Web & Social Media

SMS

StructuredDataSources

•Operational Data•COTS Data•Master & Ref. Data•Streaming & BAM

Information Management – Logical ViewInformation Ingestion

Data Ingestion

Information Interpretation

Methods and process to load data and manage DataQuality

Methods and process needed to access information

Managed Data

Load

All data under management

Query

• Data structure and processing required to load data into managed data stores• Shape represents the work done on the data to load data and/or process between layers• Layer may include file mechanism where required to facilitate loading

(e.g. Fuse fs or ZFS for operational isolation and file concat)• Normal rules of micro-batch, taking all the data and KISS principles recommended• DQ and loading stats presented through BI dashboards as a non-judgemental mechanism to improve DQ.• Data may be landed in the Ingestion layer to facilitate loading but not typically stored for any length of time. e.g. Raw data loaded from

web logs but sessionised data then loaded to Raw. Another example is data used to manage CDC may be stored in this layer.

Information Management – Logical ViewData Interpretation

Data Ingestion




Managed Data

Load

All data under management

Query

• Methods and processes required to access information in each of the stores• Shape represents the cost of interpreting the data under management• For schema-on-read the cost may include the AVRO, SerDe or reader class as well as the associated processing code to

select, filter and process the data. • For schema-on-write the cost is represented by the complexity of the SQL required to access the data only – more complex

typically for 3NF than for a dimensional query.

Information Management – Logical ViewData Layers – cost, quality and concurrency trade off

Managed DataAccess & Performance Layer

Foundation Data Layer

Raw Data Reservoir

Immutable raw data reservoirRaw data at rest is not interpreted

Immutable modelled data. Business Process Neutral form. Abstracted from business process changes

Past, current and future interpretation of enterprise data. Structured to support agile access & navigation

• Increasing enrichment• Increasing data quality

• Reducing concurrency costs

• Data under management includes 3 key layers – Raw, Foundation and Access and Performance layers.• Data normally loaded into Raw and Foundation layers BUT BI Apps loads data directly into APL and federated warehouses

may well also load data at aggregate level from federated operating companies.• Data Factory is responsible for loading and then managing data between layers. • Work is done to elevate the data between layers – typically further enriching and improving data quality. • Work done in processing the data between the layers significantly reduce query costs. i.e. higher levels of concurrency can be

sustained for the same processing power.

• Increasing formalisation of definition

Information Management – Logical ViewData Layers – Analytical processing



Raw Data Reservoir

• Analytical processing capabilities of Hadoop and RDBMS used to elevate data between layers as previously described.• These analytical capabilities can also be leveraged by tools that access the data directly.

Typically this would be by a Data Scientist for Discovery Lab operations or BI Tools and Services that are processing data using a model previously defined by the Data Scientist.

OLAP

Data MiningStatisticsOLAP

Text Mining

OtherAnalytical

Processing

Data MiningText Mining

ImageProcessing

• Increasing enrichment• Increasing data quality

• Reducing concurrency costs

• Increasing formalisation of definition

Information Management – Logical ViewData Layers – Raw Data Reservoir



Raw Data Reservoir




• Immutable data store with data at lowest level of grain.• Typically implemented in Hadoop or NoSQL for cost reasons but not always.• May be:

• Queries directly, • Used to derive base level data for Foundation Layer. Data may be represented logically in Foundation or physically as the

store is immutable BUT this effects ILM policy.• or used to derive values or aggregates for Access and Performance layer. (e.g. propensity score or total monthly SMS’s)

Information Management – Logical ViewData Layers – Foundation Data Layer



Raw Data Reservoir




• Immutable integrated and standardised store of enterprise class data. Stuff the business has agreed and organises around.• Data at lowest level of grain of value for Enterprise data.• Stored in business process neutral fashion to avoid data maintenance tasks to keep in step with current business interpretations. • Typically close to 3NF. Special attention to modelling hierarchy, flexible entity attributions, customer / supplier etc. • ONLY implemented in relational technology BUT this could be logical as previously noted in Raw Data Reservoir.• May be queries directly by a select few individuals. Wider access to detail data provided through views in APL, often with VPD

implemented to prevent queries to antecedent data.• Data in the Foundation Layer should be retained for as long as possible.• Consideration should be given to retaining data in Raw Data Reservoir rather than archiving.

Information Management – Logical ViewData Layers – Access and Performance Layer



Raw Data Reservoir




• Layer facilitates access, navigation and performance of queries.• Allows for multiple interpretations of data from Foundation or Raw data Reservoir.• Most structures can be thrown away and re-built from scratch based on Foundation and Raw Reservoir.• The exception is derived and aggregate data which may have to be retained if the underlying data/mechanism is archived.• Most users presenting information in a standardised fashion on dashboards and reports will access this layer only.

Access and Performance Layer


Access & Performance Layer


Raw Data Reservoir


Content


SMS


•Operational Data•COTS Data•Master & Ref. Data•BAM Data

• Data destined for Raw Data Reservoir may be loaded directly (e.g. through Flume) or may be stored temporarily in fs prior to loading (e.g. Fuse fs)

• Relational data ingested in most appropriate mechanism before persisting in Foundation Data Layer (usual rules apply…)• Ideally micro batch using simplest mechanism possible • Only data of agreed quality loaded in FDL• For efficient loading relationally data may be pre-staged in fs so a large number of small files can be concatenated

Information Management – Logical ViewData Factory Ingestion flow

Data IngestionBatch & Real-Time

ETL / ELT

CDC

Stream

File Ops.


Data Ingestion




Raw Data Reservoir

Flow shown:1. Data to be formalised from HDFS store extracted and loaded into Foundation Data Layer.

e.g. where Flume/HDFS is being used as an ETL pre-processor for Enterprise Dataor where HDFS data is being logically modelled in the foundation layer

2. Data is re-structured and/or aggregated to facilitate access by users and business processes3. Data may also be re-structured and/or aggregated from HDFS store where there are no specific

requirements to manage Enterprise Data in a more formal data store over time

1

2

3

Information Management – Logical ViewData Factory intra data processing flow


Information Management – Logical ViewInformation Provisioning – BI & Data Science Components

Vir

tua

lisat

ion

&

Que

ry F

ed

era

tion

Enterprise Performance Management

Pre-built & Ad-hoc BI Assets

InformationServices

Data Ingestion




Raw Data Reservoir

Vir

tua

lisat

ion

&

Que

ry F

ed

era

tion

• Data Virtualisation and the various components to access the data are as per our previous view on BI tools.• Data Virtualisation is a key components that helps to deliver tools independence, services integration and a future state roadmap• Big Data has focused considerable attention on Data Science• Analytical capabilities delivered through analytical processing in the data layers and Advanced Analytical Tools used to drive capabilities• Data Mining in particular often involves complex data processing to flatten data into a longitudinal form. This derived data and model results

are typically written to a project based sandbox. • Agile discovery is often best served through a separate Discovery Lab infrastructure (see later details)

Data Science


Information Management – Logical ViewInformation Provisioning BI Flows

Vir

tua

lisat

ion

&

Que

ry F

ed

era

tion



InformationServices

Data Science

Data Ingestion




Raw Data Reservoir

2

3

1. Typical access mechanism for Enterprise data via Access and Performance layer structures2. Access to Foundation Layer Data to specific functions, processes and users only3. Data interpretation & DQ assured through encoded logic, Avro, SerDe, FileReader, HCat etc.

4. Diagonal flows shows how data can be joined between layers as well as accessed directly. e.g. Raw Data can be queried directly through HIVE connector or joined to the RDBMS data and queried.

1

4

4

Information Management – Logical ViewData / Information Quality


Data Ingestion




Raw Data Reservoir

Vir

tua

lisat

ion

&

Que

ry F

ed

era

tion



InformationServices

Data Science

Quality of data at rest assured by a number of factors in addition to the underlying quality of data at source– File and event handling to ensure data is not missed (e.g. missing log files assured by log file sequence numbering)– The processing of data between Raw and FDL / APL layers. This can be seen as a DQ firewall to ensure only data of known and

acceptable quality is loaded. Typically this involves an element of synchronisation as some data will need to be held off until required reference data is available due to the micro-batch incremental loading approach.

Quality of information presented to downstream tools and services determined by– Model quality, understanding and performance of provisioning from modelled layers– Consistency of definition, code quality and query performance when accessing Hadoop data (e.g. HR code, Avro definition…)

Information Management – Logical ViewData Reservoir & Enterprise Information Store

Vir

tua

lisat

ion

&

Que

ry F

ed

era

tion



InformationServices

Data Ingestion




Raw Data Reservoir

Data Science


Content


SMS


•Operational Data•COTS Data•Master & Ref. Data•Streaming & BAM




Discovery Lab

Analysis Processing & Delivery

Discovery Lab & Data Science Tooling

Data Reservoir & Enterprise Data

Data Science(Primary Toolset)

Statistics Tools

Data & Text Mining Tools

Faceted Query Tools

Programming & Scripting

Data Modeling Tools

Query & Search Tools

Pre-BuiltIntelligence

Assets

IntelligenceAnalysis

Tools

Ad Hoc Query& Analysis Tools

OLAP Tools

Forecasting &Simulation Tools

Reporting Tools

Data Scientist

Vir

tua

lisat

ion

&

Info

rma

tion

Ser

vice

s

Data Factoryflow

1. Data Factory responsible for access provisioning to data or replication (all or sample) to Sandbox in Discovery Lab.

2. Direct connection from Data Science tools and analysis sandbox. Data Science tools read and write data from/to project sandboxes.

3. Data Scientist can also access standard dashboards, reports and KPI’s through Data Virtualisation layer

Data Quality & Profiling

Graphical rendering tools

Dashboards & Reports

Scorecards

Charts & Graphs

Sandbox – Project 3



1

2Data store Analytical

Processing

3

Information Management – Logical ViewDiscovery Lab data flow

R/T event Engine – Logical View and Components

Real-timeData Engine

To Event Subscribers(Events / Data)

Privacy Filter

Data Transform

Rules & Models

Mediation

Next Best Action

Real-Time Data Store

From Input Events

ReferenceData

Models & Rules

PrivacyData

Analytics

Real-Time Data Engine – Logical View

Business Activity Monitoring

Real-Time event monitoring

Real-Time Data Engine

Message mediation service Privacy filter for event data. i.e. apply customer specified privacy

and preference filters to the data stream Transformation of the message data to outbound form Apply declarative rules and models to the data stream to detect

events for further downstream processing Next Best Activity (NBA) event detection and processing. NBA

typically also includes control group management and global optimisation of rules

Business Activity Monitoring Local data store – local persistence of rules and metadata

Components

Privacy Filter

Data Transform

Rules & Models

Mediation

Next Best Action

Real-Time Data Store

BAM

Real-Time data engine flows

Describe each of the data flows

ReferenceData

Models & Rules

PrivacyData

Event Analytics

From Input Events

To Event Subscribers(Events / Data)

R/T Event Monitoring

To Do

Mapping from the previous release of the architecture

Information Management Reference ArchitectureVersion 2.0 of the Architecture

Information Management Reference Architecture

Interpretation layer shows the relative cost of reading data depending on its location

Previous staging layer now split into Data Ingestion and Raw store.

Ingestion layer includes methods and processes to load data and manage DataQuality. Shape represents the relative cost of these processes. i.e. from none for HDFS to lots in APL.

Raw Reservoir is typically at the lowest level of grain. Often lower than the enterprise cares about and so may not have been included in previous representation.

Renamed from Knowledge Discovery to Discovery Lab but otherwise unchanged.The role of Discovery Labs is becoming more central though so additional operational guidance will be added.

Discovery Lab

Still an immutable store but may be physically implemented in relational or non-relational technologies

Key differences from 2.0 to 3.0 of the Architecture

Discovery Lab and Governance considerations

Data discovery for the Enterprise

Discovery phase– Unbounded discovery– Self-Service sandbox– Wide toolset– Agile methods

Promotion to Exploitation– Commercial exploitation– Narrower toolset– Integration to operations– Non-functional requirements– Code standardisation &

governance

Discovery and monetising steps have different requirements

Business Value

Commercial Exploitation

Time / Effort

Discovery phase

Understanding of the data

Governance

To monetise fully you need to standardiseIt’s smart to standardise as part of Governance

Discovery process requires a broad toolset

Standardisation is essential for Commercial exploitation

Sustainability depends on standardisation / rationalisation

– Reduced training burden– Reduced support costs– Reduced license costs– Ongoing agility & alignment

Data Discovery Toolset Data Exploitation Toolset

Rationalised Components

• Cloudera CDH, Oracle, No-SQL• Mammoth, Yarn, EM-plugin• MR, Hive, Pig, Impala, Accum.• Flume NG, Oozie• …• …• …Optional additions• Oracle Connectors

• Additional corporate standardcomponents

Ora

cle

stan

dard

de

ploy

men

tC

orpo

rate

sta

ndar

d

Standardised Hadoop ZooStandardised deployment

http://www.google.co.uk/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=drzYrnEB4iQ8NM&tbnid=lXBwpKEi4NSlIM:&ved=0CAUQjRw&url=http://www.slideshare.net/lynnlangit/hadoop-mapreduce-fundamentals-21427224&ei=fYHZUriDFYrG0QWrlIHQDw&bvm=bv.59568121,d.ZGU&psig=AFQjCNFqPES8-OgCVlPs0ntzNQd-ThuLgQ&ust=1390072547928037

http://www.google.co.uk/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=drzYrnEB4iQ8NM&tbnid=lXBwpKEi4NSlIM:&ved=0CAUQjRw&url=http://www.slideshare.net/lynnlangit/hadoop-mapreduce-fundamentals-21427224&ei=fYHZUriDFYrG0QWrlIHQDw&bvm=bv.59568121,d.ZGU&psig=AFQjCNFqPES8-OgCVlPs0ntzNQd-ThuLgQ&ust=1390072547928037

Copyright © 2013, Oracle and/or its affiliates. All rights reserved.52

The kind of things we are looking to Discover

Data science skills requiredvary by the type of analysis

Data Management skills varyby the amount of data and itsstructure

So making data movementand manipulation easy willdeliver a better result and deliver it faster

Descriptive

Diagnostic

Predictive

Prescriptive

Business Impact

Ana

lytic

al S

kills

Insight

Foresight


Discovery is a Data process not a Development Process

Requirement Analysis

High Level Design

Low Level Design Coding Testing Acceptance

Testing

Three Versions of the BI Development Process

Excel Spreadsheet

Shared linked spreadsheets

Local Access Database

Shared Access Server

SQL Server Database

Oracle Datawarehouse

Discovery & Profile Model Exploit

What IT thinks it should be

What normally happens

What Big Data is trying to achieve


Sandbox delivery options• Separate Data Lab environment• Delivered as part of Information

Management architecture

Self Service Sandboxes• Self service provisioning of new

sandboxes for Discovery phase• Automation of data access rights,

resources and tools provisioning

Data provision• Quickly take on new data to

rapidly make available to Analysts

• Tools such as “Data Factory” canfully automate data flows

Sandboxes facilitate “Agile”Providing the technology platform for agile discovery


Monetise and Optimise steps are different

New insights deployed into business process in some form– Technical: e.g. Business rules, new customer segments

– Non-technical: e.g. Observations about behaviours

Business Intelligence systems adapted to provide monitoring, feedback and control optimisation

The faster you iterate this cycle the greater the benefit BUT Big Data does not change the fundamental need

for accurate, consistent and integrated information

What happens when we want to exploit insights?

New insights

Business Process

https://www.google.co.uk/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=gX5BO-33jfahRM&tbnid=oq6PyPXg6D8xuM:&ved=0CAUQjRw&url=https://www.ibm.com/developerworks/community/wikis/home?_escaped_fragment_=/wiki/W4108ee665aa0_4201_8931_923a96c3653a/page/Bespoke%20Application%20Process&ei=hnb3Upr1JqLb0QXr8ICACg&bvm=bv.60983673,d.ZG4&psig=AFQjCNFZWHJ18FFhMChPpUmj1GTO2GtS2g&ust=1392035796663910

http://www.tlfresearch.co.uk/how-we-can-help/customer-insight/


Rules of thumb for dataOrganised information leads to better analyses

Information needs to be organised in order to analyse it

RDBMS are great when information is organised

Hadoop minimises the penalty for disorganisation

The closer you are to insight, the more complete and organised information needs to be

Data needs to be organised to monetise it effectively


What that really means is…

We need to apply structureto data in order to analyse it

Schema on read works wellfor us in Discovery as we canbe agile about interpretation

As we move into Discovery schema on read can causes Governance & quality issues

Key lesson: The cost to store & manage is distinct from structural considerations between Big Data and RDBMS technologies

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1358

De-mystifying schema on read

DQBus. RulesMapping

ETL

Data Reservoirs

Traditional “Schema on Write”– Data quality managed by formalised ETL

process

– Data persisted in tabular, agreed and consistent form

– Data integration happens in ETL

– Structure must be decided before writing

Big Data “Schema on Read”– Interpretation of data captured in code for

each program accessing the data

– Data quality dependent on code quality

– Data integration happens in code

http://www.google.co.uk/url?sa=i&rct=j&q=cogs+in+a+process&source=images&cd=&cad=rja&docid=B6mwkIDSzn_4lM&tbnid=L17XpYh5mvvreM:&ved=0CAUQjRw&url=http://www.corporatemodelling.com/about/&ei=pjIGUbTbNofKtAaExoDwCA&psig=AFQjCNGpZyKbZloxBujwy0vs6qwFPpj_pg&ust=1359447079273614

http://www.google.co.uk/url?sa=i&rct=j&q=cognitive+behaviour&source=images&cd=&cad=rja&docid=3Mf7pmClueX_IM&tbnid=m7HyTFDTDLE0nM:&ved=0CAUQjRw&url=http://www.naturaltherapypages.co.uk/article/characteristics_of_cognitive_behavioural_therapy&ei=2zEGUYSFCsnVtQbKoYCQDg&psig=AFQjCNHGQlmkB77A4GyauzvL--2YOhPzSA&ust=1359446875680961







Underlying storage capabilities are differentTooling maturity

Stringent Non-Functionals

ACID transactional requirement

Security

Variety of data formats

Data sparsity

ETL simplicity

Cost effectively store low value data

Ingestion rate

Straight Through Processing (STP)

0

5

HadoopRelationalMy Appllication


Analytics 3.0 platform include both relational and non-relational technologies

Ken Rudin* refers to this as the genius of AND vs the tyranny of OR(see his TDWI ‘13 presentation)

Unified Reservoir simplifies access to all data regardless of characteristics & analysis requirements

Insert Chart Here

It’s smart to unify your data into a single ReservoirFully expose your data for discovery and monetisation

Ken Rudin is Director of Analytics at Facebook*

All Data

http://www.youtube.com/watch?feature=player_embedded&v=xKrw2TKfj4w



Information Management – Logical View

Vir

tua

lisat

ion

&

Que

ry F

ed

era

tion



InformationServices

Advanced AnalyticalTools

Information Provisioning Analysis Processing & Delivery

Data Ingestion

Information Access



Raw Data Reservoir








Information Management – Logical ViewAnalytical processing and delivery

Vir

tua

lisat

ion

&

Que

ry F

ed

era

tion



InformationServices

Advanced AnalyticalTools

Data Ingestion

Information Access



Raw Data Reservoir

Structures and processing required to load data (batch and Real-Time) and manage Data Quality

Structures required to interpret the data under management.i.e. logical interpretation

• Data Virtualisation and the various components to access the data are as per our previous view on Bo tools.• Data Virtualisation is a key components that helps to deliver tools independence, services integration and a future state roadmap• What has changed is the focused on Analytics• Analytical capabilities is delivered through analytical processing in the data layers and Advanced Analytical Tools used to drive capabilities• Data Mining in particular often involves complex data processing to flatten data into a longitudinal form. This derived data and model results

are typically written to a project based sandbox. • Agile discovery is often best served through a separate Discovery Lab infrastructure (described later)

OLAP

Data MiningStatisticsOLAP

Text Mining

OtherAnalytical

Processing

Data MiningText Mining

Эволюция big data и information management. reference architecture

Technology

event data

forms of data

log data

information discovery

raw data store

information platform

lower data grain

data faster test