data integration for big data (oow 2016, co-presented with oracle)

51
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | CON6624 Oracle Data Integration Platform A Cornerstone for Big Data Christophe Dupupet (@XofDup) Director | A-Team Mark Rittman (@markrittman) Independent Analyst Julien Testut (@JulienTestut) Senior Principal Product Manager September, 2016 Confidential Oracle Internal/Restricted/Highly Restricted

Upload: mark-rittman

Post on 09-Jan-2017

260 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |

CON6624Oracle Data Integration PlatformA Cornerstone for Big Data

Christophe Dupupet (@XofDup)Director | A-Team

Mark Rittman (@markrittman)Independent Analyst

Julien Testut (@JulienTestut)Senior Principal Product Manager

September, 2016

Confidential – Oracle Internal/Restricted/Highly Restricted

Page 2: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor StatementThe following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Oracle Confidential 1

Page 3: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Agenda

Oracle Data Integration for Big Data

Big Data Patterns

A Practitioner’s View on Oracle Data integration for Big Data

Q & A

1

2

3

4

Page 4: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Five CoreCapabilities

1. Business ContinuityDATA ALWAYS AVAILABLE

2. Data MovementDATA ANYWHERE IT’S NEEDED

3. Data TransformationDATA ACCESSIBLE IN ANY FORMAT

4. Data GovernanceDATA THAT CAN BE TRUSTED

5. Streaming DataDATA IN MOTION OR AT REST

Page 5: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential 5

Eight Core Products

Cloud or On-Premise

Page 6: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

MostInnovativeTechnology

#1#1

Realtime / StreamingData Integration Tool

Pushdown / E-LTData Integration Tool

1st to certify replication withStreaming Big Data

1st to certify E-LT tool withApache Spark/Python

1st to power Data Preparationw/ML + NLP + Graph Data

1st to offer Self-Service & Hybrid Cloud solution

Page 7: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential 7

Hybrid Open-Source...Open Source at the core of speed & batch processing engines

...Enterprise Vendor tools for connecting to existing IT system and

...Cloud Platforms for data fabricBusiness

DataServingLayer

Apps

Analytics

Batch Layer

Data Streams

Social and Logs

Enterprise Data

Highly AvailableDatabases

Pub / Sub

REST APIs

NoSQL

Bulk Data

Speed LayerRaw Data Stream Processing

Batch Processing

Prepared Data

Page 8: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Examples

Oracle Confidential 8

Reference Architecture

BusinessData

ServingLayer

Apps

Analytics

Batch Layer

Data Streams

Social and Logs

Enterprise Data

Highly AvailableDatabases

Pub / Sub

REST APIs

NoSQL

Bulk Data

Speed Layer

GoldenGate

Data Preparation

Data Quality, Metadata Management & Business Glossary

Oracle Data Integrator

Active DataGuard

Comprehensive architecture covers key areas – #1. Data Ingestion, #2. Data Preparation & Transformation, #3. Streaming Big Data, #4. Parallel Connectivity, and #5. Data Governance –and Oracle Data Integration has it covered.

Dataflow ML

Stream Analytics

Connectors

Page 9: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Oracle GoldenGate

Realtime Performance

Extensible & Flexible

Proven & Reliable

Oracle GoldenGate provides low-impact capture, routing, transformation, and delivery of database transactions across homogeneous and heterogeneous environments in real-time with no distance limitations.

MostDatabases Data

EventsTransaction Streams

Cloud

DBs

Big Data

Supports Databases, Big Data and NoSQL:

* The most popular enterprise integration tool in history

Page 10: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

ApplicationsApplications DatabusApplications Speed Layer

Batch Layer

Capt

ure

Trai

l

Rout

e

Deliv

er

Pum

p

Oracle Confidential 10

Streaming Analytics

Application

ServingLayer

RESTServices

VisualizationTools

ReportingTools

Data Marts

UserUpdates

DBMSUpdates

GoldenGate for Ingest

GG GG

Applications ServingLayer

Speed Layer

Batch Layer

Platforms

Page 11: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Self-Service

Better Recommendations

Built-in Data Graph Zero software to install, easy to use browser based interface

Better automation and less grunt work

for humans

Graph database of real-world facts used for enrichment

Oracle Data Preparation

ReportingApps

FilesETL

Oracle Data Preparation is a self-service tool that makes it simple to transform, prepare, enrich and standardize business data – it can help IT accelerate solutions for the Business by giving control of data formatting directly to data analysts.

Page 12: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential 12

MONTHS of effort spent on each new dataset

PROGRAMERS writing scripts or complex ETL

DATA WRANGLING wastes time and money

“Big Data’s dirty little secret is that 90% of time spent on a project is devoted to preparing data… After all the preparation work, there isn’t enough time left to do sophisticated analytics on it…” Thomas H. Davenport

InternetLogs

UN

STRU

CTU

RED

STRU

CTU

RED

Discovery& Visualization

EnterpriseReporting

EnterpriseETL & DataIntegration

BUSINESS VALUEOPPORTUNITY

Weeks or Months

I want my data!!

BDP for Data Preparation

Page 13: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Oracle Data Integrator

Bulk Data Performance

Non Invasive Footprint

Future Proof IT Skills

Oracle Data Integrator provides high performance bulk data movement, massively parallel data transformation using database or big data technologies, and block-level data loading that leverages native data utilities

Bulk DataTransformation

Most Apps,Databases

& Cloud Bulk Data Movement

Cloud

DBsBig

Data

1000’s of customers –

more than other ETL tools

Flexible ELT workloads run

anywhere: DBs, Big Data, Cloud

Up to 2x faster batch processes

and 3x more efficient tooling

Page 14: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential 14

ODI for TransformationsETL Engines

Big Data Frameworks

Speed Layer

Batch Layer

ServingLayer

ApplicationsApplications DatabusApplications

Application

RESTServices

VisualizationTools

ReportingTools

Data Marts

UserUpdates

DBMSUpdates

Applications ServingLayer

Speed Layer

Batch Layer

Oracle Data Integrator

Spark Streaming

Spark SQLSqoop

ERP

Oozie

Pig

HiveLoaders

Kafka

NoSQL

OGG

SQL

Page 15: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential 15

No ETL engine is required

Separation of Logical and

Physical design

Physical exec on SQL, Hive, Pig, or

Spark

Runtime exec in Oozie or via ODI

Java Agent

Rich set of pre-built operators

User defined functions

Business Value of ODI: Only Tool with Portable Mappings

Page 16: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Business Friendly

Extreme Performance

Spatial Awareness

Oracle Stream Analytics

DB

Web / Devices

DataEvent Data & Transaction Streams

Downstream(eg; Hadoop)

DataEvent

Oracle Stream Analytics is a powerful analytic toolkit designed to work directly on data in motion – simple data correlations, complex event processing, geo-fencing, and advanced dashboards run on millions of events per second.

Innovative dual model for

Apache Spark or Coherence grid

Simple to use spatial and geo-fencing features an industry first

Includes Oracle GoldenGate for

streaming transactions

Page 17: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Stream or Batch Data

Spark based Pipelines

ML-powered Profiling

Oracle Dataflow MLOracle Dataflow ML is big data solution for stream and batch processing in a single environment – Lambda based applications that can run streaming ETL for cloud based analytic solutions.

Batch and stream

processing at the same time

Machine learning guides users for data

profiling

Data movement across Oracle PaaS services

Most Apps,Databases

& Cloud

Bulk Data Movement

Streaming Data Cloud

DBsBig

Data

Big DataPipeline

Page 18: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

from Devices

Batch Layer

Oracle Confidential 18

Streaming Data

ApplicationsApplications

Databus

Applications

Speed Layer ServingLayer

RESTServices

VisualizationTools

ReportingTools

Data Marts

Applications

ServingLayer

Speed Layer

Batch Layer

Oracle Stream Analytics

Oracle Dataflow ML

Oracle GoldenGate

Application

ApplicationsApplicationsDevices

from Databases

Page 19: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Business Glossary

End-to-End Lineage

100+ Supported Systems

Oracle Metadata ManagementOracle Metadata Management provides an integrated toolkit that combines business glossary, workflow, metadata harvesting and rich data steward collaboration features.

Supports Databases, Big Data, ETL Tools, BI Tools etc:

BI Report Lineage

Taxonomy Lineage

Data Model Lineage

Page 20: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Data CatalogSpeed Layer

Batch Layer

ServingLayer

Oracle Confidential 20

OEMM for Data GovernanceApplicationsApplications DatabusApplications

Application

RESTServices

VisualizationTools

ReportingTools

Data Marts

UserUpdates

DBMSUpdates

Applications ServingLayer

Speed Layer

Batch Layer

KafkaGenerated Streaming

Generated ETL CodeSqoopOLTP Databases

HDFS Files

HCatalog

Hive

NoSQL

ETLTools

Data Warehouses

BI Models

ER Models

Oracle Enterprise Metadata Management

140+ Supported Tools

Page 21: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential 21

Eight Core Products

Cloud or On-Premise

Page 22: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Agenda

Oracle Data Integration for Big Data

Big Data Patterns

A Practitioner’s View on Oracle Data integration for Big Data

Q & A

1

2

3

4

Page 23: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Leverage Wide Range of Modern Analytic Styles

4 Business Patterns of Big Data Customer Adoption

Oracle Confidential, under Non-Disclosure 23

DBMS(on prem or cloud)

Sandbox

ETL Offload

Staging

Deep Data Storage

1. Analytic Data Sandbox:– Stakeholder: Functional Line of Business (LoB)– Core Value: Faster access to business data, Faster

time to value on Analytics– Innovation: Schema-on-read empowers rapid data

staging and true Data Discovery2. ETL Offload:

– Stakeholder: Information Technology (IT)– Core Value: Cost avoidance on DW/Marts– Innovation: YARN/Hadoop empowers lower cost

compute and lower cost storage3. Deep Data Storage:

– Stakeholder: Risk / Compliance (LoB)– Core Value: High fidelity aged data– Innovation: SQL on Hadoop engines enable very low

cost, queryable data access4. Streaming:

– Stakeholder: Marketing (LoB) / Telematics (LoB)– Core Value: New Data Services or Higher Click Rates– Innovation: MPP capable streaming platforms

combined with modern in-motion analytics

Data FirstAnalytics

Model FirstAnalytics

In-MotionAnalytics

Streaming

Page 24: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Discovery, Exploratory and Visualization Style Analytics• Oracle Endeca, Big Data Discovery• Tableau, Cliq, Spotfire• DataMeer etc

Business Intelligence, Reporting and Dashboard Style Analytics• Oracle BIEE, Visual Analyzer• Cognos, SAS, MicroStrategy• Business Objects, Actuate etc

Analytic Data Sandbox

Oracle Confidential, under Non-Disclosure 24

Analytic Data Sandbox:– Stakeholder: Functional Line of Business (LoB)– Core Value: Faster access to business data, Faster

time to value on Analytics– Innovation: Schema-on-read empowers rapid data

staging and true Data Discovery– Industries: All industries

Supports “Data First” Style of Analytics– No schema required– Staging data is simple and fast– Minimal data preparation required

(mainly for un/semi-structured data sets)

Typical Customer Data Types / Sets– Usually bringing in Structured Data from OLTP

(Primary data is their existing Application data)– Often bringing in Semi-Structured data

(Secondary data is clickstream, logs, machine data)– Business value is usually in the combination of the

various data sets and the improved speed of discovery

DBMS(on prem or cloud)

Sandbox

ETL Offload

Staging

Data FirstAnalytics

Model FirstAnalytics

Often the data flow may not require any ETL Tooling

Other data flows may still require ETL as a pipeline

BI Self Service

Page 25: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Discovery, Exploratory and Visualization Style Analytics• Oracle Endeca, Big Data Discovery• Tableau, Cliq, Spotfire• DataMeer etc

Business Intelligence, Reporting and Dashboard Style Analytics• Oracle BIEE, Visual Analyzer• Cognos, SAS, MicroStrategy• Business Objects, Actuate etc

ETL Offload

Oracle Confidential, under Non-Disclosure 25

DBMS(on prem or cloud)

Sandbox

ETL Offload

Staging

2. ETL Offload:– Stakeholder: Information Technology (IT)– Core Value: Cost avoidance on DW/Marts– Innovation: YARN/Hadoop empowers lower cost

compute and lower cost storage– Industries: Teradata, Netezza & AbInitio customers

Supports “Model First” Style of Analytics– Schemas required

(for working areas, sources and targets)– Staging data requires modeled staging tables– Data preparation required (mapping data sets)

(un/semi-structured data sets require pre-parsing)

Typical Customer Data Types / Sets– Usually bringing in Structured Data from OLTP Apps

(Primary data is their existing Application data)– Occasionally adding new data types to EDW schema

(Secondary data is clickstream, logs, machine data)– Business value is usually tied to the “cost avoidance”

around escalating DW and ETL tooling costs

Data FirstAnalytics

Model FirstAnalytics

Primary Data Flow RequiresData Integration Tools

Page 26: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Discovery, Exploratory and Visualization Style Analytics• Oracle Endeca, Big Data Discovery• Tableau, Cliq, Spotfire• DataMeer etc

Business Intelligence, Reporting and Dashboard Style Analytics• Oracle BIEE, Visual Analyzer• Cognos, SAS, MicroStrategy• Business Objects, Actuate etc

Deep Data Storage

Oracle Confidential, under Non-Disclosure 26

DBMS(on prem or cloud)

Sandbox

ETL Offload

Staging

Deep Data Storage

3. Deep Data Storage:– Stakeholder: Risk / Compliance (LoB)– Core Value: High fidelity aged data– Innovation: SQL on Hadoop engines enable very low

cost, queryable data access– Industries: Insurance and Banking

Typically Deep Storage of Relational Data– Schemas required

(item detail records, not necessarily aggregates)– Archival can be “on the way in” as part of routine

loading, and also via “periodic” pruning from the EDW and data marts

Popular with SQL on Hadoop and Federation– Teradata Query Grid from Teradata/Aster– IBM BigSQL from Netezza/PureData– Oracle Big Data SQL from Exadata– Pivotal HAWQ from Greenplum– Cisco Composite Software also selling on this use

case (in addition to BI Virtualization)

Data FirstAnalytics

Model FirstAnalytics

Pattern mining

Compliance

Queryable Archive

Page 27: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Streaming Big Data Analytics

Oracle Confidential, under Non-Disclosure 27

DBMS(on prem or cloud)

Sandbox

ETL Offload

Staging

Deep Data Storage

4. Streaming:– Stakeholder: Marketing (LoB) / Telematics (LoB)– Core Value: New Data Services or Higher Click Rates– Innovation: MPP capable streaming platforms

combined with modern in-motion analytics– Industries: Automotive, Aerospace, Industrial

Manufacturing, some Energy/Oil & Gas

Decisions on Data Before it hits Disk– Data volume may be too high to persist all data

• Only save the important data– Data may be highly repetitive (sensor data)– Correlations may need to happen with very low

latency requirements based on LoB demand

Key Use Case for “Data Monetization”– Customers are standing up new Data Services (eg;

realtime equipment failure alerts and subscription based monitoring)

– “Connected Car” services from most car makers– Disaster preparedness centers – Energy/Aerospace

In-MotionAnalytics

Streaming

Other data flows may still require ETL as a pipeline

Data FirstAnalytics

Model FirstAnalytics

Pattern mining

Page 28: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Some Common Themes Across Use Cases

Oracle Confidential, under Non-Disclosure 28

1. Nearly 100% Analytic Use Cases– Data Discovery directly in Hadoop– ETL Offloading for analytics in SQL DB– Deep Data Storage for analytics in SQL DB– Streaming Analytics for data before it hits disk – Lambda Arch

2. Nearly all the Data is Structured Data:– OLTP Sources: every customer starts with the trusted data sets

that already drive the majority of business value – App Data– New Sources: Clickstream Logs, Machine Data and other App

Exhaust all have “structure” even if they may not have schema3. Many more Sources are App/OLTP Sources:

– By Quantity of Sources: most customers have many (dozens or hundreds) of App/OLTP source they are bringing in

– By Volume: by quantity of data, the amount of Machine Data or Log data may often exceed the OLTP data sets

4. Mainframes Matter:– High Value App : most of the biggest customers bringing

mainframe (DB2/z, IMS, VSAM) data to Hadoop5. Multiple Projects / Programs using Hadoop:

– Larger Customers: most of the biggest customers have multiple Hadoop projects running in parallel, some are IT led (DW/ETL Offload) and others are LoB led (Discovery/Telematics)

6. Customers are Starting in Phases:– By Value: IT led vs. LoB led initiatives have different

characteristics – even if the “Lake / Reservoir” factors in as a long term goal, the initial phases are often quite small in scale

7. Size of Hadoop Clusters vary widely:– Investment Sizes Differ (by a lot): some “start” with mega-

commitments (1000’s of Nodes) and others start very small8. Commodity H/W Clusters Dominate:

– Commodity: for use cases designed to work across groups– Appliances: for use cases attached to a single project

9. Data Lakes as a Way to Handle Vendor Diversity:– Middleware for Data: bigger customers have DWs/DBs from

every vendor and >6+ different BI tools; Hadoop is becoming the “canonical” data platform to sit in between

10. Open Source Data Platform is a Strategic Priority:– Senior Stakeholder Feedback: as a design point priority for their

“next gen” it is becoming more important that Open Source has a central role to play in the enterprise data platform

11. Industry Clusters:– 1. Banking, 2. Insurance, 3. Manufacturing, 4. Media, 4. Retail

Page 29: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Agenda

Oracle Data Integration for Big Data

Big Data Patterns

A Practitioner’s View on Oracle Data Integration for Big Data

Q & A

1

2

3

4

Page 30: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

T : @markri t tman

THOUGHTS ON ORACLE DATA INTEGRATIONFOR BIG DATA - A PRACTITIONER'S VIEWMark Rittman, Oracle ACE Director

ORACLE OPENWORLD 2016, SAN FRANCISCO

Page 31: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman

• Oracle ACE Director, blogger + ODTUG member

• Regular columnist for Oracle Magazine

• Past ODTUG Executive Board Member

• Author of two books on Oracle BI

• Co-founder of Rittman Mead, now independent analyst

• 15+ Years in Oracle BI, DW, ETL + now Big Data

• Based in Brighton, UK

About the Presenter

31

Page 32: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman

• Every engagement and customer discussion has Big Data central to the project• Hadoop extending traditional DWs through scalability, flexibility, cost, RDBMS -compatibility

• Hadoop as the ETL engine driven by ODI Big Data KMs

• New datatypes and methods of analysis enabled by Hadoop schema-on-read

• Project innovation driven by machine learning, streaming, ability to store + keep *all* data

Big Data Technology Core to Modern BI Platforms

32

• And what is driving the interest in these projects…?

Data Reservoir

Oracle Data Visualization

Oracle Big Data Platform

Oracle Big Data DiscoverySafe & secure Discovery and Development

environment

Data sets and samples Models and

programs

Marketing /Sales Applications

Models

MachineLearning

Segments

Operational Data

Transactions

CustomerMaster ata

Event, Social + Unstructured Data

Voice + Chat Transcripts

Data Factory

OGG for Big Data 12c

Oracle Stream

Analytics

Data streamsODI12c

Raw Customer Data

Data stored in the original

format (usually files) such as SS7, ASN.1, JSON etc.

Mapped Customer Data

Data sets produced by mapping and transforming

raw data

OracleData

Preparation

Oracle Big Data Appliance Starter Rack + Expansion

• Cloudera CDH + Oracle software • 18 High-spec Hadoop Nodes with

InfiniBand switches for internal Hadoop traffic, optimised for network throughput

• 1 Cisco Management Switch• Single place for support for H/W + S/W

Oracle Big Data Appliance Starter Rack + Expansion

• Cloudera CDH + Oracle software • 18 High-spec Hadoop Nodes with

InfiniBand switches for internal Hadoop traffic, optimised for network throughput

• 1 Cisco Management Switch• Single place for support for H/W + S/W

Enriched Customer Profile

Modeling

Scoring

Infiniband

Page 33: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman 33

Page 34: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman

Page 35: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman

• Data from all the sources will need to be integrated to create the single customer view• Hadoop technologies (Flume, Kafka, Storm) can be used to ingest events, log data

• Files can be loaded “as is” into the HDFS filesystem

• Oracle/DB data can be bulk-loaded using Sqoop

• GoldenGate for trickle-feeding transactional data

• But nature of new data sources brings challenges• May be semi-structured or unknown schema

• Joining schema-free datasets

• Need to consider quality and resolve incorrect, incomplete, and inconsistent customer data

The Big data Secret? IT’s all about Data Integration

35

Single Customer View

Enriched Customer Profile

M/L

“How”

Chat

“What”“Who”

“Why”

Data fromstructured +

schema-on-readsources needs

integrating

Requirespreparation +obfuscation

Streaming sources with

JSON payloads

Apply Schema to Raw and Semi-Structured Data

HeterogeneousEnterprise +Web sources

Page 36: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman

• Finding raw data is easy; then the real work needs to be done - can be > 90% of project

• Four main tasks to land, prepare and integrate raw data to turn it into a customer profile1. Ingest it in real-time into the data reservoir

2. Apply Schema to Raw and Semi-Structured Data

3. Remove Sensitive Data from Any Input Files

4. Transform and map into your Customer 360-degree profile

Landing, Preparing and Securing Raw Data is *Hard*

36

Page 37: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman

• Data enrichment tool aimed at domain experts, not programmers

• Uses machine-learning to automate data classification + profiling steps

• Automatically highlight sensitive data,and offer to redact or obfuscate

• Dramatically reduce the time requiredto onboard new data sources

• Hosted in Oracle Cloud for zero-install• File upload and download from browser

• Automate for production data loads

Oracle Big Data Preparation Cloud Service

37

Raw Data

Data stored in the original format (usually files) such

as SS7, ASN.1, JSON etc.

Mapped Data

Data sets produced by mapping and transforming

raw data

Voice + Chat Transcripts

Page 38: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman

Step 2: Apply Schema to Raw and Semi-Structured Data

38

NLPEmbedded Information in

unstructured text Entities

Embedded InformationNo reliable patterns

Invalid and missing dataSensitive data

Invalidemails

Stream fromAPIs, HTTP:Moderate

Batch Loadfrom files, DB:

Easy

Load raw text from blog entries,

reviews

Page 39: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman

• Automatically profile and analyse datasets

• Use Machine Learning to spot and obfuscate sensitive data automatically

Step 3: Remove Sensitive Data from Any Input Files

39

Page 40: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman

• Oracle Data Integration offers a wider set of products for managing Customer 360 data

• Oracle GoldenGate

• Oracle Enterprise Data Quality

• Oracle Data Integrator

• Oracle Enterprise Metadata Management

• All Hadoop enabled

• Works across Big Data,Relational and Cloud

Step 4 : Transform, Join + Map into Polyglot Data Stores

40

Page 41: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman

• Projects build yesterday using MapReduce today need to be rewritten in Spark• Then Spark needs to be upgraded to Spark Streaming + Kafka for real time…

• Upgrades, and replatforming onto the latest tech, can bring “fragile” initiatives to a halt

• ODI’s pluggable KM approach to big data integration makes tech upgrades simple

• Focus time + investment on new big data initiatives• Not rewriting fragile hand-coded scripts

Future-Proof Big Data Integration Platform

41

41

Discovery & Development LabsSafe & secure Discovery and Development environment

DataWarehouse

Curated data : Historical view and

business aligned access

ODIDesktop

Client

Big Data Management Platform

Data sets and samples Models and programs

Big Data Platform - All Running Natively Under Hadoop

YARN (Cluster Resource Management)

Hive + Pig(Log processing,

UDFs etc)

HDFS (Cluster Filesystem holding raw data)

Kafka + SparkStreaming

ApacheBeam?

Enriched Customer Profile

Modeling

Scoring

Spark(In-Memory

Data Processing)

Page 42: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman

• Big data projects have had it “easy” so far in terms of data quality + data provenance• Innovation labs + schema-on-read prioritise discovery + insight, not accuracy and audit trails

• But a data reservoir without any cleansing, management + data quality = data cesspool

• … and nobody knows where all the contamination came from, or who made it worse

And the Next Challenge : Data Quality + Provenance

42

Page 43: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

(C) Mark Ri t tman 2016 W: ht tp: / /www.r i t tman.co.uk T : @markr i t tman

• From my perspective, this is what makes Oracle Data Integration my Hadoop DI platform of choice

• Most vendors can load and transform data in Hadoop (not as well, but basic capability)

• Only Oracle have the tools to tackle tomorrow’s Big Data challenge:Data Quality + Data Governance• Oracle Enterprise Data Quality

• Oracle Enteprise Metadata Mgmt

• Seamlessly integrated with ODI

• Brings enterprise “smarts” toless mature Big Data projects

Data Governance : Why I Recommend Oracle DI Tools

43

Page 44: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Presen-tationson:

Oracle Confidential 44

Data Integration Solutions Program - tinyurl.com/DISOOW16

DemoStations:

Hands-on labs:

OracleEnterprise Metadata

Management

OracleEnterprise

Data Quality

Oracle GoldenGate

OracleData

Integrator

OracleBig Data

PreparationCloud Service

OracleEnterprise

Data QualityHOL7466

Oracle GoldenGateDeep DiveHOL7528

ODI and OGGfor Big Data

HOL7434

Oracle Big DataPreparation

Cloud ServiceHOL7432

MiddlewareDemoground

- Moscone South

Big Data Showcase

- Moscone South

DatabaseDemoground

- Moscone South

Page 45: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential 45

Data Integration Solutions Program - tinyurl.com/DISOOW16

Monday, Sept 19• Oracle Data Integration Solutions – Platform Overview and Roadmap

[CON6619 ]• Oracle Data Integration: the Foundation for Cloud Integration [CON6620 ]• A Practical Path to Enterprise Data Governance with Cummins [CON6621]• Oracle Data Integrator Product Update and Strategy [CON6622]• Deep Dive into Oracle GoldenGate 12.3 New Features for the Oracle 12.2

Database [CON6555]

Tuesday, Sept 20• Oracle Big Data Integration in the Cloud [CON7472] • Oracle Data Integration Platform: a Cornerstone for Big Data [CON6624]• Oracle Data Integrator and Oracle GoldenGate for Big Data [HOL7434]• Oracle Enterprise Data Quality – Product Overview and Roadmap

[CON6627] • Self Service Data Preparation for Domain Experts – No Programming

Required [CON6630] • Oracle Big Data Preparation Cloud Service: Self-Service Data Prep for

Business Users [HOL7432] • Oracle GoldenGate 12.3 Product Update and Strategy [CON6631] • New GoldenGate 12.3 Services Architecture [CON6551] • Meet the Experts: Oracle GoldenGate Cloud Service [MTE7119]

Wednesday, Sept 21• Data Quality for the Cloud: Enabling Cloud Applications with Trusted Data

[CON6629] • Transforming Streaming Analytical Business Intelligence to Business

Advantage [CON7352]• Oracle Enterprise Data Quality for All Types of Data [HOL7466] • Oracle GoldenGate for Big Data [CON6632] • Accelerate Cloud On-Boarding using Oracle GoldenGate Cloud Service

[CON6633] • Oracle GoldenGate Deep Dive and Oracle GoldenGate Cloud Service for Cloud

Onboarding [HOL7528]

Thursday, Sept 22• Best Practices for Migrating to Oracle Data Integrator [CON6623] • Best Practices for Oracle Data Integrator: Hear from the Experts [CON6625]• Dataflow, Machine Learning and Streaming Big Data Preparation [CON6626] • Data Governance with Oracle Enterprise Data Quality and Metadata

Management [CON6628] • Faster Design, Development and Deployment with Oracle GoldenGate Studio

[CON6634] • Getting started with Oracle GoldenGate [CON7318] • Best Practice for High Availability and Performance Tuning for Oracle

GoldenGate [CON6558]

Page 46: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Oracle Cloud Platform Innovation Awards

Meet the Most Impressive Cloud Platform Innovators

• Meet peers who implemented cutting-edge solutions with Oracle Cloud Platform

• Learn how you can Transform your Business

No registration or OpenWorld pass required to attend

Oracle PaaS Customer Appreciation Reception

Tuesday, Sep 20, 4:00 p.m. - 6:00 p.m.YBCA Theater | 701 Mission St

Meet the Most Impressive Cloud Platform Innovators

• FREE Appreciation Reception for all Oracle PaaS Customers directly following the Innovation Awards Ceremony

No OpenWorld pass is required to attend this reception

Tuesday, Sep 20, 6:00 p.m. - 8:30 p.m.YBCA Theater | 701 Mission St

Page 47: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Connect with Oracle Data Integration

@OracleDI

Blogs.oracle.com/DataIntegration/

Oracle Data Integration

Oracle Data Integration

Page 48: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Agenda

Oracle Data Integration for Big Data

Big Data Patterns

A Practitioner’s View on Oracle Data integration for Big Data

Q & A

1

2

3

4

Page 49: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | 49

Page 50: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Page 51: Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor StatementThe preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Oracle Confidential 51