big and fast data strategy 2017 jr

Big and Fast Data Strategy 2017Jonathan Raspaud

AVP - Big Data Architecture

February, 2017

© Antuit 2016 Proprietary & Confidential; Not for circulation 2

Executive Summary

2017 Data Landscape

Vision

Strategy

Roadmap

Key Initiatives

High Level Architecture

High Level Data Flow

Data Validity Vendor Comparison


About Jonathan Raspaud:

1998 2000

2006

2011

2012

2017

AVP-Big Data Architecture

Senior Principal Data Architect

Mobility Practice Lead

Manager Business Intelligence

Datawarehouse EngineerSoftware Engineer

Software Engineer

Teamlog

1999

IAE GrenobleMaster of Science in Managementof Information Systems

1997


2017 Data Landscape (1): The Four V’s

Data Volume:Billions of Rows

Data Validity:FormatProcess

Data Velocity:Real timeStreaming

WeblogsClickstreamsIoTTextCall CenterChatSocial

SensorsMarketsNetworksTransportationIoTSocial

Data Variety:Structured

Semi-Structured

Unstructured


2017 Data Landscape (2): Legacy RDBMS Databases are poor at:

• Scalability,

• Fast Streaming Data,

• Unstructured Data,

• Schema Flexibility,

• Search,


2017 Data Landscape (3): MPP/Column-Store Databases:

The Good: The Bad:

SQL based, wide capability with BI tools

Need to move the data from operational systems

Good Performance Data loses Freshness

Full support for aggregation and ad hoc filtering

Ultimate scale limitations

Hard to adapt schema

Can be expensive


2017 Data Landscape (4): Hadoop:

The Good: The Bad:

Distributed storage and processing of massive data sets

SQL interfaces are improving but still not speed-of-thought

Low-cost clusters built from commodityhardware


2017 Data Landscape (5): NoSQL Databases:

The Good: The Bad:

Storage and retrieval of data which is modeled in means other than the tabular relations used in RDBMS

Traditional BI tools lack native compatibility

More and more applicationdevelopers choose NoSQL Databases as operational databases

Not optimized for analytic queries

Scalability; schema-less flexibility, and fast response time for short-request queries

Some don’t support aggregation or ad hoc filtering on arbitrary field


2017 Data Landscape (6): Search Databases:

The Good: The Bad:

Using a search index technology is a great way to enable access to big data in the enterprise

Lacks SQL interface – traditional BI tools incompatibility

Deliver fast access to unstructured or semi-structured information: blog posts and comments, customer product reviews, machine logs, JSON scripts…

Native APIs required to access data

Very effective with structured data too


2017 Data Landscape (7): Cloud Big Data Stores:

The Good: The Bad:

Storing massive amounts of data in the cloud

Traditional BI tools lack performance optimized native integration

Low cost

Easy to manage

Range of storage options: file system, SQL database, Hadoop, Spark…


2017 Data Landscape (8): Fast Data:

The Good: The Bad:

Fast inserts/updates Traditional BI tools lack integration

Fast analytics Traditional BI tools are not architected for streaming data

Limited or Lacks SQL interface


2017 Data Landscape (9): Conclusion

• Legacy BI not designed for Modern Data:• Hard to use: designed in an age of specialized skills

– Focus on the power user– Complicated workbench interfaces– Require SQL coding quickly

• Cannot Scale: deployed on desktops or monolithic servers– Limited user scalability– Poor performance– Not built for embedding in other applications

• Performance Problems: designed for relational data only– Loss of functionality– Poor performance– Limited data scalability


Modern Big and Fast Data Platform Requirements: 5 V’s

Data Requirement

Volume 1. Immediate visualization & interaction regardless of size of data

2. Don’t move or copy data

Variety 1. Support a broad range of modern sources without lock-in

2. Blend multi-source data on-the-fly3. Extensible data connectors for different types of data

Velocity 1. Support fast data (streaming)2. Integrate streaming & historical data in a single view

Veracity 1. Master Data Management2. Definitions

Value 1. Business Insight, Monetization, Optimization, New Customers


Vision (Example):

“Business Insights at the Speed of Light”.


Strategy (Example):

• Speed is our main strategic asset,

• Spark is the engine that powers all our data initiatives,

• Set the context and get out of the way,

• Build Proof of Concepts ready for Production,

• Public Cloud only,

• Leverage Key Vendors as needed: Paxata, Cloudera, ZoomData, Google, Amazon…


Roadmap (Example):

Insights

Infrastructure

Ingestion

Big BI

Strategy

Procurement

Q2 Q3

2017

Q1

LambdaArchitecture

Deskside

People

WorkDayOracle

FinancialServiceNow

HumanResource

Q4

2018

Telecom

TEM

From BI To Big Data

IOTReal Time

Data ScienceTraining

EDL

Mobile BI

Q1

Data ScienceReal Time Self Healing AI Aware

Transportation

Real Time ML

ZoomData PrestoDB Paxata IBMDS Platform


Enterprise Data Lake – Ingestion (Example):

Q1 Q2 Q3

Data Ingestion• Snapchat

Other Source Systems• Billz• Workday

“Near Real Time” Update (Spark batch)• Instagram

More than once per day update• Pinterest

Data Ingestion• Facebook ✅• Twitter ✅• Pinterest ✅• Youtube ✅• Instagram ✅• DCM ✅

Other Source Systems• Adobe Analytics• Salesforce Marketing

Near Real Time Update (Spark Batch)• Facebook

Data Ingestion• LinkedIn ✅• Google Maps ✅• Waze

Other Source Systems• GSA• Salesforce✅

“Near Real Time” Update (Spark batch)• Youtube ✅

Data Ingestion• Wikipedia• STAT

Real Time Update (Spark Streaming)• Twitter

Q4


Enterprise Data Lake – Infrastructure (Example):

Q1 Q2 Q3

Scalable Database for Data Marts• RedShift vs. BigQuery

Security• Kerberos authentication• Configure External Authentication for

Cloudera Manager using AD.

Cluster Scaling

DB migration for Hive Metastore.

Configure high availability for Hive.

Scalable Database for Big BI Data Marts• RedShift vs. BigQuery

Configuration Data Base

Kafka Cluster

Cloudera Upgrade ✅

Disaster Recovery ✅

Configuration Data Base ✅

Kafka Cluster • (Test Cluster complete Sprint 190 ✅)

Subnet Migration

Cluster resource upgrade –scaled out ✅

Q4

Security• Configure Sentry in Production cluster

Configure external database for Cloudera Manager

Hue DB migration to External Database


Key Initiatives (Example):

Focus on high impact/high dollar,

Machine Learning/Deep Learning,

Big BI,

Big MDM,


High Level Streaming Architecture (Example):

Grid Data Visualization

& Reporting

Big and Fast Data Stream and Data Store

PivotReal Time Pipeline

Batch Pipeline

Device Events


Data Sources Data Driven

Decision

Data Visualization

and Exploration

Ingestion Big Data Store Big BI

The Enterprise Data Lake is the one source of truth for all reports

SQL

Interactive

Reporting

High Level Data Flow (Example):

Relational

Data

(CSV)

Schema Free

Nested

Data

(JSON)

Tableau, PowerBI, Looker

ODBC

JDBC


Vendor Alteryx Paxata Trifacta

Primary

user

Technical data developer Non-technical business analyst Technical data scientist

Strengths Data integration

Data mapping

Advanced analytics

Data integration and quality

Comprehensive governance model

Centralized collaboration workbench

No coding, scripting required

Visualization

Batch processing

Weaknesses Data cleansing

Data manipulation

Ease of use

Limited enrichment today Only works with information loaded into

Hadoop

Only works with samples of data

Feedback is not in real time

Minimal data quality capabilities

Analysis Alteryx is a full stack BI

tool, and it includes a layer

of data integration

capabilities. Introducing

another BI tool (in addition

to Tableau, Qlik, Excel) is

not ideal, particularly since

it would only be able to

address data migration use

cases. It overlaps with

Snaplogic which Yahoo!

already owns.

Paxata has the most robust

capabilities to address the broadest

set of data preparation use cases.

Their model for data governance is

far above anything else on the

market. They appear to also ingest

the widest range of data sources

and have the ability to scale to a

billion rows. True enterprise

capabilities for security and scale.

Trifacta is not a good fit for our users

since they are all business analysts

and it is very complex to make

changes. Also, the information for

these use cases are coming from

multiple data sources, many of which

are not Hadoop. Trifacta does not have

the data quality capabilities needed for

the broadest number of use cases.

Big and Fast Data Validity: Vendor Comparison

big and fast data strategy 2017 jr

Data & Analytics