swimming across the data lake, lessons learned and keys to success

© 2016 Impetus Technologies - Confidential1

Swimming Across the Data Lake

Lessons Learned and Keys to Success

Impetus Technologies Inc.


Our 40 Minutes Today

• Critical Trends – Lessons Learned • Solving the Big Data “DILEMMA”

– Data Democratization– Enterprise Metadata Management– Data Access– Self Service BI

• Migrating Workloads– Lift and Shift Automation


Source Credit- http://www.cioinsight.com/it-strategy/big-data/slideshows/big-problems-for-big-data

Year 2013

Hadoop in Gartner’s

Hype Cycle

State of Play – Big Data

Year 2014


State of Play – Big Data - 2015

Year 2013

2015: Gartner moves Big Data

out of Hype Cycle - it is REAL

Hadoop


"Through 2018, 90% of Hadoop installations will be useless as they are overwhelmed with information assets captured for uncertain use cases““Enterprises today are realizing about 15% of potential

ROI on BI investments “


Blueprint for a Modern Data Architecture

Landing and ingestion

Structured

Unstructured

ExternalSocial

MachineGeospatial

Time Series

Streaming

Provisioning, Workflow, Monitoring and Security

Enterprise

Data Lake

Predictiveapplicatio

ns

Exploration & discovery

Enterpriseapplications

Real-Time applications

Traditionaldata

repositories

RDBMS MPP

Governance, Information Lifecycle, Enterprise Meta Data Management


Don’t throw your users into the Data Lake!

Creating a data lake is only the beginning…


The Data Lake “DILEMMA”

Data

• Ingestion and Storage

• Governance• Security &

Compliance

Information Lifecycle

Management

• Lineage

Enterprise Metadata

Management

• Meta data discovery

• Ontology

Access

• Query Performance

• Search data

D IL EMM A

Effective use of the Data Lake as a true enterprise data reservoir introduces new challenges. We call these the Data Lake “DILEMMA”. Addressing these will help avoid turning the lake into a “data swamp” and inhibit or slow enterprise adoption.


Critical Trends – Planning for Success

"Through 2018, 90% of deployed data lakes will be useless as they are overwhelmed with information assets captured for

uncertain use cases“



Making insights and data in the lake readily discoverable, accessible and usable

“Visual data-discovery, an important enabler of end user self-service, will grow 2.5 x faster than the rest of the market, becoming by 2018

a requirement for all enterprises.”


The Path to Democracy: Data Discovery

Don’t tell them what they need. Help your stakeholders find what works best for them


The Path to Democracy: Data Discovery

• Identification of unknown data• Consolidation of enterprise data dictionary• Metadata capture throughout the data lifecycle• Search tools to help users find what they need• Tools to browse/sample available data sets• Collaboration tools for users to share their data insights


The Path to Democracy: Data AccessibilityLower adoption barriers for

your stakeholders. Getting the data they want should be fast

and easy


The Path to Democracy: Data Accessibility

• Easy to use, access request mechanisms• Monitored access approval workflows• Business rules for decisioning automation• Fast data provisioning, integrated with approval workflow• Automated data provisioning mechanisms


The Path to Democracy: Data Usability

Get the most out of your data lake. Make it simple

to use.


The Path to Democracy: Data Usability

• Virtualize data views to hide storage platform complexity• Provide business user friendly façades to technical tools• Help users visualize data • Link data views to business entities• Help users access the data lake through familiar paradigms• Wrap analytics algorithms into easy to use tools


Metadata Repository

Metadata Management – Unified Single View

• Automatic schema ingestion from heterogeneous data sources

• Data dictionary and catalog entity tagging

• Search for anything on data catalog(e.g., free text search)

• Increment update and automatic metadata synchronization

• Define and manage data catalogue entity’s lifecycle

• Applying machine learning algorithms for attribute identification, mappings, and entity consolidation

• Domain experts analyse, approve and customize the results. Provide suggestions to model.

• Domain based dictionary for attributes matching, classification and building composite


Access Big Data

Specific Query &

Reporting

SQL

Cross Dimensional

Fast Slice Dice and Drill

Down

OLAP

Data from MPP,

Relational and Hadoop

Data Virtualizati

on

Finding the “Needle in a Haystack”

Search

“Don’t Know What You

Don’t Know”

Self Service

Data Discovery


“By 2017, most business users and analysts in organizations will have access to self-service tools to prepare data for

analysis”

“Managed BI Self-Service Will Continue to Close the Business and Technology Gap.”


Self Service BI over Data Lake


Steps to Effective Self Service BI

Provision Cluster

Discover and Blend New

Sources

Data Access and

Exploration

Ingest and Transform

data

Security and

Governance

BI, Analytics

and Models


Blueprint for a Modern Data Architecture

Landing and

Ingestion

StructuredUnstructure

d

ExternalSocial

MachineGeospatial

Time Series

Streaming

Provisioning, Workflow, Monitoring and Governance

Enterprise

Data Lake

Data Federation/Virtualizati

on

Exploration & Discovery

Data WranglingReal-time Applications

TraditionalData

Repositories

RDBMS MPP

Enterprise Meta Data Management

Acce

lera

tor

s


Why are customers Migrating Workloads to Hadoop ?

• Free up capacity to contain costs– Immediate ROI on Hadoop– Contain expenditure on relational warehouse

• Create a multi-platform data warehouse environment– One of the strongest tactic in data architecture today– Create an “Adjunct” to the relational warehouse platform

• Get a platform better suited to advanced analytics• Setup for success


Manual migrations often come with million-dollar price tags

and years of business logic must be recoded, debugged,and vetted in Hadoop (RISK)


Challenges to Migrate Workloads to Hadoop

• Extremely Complicated Process– Manual Identification of Workloads to Migrate

• What Tables, Data, Queries, Dependent Queries– Figuring out the Data Model on Hadoop

• Where to store on Hadoop ? • Hadoop best practices are not known/missing

– Manual Migration • time taking and risky• Offload/Migration Validation/ QA problem

• Technology Readiness– ANSI standard SQL and other complex relational

technologies are not fully supported on Hadoop– Support for DW specific keywords and data types


“4 click” ParadigmConnect to supported

Data Warehouses

TeradataSQL

Server

SAP Hana and More on Roadmap

Oracle

Netezza DB2

• Full Intelligent Assessment & Identification of “Offload-able” Entities

– Analyze and Recommend off loadable queries and tables

– Recommend Query Engine to meet SLA’s

– Recommend Data Store

Tool Sets for Automated Migration• Data Migration

– Recommendation for data partitioning, clustering and buckets

– Migrate role based security– Data Validation

• Workload Transformation– Impetus UDF Library to support Source specific

keywords– Automatic conversation of SQL and PL/SQL scripts

• Workload Execution– Support for Multiple Query Engine – HIVEQL and

SparkSQL– Schedule execution for migrated code

Validate Migrated Workloads• Establish functional equivalence• Meet or exceed SLA’s

– Support for HiveQL and SparkSQL– Support Hive on Tez and Hive on Spark Engines– Built-in recommendation for partitioning, clustering

and number of buckets based on dataset.– Optimized parallelism (number of mappers) based on

data source size– Scale out on commodity hardware


Impetus Enabling the Modern Analytical Platform

Landing and

Ingestion

StructuredUnstructure

d

ExternalSocial

MachineGeospatial

Time Series

Streaming

Provisioning, Workflow, Monitoring and Governance

Enterprise

Data Lake

Data Federation/Virtualizatio

n

Exploration & Discovery

Data WranglingReal-time applications

TraditionalData

Repositories

RDBMS MPP

Enterprise Meta Data Management

Acce

lera

tor

s

KYVOS INSIGHTS

DATA BLENDING

WORKLOAD MIGRATION

METADATA & DISCOVERY

DATA GOVERNANCE (for Hadoop)DATA ACCESS

STREAM ANALYTIX


Thank you.Questions??