how to build a successful data lake

How to Build a Successful Data Lake

Alex GorelikWaterline Data

Founder and CEO

Data Lakes Power Data-Driven Decision Making

Maximize Business Value With a Data LakeHow Do You Democratize the Data Lake to Maximize Business Value?

Data Lake

Data Puddle

Data Swamp

No Value Enterprise Impact

Tight Control

“Governed”Self-Service

Business Value

DataDemocratization

DW Off-loading

Data Swamps

Raw data

Can’t find or use data

Can’t allow access without protecting sensitive data

Data Warehouse Offloading: Cost SavingsI prefer a data

warehouse--it’s more predictable

It takes IT 3 months of data architecture and ETL work to add new data to the data lake

I can’t get the original data

Low variety of data and low adoption• Focused use case (e.g., fraud detection)• Fully automated programs (e.g., ETL off-

loading) • Small user community (e.g., data science

sand box)

Strong technical skill set requirement

Data Puddles: Limited Scope and Value

What Makes a Successful Data Lake?

Right Data Right InterfaceRight Platform+ +

Right Platform:

• Volume—Massively scalable

• Variety—Schema on read

• Future proof—modular—same data can be used by many different projects and technologies

• Platform cost – extremely attractive cost structure

Right Data Challenges Most Data is Lost, So it Can’t Be Analyzed Later

Only a small portion of data in enterprises today is saved in data warehouses

Data Exhaust

Right Data: Save Raw Data Now to Analyze Later

• Don’t know now what data will be needed later

• Save as much data as possible now to analyze later

• Don’t know now what data will be needed later

• Save as much data as possible now to analyze later

• Save raw data, so it can be treated correctly for each use case

Right Data: Save Raw Data Now to Analyze Later

• Departments hoard and protect their data and do not share it with the rest of the enterprise

• Frictionless ingestion does not depend on data owners

Right Data Challenges: Data Silos and Data Hoarding

Right Interface: Key to Broad Adoption

• Data marketplace for data self-service

• Providing data at the right level of expertise

Providing Data at the Right Level of ExpertiseData scientists

Business analysts

Raw data

Clean, trusted, prepared data

Roadmap to Data Lake Success

Organize the lake

Set up for self-service

Open the lake to the users

Organize the Data Lake into Zones Organize the lake

Multi-modal IT – Different Governance Levels for Different Zones

Raw or Landing Sensitive

Gold or Curated Work

Data Stewards

Data Scientists

Data Engineers

Data Scientists, Business Analysts

Minimal governance Make sure there is

no sensitive data

Minimal governance Make sure there is

no sensitive data

Heavy governance Trusted, curated data Lineage, data quality

Heavy governance Restricted access

Business Analyst Self-Service Workflow

Find and Understand Provision Prep Analyze

Set up for self-

service

Finding, understanding and governing data in a data lake is like shopping at a flea market

“We have 100 million fields of data – how can anyone find or trust anything?” – Telco Executive

Botond Horvath / Shutterstock.com

DATA SCIENTIST /BUSINESS ANALYST

DATASTEWARD

BIG DATA ARCHITECT

Can’t govern and trust data (unknown metadata,

data quality, PII, data lineage)

Need data to use with self-service tools but can’t

explore everything manually to find and

understand data

Can’t catalog all the data manually and keep up with data provisioning

Instead Imaging Shopping On Amazon.com

Catalog Find, Understand And Collaborate Provision

Catalog Find, Understand And Collaborate Provision

Waterline Data is like Amazon for Data in Hadoop

Finding and Understanding Data• Crowdsource metadata and

automate creation of a catalog• Institutionalize tribal data knowledge• Automate discovery to cover all data

sets• Establish trust

• Curated annotated data sets• Lineage• Data quality• Governance

Find and Understand

Accessing and Provisioning DataYou cannot give all access to all usersYou must protect PII data and sensitive business information

Provision

Agile/Self-service approach

Create a metadata-only catalog

When users request access, data is de-identified and provisioned

Top down approach Find and de-identify all sensitive data

Provide access to every user for every dataset as needed

Provide a Self-Service Interface to Find, Understand, and Provision Data

Prepare data for analytics PrepClean data

Remove or fix bad data, fill in missing values, convert to common units of measure

Shape data

Combine (join, concatenate)

Resolve entities (create a single customer record from multiple records or sources)

Transform (aggregate, bucketize, filter, convert codes to names, etc.)

Blend data

Harmonize data from multiple sources to a common schema or model

Tooling

Many great dedicated data wrangling tools on the horizon

Some capabilities in BI and data visualization tools

SQL and scripting languages for the more technical analysts

Data Analysis

• Many wonderful self-service BI and data visualization tools

• Mature space with many established and innovative vendors

Magic Quadrant for Business Intelligence and Analytics Platforms04 February 2016 | ID:G00275847Analyst(s): Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel, Thomas W. Oestreich

Analyze

Unlock the Value of the Data Lake with the Waterline Data Smart Data Catalog

Time To Value Tribal Knowledge Sharing Trust

Waterline Data Is The Only Smart Data Catalog For The Data Lake

“Use an INFORMATION CATALOG TO MAXIMIZE BUSINESS VALUE From

Information Assets”

“automatically identify, profile, and metatag files in HDFS and make them

available for analysis and exploration”

“tapped into an important and underserved

opportunity”

“comprehensive big data governance and discovery

platform”

“opens the data to a wider variety of

people”

“fills a critical gap in big data exploratory analytics by automating the tagging

and cataloging of data”

Current Customers

Healthcare

Insurance

Life Sciences

Aerospace

Automotive

Banking

Government

Marketing

"Opening up a data lake for self-service analytics requires a data catalog that's smart enough to

automatically catalog every field of data so business analysts can maximize time to value” -- Jerry Megaro,

Global Head Of Data Analytics, Merck KGaA

“Understanding where your data came from and what it means in context is vital to making a data lake initiative

successful and not just another data quagmire – the catalog plays a critical component in this” -- Global

Head of Data Governance, Risk, and Standard, International Multi-Line Insurer

“A governed yet agile data catalog is key to open up the data lake to business people” -- Paolo Arvati, Big

Data, CSI-Piemonte

We Run Natively On Hadoop And Integrate With Existing Tools

Workflow of Enabling Self-Service Analytics With Hortonworks

Hortonworks Atlas And Ranger

Data Prep Analytics & Visualizatio

n

Smart Data Discovery

Profiling, Sensitive Data &

Data Lineage Discovery, Automated

Tagging

Data Stewardshi

pCurate Tags

Self-Service

Data Catalog

Find, Collaborate And

Take Action

Metadata, Tags, Data Lineage

Metadata, Tags, Roles & Access Control

Roles & Access Control

A Successful Data Lake

Right Data Right InterfaceRight Platform+ +

Come to Booth 303 to see a demoand talk to us about your data lake

Come to the Atlas session at 4:00 PM on Thursday in room 210C

Waterline DataThe Smart Data Catalog Company

how to build a successful data lake

Technology