how to build a successful data lake
TRANSCRIPT
How to Build a Successful Data Lake
Alex GorelikWaterline Data
Founder and CEO
Data Lakes Power Data-Driven Decision Making
Maximize Business Value With a Data LakeHow Do You Democratize the Data Lake to Maximize Business Value?
Data Lake
Data Puddle
Data Swamp
No Value Enterprise Impact
Tight Control
“Governed”Self-Service
Business Value
DataDemocratization
DW Off-loading
Data Swamps
Raw data
Can’t find or use data
Can’t allow access without protecting sensitive data
Data Warehouse Offloading: Cost SavingsI prefer a data
warehouse--it’s more predictable
It takes IT 3 months of data architecture and ETL work to add new data to the data lake
I can’t get the original data
Low variety of data and low adoption• Focused use case (e.g., fraud detection)• Fully automated programs (e.g., ETL off-
loading) • Small user community (e.g., data science
sand box)
Strong technical skill set requirement
Data Puddles: Limited Scope and Value
What Makes a Successful Data Lake?
Right Data Right InterfaceRight Platform+ +
Right Platform:
• Volume—Massively scalable
• Variety—Schema on read
• Future proof—modular—same data can be used by many different projects and technologies
• Platform cost – extremely attractive cost structure
Right Data Challenges Most Data is Lost, So it Can’t Be Analyzed Later
Only a small portion of data in enterprises today is saved in data warehouses
Data Exhaust
Right Data: Save Raw Data Now to Analyze Later
• Don’t know now what data will be needed later
• Save as much data as possible now to analyze later
• Don’t know now what data will be needed later
• Save as much data as possible now to analyze later
• Save raw data, so it can be treated correctly for each use case
Right Data: Save Raw Data Now to Analyze Later
• Departments hoard and protect their data and do not share it with the rest of the enterprise
• Frictionless ingestion does not depend on data owners
Right Data Challenges: Data Silos and Data Hoarding
Right Interface: Key to Broad Adoption
• Data marketplace for data self-service
• Providing data at the right level of expertise
Providing Data at the Right Level of ExpertiseData scientists
Business analysts
Raw data
Clean, trusted, prepared data
Roadmap to Data Lake Success
Organize the lake
Set up for self-service
Open the lake to the users
Organize the Data Lake into Zones Organize the lake
Multi-modal IT – Different Governance Levels for Different Zones
Raw or Landing Sensitive
Gold or Curated Work
Data Stewards
Data Scientists
Data Engineers
Data Scientists, Business Analysts
Minimal governance Make sure there is
no sensitive data
Minimal governance Make sure there is
no sensitive data
Heavy governance Trusted, curated data Lineage, data quality
Heavy governance Restricted access
Business Analyst Self-Service Workflow
Find and Understand Provision Prep Analyze
Set up for self-
service
Finding, understanding and governing data in a data lake is like shopping at a flea market
“We have 100 million fields of data – how can anyone find or trust anything?” – Telco Executive
Botond Horvath / Shutterstock.com
DATA SCIENTIST /BUSINESS ANALYST
DATASTEWARD
BIG DATA ARCHITECT
Can’t govern and trust data (unknown metadata,
data quality, PII, data lineage)
Need data to use with self-service tools but can’t
explore everything manually to find and
understand data
Can’t catalog all the data manually and keep up with data provisioning
Instead Imaging Shopping On Amazon.com
Catalog Find, Understand And Collaborate Provision
Catalog Find, Understand And Collaborate Provision
Waterline Data is like Amazon for Data in Hadoop
Finding and Understanding Data• Crowdsource metadata and
automate creation of a catalog• Institutionalize tribal data knowledge• Automate discovery to cover all data
sets• Establish trust
• Curated annotated data sets• Lineage• Data quality• Governance
Find and Understand
Accessing and Provisioning DataYou cannot give all access to all usersYou must protect PII data and sensitive business information
Provision
Agile/Self-service approach
Create a metadata-only catalog
When users request access, data is de-identified and provisioned
Top down approach Find and de-identify all sensitive data
Provide access to every user for every dataset as needed
Provide a Self-Service Interface to Find, Understand, and Provision Data
Prepare data for analytics PrepClean data
Remove or fix bad data, fill in missing values, convert to common units of measure
Shape data
Combine (join, concatenate)
Resolve entities (create a single customer record from multiple records or sources)
Transform (aggregate, bucketize, filter, convert codes to names, etc.)
Blend data
Harmonize data from multiple sources to a common schema or model
Tooling
Many great dedicated data wrangling tools on the horizon
Some capabilities in BI and data visualization tools
SQL and scripting languages for the more technical analysts
Data Analysis
• Many wonderful self-service BI and data visualization tools
• Mature space with many established and innovative vendors
Magic Quadrant for Business Intelligence and Analytics Platforms04 February 2016 | ID:G00275847Analyst(s): Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel, Thomas W. Oestreich
Analyze
Unlock the Value of the Data Lake with the Waterline Data Smart Data Catalog
Time To Value Tribal Knowledge Sharing Trust
Waterline Data Is The Only Smart Data Catalog For The Data Lake
“Use an INFORMATION CATALOG TO MAXIMIZE BUSINESS VALUE From
Information Assets”
“automatically identify, profile, and metatag files in HDFS and make them
available for analysis and exploration”
“tapped into an important and underserved
opportunity”
“comprehensive big data governance and discovery
platform”
“opens the data to a wider variety of
people”
“fills a critical gap in big data exploratory analytics by automating the tagging
and cataloging of data”
Current Customers
Healthcare
Insurance
Life Sciences
Aerospace
Automotive
Banking
Government
Marketing
"Opening up a data lake for self-service analytics requires a data catalog that's smart enough to
automatically catalog every field of data so business analysts can maximize time to value” -- Jerry Megaro,
Global Head Of Data Analytics, Merck KGaA
“Understanding where your data came from and what it means in context is vital to making a data lake initiative
successful and not just another data quagmire – the catalog plays a critical component in this” -- Global
Head of Data Governance, Risk, and Standard, International Multi-Line Insurer
“A governed yet agile data catalog is key to open up the data lake to business people” -- Paolo Arvati, Big
Data, CSI-Piemonte
We Run Natively On Hadoop And Integrate With Existing Tools
Workflow of Enabling Self-Service Analytics With Hortonworks
Hortonworks Atlas And Ranger
Data Prep Analytics & Visualizatio
n
Smart Data Discovery
Profiling, Sensitive Data &
Data Lineage Discovery, Automated
Tagging
Data Stewardshi
pCurate Tags
Self-Service
Data Catalog
Find, Collaborate And
Take Action
Metadata, Tags, Data Lineage
Metadata, Tags, Roles & Access Control
Roles & Access Control
A Successful Data Lake
Right Data Right InterfaceRight Platform+ +
Come to Booth 303 to see a demoand talk to us about your data lake
Come to the Atlas session at 4:00 PM on Thursday in room 210C
Waterline DataThe Smart Data Catalog Company