swimming across the data lake, lessons learned and keys to success
TRANSCRIPT
© 2016 Impetus Technologies - Confidential1
Swimming Across the Data Lake
Lessons Learned and Keys to Success
Impetus Technologies Inc.
© 2016 Impetus Technologies - Confidential2
Our 40 Minutes Today
• Critical Trends – Lessons Learned • Solving the Big Data “DILEMMA”
– Data Democratization– Enterprise Metadata Management– Data Access– Self Service BI
• Migrating Workloads– Lift and Shift Automation
© 2016 Impetus Technologies - Confidential3
Source Credit- http://www.cioinsight.com/it-strategy/big-data/slideshows/big-problems-for-big-data
Year 2013
Hadoop in Gartner’s
Hype Cycle
State of Play – Big Data
Year 2014
© 2016 Impetus Technologies - Confidential4
State of Play – Big Data - 2015
Year 2013
2015: Gartner moves Big Data
out of Hype Cycle - it is REAL
Hadoop
© 2016 Impetus Technologies - Confidential5
"Through 2018, 90% of Hadoop installations will be useless as they are overwhelmed with information assets captured for uncertain use cases““Enterprises today are realizing about 15% of potential
ROI on BI investments “
© 2016 Impetus Technologies - Confidential6
Blueprint for a Modern Data Architecture
Landing and ingestion
Structured
Unstructured
ExternalSocial
MachineGeospatial
Time Series
Streaming
Provisioning, Workflow, Monitoring and Security
Enterprise
Data Lake
Predictiveapplicatio
ns
Exploration & discovery
Enterpriseapplications
Real-Time applications
Traditionaldata
repositories
RDBMS MPP
Governance, Information Lifecycle, Enterprise Meta Data Management
© 2016 Impetus Technologies - Confidential7
Don’t throw your users into the Data Lake!
Creating a data lake is only the beginning…
© 2016 Impetus Technologies - Confidential8
The Data Lake “DILEMMA”
Data
• Ingestion and Storage
• Governance• Security &
Compliance
Information Lifecycle
Management
• Lineage
Enterprise Metadata
Management
• Meta data discovery
• Ontology
Access
• Query Performance
• Search data
D IL EMM A
Effective use of the Data Lake as a true enterprise data reservoir introduces new challenges. We call these the Data Lake “DILEMMA”. Addressing these will help avoid turning the lake into a “data swamp” and inhibit or slow enterprise adoption.
© 2016 Impetus Technologies - Confidential9
Critical Trends – Planning for Success
"Through 2018, 90% of deployed data lakes will be useless as they are overwhelmed with information assets captured for
uncertain use cases“
© 2016 Impetus Technologies - Confidential10
Critical Trends – Planning for Success
Making insights and data in the lake readily discoverable, accessible and usable
“Visual data-discovery, an important enabler of end user self-service, will grow 2.5 x faster than the rest of the market, becoming by 2018
a requirement for all enterprises.”
© 2016 Impetus Technologies - Confidential11
The Path to Democracy: Data Discovery
Don’t tell them what they need. Help your stakeholders find what works best for them
© 2016 Impetus Technologies - Confidential12
The Path to Democracy: Data Discovery
• Identification of unknown data• Consolidation of enterprise data dictionary• Metadata capture throughout the data lifecycle• Search tools to help users find what they need• Tools to browse/sample available data sets• Collaboration tools for users to share their data insights
© 2016 Impetus Technologies - Confidential13
The Path to Democracy: Data AccessibilityLower adoption barriers for
your stakeholders. Getting the data they want should be fast
and easy
© 2016 Impetus Technologies - Confidential14
The Path to Democracy: Data Accessibility
• Easy to use, access request mechanisms• Monitored access approval workflows• Business rules for decisioning automation• Fast data provisioning, integrated with approval workflow• Automated data provisioning mechanisms
© 2016 Impetus Technologies - Confidential15
The Path to Democracy: Data Usability
Get the most out of your data lake. Make it simple
to use.
© 2016 Impetus Technologies - Confidential16
The Path to Democracy: Data Usability
• Virtualize data views to hide storage platform complexity• Provide business user friendly façades to technical tools• Help users visualize data • Link data views to business entities• Help users access the data lake through familiar paradigms• Wrap analytics algorithms into easy to use tools
© 2016 Impetus Technologies - Confidential17
Metadata Repository
Metadata Management – Unified Single View
• Automatic schema ingestion from heterogeneous data sources
• Data dictionary and catalog entity tagging
• Search for anything on data catalog(e.g., free text search)
• Increment update and automatic metadata synchronization
• Define and manage data catalogue entity’s lifecycle
• Applying machine learning algorithms for attribute identification, mappings, and entity consolidation
• Domain experts analyse, approve and customize the results. Provide suggestions to model.
• Domain based dictionary for attributes matching, classification and building composite
© 2016 Impetus Technologies - Confidential18
Access Big Data
Specific Query &
Reporting
SQL
Cross Dimensional
Fast Slice Dice and Drill
Down
OLAP
Data from MPP,
Relational and Hadoop
Data Virtualizati
on
Finding the “Needle in a Haystack”
Search
“Don’t Know What You
Don’t Know”
Self Service
Data Discovery
© 2016 Impetus Technologies - Confidential19
“By 2017, most business users and analysts in organizations will have access to self-service tools to prepare data for
analysis”
“Managed BI Self-Service Will Continue to Close the Business and Technology Gap.”
Critical Trends – Planning for Success
Self Service BI over Data Lake
© 2016 Impetus Technologies - Confidential20
Steps to Effective Self Service BI
Provision Cluster
Discover and Blend New
Sources
Data Access and
Exploration
Ingest and Transform
data
Security and
Governance
BI, Analytics
and Models
© 2016 Impetus Technologies - Confidential21
Blueprint for a Modern Data Architecture
Landing and
Ingestion
StructuredUnstructure
d
ExternalSocial
MachineGeospatial
Time Series
Streaming
Provisioning, Workflow, Monitoring and Governance
Enterprise
Data Lake
Data Federation/Virtualizati
on
Exploration & Discovery
Data WranglingReal-time Applications
TraditionalData
Repositories
RDBMS MPP
Enterprise Meta Data Management
Acce
lera
tor
s
© 2016 Impetus Technologies - Confidential22
Why are customers Migrating Workloads to Hadoop ?
• Free up capacity to contain costs– Immediate ROI on Hadoop– Contain expenditure on relational warehouse
• Create a multi-platform data warehouse environment– One of the strongest tactic in data architecture today– Create an “Adjunct” to the relational warehouse platform
• Get a platform better suited to advanced analytics• Setup for success
© 2016 Impetus Technologies - Confidential23
Manual migrations often come with million-dollar price tags
and years of business logic must be recoded, debugged,and vetted in Hadoop (RISK)
© 2016 Impetus Technologies - Confidential24
Challenges to Migrate Workloads to Hadoop
• Extremely Complicated Process– Manual Identification of Workloads to Migrate
• What Tables, Data, Queries, Dependent Queries– Figuring out the Data Model on Hadoop
• Where to store on Hadoop ? • Hadoop best practices are not known/missing
– Manual Migration • time taking and risky• Offload/Migration Validation/ QA problem
• Technology Readiness– ANSI standard SQL and other complex relational
technologies are not fully supported on Hadoop– Support for DW specific keywords and data types
© 2016 Impetus Technologies - Confidential25
“4 click” ParadigmConnect to supported
Data Warehouses
TeradataSQL
Server
SAP Hana and More on Roadmap
Oracle
Netezza DB2
• Full Intelligent Assessment & Identification of “Offload-able” Entities
– Analyze and Recommend off loadable queries and tables
– Recommend Query Engine to meet SLA’s
– Recommend Data Store
Tool Sets for Automated Migration• Data Migration
– Recommendation for data partitioning, clustering and buckets
– Migrate role based security– Data Validation
• Workload Transformation– Impetus UDF Library to support Source specific
keywords– Automatic conversation of SQL and PL/SQL scripts
• Workload Execution– Support for Multiple Query Engine – HIVEQL and
SparkSQL– Schedule execution for migrated code
Validate Migrated Workloads• Establish functional equivalence• Meet or exceed SLA’s
– Support for HiveQL and SparkSQL– Support Hive on Tez and Hive on Spark Engines– Built-in recommendation for partitioning, clustering
and number of buckets based on dataset.– Optimized parallelism (number of mappers) based on
data source size– Scale out on commodity hardware
© 2016 Impetus Technologies - Confidential26
Impetus Enabling the Modern Analytical Platform
Landing and
Ingestion
StructuredUnstructure
d
ExternalSocial
MachineGeospatial
Time Series
Streaming
Provisioning, Workflow, Monitoring and Governance
Enterprise
Data Lake
Data Federation/Virtualizatio
n
Exploration & Discovery
Data WranglingReal-time applications
TraditionalData
Repositories
RDBMS MPP
Enterprise Meta Data Management
Acce
lera
tor
s
KYVOS INSIGHTS
DATA BLENDING
WORKLOAD MIGRATION
METADATA & DISCOVERY
DATA GOVERNANCE (for Hadoop)DATA ACCESS
STREAM ANALYTIX
© 2016 Impetus Technologies - Confidential27
Thank you.Questions??