optimizing your modern data architecture - with attunity, rcg global services and hortonworks
TRANSCRIPT
Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Optimizing the Modern Data Architecture with Attunity, Hortonworks and RCG Global Services
We do Hadoop.
Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Speakers Hortonworks
◦ Adis Cesir, Big Data Solution Engineer
RCG Global Services ◦ Ramu Kalvakuntla, Principal, Big Data Practice
Attunity ◦ Santosh Chitakki, Director of Product Management
Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Partnership
Strategy and Solu/on Delivery
Hadoop Distribu/on, Support and Training
Any Data, Anywhere, Any/me
RCG GLOBAL SERVICES, HORTONWORKS AND ATTUNITY ARE PARTNERING TO PROVIDE AN EDW OPTIMIZATION SOLUTION THAT DELIVERS REAL FINANCIAL BENEFITS BY EFFECTIVELY IMPLEMENTING APACHE HADOOP TO
AUGMENT CURRENT EDW PLATFORMS.
Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Traditional systems under pressure Challenges • Can’t manage new data • Constrains data to app • Costly to scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
2012 2.8 Zettabytes
2020 40 Zettabytes
LAGGARDS
INDUSTRY LEADERS
1
2 New Data
ERP CRM SCM
New
Traditional
Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
A Typical EDW Faces Three Challenges
1. Data Storage: storing cold data or throwing data away
2. Processing Capacity: wasting processing cycles on low value workloads
3. New Data Sources: unable to capture and use new data
AN
ALY
TIC
S
Data Marts
Business Analytics
Visualization & Dashboards
DAT
A SY
STEM
S
Systems of Record
RDBMS
ERP
CRM
Other
Clickstream Web & Social Geoloca3on Sensor & Machine
Server Logs
Unstructured NEW
SO
UR
CES
1 2
3
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Most EDWs Are Used Inefficiently A
NA
LYTI
CS
Data Marts
Business Analytics
Visualization & Dashboards
DAT
A SY
STEM
S
Systems of Record
RDBMS
ERP
CRM
Other
1. Data Storage: – More than 50% of data is
unused
2. Processing Capacity: – 55% of CPU capacity is ETL – 35% of CPU consumed by
ETL is to load unused data – 30-40% of CPU is consumed
by only 5% of ETL workloads
In a typical EDW*:
Hot Warm Cold
Why pay first class price for economy data?
Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Optimization: Realize Cost Savings with HDP
Archive data away from the EDW • Move cold or rarely used data to Hadoop
as active archive
• Store more data longer
Offload costly ETL processes • Free your EDW to perform high-value functions like
analytics & operations, not ETL
• Use Hadoop for advanced ELT
Enrich the value of your EDW • Use Hadoop to refine new data sources, such as
web and machine data, for new analytical context
HDP helps you reduce costs and optimize the value associated with your EDW
Clickstream Web & Social
Geoloca3on Sensor & Machine
Server Logs
Unstructured
SOU
RC
ES
Existing Systems
ERP CRM SCM
AN
ALY
TIC
S
Data Marts
Business Analytics
Visualization & Dashboards
AN
ALY
TIC
S
Applications Business Analytics
Visualization & Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS (Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-Time Batch Partner ISV Batch Batch MPP EDW
Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
• Time spent understanding source data and defining destination structure
• High latency between data generation and availability
Challenge with traditional Architecture
DB
Structured Data
Source Layer
ETL / ELT EDW ETL
Data Collection & Processing
Data Mart
Integration, Storage & Business View
Business / Department Specific
Data Mart Data Mart
Data Mart Data Mart
Incapable/high complexity when
dealing with loosely structured data
• No linear scale • High license cost • Large code footprint
Data discarded due to cost or
performance
Low or no visibility into transactional
data
EDW used as an ETL tool with 100s of
staging tables
Data Collection & Processing
Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Offload/Archive/Process – Hadoop based Platform
DB
Structured Data
Data Collection, Integration, Storage and Processing
° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° N
Integrate, Transform, Archive, Enrich
Source Layer
EDW Data Mart
Data Mart Data Mart
Data Mart Data Mart
Data Mart
• Store transactional data • Retain 7+ years of data (Hot archive) • Data Lineage – ability to store intermediate data sets • Becomes an analytics platform for data scientists
• Linearly scalable commodity hardware
• Massively parallel compute and storage
Support for any type of data: structured or
unstructured with any volume and velocity
Data Warehouse can now focus less on storage and
transformation and more on presentation
Clickstream Social Geo Sensor Server Logs
Unstrctur.
Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Optimization Customer Stories
Archive TrueCar stores data on millions of car purchases at $0.12 per GB with HDP, well below the $19 per GB possible with other solutions.
Offload Luminar cut its ETL processing times from 3 days to 3 hours with HDP, quickly refreshing its models with new customer transaction data.
Enrich ZirMed enriches its EDW with new data, including pharmacy receipts, text messages, and patient web searches.
Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Driver: Enabling the Data Lake SC
ALE
SCOPE
Data Lake Definition • Centralized Architecture
Multiple applications on a shared data set with consistent levels of service
• Any App, Any Data Multiple applications accessing all data affording new insights and opportunities.
• Unlocks ‘Systems of Insight’ Advanced algorithms and applications used to derive new value and optimize existing value.
Drivers: 1. Cost Optimization 2. Advanced Analytic Apps
Goal: • Centralized architecture • Data-driven business
DATA LAKE
Journey to the Data Lake with Hadoop
Systems of Insight
Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Modern Data Architecture
• Reduce cost and improve performance by off-loading EDW data and processing to the Hortonworks Distribution Platform (HDP)
• Implement a platform that scales incrementally using low cost hardware and software
• Support unstructured, semi-structured and structured data in a single analytics platform
• Enable superior analytic capabilities providing insight that is not possible to achieve from their current environments
• Provide seamless access to data for analysis and business applications
Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solution Model - Modern Data Architecture
EDW Optimization Roadmap
Identify offload candidates, create architectural blueprint, implementation roadmap, business case and ROI
EDW Optimization Implementation Execute Data and ETL/ELT off-load, active archive, implement data ingestion and data service
Data Value Realization Provide insight, data in motion, advanced analytics, information value creation, and visualization
Enterprise Enablement Enterprise access, enriched data sources, service orchestration and data virtualization
Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
EDW Optimization – Roadmap and Analysis
• Assess current reporting, ELT/ETL, and analytical processes
• Review logical and physical data models
• Assess current technical architecture
• Prioritize opportunities • Define future Hadoop
architecture and capacity needs
• Develop implementation plan • Create business case / ROI • Create and review Executive
Summary with Clients
• Analyze Data Usage: • Identify under-utilized
• Schemas • Tables / Columns • Data
• Identify off-load opportunities
Analyze EDW Workload • Read vs. Writes • ETL vs. ELT • Analytical vs. Batch SQL’s • CPU consumption • CPU utilization
Current State Analysis
Data Usage Analysis
Workload Analysis
Blueprint & Roadmap
Activities Week1 Week2 Week3 Week4 Current State Analysis
Data Usage Analysis
Workload Analysis
Blueprint & Roadmap
Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
EDW Optimization – Implementation
Activities Month 1 Month 2 Month 3 Month …
Data Off-Load
Process Off-load
Data Services
Analysis & Reporting
Data Off-load
Process Off-load
Data Services
Analysis & Reporting
• POC / Reference Implementation (if needed)
• Install / expand HDP cluster
• Analyze off-load data sets • Automate data ingestion • Implement active archiving
• Provide scheme-on-read for direct business analysis
• Migrate resource intensive analysis to Hadoop
• Connect analysis and visualization tools to Hadoop
• Migrate EDW ETL/ELT workload to Hadoop
• De-normalize data to optimize performance
• Load Hadoop ETL/ELT output data back into EDW
• Provide data virtualization for data transparency across Hadoop and MPP databases
• Build business services for reporting and enterprise applications
Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Warehouse Optimization - An Iterative Process
• Identify low-hanging fruits
• Get buy-in from stakeholders
• Plan and implement in increments
• Continuously assess and iterate
Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Attunity Visibility Data Usage Analysis (Sample)
• Unused Data (e.g. Tables with no ‘SELECT’ statements)
70 Terabytes in Unused Databases
Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Attunity Visibility Data Usage Analysis (Sample)
• History of data used in large “Fact” table
• Queries go back only 2 years
• Maintains 8 years of data
Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Attunity Visibility Workload Analysis (Sample)
Almost 60% of CPU to load and ingest data
• Intensive ETL workloads
Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Attunity Visibility Workload Analysis (Sample) The Top 100 repetitive SQL of 101,000 in ETL SQL acounts for 30+ % of CPU consumption by ETL.
Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Attunity Visibility – The Data Dashboard
Completely Analyze Workloads And Data Usage
Reduce Cost | Optimize Performance | Justify Investments
User Activity Data Usage Workload Performance
Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
RCG Success Stories
• Completed EDW optimization projects for two large retailors
• Offloading cold data and ELT to Hadoop
• Cost savings projected between $6M to $10M
Top Retailors
$ Top Financial Services
• Currently working with two large Fortune 100 financial companies
• Offloading 40TB to 60TB of RAW data from EDW platforms to Hadoop
• Re-architecting their batch decision processing with savings between $10M to $15M.
Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Next Steps…
Download the Hortonworks Sandbox Learn Hadoop
Build Your Analytic App
Try Hadoop
Learn more about our partnerships
http://hortonworks.com/partner/rcg-global-services/
http://hortonworks.com/partner/attunity/
Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SAN JOSE June 9-11
BRUSSELS April 15-16
• Deep-dive technical content • 65+ sessions and 5 tracks • 1,000 attendees • Sponsorships Available • Including Pre and Post event community meetups
and BOFs • Hadoop training available
• 100+ sessions and 7 tracks • Deep-dive technical content • 5,000 attendees • Sponsorships Available • Including Pre and Post event community meetups
and BOFs • Hadoop training available
www.hadoopsummit.org
The Largest Hadoop Community Events in Europe and North America