the role of hadoop in bi and data warehousing - 1105...
TRANSCRIPT
The Role of Hadoop in BI and Data Warehousing
Colin White
BI Research
Sept. 16, 2014
2
Sponsor
3
Speakers
Chad Meley
VP of Product and
Services Marketing,
Teradata
Colin White
President,
BI Research
Colin White
President, BI Research
TDWI-Teradata Web Seminar
September 2014
The Role of Hadoop
in BI and Data Warehousing
Webinar Objectives
Review Hadoop Trends and Directions
Look at the Role of Hadoop in BI and Data Warehousing
Discuss Approaches for Integrating Hadoop into the Existing
BI/DW Environment
Copyright BI Research, 2014 5
Hadoop Origins: Apache
Copyright BI Research, 2014 6
Source: Microsoft
“A framework for running
applications on a large
hardware cluster built of
commodity hardware.”
wiki.apache.org/hadoop/
Focus was on programmatic and batch-oriented applications that processed
large amounts of multi-structured data (the original “big data”)
Systems were deployed by assembling Apache components or using Hadoop
distributions from companies such as Cloudera, Hortonworks and MapR
Hadoop Today: Cloudera & Hortonworks Examples
Copyright BI Research, 2014 7
Source: Cloudera
Source: Cloudera
Source: Hortonworks
Hadoop Today: MapR Example
Copyright BI Research, 2014 8
Hadoop Today: The Component Wars
Copyright BI Research, 2014 9
Component (and Founder)
Vendor Support
Cloudera MapR Amazon IBM Hortonworks
Impala (Cloudera) ✔ ✔ ✔ X X
Hue (Cloudera) ✔ ✔ X X ✔
Sentry (Cloudera) ✔ ✔ X ✔ X
Flume (Cloudera) ✔ ✔ X ✔ ✔
Parquet (Cloudera/Twitter) ✔ ✔ X ✔ X
Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔
Ambari (Hortonworks) X X X X ✔
Knox (Hortonworks) X X X X ✔
Tez (Hortonworks) X X X X ✔
Drill (MapR) X ✔ X X X
Source: Cloudera
Hadoop Today: Enterprise Integration Example
Copyright BI Research, 2014 10
Source: Hortonworks
Hadoop Today: Summary
Hadoop ecosystem is growing rapidly
– has moved beyond batch MR
processing to support a wide range of
different application use cases
Classic software and hardware
vendors have joined the race to
support the use of Hadoop for use
on-premises and in the cloud
Many of these classic vendors
use distributions from “open source”
suppliers
Many leading-edge and traditional businesses have Hadoop
projects in evaluation mode and also some in production –
most of these projects are focused on specific LOB solutions
Copyright BI Research, 2014 11
Hadoop Today: Key Questions
Copyright BI Research, 2014 12
Why use Hadoop?
• Replace existing enterprise systems
• Enhance existing enterprise systems
What are the use cases for Hadoop?
What are the TCO considerations for Hadoop?
How mature is the Hadoop ecosystem?
What are the skill requirements for Hadoop?
Which Hadoop solution should we use?
How do we integrate Hadoop with existing
systems?
Driving Forces Behind Big Data and Hadoop
13
New business
insights
Reduced
costs
New
technologies
Enhanced
data
management
Advanced
analytics
New
deployment
options
Big
Data
DRIVERS
TECHNOLOGIES
Copyright BI Research, 2014
New Business Insights: Customer Marketing
14 Copyright BI Research, 2014
Situational 1-to-1 Marketing – reach individual
customers with the right messages and offers
• Micro-segmentation
• Analyze all channels: web, stores, call centers,
purchases, buying patterns
• Analyze other information for influential factors:
geography, weather
Customer experience management – make all
experiences beneficial to customer/business
Customer perception management – analyze
trends in social channels and respond
appropriately
In all cases analysts need to be able to move
from analyzing past events to predicting future
outcomes
New Business Insights: Fraud Detection
15 Copyright BI Research, 2014
New Business Insights: The Internet of Things
16 Copyright BI Research, 2014
Further reading: GE Document - Industrial Internet: Pushing the Boundaries of Minds and Machines
New Technologies: eXtended Data Warehouse
Copyright BI Research, 2014 17
Traditional EDW environment
Investigative computing platform
Data refinery
Data integration platform
Operational real-time environment
RT analysis platform
Other internal & external structured & multi-structured data
Real-time streaming data
Analytic tools & applications
Operational systems
RT BI services
Two Important New Components
Copyright BI Research, 2014 18
Data refinery
Investigative computing platform
Analytic tools & applications
Investigative Computing Platform
o Used for exploring data and
developing new analyses and
analytic models
o Output used by an enterprise DW,
real-time analysis engine, or stand-
alone LOB application
o May employ RDBMS or Hadoop
Data Refinery
o Ingests raw detailed data in batch
and/or real-time into a managed
data store
o Distills the data into useful
information and distributes results
to other systems
o Primary use of Hadoop today
Other internal & external data,
RT streaming data
EDW data
Operational data
EDW data & analyses
models & rules
applications
The Role of Investigative Computing
Enables data scientists and analysts to blend new types
of data with existing information to discover ways of
improving business processes
Allows data scientists and analysts to experiment with
different types of data and analytics before committing
to a particular solution
May employ an analytic sandbox, analytic platform or a data refinery
Results may include data schemas, analyses, analytic models, business
rules, decision workflows, dashboards, LOB applications, etc.
Represents a shift in the way organizations build analytic solutions:
o Increases flexibility and provides faster time to value because data does not
have to be modeled or integrated into an EDW before it can be analyzed
o Extends traditional business decision making with solutions that increase the
use and business value of analytics throughout the enterprise
Copyright BI Research, 2014 19
Teradata Example: Identify/Retain “At Risk” Users
Copyright BI Research, 2014 20
Hadoop captures,
stores and transforms
social, images, and call records
Aster does web
sessionization, path and basic
sentiment analysis
with multi-structured
data
Data Sources
Multi-Structured Raw
Data
Call Center Voice Records
Traditional Data Flow
Analysis + Marketing
Automation
(Customer Retention Campaign)
Capture, Retain and
Refine Layer
ETL Tools
Hadoop
Call Data
Teradata
Integrated DW
Dim
en
sio
na
l D
ata
An
aly
tic R
esu
lts
Aster Discovery Platform Raw
Sentiment
Data
SOCIAL FEEDS
POS
Web Sale
Cust & Item
Master
Mobile Sale
Surveys and Customer Feedback
WEB AND MOBILE
CLICKSTREAM
Customer Feedback
Aster pre-built operators:
sessionization, n-path, many to
many basket and affinity,
collaborative filtering for
recommendations
Source: Teradata
Hadoop Today: Key Questions Revisited - 1
Copyright BI Research, 2014 21
Why use Hadoop?
• Replace existing enterprise systems ✖
• Enhance existing enterprise systems ✔
What are the use cases for Hadoop?
• Data refinery (including archiving)
• Investigative computing platform for analyzing large volumes of
raw data (especially multi-structured data) for specific LOB
solutions
What are the TCO considerations for Hadoop?
• Need to consider more than just hardware and software costs
• Other factors include training, development, administration and
support costs, and floor space and utility requirements
Hadoop Today: Key Questions Revisited - 2
Copyright BI Research, 2014 22
How mature is the Hadoop ecosystem?
• Still immature (especially in the areas of governance and systems
management), but improving rapidly
What are the skill requirements for Hadoop?
• Despite increasing SQL support, Hadoop still requires highly
technical skills in areas such as large-scale Linux and Java
Which Hadoop solution should we use?
• Hadoop is not a single product but a set of different components
that satisfy a variety of requirements
• Choice is between traditional and “open source” vendors
How do we integrate Hadoop with existing systems?
• A key issue – build an eXtended data warehouse infrastructure
Bottom Line: A Lot Has Changed in a Year!
Copyright BI Research, 2014 23
Cloudera: “Enterprise Data Hub Complements the Ecosystem”
Relational and NoSQL Database Enterprise Data Hub
Data Applications
Data Sources
Custom Application
s
2
4
3
1
Hortonworks: “HDP is Deeply Integrated in the Data Center”
MapR: “Optimized Data Architecture”
The Role of Hadoop in BI and Data Warehousing Chad Meley VP of Product and Services Marketing
25 9/15/2014 Teradata Confidential
Key Trends
Economics have changed increasing the amount of data you can capture
Tools have changed, expanding the types of analyses
Framework has evolved so you can use the right tool for the right job
Math
and Stats
Data
Mining
Business
Intelligence
Applications
Languages
Marketing
ANALYTIC TOOLS & APPS
USERS
INTEGRATED DISCOVERY PLATFORM
INTEGRATED DATA WAREHOUSE
ERP
SCM
CRM
Images
Audio
and Video
Machine
Logs
Text
Web and
Social
SOURCES
DATA PLATFORM
ACCESS MANAGE MOVE
TERADATA UNIFIED DATA ARCHITECTURE System Conceptual View
Marketing
Executives
Operational
Systems
Frontline
Workers
Customers
Partners
Engineers
Data
Scientists
Business
Analysts HADOOP
TERADATA DATABASE
ASTER DATABASE
28 9/15/2014 Teradata Confidential
NoSchema Advantages in Hadoop
• Raw data format provides complete flexibility
• Non-traditional data types easily supported (graph, text, weblog, etc.)
• NoETL approach provides agility
• Late-binding gives more power to the data scientist
Load data, and figure it out later
29 9/15/2014 Teradata Confidential
NoSQL Advantages
• Flexibility in choice of programming languages
• Leverage existing programming skill sets
• Not constrained to SQL set processing model
• More natural framework for manipulating non-traditional data types
• Efficiency for parallelization of complex processing (e.g., image processing, text parsing, etc.)
9/15/2014 29
Click 1 Click 2 Click 3 Click 4
{user, page, time}
Weblogs
Purchase 1 Purchase 2 Purchase 3 Purchase 4
{user, product, time} Sales
Transactions
Reading 1 Reading 2 Reading 3
{device, value, time} Smart
Meters
Stock Tick
Data Tick 1 Tick 2 Tick 3 Tick 4
{stock, price, time}
Call Data Records Call 1 Call 2 Call 3 Call 4
{user, number, time}
Call 4
Reading 4
Example: Pattern Matching Analysis Discover patterns in rows of sequential data.
MapReduce Approach • Single-pass of data • Time series analysis • Gap recognition
Traditional SQL Approach • Full Table Scans • Self-Joins for sequencing • Limited operators for ordered data
30 9/15/2014 Teradata Confidential
Unlock New Insights with Late Binding
Data Providers > Evaluate Data
(Quality, Structure, Source(s), Meaning)
> Define Data Structure
(Data Model, Data Type, Rules)
> Collect Data
(Ingest)
> Apply Structure
(Transform to defined structure)
Data Consumers > Author Questions
(Translate questions into scripts)
Ideal for…
• Reused & Known Data
• Consistent Results
• The Masses
Ideal for…
• Unfamiliar & Unknown Data
• Infrequent usage
• Unstable source schema
Early Binding Late Binding
Data Providers > Collect Data
(Ingest)
Data Consumers > Evaluate Data
(Quality, Structure, Source(s), Meaning)
> Define Data Structure
(Data Model, Data Type, Rules)
> Apply Structure &
Author Questions (Transform to defined structure and translate questions into a single script)
31 9/15/2014 Teradata Confidential
MPP RDBMS and HADOOP – Right Tool for the Right Job
MPP RDBMS is cost advantaged when the above is true
Development Costs
Maintenance Costs
Usage Costs
HADOOP is cost advantaged when the above is true
Acquisition Costs
Development Costs
Usage Costs
MPP RDBMS is the right tool for the job when
there are increases in number of:
Analyses (Concurrency, Throughput, SLAs, ANSI SQL - ease and maturity)
Integrated Data Sources (High IO, Access Complexity for joins, groupings, seeks)
Reuse of Data (schema-on-write, business rule changes, governance)
And with needs for…..
Fine Grain Security
Data Quality and Integrity
High Availability
Fast Response Times
HADOOP is the right tool for the job when
decreases in Analyses, Integrated Data Sources & Reuse of Data, plus increases in:
Data Variety (no schema, evolving schema, sparse data)
High Intensity, Batch Computation (High CPU)
Logic Complexity (Procedural Language Processing in Parallel)
And with needs for…..
Extreme Data Ingest Rates
Open Source Development
32 9/15/2014 Teradata Confidential
TERADATA
ASTER
DATABASE
SQL, SQL-MR, SQL-GR
RDBMS
DATABASES
Teradata QueryGrid™
Multiple Teradata Systems
TERADATA
DATABASE HADOOP
Push-down to Hadoop
System
IDW
TERADATA
DATABASE
Discovery
TERADATA
ASTER
DATABASE
Business users Data Scientists
MONGODB
DATABASE COMPUTE
CLUSTER
Run SAS, Perl, Ruby, Python, R
Push-down to Other
Database
Push-down to NoSQL Databases
Future
33 9/15/2014 Teradata Confidential
Key Principles Teradata EDW Teradata UDA
Data is More Valuable when Integrated
• Full Parallelism • Data Modeling IP • Experience • JSON Data Type
Orchestration between EDW and Hadoop • QueryGrid Push Down Processing • Unity
Atomic Data Yields More Insights than Summary Data
• Scale out Architecture • Intelligent Memory • AJIs & PPIs • Hybrid Columnar
EDW + 1:1 Interactions in Hadoop
Results Increase as Analytical Capabilities Mature from Reporting → Analyzing → Predicting → Operationalizing
• Parallel Set Processing • In-Database Analytics • Temporal • Geospacial • High Availability / Dual Active • Tactical Queries
• Parallel Set Processing + Parallel Procedural Programming
• Streaming with Analytics • HBase, MongoDB • Value Add Hadoop engineering
for Reliability
Amazing Things Happen when Data is Democratized Throughout the Enterprise
• ANSI SQL • Workload Management • Optimizer • Data Labs • High Concurrency
EDW + Aster SQL Map Reduce
Architectural Principles
Technology continues to evolve, but the principles remain the same
35
Questions and Answers
36
Contact Information
If you have further questions or comments: Colin White, BI Research [email protected]
Chad Meley, Teradata