the role of hadoop in bi and data warehousing - 1105...

The Role of Hadoop in BI and Data Warehousing

Colin White

BI Research

Sept. 16, 2014

2

Sponsor

3

Speakers

Chad Meley

VP of Product and

Services Marketing,

Teradata

Colin White

President,

BI Research

Colin White

President, BI Research

TDWI-Teradata Web Seminar

September 2014

The Role of Hadoop

in BI and Data Warehousing

Webinar Objectives

Review Hadoop Trends and Directions

Look at the Role of Hadoop in BI and Data Warehousing

Discuss Approaches for Integrating Hadoop into the Existing

BI/DW Environment

Copyright BI Research, 2014 5

Hadoop Origins: Apache


Source: Microsoft

“A framework for running

applications on a large

hardware cluster built of

commodity hardware.”

wiki.apache.org/hadoop/

Focus was on programmatic and batch-oriented applications that processed

large amounts of multi-structured data (the original “big data”)

Systems were deployed by assembling Apache components or using Hadoop

distributions from companies such as Cloudera, Hortonworks and MapR

Hadoop Today: Cloudera & Hortonworks Examples


Source: Cloudera

Source: Cloudera

Source: Hortonworks

Hadoop Today: MapR Example


Hadoop Today: The Component Wars


Component (and Founder)

Vendor Support

Cloudera MapR Amazon IBM Hortonworks

Impala (Cloudera) ✔ ✔ ✔ X X

Hue (Cloudera) ✔ ✔ X X ✔

Sentry (Cloudera) ✔ ✔ X ✔ X

Flume (Cloudera) ✔ ✔ X ✔ ✔

Parquet (Cloudera/Twitter) ✔ ✔ X ✔ X

Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔

Ambari (Hortonworks) X X X X ✔

Knox (Hortonworks) X X X X ✔

Tez (Hortonworks) X X X X ✔

Drill (MapR) X ✔ X X X

Source: Cloudera

Hadoop Today: Enterprise Integration Example


Source: Hortonworks

Hadoop Today: Summary

Hadoop ecosystem is growing rapidly

– has moved beyond batch MR

processing to support a wide range of

different application use cases

Classic software and hardware

vendors have joined the race to

support the use of Hadoop for use

on-premises and in the cloud

Many of these classic vendors

use distributions from “open source”

suppliers

Many leading-edge and traditional businesses have Hadoop

projects in evaluation mode and also some in production –

most of these projects are focused on specific LOB solutions


Hadoop Today: Key Questions


Why use Hadoop?

• Replace existing enterprise systems

• Enhance existing enterprise systems

What are the use cases for Hadoop?

What are the TCO considerations for Hadoop?

How mature is the Hadoop ecosystem?

What are the skill requirements for Hadoop?

Which Hadoop solution should we use?

How do we integrate Hadoop with existing

systems?

Driving Forces Behind Big Data and Hadoop

13

New business

insights

Reduced

costs

New

technologies

Enhanced

data

management

Advanced

analytics

New

deployment

options

Big

Data

DRIVERS

TECHNOLOGIES

Copyright BI Research, 2014

New Business Insights: Customer Marketing

14 Copyright BI Research, 2014

Situational 1-to-1 Marketing – reach individual

customers with the right messages and offers

• Micro-segmentation

• Analyze all channels: web, stores, call centers,

purchases, buying patterns

• Analyze other information for influential factors:

geography, weather

Customer experience management – make all

experiences beneficial to customer/business

Customer perception management – analyze

trends in social channels and respond

appropriately

In all cases analysts need to be able to move

from analyzing past events to predicting future

outcomes

New Business Insights: Fraud Detection


New Business Insights: The Internet of Things


Further reading: GE Document - Industrial Internet: Pushing the Boundaries of Minds and Machines

http://www.ge.com/docs/chapters/Industrial_Internet.pdf

http://www.ge.com/docs/chapters/Industrial_Internet.pdf

New Technologies: eXtended Data Warehouse


Traditional EDW environment

Investigative computing platform

Data refinery

Data integration platform

Operational real-time environment

RT analysis platform

Other internal & external structured & multi-structured data

Real-time streaming data

Analytic tools & applications

Operational systems

RT BI services

Two Important New Components


Data refinery

Investigative computing platform

Analytic tools & applications

Investigative Computing Platform

o Used for exploring data and

developing new analyses and

analytic models

o Output used by an enterprise DW,

real-time analysis engine, or stand-

alone LOB application

o May employ RDBMS or Hadoop

Data Refinery

o Ingests raw detailed data in batch

and/or real-time into a managed

data store

o Distills the data into useful

information and distributes results

to other systems

o Primary use of Hadoop today

Other internal & external data,

RT streaming data

EDW data

Operational data

EDW data & analyses

models & rules

applications

The Role of Investigative Computing

Enables data scientists and analysts to blend new types

of data with existing information to discover ways of

improving business processes

Allows data scientists and analysts to experiment with

different types of data and analytics before committing

to a particular solution

May employ an analytic sandbox, analytic platform or a data refinery

Results may include data schemas, analyses, analytic models, business

rules, decision workflows, dashboards, LOB applications, etc.

Represents a shift in the way organizations build analytic solutions:

o Increases flexibility and provides faster time to value because data does not

have to be modeled or integrated into an EDW before it can be analyzed

o Extends traditional business decision making with solutions that increase the

use and business value of analytics throughout the enterprise


Teradata Example: Identify/Retain “At Risk” Users


Hadoop captures,

stores and transforms

social, images, and call records

Aster does web

sessionization, path and basic

sentiment analysis

with multi-structured

data

Data Sources

Multi-Structured Raw

Data

Call Center Voice Records

Traditional Data Flow

Analysis + Marketing

Automation

(Customer Retention Campaign)

Capture, Retain and

Refine Layer

ETL Tools

Hadoop

Call Data

Teradata

Integrated DW

Dim

en

sio

na

l D

ata

An

aly

tic R

esu

lts

Aster Discovery Platform Raw

Sentiment

Data

SOCIAL FEEDS

POS

Web Sale

Cust & Item

Master

Mobile Sale

Surveys and Customer Feedback

WEB AND MOBILE

CLICKSTREAM

Customer Feedback

Aster pre-built operators:

sessionization, n-path, many to

many basket and affinity,

collaborative filtering for

recommendations

Source: Teradata

Hadoop Today: Key Questions Revisited - 1


Why use Hadoop?

• Replace existing enterprise systems ✖

• Enhance existing enterprise systems ✔

What are the use cases for Hadoop?

• Data refinery (including archiving)

• Investigative computing platform for analyzing large volumes of

raw data (especially multi-structured data) for specific LOB

solutions

What are the TCO considerations for Hadoop?

• Need to consider more than just hardware and software costs

• Other factors include training, development, administration and

support costs, and floor space and utility requirements

Hadoop Today: Key Questions Revisited - 2


How mature is the Hadoop ecosystem?

• Still immature (especially in the areas of governance and systems

management), but improving rapidly

What are the skill requirements for Hadoop?

• Despite increasing SQL support, Hadoop still requires highly

technical skills in areas such as large-scale Linux and Java

Which Hadoop solution should we use?

• Hadoop is not a single product but a set of different components

that satisfy a variety of requirements

• Choice is between traditional and “open source” vendors

How do we integrate Hadoop with existing systems?

• A key issue – build an eXtended data warehouse infrastructure

Bottom Line: A Lot Has Changed in a Year!


Cloudera: “Enterprise Data Hub Complements the Ecosystem”

Relational and NoSQL Database Enterprise Data Hub

Data Applications

Data Sources

Custom Application

s

2

4

3

1

Hortonworks: “HDP is Deeply Integrated in the Data Center”

MapR: “Optimized Data Architecture”

The Role of Hadoop in BI and Data Warehousing Chad Meley VP of Product and Services Marketing

[email protected]

25 9/15/2014 Teradata Confidential

Key Trends

Economics have changed increasing the amount of data you can capture

Tools have changed, expanding the types of analyses

Framework has evolved so you can use the right tool for the right job

Math

and Stats

Data

Mining

Business

Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

INTEGRATED DISCOVERY PLATFORM

INTEGRATED DATA WAREHOUSE

ERP

SCM

CRM

Images

Audio

and Video

Machine

Logs

Text

Web and

Social

SOURCES

DATA PLATFORM

ACCESS MANAGE MOVE

TERADATA UNIFIED DATA ARCHITECTURE System Conceptual View

Marketing

Executives

Operational

Systems

Frontline

Workers

Customers

Partners

Engineers

Data

Scientists

Business

Analysts HADOOP

TERADATA DATABASE

ASTER DATABASE


NoSchema Advantages in Hadoop

• Raw data format provides complete flexibility

• Non-traditional data types easily supported (graph, text, weblog, etc.)

• NoETL approach provides agility

• Late-binding gives more power to the data scientist

Load data, and figure it out later


NoSQL Advantages

• Flexibility in choice of programming languages

• Leverage existing programming skill sets

• Not constrained to SQL set processing model

• More natural framework for manipulating non-traditional data types

• Efficiency for parallelization of complex processing (e.g., image processing, text parsing, etc.)

9/15/2014 29

Click 1 Click 2 Click 3 Click 4

{user, page, time}

Weblogs

Purchase 1 Purchase 2 Purchase 3 Purchase 4

{user, product, time} Sales

Transactions

Reading 1 Reading 2 Reading 3

{device, value, time} Smart

Meters

Stock Tick

Data Tick 1 Tick 2 Tick 3 Tick 4

{stock, price, time}

Call Data Records Call 1 Call 2 Call 3 Call 4

{user, number, time}

Call 4

Reading 4

Example: Pattern Matching Analysis Discover patterns in rows of sequential data.

MapReduce Approach • Single-pass of data • Time series analysis • Gap recognition

Traditional SQL Approach • Full Table Scans • Self-Joins for sequencing • Limited operators for ordered data


Unlock New Insights with Late Binding

Data Providers > Evaluate Data

(Quality, Structure, Source(s), Meaning)

> Define Data Structure

(Data Model, Data Type, Rules)

> Collect Data

(Ingest)

> Apply Structure

(Transform to defined structure)

Data Consumers > Author Questions

(Translate questions into scripts)

Ideal for…

• Reused & Known Data

• Consistent Results

• The Masses

Ideal for…

• Unfamiliar & Unknown Data

• Infrequent usage

• Unstable source schema

Early Binding Late Binding

Data Providers > Collect Data

(Ingest)

Data Consumers > Evaluate Data

(Quality, Structure, Source(s), Meaning)

> Define Data Structure

(Data Model, Data Type, Rules)

> Apply Structure &

Author Questions (Transform to defined structure and translate questions into a single script)


MPP RDBMS and HADOOP – Right Tool for the Right Job

MPP RDBMS is cost advantaged when the above is true

Development Costs

Maintenance Costs

Usage Costs

HADOOP is cost advantaged when the above is true

Acquisition Costs

Development Costs

Usage Costs

MPP RDBMS is the right tool for the job when

there are increases in number of:

Analyses (Concurrency, Throughput, SLAs, ANSI SQL - ease and maturity)

Integrated Data Sources (High IO, Access Complexity for joins, groupings, seeks)

Reuse of Data (schema-on-write, business rule changes, governance)

And with needs for…..

Fine Grain Security

Data Quality and Integrity

High Availability

Fast Response Times

HADOOP is the right tool for the job when

decreases in Analyses, Integrated Data Sources & Reuse of Data, plus increases in:

Data Variety (no schema, evolving schema, sparse data)

High Intensity, Batch Computation (High CPU)

Logic Complexity (Procedural Language Processing in Parallel)

And with needs for…..

Extreme Data Ingest Rates

Open Source Development


TERADATA

ASTER

DATABASE

SQL, SQL-MR, SQL-GR

RDBMS

DATABASES

Teradata QueryGrid™

Multiple Teradata Systems

TERADATA

DATABASE HADOOP

Push-down to Hadoop

System

IDW

TERADATA

DATABASE

Discovery

TERADATA

ASTER

DATABASE

Business users Data Scientists

MONGODB

DATABASE COMPUTE

CLUSTER

Run SAS, Perl, Ruby, Python, R

Push-down to Other

Database

Push-down to NoSQL Databases

Future


Key Principles Teradata EDW Teradata UDA

Data is More Valuable when Integrated

• Full Parallelism • Data Modeling IP • Experience • JSON Data Type

Orchestration between EDW and Hadoop • QueryGrid Push Down Processing • Unity

Atomic Data Yields More Insights than Summary Data

• Scale out Architecture • Intelligent Memory • AJIs & PPIs • Hybrid Columnar

EDW + 1:1 Interactions in Hadoop

Results Increase as Analytical Capabilities Mature from Reporting → Analyzing → Predicting → Operationalizing

• Parallel Set Processing • In-Database Analytics • Temporal • Geospacial • High Availability / Dual Active • Tactical Queries

• Parallel Set Processing + Parallel Procedural Programming

• Streaming with Analytics • HBase, MongoDB • Value Add Hadoop engineering

for Reliability

Amazing Things Happen when Data is Democratized Throughout the Enterprise

• ANSI SQL • Workload Management • Optimizer • Data Labs • High Concurrency

EDW + Aster SQL Map Reduce

Architectural Principles

Technology continues to evolve, but the principles remain the same

35

Questions and Answers

36

Contact Information

If you have further questions or comments: Colin White, BI Research [email protected]

Chad Meley, Teradata

[email protected]

the role of hadoop in bi and data warehousing - 1105...

Documents