hadoop world 2011: building scalable data platforms ; hadoop & netezza deployment models

Building Scalable Data PlatformsHadoop and Netezza Deployment Models

Krishnan ParasuramanNetezza

Greg RokitaEdmunds.com

Hadoop World 20112

Talking Points

• Building scalable data platforms– Architectural considerations

• Hadoop and Massively Parallel Databases– Similarities and differences– Usage patterns

• Practitioner’s View Point– Edmunds.com data warehouse platform

Hadoop World 20113

Building scalable data platformsTypical Digital Media Information Processing Pipeline

Clicks

Visits

Page Views

Likes

Tweets

Impressions

Real Time Decision Engine

• Display Ads• Recommendation• Personalized Content

Locations

Data Processing

• Correlate• Structure• Consolidate

Analytics and Optimization• Scoring• Yield optimization• Audience Analytics

Reporting

• Aggregate• Summarize• Ad-hoc analysis

Hadoop World 20114

DATA PLATFORM

Building scalable data platformsClicks

Visits

Page Views

Likes

Tweets

Impressions


Locations

Data Processing

Analytics and Optimization

Reporting

Hadoop World 20115

Building scalable data platforms


Data Processing


Reporting

Workloads• Real Time• High Concurrency• Transactional• High Thruput

• High Velocity• Linearly Scalable• Disk bound

• Cached Queries• Low Latency• H. Concurrency

• Compute intensive• Full table scans• Disk bound

Data• Structured• Un-Structured• Key-Value pairs

• Structured• Un-Structured• Machine Gen.

• Mostly Structured• Some unstructured

• Structured• Relational

Capability• Stream Processing• Memory resident• Key based lookups

• Low Disk I/O• Fast Processing• Low Cost/TB

• In-DB computation• SQL and MR• Analytic Libraries

• OLAP• Columnar

Hadoop World 20116

Building scalable data platforms


Data Processing


Reporting

Workloads• Real Time• High Concurrency• Transactional• High Thruput

• High Velocity• Linearly Scalable• Disk bound

• Cached Queries• Low Latency• H. Concurrency

• Compute intensive• Full table scans• Disk bound

Data• Structured• Un-Structured• Key-Value pairs

• Structured• Un-Structured• Machine Gen.

• Mostly Structured• Some unstructured

• Structured• Relational

Capability• Stream Processing• Memory resident• Key based lookups

• Low Disk I/O• Fast Processing• Low Cost/TB

• In-DB computation• SQL and MR• Analytic Libraries

• OLAP• Columnar

NoSQL Databases

Hadoop

Graph DB

Massively Parallel DB

Plain Ole’ DB on steroids

In-Memory DB

Hadoop World 20117

Myth

A single technology will meet all the considerations for our scalable data platform needs

Best Practices

Workloads scale differently – Monolithic architectures don’t work

Minimize components – Data movement is painful

Understand tradeoffs – Performance Price Effort

Start with the core architecture and work in the edge cases

Hadoop World 20118

Massively parallel data warehouses

FPGA

Memory

CPU FPGA

Memory

CPU FPGA

Memory

CPU

Hosts

Distributed Storage

Massively parallel compute nodes

Network fabric

Host controllers

SQL And MR

Hadoop World 20119

Hadoop

Parallel compute nodes

Network fabric

Master Node

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Job Tracker

Map Reduce

Distributed Storage

Hadoop World 201110

There are striking similarities….

Highly Available

Scalable

Execute code & algorithms next to data

Massive parallelism

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Job Tracker

Map Reduce

Map Reduce

But also key differences

11

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Job Tracker

Map Reduce

Data Loading = File copy Look Ma, No ETL

Schema on Read – Data loading is fast

Batch Mode data access

Lower cost of data storage

Process unstructured data

Had

oop

Optimized for Performance

Real time access, random reads, query optimizer, co-located joins

SQL and Map Reduce

Hardware Accelerated queriesNet

ezza

Hadoop World 201112

These differences lead to opportunities for co-existence for Hadoop in a Netezza environment

1. Scalable ETL engine– Complex data

– Relationships not defined

– Evolving schema

2. Queryable Archive– Moving computation is cheaper than moving data

3. Analytics sandbox– Exploratory analysis

Hadoop World 201113

Netezza-Hadoop: Deployment Patterns

unstructured data

semi-structured data

structured data

Create context (classification, text mining)

Analyze

Parse, aggregate Analyze, report

Analyze, reportActive archival

Long running queries

Hadoop World 201114

Pattern 1: Data Processing Engine (ETL)

NameNodeJobTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Hadoop Cluster Netezza Environment

Raw Weblogs

Hadoop World 201115

Pattern 2: Low cost storage and dynamic provisioning

Elastic MapReduce

2

3

Amazon S3

Amazon Cloud

1

Netezza Environment

Hadoop World 201116

Pattern 3: Queryable Archive

Data Sources

1

23

Netezza Environment

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Edmunds.com and Scaleo Premier online resource for automotive information

launched in 1995 as the first automotive information Web site

o 15 million unique visitorso 210 million page viewso 1 million+ new inventory items per dayo 2 TB of new data every montho 40 node Hadoop cluster aggregating logs,

advertising, vehicle, pricing, inventory and other data sets

o



Edmunds Proposition

We have developed an iterative approach to data warehouse

development that has dropped the time it takes for us to deliver reports to our

users from months to weeks.

18



How did we do it?

o Processo Technologyo Understanding of Value



Process: agile approach

o Continuous and fast delivery of new featureso Collaboration between users and developerso Make new data available quickly and

inexpensivelyo Quick problem resolution o No wasting of entire development cycle if data is

not usefulo Encouragement of exploration and creation of

new applications



Process

21

Post-process:• Filtered• Transformed • Modeled as star schema• Optimized• Slow turn-around• High retention • Fast performance

Pre-process:• Complete• Raw• Modeled as source data • Generically loaded• Quick turn-around • Low retention • Slower performance



Post-Process Sandbox

22

YesDevelop Optimized Pipeline: data is confirmed to

be useful effort is warranted

No

Discard: prevents shadow

production little effort lost

Prototype



Technology

23



24

Edmunds Publishing System



25

Generic flow for pre-process

Generic, written once



What architecture enables generic consumer?

o Message o Deliveryo Routing o Persistenceo Durability

o Retrieso Throttling o Versioningo Monitoring

ActiveMQ

Camel

Thrift



Flexibility for Producers and Consumers: Support for Topologies

Field Example Values Purpose

Environment PROD, TEST, DEV Promotion cycle of deployment units

Index Blue, Green, Stage Environment Index

Data Center LAX1, EC2 The data center where deployment unit is located

Site Edmunds, Insideline Company’s Product

Application HBase, Digital Asset Manager Deployment Unit



Producer-Consumer matching

Producer

Consumer

ProdLaxEdmundsInventory

Prod, TestLax, EC2EdmundsDealer

ProdLax, EC2EdmundsInventory

TestEC2EdmundsDealer

BrokerDestinationInterceptor

PublishInventory

PublishInventory

Virtual Topic Name

QueueName

Match!



HBase: how to handle data generically

Colum Family

Binary Discrete Type 2

Columns Serialized Thrift Object

Hashcode of the Thrift Object

Thrift ObjectField 1



Start Date

End Date

List of fields

Role System of record

Check if updates arenecessary(optimization)

Versioning at the most granular level for lookups

Versioning for optimized dimension tables

29



Netezza: Time is Money

31

Compared to Oracle Business Value

Up to 12x faster load times Can reload data more frequently Failed workflows are no longer a big problem Helps in transition to real time system: We can now create intraday reports for Leads!

Up to 400x faster query times

More productive Business Intelligence Queries that could ‘never’ finish in Oracle are

now providing business value



Generic and reusable Oozie actions for Netezza

32



Value

o Data warehouse proves product value both internally and to our customers

o Failing fast and quick turn around allow us to know when we are building the right reporting and analytical products without a large up front investment

o By combining all data in a single system we are enabling new products to be developed that we previously could not

33

Building Scalable Data PlatformsHadoop and Netezza Deployment Models

Krishnan Parasuraman@kparasuraman

Greg RokitaEdmunds.com

hadoop world 2011: building scalable data platforms ; hadoop & netezza deployment models

Technology

data sources

data setsno

storage8 hadoop world

analysis3 hadoop world

data warehouse platform

tb of new data

parallel data warehouses

edge cases7hadoop world