hadoop world 2011: building scalable data platforms ; hadoop & netezza deployment models

34
Building Scalable Data Platforms Hadoop and Netezza Deployment Models Krishnan Parasuraman Netezza Greg Rokita Edmunds.com

Upload: krishnan-parasuraman

Post on 25-Jun-2015

1.046 views

Category:

Technology


2 download

DESCRIPTION

Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. In this presentation, you will learn about the similarities and differences of Hadoop and parallel data warehouses, and typical best practices. Edmunds will discuss how they increased delivery speed, reduced risk, and achieved faster reporting by combining ELT and ETL. For example, Edmunds ingests raw data into Hadoop and HBase then reprocesses the raw data in Netezza. You will also learn how Edmunds uses prototyping to work on nearly raw data with the company’s Analytics Team using Netezza.

TRANSCRIPT

Page 1: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Building Scalable Data PlatformsHadoop and Netezza Deployment Models

Krishnan ParasuramanNetezza

Greg RokitaEdmunds.com

Page 2: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 20112

Talking Points

• Building scalable data platforms– Architectural considerations

• Hadoop and Massively Parallel Databases– Similarities and differences– Usage patterns

• Practitioner’s View Point– Edmunds.com data warehouse platform

Page 3: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 20113

Building scalable data platformsTypical Digital Media Information Processing Pipeline

Clicks

Visits

Page Views

Likes

Tweets

Impressions

Real Time Decision Engine

• Display Ads• Recommendation• Personalized Content

Locations

Data Processing

• Correlate• Structure• Consolidate

Analytics and Optimization• Scoring• Yield optimization• Audience Analytics

Reporting

• Aggregate• Summarize• Ad-hoc analysis

Page 4: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 20114

DATA PLATFORM

Building scalable data platformsClicks

Visits

Page Views

Likes

Tweets

Impressions

Real Time Decision Engine

Locations

Data Processing

Analytics and Optimization

Reporting

Page 5: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 20115

Building scalable data platforms

Real Time Decision Engine

Data Processing

Analytics and Optimization

Reporting

Workloads• Real Time• High Concurrency• Transactional• High Thruput

• High Velocity• Linearly Scalable• Disk bound

• Cached Queries• Low Latency• H. Concurrency

• Compute intensive• Full table scans• Disk bound

Data• Structured• Un-Structured• Key-Value pairs

• Structured• Un-Structured• Machine Gen.

• Mostly Structured• Some unstructured

• Structured• Relational

Capability• Stream Processing• Memory resident• Key based lookups

• Low Disk I/O• Fast Processing• Low Cost/TB

• In-DB computation• SQL and MR• Analytic Libraries

• OLAP• Columnar

Page 6: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 20116

Building scalable data platforms

Real Time Decision Engine

Data Processing

Analytics and Optimization

Reporting

Workloads• Real Time• High Concurrency• Transactional• High Thruput

• High Velocity• Linearly Scalable• Disk bound

• Cached Queries• Low Latency• H. Concurrency

• Compute intensive• Full table scans• Disk bound

Data• Structured• Un-Structured• Key-Value pairs

• Structured• Un-Structured• Machine Gen.

• Mostly Structured• Some unstructured

• Structured• Relational

Capability• Stream Processing• Memory resident• Key based lookups

• Low Disk I/O• Fast Processing• Low Cost/TB

• In-DB computation• SQL and MR• Analytic Libraries

• OLAP• Columnar

NoSQL Databases

Hadoop

Graph DB

Massively Parallel DB

Plain Ole’ DB on steroids

In-Memory DB

Page 7: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 20117

Myth

A single technology will meet all the considerations for our scalable data platform needs

Best Practices

Workloads scale differently – Monolithic architectures don’t work

Minimize components – Data movement is painful

Understand tradeoffs – Performance Price Effort

Start with the core architecture and work in the edge cases

Page 8: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 20118

Massively parallel data warehouses

FPGA

Memory

CPU FPGA

Memory

CPU FPGA

Memory

CPU

Hosts

Distributed Storage

Massively parallel compute nodes

Network fabric

Host controllers

SQL And MR

Page 9: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 20119

Hadoop

Parallel compute nodes

Network fabric

Master Node

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Job Tracker

Map Reduce

Distributed Storage

Page 10: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 201110

There are striking similarities….

Highly Available

Scalable

Execute code & algorithms next to data

Massive parallelism

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Job Tracker

Map Reduce

Map Reduce

Page 11: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

But also key differences

11

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Job Tracker

Map Reduce

Data Loading = File copy Look Ma, No ETL

Schema on Read – Data loading is fast

Batch Mode data access

Lower cost of data storage

Process unstructured data

Had

oop

Optimized for Performance

Real time access, random reads, query optimizer, co-located joins

SQL and Map Reduce

Hardware Accelerated queriesNet

ezza

Page 12: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 201112

These differences lead to opportunities for co-existence for Hadoop in a Netezza environment

1. Scalable ETL engine– Complex data

– Relationships not defined

– Evolving schema

2. Queryable Archive– Moving computation is cheaper than moving data

3. Analytics sandbox– Exploratory analysis

Page 13: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 201113

Netezza-Hadoop: Deployment Patterns

unstructured data

semi-structured data

structured data

Create context (classification, text mining)

Analyze

Parse, aggregate Analyze, report

Analyze, reportActive archival

Long running queries

Page 14: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 201114

Pattern 1: Data Processing Engine (ETL)

NameNodeJobTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Hadoop Cluster Netezza Environment

Raw Weblogs

Page 15: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 201115

Pattern 2: Low cost storage and dynamic provisioning

Elastic MapReduce

2

3

Amazon S3

Amazon Cloud

1

Netezza Environment

Page 16: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop World 201116

Pattern 3: Queryable Archive

Data Sources

1

23

Netezza Environment

Page 17: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Edmunds.com and Scaleo Premier online resource for automotive information

launched in 1995 as the first automotive information Web site

o 15 million unique visitorso 210 million page viewso 1 million+ new inventory items per dayo 2 TB of new data every montho 40 node Hadoop cluster aggregating logs,

advertising, vehicle, pricing, inventory and other data sets

o

Page 18: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Edmunds Proposition

We have developed an iterative approach to data warehouse

development that has dropped the time it takes for us to deliver reports to our

users from months to weeks.

18

Page 19: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

How did we do it?

o Processo Technologyo Understanding of Value

Page 20: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Process: agile approach

o Continuous and fast delivery of new featureso Collaboration between users and developerso Make new data available quickly and

inexpensivelyo Quick problem resolution o No wasting of entire development cycle if data is

not usefulo Encouragement of exploration and creation of

new applications

Page 21: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Process

21

Post-process:• Filtered• Transformed • Modeled as star schema• Optimized• Slow turn-around• High retention • Fast performance

Pre-process:• Complete• Raw• Modeled as source data • Generically loaded• Quick turn-around • Low retention • Slower performance

Page 22: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Post-Process Sandbox

22

YesDevelop Optimized Pipeline: data is confirmed to

be useful effort is warranted

No

Discard: prevents shadow

production little effort lost

Prototype

Page 23: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Technology

23

Page 24: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

24

Edmunds Publishing System

Page 25: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

25

Generic flow for pre-process

Generic, written once

Page 26: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

What architecture enables generic consumer?

o Message o Deliveryo Routing o Persistenceo Durability

o Retrieso Throttling o Versioningo Monitoring

ActiveMQ

Camel

Thrift

Page 27: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Flexibility for Producers and Consumers: Support for Topologies

Field Example Values Purpose

Environment PROD, TEST, DEV Promotion cycle of deployment units

Index Blue, Green, Stage Environment Index

Data Center LAX1, EC2 The data center where deployment unit is located

Site Edmunds, Insideline Company’s Product

Application HBase, Digital Asset Manager Deployment Unit

Page 28: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Producer-Consumer matching

Producer

Consumer

ProdLaxEdmundsInventory

Prod, TestLax, EC2EdmundsDealer

ProdLax, EC2EdmundsInventory

TestEC2EdmundsDealer

BrokerDestinationInterceptor

PublishInventory

PublishInventory

Virtual Topic Name

QueueName

Match!

Page 29: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

HBase: how to handle data generically

Colum Family

Binary Discrete Type 2

Columns Serialized Thrift Object

Hashcode of the Thrift Object

Thrift ObjectField 1

Thrift ObjectField 2

Thrift ObjectField 3

Start Date

End Date

List of fields

Role System of record

Check if updates arenecessary(optimization)

Versioning at the most granular level for lookups

Versioning for optimized dimension tables

29

Page 30: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Generic Thrift Persistence in HBaseColumn Name Value

[ModelYear]|F:id|T:long|I:0[ModelYear]|F:midYear|T:boolean|I:1[ModelYear]|F:year|T:int|I:2[ModelYear]|F:name|T:java.lang.String|I:4[ModelYear]#[attributss][0]|F:_key|T:java.lang.Long[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][0]|F:value|T:java.lang.String|I:1[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:value|T:java.lang.String|I:1[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:id|T:long|I:2[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][3]|F:value|T:java.lang.String|I:1

1368false1993Celica64Standard SportV:GT-S 2dr Hatchback

441

V:GT-S

30

Page 31: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Netezza: Time is Money

31

Compared to Oracle Business Value

Up to 12x faster load times Can reload data more frequently Failed workflows are no longer a big problem Helps in transition to real time system: We can now create intraday reports for Leads!

Up to 400x faster query times

More productive Business Intelligence Queries that could ‘never’ finish in Oracle are

now providing business value

Page 32: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Generic and reusable Oozie actions for Netezza

32

Page 33: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc.

Value

o Data warehouse proves product value both internally and to our customers

o Failing fast and quick turn around allow us to know when we are building the right reporting and analytical products without a large up front investment

o By combining all data in a single system we are enabling new products to be developed that we previously could not

33

Page 34: Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Building Scalable Data PlatformsHadoop and Netezza Deployment Models

Krishnan Parasuraman@kparasuraman

Greg RokitaEdmunds.com