marketshare big data analytics

37
© 2011 MarketShare. All Rights Reserved. Confidential & Proprietary. MarketShare Big Data Analytics An Big Data Analytics architecture for the cloud

Upload: shea

Post on 24-Feb-2016

85 views

Category:

Documents


0 download

DESCRIPTION

MarketShare Big Data Analytics. An Big Data Analytics architecture for the cloud. With and Without Buzzwords. Big Data analytics on the Cloud. >50 TB analytics pipeline on Amazon Web Services using Hadoop. MarketShare : Big Data Analytics on the Cloud. Cloud architecture evolution - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MarketShare Big Data Analytics

© 2011 MarketShare. All Rights Reserved. Confidential & Proprietary.

MarketShare Big Data AnalyticsAn Big Data Analytics architecture for the cloud

Page 2: MarketShare Big Data Analytics

>50 TB analytics pipeline on Amazon Web Services using Hadoop

With and Without Buzzwords

Big Data analytics on the Cloud

Page 3: MarketShare Big Data Analytics

3

• Cloud architecture evolution

• A real life big data workflow on the cloud

• Fixing issues in big data workflow

MarketShare: Big Data Analytics on the Cloud

MarketShare confidential and proprietary

* Source: Interbrand 2011 report

Page 4: MarketShare Big Data Analytics

Cloud + Big Data

Page 5: MarketShare Big Data Analytics

Traditional 3 tier architecture

ApplicationServer

User

AppServer

Database Server

Master

Eliminate accessibility restrictions

Page 6: MarketShare Big Data Analytics

Moving to web based applications

ApplicationServer(s)

User

AppServer

Database Server(s)

Master

WebServer

Corporate Data Center

Laptop

StorageServer(s)

Eliminate Hardware Expenditure

Page 7: MarketShare Big Data Analytics

Moving to the cloud

User

WebServer

Amazon Web Services Cloud

Laptop

EC2 Sharded Instancewith MySQL

EC2 Instance

AppServer

EC2 Instance

Eliminate Distributed Storage

Page 8: MarketShare Big Data Analytics

Moving big data to Hadoop

User

WebServer

Amazon Web Services Cloud

Laptop

EC2 Instance

AppServer

EC2 Instance

EC2 Instancewith MySQLEliminate distributed systems mgmt. HDFS Cluster

Page 9: MarketShare Big Data Analytics

Compute Elasticity

User

WebServer

Amazon EC2 permanent Instances

Laptop

EC2 Instance

AppServer

EC2 Instance

EC2 Instancewith MySQL

Amazon Elastic MapReduce

Amazon EC2 On Demand Instances

On demand hadoop instances

Page 10: MarketShare Big Data Analytics

Storage Elasticity

User

WebServer

Amazon EC2 permanent Instances

Laptop

EC2 Instance

AppServer

EC2 Instance

EC2 Instancewith MySQL

Amazon Elastic MapReduce

Amazon EC2 On Demand Instances

Managed Database ServicesAmazon Simple Storage Service

(S3)

Amazon Managed Storage

RDS Database Instance

Page 11: MarketShare Big Data Analytics

Network Elasticity

UserLaptop

WebServer

Amazon EC2 permanent Instances

EC2 Instance

AppServer

EC2 Instance Amazon Elastic MapReduce

Amazon EC2 On Demand Instances

Network ElasticityAmazon Simple Storage Service

(S3)

Amazon Managed Storage

RDS Database Instance

Elastic LoadBalancer

WebServer

Amazon EC2 permanent Instances

EC2 Instance

AppServer

EC2 Instance

Page 12: MarketShare Big Data Analytics

Cloud = Managed Storage + Network Elasticity + On Demand Compute

Defining the Cloud

Page 13: MarketShare Big Data Analytics

An Example Big Data Workflow

Page 14: MarketShare Big Data Analytics

© 2008 jovianDATA , Proprietary & Confidential, DO NOT Copy or Distribute jovianDATATM

Customer Data

Repository

Raw Data Repositor

y

Generic Schema Funnel LFIS

LUIS

FTP

Transformation

Validation

Sequencing Validation

Summarization

Validation

transformation

Report Generation

and Visualization

StatsStatistical Analysis

Raw Data Repository

SCP

hadoop fs -put

GUID propagation

Cleaning & truncation

Joining with dimensions

Transform in Generic Schema

Provision Cluster

Terminate Cluster

Generic Schema--Exception Handling!<cmd>

Add Nodes

Rebalance Partitions

Restart

MR job not launched

Mapred job timeout 95% done

Hadoop stopped responding

Provision Cluster

Estimate Cluster Size

Verify Cluster Topology

Fix Cluster & Deploy Binaries

....................................

....................................

-- GUID Propagation

CREATE TABLE userid_guid_map AS SELECT distinct user_id, guid from dbact_guid;

CREATE TABLE mapcount AS SELECT COUNT(*) AS cnt, user_id FROM userid_guid_map GROUP BY user_id;

CREATE TABLE impr_guid AS SELECTa.*, m.guidFROM impr i JOIN userid_guid_map m ON (i.user_id=m.user_id);

SELECT COUNT(*) FROM impr_guid;

....................................

....................................

....................................

Transformation Validation

Page 15: MarketShare Big Data Analytics

The big picture

Page 16: MarketShare Big Data Analytics

© 2011 MarketShare. All Rights Reserved. Confidential & Proprietary.

Data Management on the CloudBig Data + Dynamic Provisioning

Page 17: MarketShare Big Data Analytics

17

A dynamically provisioned commodity cluster of virtual machines with the following characteristics• Infinite

• A large number of nodes can be commissioned in minutes• Taxi Meter

• Most services are billed at an hourly usage level

Defining the cloud for databases

Page 18: MarketShare Big Data Analytics

18

• When should resources be added to a data processing system?• Partition management for Low Cost on the cloud

• How many of these resources should be permanent?• Materialization with Intermittent Scalability

• Where should these resources be added in the stack?• Replication to Improve Query Performance

3 new problems for database query processing

Page 19: MarketShare Big Data Analytics

Transforming Data to Actionable Insights

HighMediumLow

Engagement

Campaign Heat MapCube Materialization

Publishers Geograp

hy

Time Incremental updates Multi-dimensional

indexes Multi-dimensional

partitions

Page 20: MarketShare Big Data Analytics

Generic Schema

Dimension Level Level Level LevelTIME YEAR MONTH DAY HOUR

TIME_WEEKLY YEAR DAY_NAME

GEO COUNTRY STATE

PUBLISHER PUBLISHER SITE_NAME SITE_TYPE

PUBLISHER_PLACEMENT SITE_NAME PLACEMENT

CAMPAIGN CAMPAIGN_NAME CAMPAIGN_DESC INDUSTRY_SEGMENT

AD SIZE HEIGHT WIDTH

ADVERTISER ADVERTISER_NAME

FLIGHT FLIGHT_NAME FLIGHT_START_DATE FLIGHT_END_DATE FLIGHT_CREATIVE_ID

RID

USERID GENDER AGE_BUCKET AGE

Page 21: MarketShare Big Data Analytics

Life of a Query

Access Partitions

OptimizeQuery

IdentifyPartitions

Query

ConsolidateResults

21

Page 22: MarketShare Big Data Analytics

Tuple Access Layer YEAR MONTH DAY HOUR COUNTRY WEEK STATE PUBLISHE

RSITE_NAME

CAMPAIGN_NAME

CAMPAIGN_DESC

SIZE HEIGHT FLIGHT_NAME

FLIGHT_START_DATE

FLIGHT_END_DATE

FLIGHT_CREATIVE_ID

2009 June 3 1 US * * AOL Autos

* * * * * Ford * * Banner

MONTH DAY HOUR COUNTRY STATE WEEK

June 3 1 US * *

YEAR PUBLISHER SITE_NAME CAMPAIGN_NAME

CAMPAIGN_DESC

SIZE HEIGHT FLIGHT_NAME

FLIGHT_START_DATE

FLIGHT_END_DATE

FLIGHT_CREATIVE_ID

2009 AOL Autos

* * * * * Ford * * Banner

Locate the partition for the fast dimension values

Note here that STATE is set to ‘*’ but it has already been materializedTherefore, we eliminate any sum across all states

On the local partition, do the aggregation to calculate the ‘*’

Page 23: MarketShare Big Data Analytics

24

5 types of partitions maintained in the cloud

Exclusive

EC2 nodes that are allocated for specific keys

Permanent

EC2 nodes that are permanently allocated to service queries

Archive

S3 storage that is not accessible directly by the query engine

Intermittent

EC2 nodes that are allocated by the loading engine

Temporary

EC2 nodes that host temporary replicas

Page 24: MarketShare Big Data Analytics

25

Taxi MeterMaking resources transient

Page 25: MarketShare Big Data Analytics

Usage Patterns• Usage patterns vary throughout the day and throughout the week• A couple of periods of heavy usage daily, followed by moderate to low usage

4 – 6am 8 – 10am 10 – Noon Noon – 4pm 4 – 6pm

Data LoadingCube Materialization

Users review standard reports looking for campaign exceptions

Users run ad hoc queries to understand campaign exceptions

Users make any necessary campaign adjustments

Quick review of key reports by users prior to heading home

Com

putin

g R

esou

rces

Page 26: MarketShare Big Data Analytics

Traditional Computing Approach• Traditional computing approach buys enough computing resources to meet peak usage

demand• Even many cloud “solutions” provide only the peak computing power option with no way

to dynamically reallocate the computing resources to match the current usage demand• Result: Substantial waste in computing resources and money

4 – 6am 8 – 10am 10 – Noon Noon – 4pm 4 – 6pm

Com

putin

g R

esou

rces

$$ $$ $$ $$

Maximum Computing Resources

Page 27: MarketShare Big Data Analytics

“Adaptive” Computing Economics• Finely matching computing resources to user usage patterns can provide a 50% to 90%

cost savings versus the traditional computing resource allocation approach• Result: Lower cost with improvements in availability and performance

4 – 6am 8 – 10am 10 – Noon Noon – 4pm 4 – 6pm

Com

putin

g R

esou

rces

$$ $$ $$ $$

Maximum Computing Resources

Adaptive Computin

g Resources

Page 28: MarketShare Big Data Analytics

29

Intermittent scalabilityUsing large number of nodes during load time

Page 29: MarketShare Big Data Analytics

Managing CapEx with Role Based Clusters

SINGLECLUSTER FOR

DATA CLEANSING, LOAD AND QUERY

15TB100 NODES

Monthly Cost = $28,800

Page 30: MarketShare Big Data Analytics

Role Based Clusters

BUILD CUBE

HIBERNATE CUBE

QUERY READY PARTITIONS

UI Ad Server Data, Search Engine Data

DATA CLEANSING

2 hours daily for load on 10 nodesQuery on 5 nodes

Monthly Cost = $2,052

Page 31: MarketShare Big Data Analytics

32

Selective replication for hot partitions

Page 32: MarketShare Big Data Analytics

Partition level query slowdown• Dynamic statistics

• The query execution system logs status for each partition• If a particular partition is regularly lagging behind, it is marked for replication

• Static statistics• The query execution system identifies skews in specific partitions• Partitions with size skew etc are marked for replication

Operational (EC2)

P1 P2

P3 P4 P5 P6

P1 P2 P3 P4

P5 P6

Partition Size Average Execution Time

P1 1MB 1.2s

P2 2MB

P3 1.5MB 60s

Page 33: MarketShare Big Data Analytics

Fixing partition level slowdown• If the query execution system detects SLA violations• Adds two new temporary nodes (Temp 1)• Creates new replicas for the ‘hot’ partitions

Operational (EC2)

P1

Node 1

P2

P3 P4 P5

Node 2

P6

P1 P2 P3

Node 3

P4

P5 P6

Temporary (EC2)

P3

Temp 1

P6

P3

P6

Page 34: MarketShare Big Data Analytics

Key level query slowdown• Key Level Dynamic statistics

• A particular key takes time for materializing various facets of the cube

Operational (EC2)

P1 P2

P3 P4 P5 P6

P1 P2 P3 P4

P5 P6

Keys Size Average Execution Time

P3. K1 20MB 120s

Page 35: MarketShare Big Data Analytics

Fixing partition level slowdown• If the query execution system detects SLA violations for a particular key• Adds a new temporary node (Temp 2)• Denormalizes the key such that all data for that key is materialized

Operational (EC2)

P1

Node 1

P2

P3 P4 P5

Node 2

P6

P1 P2 P3

Node 3

P4

P5 P6

Temporary (EC2)

KN

Materialized

Node 5

Page 36: MarketShare Big Data Analytics

37

Partitions can be in 5 different states

Isolated (EC2)

Operational (EC2)

Data Retrieval Module

Access replicas if base partition is overwhelmed

Access base partitions is not materialized

Load (temporary EC2)

Create replicas if partition is hot

Isolate keys on separate machines

Retrieve data from materialized

Post load, archive the partitions

Replicated (temporary EC2)

Archive (S3/EBS)

Page 37: MarketShare Big Data Analytics

• Lots of challenges in cloud + modeling• Collaboration opportunities• We are hiring!

Next Steps