pinterest hadoop summit_talk

58
Confident ial Using Hadoop to build data driven Products 50 Billion pins and counting Krishna Gade 1

Upload: krishna-gade

Post on 10-Aug-2015

86 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Confidential

Using Hadoop to build data driven Products

50 Billion pins and counting

Krishna Gade

1

What is Pinterest?

A visual bookmarking tool

Discover an inspiring ideaSave it to a board

Go do it

Krishna Gade

• Data Engineering at Pinterest

• Search and Data platforms at Twitter and Bing

• Follow @krishnagade

Who am I?

Pinterest is a data product

Why do we care about data?

How is Hadoop helping us to harness the power of the data?

What are some of the tools we built on top of Hadoop Platform?

Why do we care about data?

How is Hadoop helping us to harness the power of data?

What are some of the tools we built on top of Hadoop Platform?

3.375

5’10”

< uncertainty

> odds of making the best decisions

15

It is a capital mistake to theorize before one has data.

- Sherlock Holmes

Why do we care about data?

How is Hadoop helping us to harness the power of the data?

What are some of the tools we built on top of Hadoop Platform?

Data at Pinterest

• 50 Billion Pins• 1 Billion boards• 40 PB of data on S3• 3 PB processed every day• 2000 node Hadoop cluster• 200 engineers

Pinterest Data Architecture

App

Pinterest Data Architecture

App

events

Kafka

Secor

Singer

Pinterest Data Architecture

App

events

Kafka

Secor

Singer

Pinterest Data Architecture

App

events

Kafka

SecorSkyline

Pinball

Redshift

Pinalytics

Features

Qubole (Hadoop)

Singer

•Ephemeral clusters

•Access control layer

•Shared data store

•Easy deployment

Hadoop Platform Requirements

•Isolated multi-tenancy

•Elasticity

•Support multiple clusters

Confidential

Design Choices

23

Decoupling compute & storage

Hadoop Cluster 1

Transient HDFS

Hadoop Cluster 2

Transient HDFS

S3 Persistent Store

Centralized Hive Metastore

Hive Metastore

Pig

Cascading

Hive

HDFS/S3

DataMetadata

Multi-layered Packaging

Mapreduce JobsHadoop Jars/Libs

Job/User level Configs

Software Packages/LibsConfigs (OS/Hadoop)

Misc Sys Admin

OSBootstrap Script

Core SW

Runtime Staging(on S3)

Automated Configuration

(Masterless Puppet)

Baked AMI

Executor Abstraction Layer

Hive Metastore

HDFS/S3

Qubole

Managed Hadoop

EMR

Executor

Pinball

Dev Server

•API for simplified executor abstraction

•Advanced support for spot instances

•Baked AMI customization

Why Qubole?

•Hadoop & Spark as managed services

•Tight integration with Hive

•Graceful cluster scaling

Confidential

● Scale:o 50 Billion Pinso Hundreds of workflowso Thousands of jobso 500+ jobs in a workflowo 3 petabytes processed daily

● Support:o Hadoop, Cascading, Hive, Spark …

Scale of Processing

job

workflow

Confidential

Pinball

30

Confidential

Why Pinball?● Requirements

o Simple abstractionso Extensible in futureo Reliable stateless computingo Easy to debugo Scales horizontallyo Can be upgraded w/o aborting workflowso Rich features like auto-retries, per-job emails, overrun

policies…

● Optionso Apache Oozie, Azkaban, Luigi

Confidential

Pinball Design

Confidential

● Workflow o A directed graph

of nodes called jobs

● Edgeo Run after

dependence● Node

o Job is a node

Workflow Model

Confidential

Job State● Job state is captured in a token● Tokens are named hierarchically

Master

Job Token

version: 123name: /workflow/w1/jobowner: worker_0expiration: 1234567data: JobTemplate(....)

Confidential

Job State Machine

Confidential

● Master keeps the state● Workers claim and execute tasks● Horizontally scalable

Master Worker Interaction

Worker Master Persistent Store

1: request 2: update

3: ack

Confidential

Master

● Entire state is kept in memory● Each state update is synchronously

persisted before master replies to client● Master runs on a single thread – no

concurrency issues

Confidential

Worker

Confidential

Open Source

Git repo: https://github.com/pinterest/pinball

Mailing list:https://groups.google.com/forum/#!forum/pinball-users

Confidential

Data Driven Products

40

Confidential

Guided Search

Confidential

Related Pins

Why do we care about insights?

How is Hadoop helping us to harness the power of data?

What are some of the tools we built on top of Hadoop Platform?

Confidential

Scalable Data Analytics Engine

Pinalytics

44

Confidential

Architecture

45

BackendThrift Services and Hbase databases

WebappRich UI Components

ReporterGenerates formatted data

MetricsCustomized optimizations

1

2

3

4

Main Components

Confidential

Visualizations• Highcharts• Time-series updated automatically ● daily

Customizability• Dashboards• Built-in or user-defined reports

User Interface

47

Confidential

Pinomaly• Anomalous metric tracking• Email alerts

Reporting• Formatted dashboards• PDF printing• Duplicated weekly

Metric Manipulation• Metric Composer• Global operations (segmentation,● rollup/aggregation, etc).

User Interface

48

Confidential

Date, seg1, seg2, ... => value• Store the value for every possible segmentation• On-the-fly aggregation

E.g.• 2015-01-01, US, Male => 1• 2015-01-01, US, Female => 2• 2015-01-01, UK, Male => 3• 2015-01-01, UK, Female => 4• 2015-01-01, UK, * => 7• 2015-01-01, *, Male => 4

Data Model

51

Confidential

Backend Architecture

53

PinalyticsThrift

Service

2. readMetrics()

5. metrics

HBase

Region Server 1

Region Server N

Region Server 2

Region1 CP

Region2 CP

Region3 CP

Region4 CP

Region5 CP

RegionM CP

Metric table

WebappServer

3. Scan &Aggregate

1. request

4. Region aggregation

Confidential

Horizontal Scalability• No app-level sharding

Flexibility in Aggregation• FuzzyRowFilter• Coprocessor

Tables• Report metadata• Reports

HBase

54

Confidential

Composite row key• METRIC|TIME|SEG1|SEG2|...

Filters rows given a row key and a fuzzy row• 0: match the byte, 1: don’t match the byte

E.g. MAU of male users on 2015-01-01• Start row: MAU|2015-01-01|• End row: MAU|2015-01-01||• Row Key: MAU|2015-01-01|--|M-• Fuzzy filter: 000|0000000000|11|00

Fuzzy Row Filter

55

Confidential

• Region-local aggregation with coprocessor

• Final aggregation at the Thrift service

• Reduces Network I/O

• Low Latency

HBase Coprocessor

56

Confidential

Flexible python client library for generating reports• Arbitrary metrics and segments

Easy-to-access data• Data is automatically copied to s3• Hive external table is generated

Reporter

58

Confidential

WAU, WARC and MAU segmented by gender and countryclass DemoWAUReport(PinalyticsWideReport):

_METRIC_NAMES = ['wau', 'warc', 'mau']_SEGKEY_NAMES = ['gender', 'country']_QUERY_TEMPLATE = """

SELECT dt, gender, country, wau, warc, mauFROM activity_metrics WHERE dt>='2015-01-01';"""

• Sample query output[‘2015-01-01’, ‘male’, ‘US’, 102, 53, 110]

Reporter Example

60

Confidential

• Pre-compute a lot of core metrics

• Standard segmentation

● - Gender, Country, App● - Spam-filtering

Core Metrics

62

• Activity• Event counts• Retention• Signups

Confidential

Outcomes

69

Confidential

70

Internal Tools MatterSolving problems inside of our company

400 Unique users

800 Page views per day

1500 Custom charts created and updated daily

Confidential

Thank You