monal daxini - beaming flink to the cloud @ netflix

70
Monal Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink to the Cloud @ Netflix Sep 12 2016

Upload: flink-forward

Post on 08-Jan-2017

358 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Monal Daxini

Engineering ManagerStream Processing, Real Time Data Infrastructure

@monaldax @Netflix #keystone

Beaming Flink to the Cloud @ Netflix

Sep 12 2016

Page 2: Monal Daxini - Beaming Flink to the Cloud @ Netflix

World’s Leading Internet Streaming Service(Global launch Jan 6, 2016)

Page 3: Monal Daxini - Beaming Flink to the Cloud @ Netflix

● 83+ Million Members, 190+ Countries

● 1000+ device types

● 35% of downstream Internet traffic

Netflix Service Scale

Page 4: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Netflix Service Scale - Daily viewing hours

125,000,000,000+Whoa!

Page 5: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Events Processed / day

1,000,000,000,000+1.4 PB

That’s a huge number!

Page 6: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Event Scale

Peak

● 1T unique events ingested/day● 16M / sec● 43GB / sec● 10MB / message

Daily Averages

● 1T+ events processed● 600B unique events ingested● 1.4 PB / day

● 4K / event

Page 7: Monal Daxini - Beaming Flink to the Cloud @ Netflix

99.99% + Availability / Four 9s

Keystone Scale

Page 8: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone Events Trend

1/2014 80B / day 1/2015 300B / day 1/2016 1T+ / day

Page 9: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Evolution

SPaaS

Page 10: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Goal - Migrate 1.3 PB of event data to a new Pipeline in flight, while ensuring data diff < 0.1%

Page 11: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Few years ago (Season 1)

EMR

EventProducers

Page 12: Monal Daxini - Beaming Flink to the Cloud @ Netflix

A year ago.. (Season 2)

EventProducer

Stream Consumers

EMR

ConsumerKafka

Suro Router

EventProducer

Suro

Kafka

SuroProxy

Page 13: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Season 3

Page 14: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone

Keystone Stream Processing

(SPaaS)

Keystone Management

Keystone Messaging

Schema Support

100% in AWS

Page 15: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Create DuploⓇ Blocks:Let reusability drive new value

Our Philosophy

Page 16: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Q4 2015

EventProducers

Sinks

SPaaS

Page 17: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone Management

Page 18: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Per Stream Auto Dashboard

Page 19: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone SPaaS

Stream Consumers

Router

EMR

FrontingKafka

ConsumerKafka

EventProducer

KS

Prox

yControl PlaneSelf Service UI

100% in AWS

24 x 7

Region failover

Page 20: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Event flowKeystone Pipeline As a Service

Page 21: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

KS

Prox

y

KS

Lib Control Plane

Self Service UI

Page 22: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneSelf Service UI

Page 23: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneSelf Service UI

Page 24: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneSelf Service UI

Page 25: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneSelf Service UI

Page 26: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Routing

Page 27: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone

Stream Consumers

SamzaRouter

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneSelf Service UI

Checkpoint Cluster

Page 28: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Details..

○ Massively parallel use-case

■ Per element processing - declarative filtering &

projection

○ Stateless except Kafka offset checkpointing state

Page 29: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Routing Infrastructure

+

CheckpointingCluster

+ 0.9.1Go

C language

Page 30: Monal Daxini - Beaming Flink to the Cloud @ Netflix

EC2 InstancesZookeeper

(Instance Id assignment)

JobJob

Job

NodeAgent

Checkpointing Cluster

ASG

Immutable Job Config

Self-Service UI / Control Plane

Page 31: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Custom Go Executor

./runJob

LogsSnapshots

Attach Volumes

./runJob./runJob

Reconcile Loop - 1 minHealth Check

What’s running on the host?

Zookeeper(Instance Id assignment)

Page 32: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Yes! You inferred right!

No Mesos & No Yarn

Page 33: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone Salient Features

Page 34: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone Scale

● 4000+ Kafka brokers● 14000+ Samza Jobs

○ Running in docker containers○ On 1600+ nodes

Page 35: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone Salient Features

● At-least-once delivery semantics*● Multi-Tenant● Self Service● Scalable● Fault Tolerant

100% in AWS

Page 36: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone Salient Features

● Event payload is immutable● Inject event metadata

○ guid, timestamp, host, app● Custom extensible wire protocol● Kafka producer wrapper

Page 37: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone Salient Features

● Kafka Cluster failover○ Kafka Kong

● Routing regional / failover● Scales based on historical traffic

○ Externally driven

Page 38: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Season 4Plot & Pilot...

Page 39: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Stream Processing As a Service (SPaaS)

(more capable)

Page 40: Monal Daxini - Beaming Flink to the Cloud @ Netflix

SPaaS Vision (plot)

● Multi-tenant support for stateful stream processing apps

● Autoscaling managed infrastructure

● Support for schemas

● Self Service Tooling

Page 41: Monal Daxini - Beaming Flink to the Cloud @ Netflix

SPaaS Architecture (plot)

SPaaS ManagerTitus Container

Runtime

Framework Specific APIor

Common API (Beam)[ Dockerized Job ]

1. Create

2. Submit 3. Launch

RunnerFlink / Spark /

Mantis

Running Job

1. Submit Job DSL (SQL)

Tooling/ Dashboard

Page 42: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Why Apache Beam?

○ Portable API layer to build sophisticated data processing apps

■ Support multiple execution engines

○ Unified model API over bounded and unbounded data sources

○ Millwheel, FlumeJava, Dataflow model lineage

SPaaS - “Beam Me Up, Scotty ! "

Page 43: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Iterative build out: then

● First - Flink on Titus in VPC, AWS○ Titus is a cloud runtime platform for container based jobs

● Next - Apache Beam and Flink runner

SPaaS - Pilot

Page 44: Monal Daxini - Beaming Flink to the Cloud @ Netflix

2.

Keystone SPaaS-Flink Pilot Use Cases

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

Demux Merge Control Plane

Self Service UI

Page 45: Monal Daxini - Beaming Flink to the Cloud @ Netflix

2.

1.

Keystone SPaaS-Flink Pilot Use Cases

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

3.

Demux Merge Control Plane

Self Service UI

Page 46: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Broker

Router - Massively parallel use case

Router

Page 47: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Titus Job

Task Manager

IP

Titus Host 4 Titus Host 5

Flink (1.2) Program Deployment (prod shadow)

Zookeeper

Job Manager (standby)

Job Manager (master)

Task Manager

Titus Host 1IP

Titus Host 2….

Task Manager

Titus Host 3IP

Titus JobIPIP

AWS VPC ENI

Page 48: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Titus High Level Architecture

Titus UITitus UI

Docker RegistryDocker Registry

Rheacontainer

containercontainer

docker

Titus Agentmetrics agent

containercontainerSPaaS-Flink

Titus executor

logging agent

zfsmesos agent

docker

RheaTitus API

Cassandra

Titus Master

Job Management & Scheduler

S3

ZookeeperDocker Registry

EC2 Autoscaling API

Mesos Master

Titus UI

(CI/CD)

Fenzo

Page 49: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Titus UI

Page 50: Monal Daxini - Beaming Flink to the Cloud @ Netflix

CD

Page 51: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Dashboard

Page 52: Monal Daxini - Beaming Flink to the Cloud @ Netflix
Page 53: Monal Daxini - Beaming Flink to the Cloud @ Netflix
Page 54: Monal Daxini - Beaming Flink to the Cloud @ Netflix
Page 55: Monal Daxini - Beaming Flink to the Cloud @ Netflix
Page 56: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone SPaaS-Flink Pilot Use Case - 2

Page 57: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Keystone SPaaS-Flink Pilot Use Case - 2

Page 58: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Flink Router perf test (YMMV)

○ Note

■ The tests were performed on a specific use case,

■ running in a specific environment, and with

■ with one specific event stream, and setup.

Page 59: Monal Daxini - Beaming Flink to the Cloud @ Netflix

1.

Keystone

Stream Consumers

SamzaRouter

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneSelf Service UI

Page 60: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Details..

○ Different runtimes for Flink & Samza routers

○ Massively parallel use-case

■ Per element processing

○ Focused on net outcomes

Page 61: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Titus Job

Task Manager

IP

Titus Host 4 Titus Host 5

Flink (1.2) Router

Zookeeper

Job Manager (standby)

Job Manager (master)

Task Manager

Titus Host 1IP

Titus Host 2….

Task Manager

Titus Host 3IP

Titus JobIPIP

AWS VPC ENI

Backed state

Page 62: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Flink Router outcome (YMMV)

○ Cost ≅17% savings

○ Memory utilization ≅16% better

○ Cpu utilization ≅ 40% better

○ Network utilization ≅ 10% better

○ Msg. throughput ≅ 1% (avg) - 4% (peak) better

Page 63: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Awaiting Flink Features

Page 64: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Titus Job

Task Manager

IP

Titus Host 4 Titus Host 5

Fine Grained Recovery FLIP-1

Zookeeper

Job Manager (standby)

Job Manager (master)

Task Manager

Titus Host 1IP

Titus Host 2….

Task Manager

Titus Host 3IP

Titus JobIPIP

X Flink Job Restarts

Page 65: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Awaiting Flink Features (now)

● Checkpoints and savepoints

○ FLINK-4484 - Unify, Persistent checkpoints, periodic savepoints

○ Compatible across Flink version upgrades

○ Inspection tool for debugging

Page 66: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Awaiting Flink Features (now)

● Atomic savepoint and stop (pause) the job

● Dynamic Scaling

○ Resume from savepoint with different parallelism

○ Job elasticity

● FLINK-4545 Adjust TaskManager network buffer when scaling

DONE

Page 67: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Awaiting Flink Features (now)

● Cluster Runtime and Elasticity (FLIP - 6)

● Extending Window Function Metadata (FLIP-2)

● Large State support

○ Incremental checkpointing

○ hot standby

Page 68: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Awaiting Flink Features

● Metric tags as key-value pairs - FLINK-4245

● FLIP-9 Window Trigger DSL

● FLIP-11 Table API

● Side Inputs / Side Outputs (handle late data)

DONE

Page 69: Monal Daxini - Beaming Flink to the Cloud @ Netflix

Ponder over

Could Stream Processing Engine enable building non-analytics Applications?

Page 70: Monal Daxini - Beaming Flink to the Cloud @ Netflix

More brain food...

● Netflix Keystone Pipeline Evolution

● Netflix Kafka in Keystone Pipeline

● Samza Meetup Presentation

● Titus talk

● Netflix OSS