cloud native data pipelines (dataengconf sf 2017)

Post on 22-Jan-2018

463 Views

Category:

Software

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Cloud Native Data Pipelines

1

Sid Anand (@r39132) DataEngConf SF 2017

About Me

2

Work [ed | s] @

Committer & PPMC on

Father of 2

Co-Chair for

Apache Airflow

Agari

3

What We Do!

Agari : What We Do

4

5

Agari : What We Do

6

Agari : What We Do

7

Agari : What We Do

8

Agari : What We Do

9

Enterprise Customers

email metadata

apply trust

modelsemail md + trust score

Agari’s Previous EP Version

Agari : What We Do

Batch

10

email metadata

apply trust

modelsemail md + trust score

Agari’s Current EP VersionEnterprise Customers

Agari : What We Do

Near-real time

Quarantine, Label,

PassThrough

Data PipelinesBI vs Predictive

11

Data Pipelines (BI)

12

WebServers

OLTPDB

DataWarehouse

Repor6ngTools

QueryBrowsers

ETL(batch)MySQL,Oracle,Cassandra

Terradata,RedShi;BigQuery

OLTPDBorcache

ETL(batchorstreaming)

MySQL,Oracle,Cassandra,Redis

Spark,Flink,Beam,Storm

WebServers

DataProductsRanking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon

DataSource

Data Pipelines (Predictive)

13

Data Products

14

BI Predictive

Common Focus of this talk

Data Pipelines

15

WebServers

OLTPDB

DataWarehouse

Repor6ngTools

QueryBrowsers

ETL(batch)MySQL,Oracle,Cassandra

Terradata,RedShi;BigQuery

OLTPDBorcache

ETL(batchorstreaming)

MySQL,Oracle,Cassandra,Redis

Spark,Flink,Beam,Storm

WebServers

Ranking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon

DataSource

MotivationCloud Native Data Pipelines

16

Cloud Native Data Pipelines

17

Big Data Companies like LinkedIn, Facebook, Twitter, & Google have large teams to manage their data pipelines

Most start-ups run in the public cloud. Can they leverage aspects of the public cloud to build comparable pipelines?

Cloud Native Data Pipelines

18

Cloud Native Techniques

Open Source Technogies

Data Pipelines seen in Big Data companies

~

Design GoalsDesirable Qualities of a Resilient Data Pipeline

19

20

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

21

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

• Data Integrity (no loss, etc…)

• Expected data distributions

• All output within time-bound SLAs

• Minimize Operational Fatigue / Automate Everything

• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

• Quick Recoverability

• Pay-as-you-go

Predictive Analytics @ AgariUse Cases

22

Use Cases

23

Apply trust models (message scoring)

batch + near real time

Build trust models

batch

(Enterprise Protect)

Use Cases

24

Apply trust models (message scoring)

batch + near real time

Build trust models

batch

(Enterprise Protect)Focus of this talk

Use-Case : Message Scoring (batch)Batch Pipeline Architecture

25

Use-Case : Message Scoring

26

enterprise Aenterprise Benterprise C

S3

S3 uploads an Avro file every 15 minutes

Use-Case : Message Scoring

27

enterprise Aenterprise Benterprise C

S3

Airflow kicks of a Spark message scoring job

every hour (EMR)

Use-Case : Message Scoring

28

enterprise Aenterprise Benterprise C

S3

Spark job writes scored messages and stats to

another S3 bucket

S3

Use-Case : Message Scoring

29

enterprise Aenterprise Benterprise C

S3

This triggers SNS/SQS messages events

S3

SNS

SQS

Use-Case : Message Scoring

30

enterprise Aenterprise Benterprise C

S3

An Autoscale Group (ASG) of Importers spins up when it detects SQS

messages

S3

SNS

SQS

Importers

ASG

31

enterprise Aenterprise Benterprise C

S3

The importers rapidly ingest scored messages and aggregate statistics into

the DB

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

32

enterprise Aenterprise Benterprise C

S3

Users receive alerts of untrusted emails & can review them in

the web app

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

33

enterprise Aenterprise Benterprise C

S3 S3

SNS

SQS

Importers

ASGDB

Airflow manages the entire process

Use-Case : Message Scoring

34

Architectural ComponentsComponent Role Uses Salient Features Operability Model

Data Lake • All data stored in S3 • All processing uses S3

Scalable, Available, Performant Serverless

Messaging • Reliable, Transactional, Pub/Sub

Scalable, Available, Performant Serverless

ASG General Processing

• Used for importing, data cleansing, business logic

Scalable, Available, Performant Managed

Data Science Processing

• Aggregation • Model Building • Scoring

Nice programming model at the cost of

debugging complexityWe Operate

Workflow Engine

• Coordinates all Spark Jobs & complex flows

Lightweight, DAGs as Code, Steep learning

curveWe Operate

DB Persistence for WebApp

• Holds subset of data needed for Web App Rails + Postgres

‘nuff said We Operate

S3

SNS SQS

Tackling Cost & TimelinessLeveraging the AWS Cloud

35

Tackling Cost

36

Between Daily Runs During Daily Runs

When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR

Tackling Cost

37

Between Hourly Runs During Hourly Runs

When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR

This does not help when runs are hourly since AWS charges at an hourly rate for EC2 instances!

Tackling TimelinessAuto Scaling Group (ASG)

38

ASG - Overview

39

What is it?

A means to automatically scale out/in clusters to handle variable load/traffic

A means to keep a cluster/service of a fixed size always up

ASG - Data Pipeline

40

importer

importer

importer

importer

Importer ASG

scale out / inSQS

DB

41

Sent

CPU

ACKd/Recvd

CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant

ASG : CPU-based

ASG : CPU-based

42

Sent

CPU

Recv

Premature Scale-in

Premature Scale-in:

• The CPU drops to noise-levels before all messages are consumed

• This causes scale in to occur while the last few messages are still being committed

43

Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)

Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d)

This causes the ASG to grow

This causes the ASG to shrink

ASG : Queue-based

Auto Scaling GroupsBuild & Deploy

44

ASG - Build & Deploy

45

Component Role Details

Spins up Cloud Resources• Spins up SQS, Kinesis, EC2, ASG,

ELB, etc.. and associate them using Terraform

• A better version of Chef & Puppet

• Sets up an EC2 instance

• Agentless, idempotent, & declarative tool to set up EC2 instances, by installing & configuring packages, and more

• Spins up an EC2 instance for the purposes of building an AMI!

• Can be used with Ansible & Terraform to bake AMIs & Launch Auto-Scaling Groups

ASG - Build & Deploy

46

EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas!

EC2

ASG - Build & Deploy

47

EC2 Step 1 : Packer spins up a temporary EC2 node - a blank canvas!

Step 2 : Packer runs an Ansible role against the EC2 node to set it up.

EC2

ASG - Build & Deploy

48

EC2

Step 2 : Packer runs an Ansible role against the EC2 node to set it up.

Step 3 : Snapshots the machine & register the AMI.EC2

Step 1 : Packer spins up a temporary EC2 node - a blank canvas!

EC2

ASG - Build & Deploy

49

EC2

Step 2 : Packer runs an Ansible role against the EC2 node to set it up.

Step 3 : Snapshots the machine & register the AMI.EC2

Step 4 : Terminates the EC2 instance!

Step 1 : Packer spins up a temporary EC2 node - a blank canvas!

EC2

ASG - Build & Deploy

50

EC2

Step 2 : Packer runs an Ansible role against the EC2 node to set it up.

Step 3 : Snapshots the machine & register the AMI.EC2

Step 4 : Terminates the EC2 instance!

Step 5 : Using the AMI, Terraform spins up an auto-scaled compute cluster (ASG)

Step 1 : Packer spins up a temporary EC2 node - a blank canvas!

ASG

51

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost• ASG • EMR Spark

Daily • ASG • EMR Spark Hourly ASG • No Cost Savings

Tackling Operability & CorrectnessLeveraging Tooling

52

53

A simple way to author, configure, manage workflows

Provides visual insight into the state & performance of workflow runs

Integrates with our alerting and monitoring tools

Tackling Operability : Requirements

Apache AirflowWorkflow Automation & Scheduling

54

55

Airflow: Author DAGs in Python! No need to bundle many config files!

Apache Airflow - Authoring DAGs

56

Airflow: Visualizing a DAG

Apache Airflow - Authoring DAGs

57

Airflow: It’s easy to manage multiple DAGs

Apache Airflow - Managing DAGs

Apache Airflow - Perf. Insights

58

Airflow: Gantt chart view reveals the slowest tasks for a run!

59

Apache Airflow - Perf. InsightsAirflow: Task Duration chart view show task completion time trends!

60

Airflow: …And easy to integrate with Ops tools!Apache Airflow - Alerting

61

Apache Airflow - Correctness

62

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

Use-Case : Message Scoring (near-real time)NRT Pipeline Architecture

63

Use-Case : Message Scoring

64

enterprise Aenterprise Benterprise C

Kinesis batch put every second

K

Use-Case : Message Scoring

65

enterprise Aenterprise Benterprise C

K

As ASG of scorers is scaled up to one process per core per kinesis shard

Scorers

ASG

Use-Case : Message Scoring

66

enterprise Aenterprise Benterprise C

KScorers

ASG

KinesisScorers apply the trust model and send scored messages downstream

Use-Case : Message Scoring

67

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

As ASG of importers is scaled up to rapidly import messages

DB

Use-Case : Message Scoring

68

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

Imported messages are also consumed by the

alerter

DB

K

Alerters

ASG

Use-Case : Message Scoring

69

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

Imported messages are also consumed by the

alerter

DB

K

Alerters

ASG

Quarantine Email

70

Stream Processing ArchitectureComponent Role Details Pros Operability Model

Data Lake • All data stored in S3 via Kinesis Firehose

Scalable, Available, Performant, Serverless Serverless

Kinesis Messaging • Streaming transport modeled on Kafka

Scalable, Available, Serverless Serverless

General Processing

• ASG Replacement except for Rails Apps Scalable, Available,

Serverless Serverless

ASG General Processing

• Used for importing, data cleansing, business logic

Scalable, Available, Managed Managed

Data Science Processing

• Model Building We Operate

Workflow Engine• Nightly model builds +

some classic Ops cron workloads

Lightweight, DAGs as Code We Operate

DB Persistence for WebApp

• Holds smaller subset of data needed for Web App

Rails + Postgres ‘nuff said We Operate

Persistence for WebApp

• Aggregation + Search moved from DB to ES

• Model Building queries moved to Elasticache Redis

Faster. more accurate for aggregates, frees up

headroom for DB (polyglot persistence)

Managed

S3

InnovationsNRT Pipeline Architecture

71

Apache AvroWhat is Avro?

72

73

What is Avro?

Avro is a self-describing serialization format that supports

primitive data types : int, long, boolean, float, string, bytes, etc…

complex data types : records, arrays, unions, maps, enums, etc…

many language bindings : Java, Scala, Python, Ruby, etc…

74

What is Avro?

Avro is a self-describing serialization format that supports

primitive data types : int, long, boolean, float, string, bytes, etc…

complex data types : records, arrays, unions, maps, enums, etc…

many language bindings : Java, Scala, Python, Ruby, etc…

The most common format for storing structured Big Data at rest in HDFS, S3, Google Cloud Storage, etc…

Supports Schema Evolution!

Apache AvroWhy is it useful?

75

76

Why is Avro Useful?Agari is an IoT company!

Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS

Data is sent via Kinesis!

enterprise Aenterprise Benterprise C Kinesis

Agari SAAS in AWS

77

Why is Avro Useful?

enterprise A :enterprise B :enterprise C : Kinesis

v1v2v3

Agari is an IoT company!

Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS

Data is sent via Kinesis!

At any point in time, customers run different versions of the Agari Sensor

Agari SAAS in AWS

78

Why is Avro Useful?

enterprise A :enterprise B :enterprise C : Kinesis

v1v2v3

Agari is an IoT company!

Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS

Data is sent via Kinesis!

At any point in time, customers run different versions of the Agari Sensor

These Sensors might send different format versions of the data!

Agari SAAS in AWS

79

Why is Avro Useful?

enterprise A :enterprise B :enterprise C : Kinesis

v1v2v3

Agari SAAS in AWS

v4

Agari is an IoT company!

Agari Sensors, deployed at customer sites, stream data to Agari’s Cloud SAAS

Data is sent via Kinesis!

At any point in time, customers run different versions of the Agari Sensor

These Sensors might send different format versions of the data!

80

Why is Avro Useful?

enterprise A :enterprise B :enterprise C :

v1v2v3

Avro allows Agari to seamlessly handle different IoT data format versions

Agari SAAS in AWS

Kinesis v4

datum_reader = DatumReader( writers_schema = writers_schema,

readers_schema = readers_schema)

Requirements:

• Schemas are backward-compatible

81

Why is Avro Useful?

Agari SAAS in AWS

S1 S2 S3

s3 Spark

Avro Everywhere!

Avro is so useful, we don’t just to communicate between our Sensors & our SAAS infrastructure

We also use it as the common data-interchange format between all services (streaming & batch) within our AWS deployment

82

Why is Avro Useful?

Agari SAAS in AWS

S1 S2 S3

s3 Spark

Avro Everywhere!

Good Language Bindings :

Data Pipelines services are written in Java, Ruby, & Python

Apache AvroBy Example

83

84

Avro Schema Example

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

85

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

complex type (record)

Avro Schema Example

86

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

complex type (record)Schema name : User

Avro Schema Example

87

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

complex type (record)Schema name : User

3 fields in the record: 1 required, 2 optional

Avro Schema Example

88

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Data

x 1,000,000,000

Avro Schema Data File Example

Schema

Data

0.0001 %

99.999 %

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

89

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Binary Data block

Avro Schema Streaming Example

Schema

Data

99 %

1 %

Data

90

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Binary Data block

Avro Schema Streaming Example

Schema

Data

99 %

1 %

Data

OVERHEAD!!

Apache AvroSchema Registry

91

92

Schema Registry

(Lambda)

Avro Schema Registry

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

register_schema

Message Producer (P)

93

Schema Registry

(Lambda)

register_schema returns a UUID

Message Producer (P)

Avro Schema Registry

94

Schema Registry

(Lambda)

Message Producer sends UUID +

Message Producer (P)

Data

Message Consumer (C)

Avro Schema Registry

95

Schema Registry

(Lambda)

Message Producer (P)

Data

Message Consumer (C)

getSchemaById (UUID)

Avro Schema Registry

96

Schema Registry

(Lambda)

Message Producer (P)

Data

Message Consumer (C)

getSchemaById (UUID){"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Avro Schema Registry

97

Schema Registry

(Lambda)

Message Producer (P)

Message Consumer (C)

getSchemaById (UUID){"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Message Consumers • download & cache the schema

• then decode the data

Avro Schema Registry

98

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

Imported messages are also consumed by the

alerter

DB

K

Alerters

ASG

SR

SR

SR

Avro Schema Registry

99

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

Imported messages are also consumed by the

alerter

DB

K

Alerters

ASG

SR

SR

SR

Avro Schema Registry

Acknowledgments

100

• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Chris Buchanan • Neil Chapin • Wil Collins • Don Spencer

• Scot Kennedy • Natia Chachkhiani • Patrick Cockwell • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle • Gabriel Poon • Spencer Sun • Nathan Bryant

None of this work would be possible without the essential contributions of the team below

Questions? (@r39132)

101

top related