cloud native data pipelines anand.pdf · 2016-10-20 · data pipeline correctness operability...
TRANSCRIPT
![Page 1: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/1.jpg)
Cloud Native Data Pipelines
Sid Anand QCon Shanghai & Tokyo 2016
1
![Page 2: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/2.jpg)
About Me
2
Work [ed | s] @
Committer & PPMC on
Father of 2
Co-Chair for
Apache Airflow
![Page 3: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/3.jpg)
Agari
3
What We Do!
![Page 4: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/4.jpg)
Agari : What We Do
4
![Page 5: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/5.jpg)
5
Agari : What We Do
![Page 6: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/6.jpg)
6
Agari : What We Do
![Page 7: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/7.jpg)
7
Agari : What We Do
![Page 8: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/8.jpg)
8
Agari : What We Do
![Page 9: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/9.jpg)
9
Enterprise Customers
email metadata
apply trust
models
email md + trust score
Agari’s Previous EP Version
Agari : What We Do
Batch
![Page 10: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/10.jpg)
10
email metadata
apply trust
modelsemail md + trust score
Agari’s Current EP VersionEnterprise Customers
Agari : What We Do
Near-real time
Quarantine
![Page 11: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/11.jpg)
Data PipelinesBI vs Predictive
11
![Page 12: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/12.jpg)
Data Pipelines (BI)
12
WebServers
OLTPDB
DataWarehouse
Repor6ngTools
QueryBrowsers
ETL(batch)MySQL,Oracle,Cassandra
Terradata,RedShi;BigQuery
![Page 13: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/13.jpg)
Data Pipelines (Predictive)
13
OLTPDBorcache
ETL(batchorstreaming)
MySQL,Oracle,Cassandra,Redis
Spark,Flink,Beam,Storm
WebServers
DataProductsRanking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon
DataSource
![Page 14: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/14.jpg)
Data Products
14
![Page 15: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/15.jpg)
BI Predictive
Common Focus of this talk
Data Pipelines
15
WebServers
OLTPDB
DataWarehouse
Repor6ngTools
QueryBrowsers
ETL(batch)MySQL,Oracle,Cassandra
Terradata,RedShi;BigQuery
OLTPDBorcache
ETL(batchorstreaming)
MySQL,Oracle,Cassandra,Redis
Spark,Flink,Beam,Storm
WebServers
Ranking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon
DataSource
![Page 16: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/16.jpg)
MotivationCloud Native Data Pipelines
16
![Page 17: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/17.jpg)
Cloud Native Data Pipelines
17
Big Data Companies like LinkedIn, Facebook, Twitter, & Google build custom, large scale data pipelines that run in their own Data Centers
![Page 18: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/18.jpg)
Cloud Native Data Pipelines
18
Big Data Companies like LinkedIn, Facebook, Twitter, & Google build custom, large scale data pipelines that run in their own Data Centers
Most start-ups run in the public cloud. Can they leverage aspects of the public cloud to build comparable pipelines?
![Page 19: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/19.jpg)
Cloud Native Data Pipelines
19
Cloud Native Techniques
Open Source Technogies
Custom Data Pipeline Stacks seen in Big Data companies
~
![Page 20: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/20.jpg)
Design GoalsDesirable Qualities of a Resilient Data Pipeline
20
![Page 21: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/21.jpg)
21
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
![Page 22: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/22.jpg)
22
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
• Data Integrity (no loss, etc…) • Expected data distributions
• All output within time-bound SLAs
• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs
• Quick Recoverability
• Pay-as-you-go
![Page 23: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/23.jpg)
Quickly Recoverable
23
• Bugs happen!
• Bugs in Predictive Data Pipelines have a large blast radius
• Optimize for MTTR
![Page 24: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/24.jpg)
Predictive Analytics @ AgariUse Cases
24
![Page 25: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/25.jpg)
Use Cases
25
Apply trust models (message scoring)
batch + near real time
Build trust models
batch
(Enterprise Protect)
![Page 26: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/26.jpg)
Use-Case : Message Scoring (batch)Batch Pipeline Architecture
26
![Page 27: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/27.jpg)
Use-Case : Message Scoring
27
enterprise Aenterprise Benterprise C
S3
S3 uploads an Avro file every 15 minutes
![Page 28: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/28.jpg)
Use-Case : Message Scoring
28
enterprise Aenterprise Benterprise C
S3
Airflow kicks of a Spark message scoring job
every hour (EMR)
![Page 29: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/29.jpg)
Use-Case : Message Scoring
29
enterprise Aenterprise Benterprise C
S3
Spark job writes scored messages and stats to
another S3 bucket
S3
![Page 30: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/30.jpg)
Use-Case : Message Scoring
30
enterprise Aenterprise Benterprise C
S3
This triggers SNS/SQS messages events
S3
SNS
SQS
![Page 31: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/31.jpg)
Use-Case : Message Scoring
31
enterprise Aenterprise Benterprise C
S3
An Autoscale Group (ASG) of Importers spins up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
![Page 32: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/32.jpg)
32
enterprise Aenterprise Benterprise C
S3
The importers rapidly ingest scored messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
![Page 33: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/33.jpg)
33
enterprise Aenterprise Benterprise C
S3
Users receive alerts of untrusted emails & can review them in
the web app
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
![Page 34: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/34.jpg)
34
enterprise Aenterprise Benterprise C
S3 S3
SNS
SQS
Importers
ASGDB
Airflow manages the entire process
Use-Case : Message Scoring
![Page 35: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/35.jpg)
Tackling Cost & TimelinessLeveraging the AWS Cloud
35
![Page 36: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/36.jpg)
Tackling Cost
36
Between Daily Runs During Daily Runs
When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR
![Page 37: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/37.jpg)
Tackling Cost
37
Between Hourly Runs During Hourly Runs
When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR
This does not help when runs are hourly since AWS charges at an hourly rate for EC2 instances!
![Page 38: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/38.jpg)
Tackling TimelinessAuto Scaling Group (ASG)
38
![Page 39: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/39.jpg)
ASG - Overview
39
What is it?
A means to automatically scale out/in clusters to handle variable load/traffic
A means to keep a cluster/service of a fixed size always up
![Page 40: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/40.jpg)
ASG - Data Pipeline
40
importer
importer
importer
importer
Importer ASG
scale out / inSQS
DB
![Page 41: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/41.jpg)
41
Sent
CPU
ACKd/Recvd
CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant
ASG : CPU-based
![Page 42: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/42.jpg)
ASG : CPU-based
42
Sent
CPU
Recv
Premature Scale-in
Premature Scale-in:
• The CPU drops to noise-levels before all messages are consumed
• This causes scale in to occur while the last few messages are still being committed
![Page 43: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/43.jpg)
43
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d)
This causes the ASG to grow
This causes the ASG to shrink
ASG : Queue-based
![Page 44: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/44.jpg)
44
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost• ASG • EMR Spark
Daily • ASG • EMR Spark Hourly ASG • No Cost Savings
![Page 45: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/45.jpg)
Tackling Operability & CorrectnessLeveraging Tooling
45
![Page 46: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/46.jpg)
46
A simple way to author and manage workflows
Provides visual insight into the state & performance of workflow runs
Integrates with our alerting and monitoring tools
Tackling Operability : Requirements
![Page 47: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/47.jpg)
Apache AirflowWorkflow Automation & Scheduling
47
![Page 48: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/48.jpg)
48
Airflow: Author DAGs in Python! No need to bundle many config files!
Apache Airflow - Authoring DAGs
![Page 49: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/49.jpg)
49
Airflow: Visualizing a DAG
Apache Airflow - Authoring DAGs
![Page 50: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/50.jpg)
50
Airflow: It’s easy to manage multiple DAGs
Apache Airflow - Managing DAGs
![Page 51: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/51.jpg)
Apache Airflow - Perf. Insights
51
Airflow: Gantt chart view reveals the slowest tasks for a run!
![Page 52: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/52.jpg)
52
Apache Airflow - Perf. InsightsAirflow: Task Duration chart view show task completion time trends!
![Page 53: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/53.jpg)
53
Airflow: …And easy to integrate with Ops tools!Apache Airflow - Alerting
![Page 54: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/54.jpg)
54
Apache Airflow - Correctness
![Page 55: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/55.jpg)
55
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
![Page 56: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/56.jpg)
Use-Case : Message Scoring (near-real time)NRT Pipeline Architecture
56
![Page 57: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/57.jpg)
Use-Case : Message Scoring
57
enterprise Aenterprise Benterprise C
Kinesis batch put every second
K
![Page 58: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/58.jpg)
Use-Case : Message Scoring
58
enterprise Aenterprise Benterprise C
K
As ASG of scorers is scaled up to one process per core per kinesis shard
Scorers
ASG
![Page 59: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/59.jpg)
Use-Case : Message Scoring
59
enterprise Aenterprise Benterprise C
KScorers
ASG
KinesisScorers apply the trust model and send scored messages downstream
![Page 60: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/60.jpg)
Use-Case : Message Scoring
60
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
As ASG of importers is scaled up to rapidly import messages
DB
![Page 61: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/61.jpg)
Use-Case : Message Scoring
61
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
Imported messages are also consumed by the
alerter
DB
K
Alerters
ASG
![Page 62: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/62.jpg)
Use-Case : Message Scoring
62
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
Imported messages are also consumed by the
alerter
DB
K
Alerters
ASG
![Page 63: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/63.jpg)
InnovationsNRT Pipeline Architecture
63
![Page 64: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/64.jpg)
64
The Architecture is composed of repeated patterns of :
ASG-based compute consumer
Kinesis transport streams (i.e. AWS’ managed “Kafka”)
A Lambda-based Avro Schema Registry
Innovation 1 : Repeatable Units
ComputeiKinesisi
ASGi
SR
![Page 65: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/65.jpg)
65
You can chain these repeatable units together to make arbitrary DAGs (Directed Acyclic Graphs)
The example above is a simple Linear DAG with 3 units
Innovation 1 : Repeatable Units
ComputeiKinesisi
ASGi
SR
ComputeiKinesisi
ASGi
SR
ComputeiKinesisi
ASGi
SR
![Page 66: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/66.jpg)
66
The message body is Avro-encoded, with one detail:
The schema is not included in the Kinesis message!
The schema would be 99% overhead for the message
Instead, a schema_id is sent in the message header
Innovation 2 : Avro Schema Registry
ASG1
Compute1 Compute2Kinesis2
ASG2
SR
![Page 67: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/67.jpg)
67
When the Compute 2 consumer receives the message, it
First reads the Schema_id out of the message header
Contacts the Schema Registry for the Schema (and caches it)
Deserialized the Avro body using the newly acquired schema
Innovation 2 : Avro Schema Registry
ASG
Compute1 Compute2Kinesis2
ASG
SR SR.getSchemaById()…
![Page 68: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/68.jpg)
68
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASG
Imported messages are also consumed by the
alerter
DB
K
Alerters
ASG
SR
SR
SR
Innovation 2 : Avro Schema Registry
![Page 69: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/69.jpg)
Airflow Job Reactively Scales
Innovation 3 : Reactive-Scaling (WIP)
69
enterprise Aenterprise Benterprise C
KScorers
ASG
Kinesis
Importers
ASGDB
K
Alerters
ASG
SR
SR
SR
![Page 70: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/70.jpg)
70
If the ADR is triggered and a model build or code push was recently done to Compute 1, ADR will revert the last code or model push to ASG Compute 1
Innovation 4 : Anomaly-based Rollback (WIP)
ASG
Compute1 Compute2Kinesis
ASG
SR
Anomaly-detector&Reverter
![Page 71: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/71.jpg)
Open Source Plans
71
Follow us to be notified when the following is open-sourced
• Avro Schema Registry
• Agari (Kinesis+ASG) scaling tool (Airflow Job)
• Anomaly-detector & Reverter
To be notified, follow @AgariEng & @r39132
![Page 72: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/72.jpg)
Acknowledgments
72
• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones
• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle
None of this work would be possible without the contributions of the strong team below
![Page 73: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions](https://reader035.vdocuments.net/reader035/viewer/2022070710/5ec55ffd67ccba06050c0b75/html5/thumbnails/73.jpg)
Questions? (@r39132)
73