scalable pipelines

16
Scalable Pipelines Vivek Nagarajan Insight Data Engineering Consulting Project

Upload: vivek-nagarajan

Post on 21-Feb-2017

93 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Scalable Pipelines

Scalable Pipelines

Vivek NagarajanInsight Data Engineering Consulting Project

Page 2: Scalable Pipelines
Page 3: Scalable Pipelines

My Role

• Reduce latency of running a pipeline• Setup infrastructure for scaling pipelines

Page 4: Scalable Pipelines

Pre-Pipeline Stage

input: <file to upload>

output: <file to output>

transforms:

split on newline

filter record by key

<filename, yaml>

Page 5: Scalable Pipelines

Pre-Pipeline Stage

Page 6: Scalable Pipelines

My ETL Pipeline

Page 7: Scalable Pipelines

Scaling Pipeline

Schedule

Page 8: Scalable Pipelines

Scaling Pipeline

Page 9: Scalable Pipelines

Scaling Pipeline

Page 10: Scalable Pipelines

Scaling Pipeline

Page 11: Scalable Pipelines

Demo

Airflow web server link: http://vivek-airlflow-pipeline.us/

Page 12: Scalable Pipelines

Challenges

• Understand existing framework and infrastructure

• Evolving set of requirements• Quirks of scaling pipelines in distributed

Airflow

Page 13: Scalable Pipelines

Performance Stats

• Reduced time taken to process pipeline by over 50 percent

• Running 30 pipelines concurrently takes an average of 2 minutes per pipeline

Page 14: Scalable Pipelines

Possible extensions

• Setting up HA on Flink cluster • Benchmarking with Spark transformations• Setting up multi-node Redis cluster• More support for dynamic transformations

Page 15: Scalable Pipelines

About Me

Page 16: Scalable Pipelines

Thank You