scalable pipelines
TRANSCRIPT
Scalable Pipelines
Vivek NagarajanInsight Data Engineering Consulting Project
My Role
• Reduce latency of running a pipeline• Setup infrastructure for scaling pipelines
Pre-Pipeline Stage
input: <file to upload>
output: <file to output>
transforms:
split on newline
filter record by key
<filename, yaml>
Pre-Pipeline Stage
My ETL Pipeline
Scaling Pipeline
Schedule
Scaling Pipeline
Scaling Pipeline
Scaling Pipeline
Demo
Airflow web server link: http://vivek-airlflow-pipeline.us/
Challenges
• Understand existing framework and infrastructure
• Evolving set of requirements• Quirks of scaling pipelines in distributed
Airflow
Performance Stats
• Reduced time taken to process pipeline by over 50 percent
• Running 30 pipelines concurrently takes an average of 2 minutes per pipeline
Possible extensions
• Setting up HA on Flink cluster • Benchmarking with Spark transformations• Setting up multi-node Redis cluster• More support for dynamic transformations
About Me
Thank You