Building Data Pipelines in Python using Apache Airflow
STL Python Meetup Aug 2nd 2016 @conornash
What is Apache Airflow?• Airflow is a platform to
programmatically author, schedule and monitor workflows
• Designed for batch jobs, not for real-time data streams
• Originally developed at AirBnB by Maxime Beauchemin, now incubating as an Apache project
Why would you want to use it?
• Companies grow to have a complex network of processes and data that have intricate dependencies
• Analytics & batch processing is becoming increasingly important
• Want to find a way to scale up analytics/batch processing while keeping time spent writing/monitoring/troubleshooting to a minimum
• Useful even for small workflows/batch jobs
Airflow Features• Dependency management (DAGs)
• Status visibility
• Scheduling
• Log storage/retrieval
• Parameterized retries
• Distributed DAGs (RabbitMQ)
• Queues
• Pools
• Branching/Partial Success
• SLA monitoring
• Jinja templating
• Plugin system and more…
Airflow: Dashboard
Airflow: DAG
Quick start requirements• Python 2 or 3
• Make new project (virtualenv, pyenv, …)
• $ cd <project folder path> && export AIRFLOW_HOME=<project folder path>
• $ pip install airflow
• $ airflow initdb
• $ airflow webserver -p 8080
Airflow: First DAG
• Existing Python/Bash/Java/etc. script that is difficult to monitor
• Probably already set up as a cron (Unix) or scheduled task (Windows)
• Want to integrate it into an Airflow DAG
Airflow: First DAG
Airflow: Complex DAG
Airflow: Complex DAG
Why would you want to use it?• Data Warehousing
• Anomaly Detection
• Search Ranking
• Model Training
• Text Analysis
• Experimentation (i.e. A/B tests)
• Data Cleaning
• 3rd Party Data Integration