intro to airflow: good bye cron, welcome scheduled workflow management

36
A I R F L O W

Upload: burasakorn-sabyeying

Post on 21-Jan-2018

1.251 views

Category:

Software


3 download

TRANSCRIPT

A I R F L O W

MILS BURASAKORNDATA ENGINEER

g r o u p

m a r k e t i n g t o o l s

Value driven “marketing as a service” agency for small business

Best in class marketing and productivity tools

for small business

DEALS WITH LONG RUNNING PROGRESS

IMAGINE YOU WORK FOR DATA-DRIVEN COMPANY

NIGHTLY DATA LOADS INTO THE DATA WAREHOUSE

USES A WORKFLOW SCHEDULER TO COORDINATE

C R O N

10 1 * * * echo “hello world” >> hello.log

execute commands or scripts (groups of commands) automatically at a specified time/date

Every 1 minute * * * * *

Every 15 minutes */15 * * * *

Every 30 minutes */30 * * * *

Every 1 hour 0 * * * *

Every 6 hours 0 */6 * * *

“Time-based job scheduler”

C R O N =

good old cron scheduler to get started

However, we found it hard to manage and monitor the status of the jobs.

E T L

We want data to be processed. Process is made by many steps (tasks or jobs)

WHY (NOT) CRON ?

IT CAN NOT HANDLE DEPENDENCIES BETWEEN TASKS

USING CRON BECAME A HEADACHE

▸ It’s very difficult to add new jobs in complex crons.

▸ Hard to debug and maintain. The crontab is just a text file.

▸ Failure handling

▸ developer needs to write a program for the Cron to call

▸ No scalability

https://danidelvalle.me/2016/09/12/im-sorry-cron-ive-met-airbnbs-airflow/

SO MAYBE....

Pinball

… workflow management tools …

TEXT

“ IF I HAD TO BUILD A NEW ETL SYSTEM TODAY FROM SCRATCH, I WOULD USE AIRFLOW. “

- MARTON TRENCSENI

HTTP://BYTEPAWN.COM/LUIGI-AIRFLOW-PINBALL.HTML

- started by Maxime Beauchemin at Airbnb in2014

- joined the Apache Software Foundation’s incubation program in 2016

A I R F L O W ?

Airflow is a platform to programmatically author, schedule and monitor workflows.

- It’s been built to scale

- Python script (configuration as code)

- active development

- Rich web UI

- In Airflow, a DAG – or a Directed Acyclic Graph

https://en.wikipedia.org/wiki/Directed_acyclic_graph

- define DAGs = define workflow ( Yes! Python code)

DAG

task

While DAGs describe how to run a workflow, An operator describes a single task in a workflow.

Airflow is not a data streaming solution. Tasks do not move data from one to the other

O P E R A T O R S

BashOperator - executes a bash command

PythonOperator - calls an arbitrary Python function

EmailOperator - sends an email

HTTPOperator - sends an HTTP request

SqlOperator - executes a SQL command

Sensor - waits for a certain time, file, database row, S3 key, etc…

and more in ….airflow/contrib/ directory

more specific operators: DockerOperator, HiveOperator, S3FileTransferOperator, PrestoToMysqlOperator, SlackOperator

‣ Email notifications of tasks retries or failures.

‣ Specify task dependencies is straightforward.

‣ Automatically retry failed jobs.

‣ a cool DAG visualization — perform some maintenance.

‣ A powerful CLI, useful to test new tasks or dags.

‣ Logging! see the output of each task execution

‣ Scaling! Integration with Apache Mesos and Celery.

P R O S

▸ Ui or webserver

U I / W E B S E R V E R

U I / W E B S E R V E R

ex. Today is 06 - 05 (June 05, 2017)

actual rundata we want on that day

E X E C U T I O N vs S T A R T DATE DATE

HTTP://SITE.CLAIRVOYANTSOFT.COM/SETTING-APACHE-AIRFLOW-CLUSTER/

Single Node

WEBSERVER + SCHEDULER + WORKER

WEBSERVER + SCHEDULER + WORKER

HTTP://SITE.CLAIRVOYANTSOFT.COM/SETTING-APACHE-AIRFLOW-CLUSTER/

Multi-Node (Cluster)

E X E C U T O R

▸ Sequential executor This executor will only run one task instance at a time

▸ Local executor executes tasks locally in parallel.

▸ Celery executor allows distributing the execution of task instances to multiple worker nodes.

} tasks

} DAG

} Default Arguments

Importing modules}DAG File

} dependencies

C O N S

“Time-based job scheduler”

C R O N

“workflow scheduler/ management”A I R F L O W

▸ Documentation: https://airflow.incubator.apache.org/

▸ Install Documentation: https://airflow.incubator.apache.org/installation.html

▸ GitHub Repo: https://github.com/apache/incubator-airflow

www.facebook.com/girlswhodev/

Q&A