luigi presentation

18
Data Workflows at Foursquare using Luigi

Upload: james-moore

Post on 01-Dec-2015

72 views

Category:

Documents


3 download

DESCRIPTION

OANYC Summit

TRANSCRIPT

Page 1: Luigi Presentation

Data Workflows at Foursquare using Luigi

Page 2: Luigi Presentation

Foursquare

• 35 million users

• Nearly 4 billion check-ins

• More than 5 million check-ins per day

• 50 million point-of-interest database

• 100's of GB of log data per day

Page 3: Luigi Presentation

Tools We Use

• Hiveo Ad hoc analytics, data dumping ground

• Raw MapReduceo 100's of MapReduce jobs in our codebase

• Pigo Fits between structure Hive and free-form

MapReduce

• Verticao Low latency analytics

Page 4: Luigi Presentation

Cron

E.g.0 0 * * * ./hadoop-script-1.sh

# Wait two hours for that job to finish...

0 2 * * * ./hadoop-script-2.sh

# And on and on and on

Page 5: Luigi Presentation

Cron - Problems

• Brittle

• Hard to reason about / visualize

• Spend a lot of time waiting

• Difficult to tell what succeeded or failed

• No one likes writing Bash scripts

Page 6: Luigi Presentation

Oozie

XML-based Workflow Engine, with support for Hadoop, Hive, and Pig

Workflows specify computations in a DAG, e.g "Run this Hive query, then run these two MapReduce jobs in parallel"

Coordinators launch recurring workflows at a given frequency, when dependent data is available

Page 7: Luigi Presentation

Oozie - Example

Page 8: Luigi Presentation

Oozie - Problems

• Workflows are all-or-nothingo Cannot just run step that failedo Very little code reuse

• Little to no extensibility

• Limited control flow

• Extremely verbose

• Difficult to test

• No one likes writing XML

Page 9: Luigi Presentation

Luigi

• Python framework for batch processing jobs

• Created by Spotify, open-sourced Sept. 2012

• Tasks are units of work that produce Targets

• Tasks can depend on one or more other Tasks

• A Task is only run if all of its dependent Tasks are done

• Tasks are idempotent

Page 10: Luigi Presentation

Luigi - Example Task

Page 11: Luigi Presentation

Luigi - Running the Task

$ python word-count.py WordCount --date 2013-06-01

Page 12: Luigi Presentation

Luigi - Scheduler

Central scheduler ensures each Task is only run by a single worker.

A task is uniquely identified by its class name and its Parameters, e.g. WordCount(date=2013-06-01)

Will retry failed Tasks after a configured timeout

Emails someone when a Task fails

Page 13: Luigi Presentation

Luigi - Visualizer

Page 14: Luigi Presentation

Luigi - Visualizer

Page 15: Luigi Presentation

Luigi - Visualizer

Page 16: Luigi Presentation

Luigi - Advantages over Cron

• Explicit dependencies

• No wasted time waiting

• Easy to tell what has failed

• Avoid duplicate work / partial failures

Page 17: Luigi Presentation

Luigi - Advantages over Oozie

• Explicit dependencies between workflows

• Easier to write

• Vastly more extensible

• Code reuse

• Can easily re-run individual steps

Page 18: Luigi Presentation

Thank you!

Check out Luigi:

https://github.com/spotify/luigi

Drop me a line:

Joe Ennever

[email protected]