open-source asynchronous distributed hyperparameter ... · open-source asynchronous distributed...

Moritz [email protected] @morimeister

Maggy-

Open-Source Asynchronous Distributed Hyperparameter Optimization Based on

Apache Spark

FOSDEM’2002.02.2020

The Bitter Lesson (of AI)*

“Methods that scale with computation are the future of AI”**

“The two (general purpose) methods that seem to scale …

... are search and learning.”*

2

** https://www.youtube.com/watch?v=EeMCEQa85tw

Rich Sutton (Father of Reinforcement Learning)

* http://www.incompleteideas.net/IncIdeas/BitterLesson.html

https://www.youtube.com/watch?v=EeMCEQa85tw

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

The Answer

Spark scales with available compute!

is the answer!

Distribution and Deep Learning

Generalization Error

Better RegularizationMethods

DesignBetter Models

Hyperparameter Optimization

Larger Training Datasets

Better Optimization Algorithms

Distribution

Inner Loop

Outer Loop

worker1

Inner and Outer Loop of Deep Learning

worker2 workerN

…

∆

2

∆

N

Synchronization

http://tiny.cc/51yjdz

Training Data

∆

1

Metric

Search Method

TrialHParams,

Architecture, etc.


Inner Loop

Outer Loop

worker1

Inner and Outer Loop of Deep Learning

worker2 workerN

…

∆

2

∆

N

Synchronization


Training Data

∆

1

Metric

SEARCH LEARNING

Search Method

TrialHParams,

Architecture, etc.


In Reality This Means Rewriting Training Code

Data PipelinesIngest & Prep

Feature Store

Machine Learning Experiments Data Parallel Training

Model ServingAblation

StudiesHyperparameter

Optimization

Explore/Design Model

Towards Distribution Transparency

The distribution oblivious training function (pseudo-code):

Inner Loop

Towards Distribution Transparency

• Trial and Error is slow

• Iterative approach is greedy

• Search spaces are usually large

• Sensitivity and interaction of hyperparameters

Set Hyper-parameters

Train ModelEvaluate Performance

Sequential Black Box Optimization

LearningBlack Box

Metric

Meta-level learning &

optimization

Search space

Outer Loop

Sequential Search

LearningBlack Box

Metric

GlobalController

Outer Loop

Parallel Search

LearningBlack Box

Metric

Global Controller

ParallelWorkersQueue

Trial

Trial

Search space or Ablation Study

Which algorithm to use for search? How to monitor progress?

Fault Tolerance?How to aggregate results?

Outer Loop

Parallel Search

LearningBlack Box

Metric

Meta-level learning &

optimization ParallelWorkersQueue

Trial

Trial

Search space

Which algorithm to use for search? How to monitor progress?

Fault Tolerance?How to aggregate results?

This should be managed with platform support!

Maggy

A flexible framework for asynchronous parallel execution of trials for ML experiments on Hopsworks:

ASHA, Random Search, Grid Search, LOCO-Ablation,Bayesian Optimizationand more to come…

Synchronous Search

Task11

Driver

Task12

Task13

Task1N

…

HDFS

Task21

Task22

Task23

Task2N

…

Barrier

Barrier

Task31

Task32

Task33

Task3N

…

Barrier

Metrics1 Metrics2 Metrics3

Add Early Stopping and Asynchronous Algorithms

Task11

Driver

Task12

Task13

Task1N

…

HDFS

Task21

Task22

Task23

Task2N

…

Barrier

Barrier

Task31

Task32

Task33

Task3N

…

Barrier

Metrics1 Metrics2 Metrics3

Wasted Compute Wasted ComputeWasted Compute

Performance Enhancement

Early Stopping:

● Median Stopping Rule

● Performance curve prediction

Multi-fidelity Methods:

● Successive Halving Algorithm

● Hyperband

Asynchronous Successive Halving Algorithm

Animation: https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/Liam Li et al. “Massively Parallel Hyperparameter Tuning”. In: CoRR abs/1810.05934 (2018).

https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/

Ablation Studies

PClassname survivesex sexname survive

Replacing the Maggy Optimizer with an Ablator:

• Feature Ablation using the Feature Store

• Leave-One-Layer-Out Ablation

• Leave-One-Component-Out (LOCO)

How can we fit this into the bulk synchronous execution model of Spark?Mismatch: Spark Tasks and Stages vs. Trials

Challenge

Databricks’ approach: Project Hydrogen (barrier execution mode) & SparkTrials in Hyperopt

The Solution

Task11

Driver

Task12

Task13

Task1N

…

Barrier

Metrics

New Trial

Early Stop

Long running tasks and communication:

HyperOpt: One Job/Trial, requiring many Threads on Driver

Enter Maggy

User API

Developer API

Ablation API

Results

Hyperparameter Optimization Task ASHA Validation Task

ASHARS-ES

RS-NS

ASHARS-ES

RS-NS

Conclusions

● Avoid iterative Hyperparameter Optimization

● Black box optimization is hard

● State-of-the-art algorithms can be deployed

asynchronously

● Maggy: platform support for automated hyperparameter

optimization and ablation studies

● Save resources with asynchronism

● Early stopping for sensible models

What’s next?

● More algorithms● Distribution

Transparency● Comparability/

reproducibility of experiments

● Implicit Provenance● Support for PyTorch

AcknowledgementsThanks to the entire Logical Clocks Team ☺

Contributions from colleagues:Robin Andersson @robzor92Sina Sheikholeslami @cutlashKim Hammar @KimHammar1Alex Ormenisan @alex_ormenisan

• Maggyhttps://github.com/logicalclocks/maggy https://maggy.readthedocs.io/en/latest/

• Hopsworkshttps://github.com/logicalclocks/hopsworkshttps://www.logicalclocks.com/whitepapers/hopsworks

• Feature Store: the missing data layer in ML pipelines?https://www.logicalclocks.com/feature-store/

@hopsworksHOPSWORKS

https://github.com/logicalclocks/maggy

https://maggy.readthedocs.io/en/latest/

https://www.logicalclocks.com/whitepapers/hopsworks

https://www.logicalclocks.com/feature-store/

open-source asynchronous distributed hyperparameter ... · open-source asynchronous distributed...

Documents