scott clark, co-founder and ceo, sigopt at mlconf sf 2016

83
BAYESIAN GLOBAL OPTIMIZATION Using Optimal Learning to Tune ML Models Scott Clark [email protected]

Upload: mlconf

Post on 16-Apr-2017

293 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

BAYESIAN GLOBAL OPTIMIZATIONUsing Optimal Learning to Tune ML Models

Scott Clark

[email protected]

Page 2: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

OUTLINE

1. Why is Tuning ML Models Hard?

2. Standard Tuning Methods

3. Bayesian Global Optimization

4. Comparing Optimizers

5. Real World Examples

Page 3: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Page 4: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

Machine Learning is extremely powerful

Page 5: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

Machine Learning is extremely powerful

Tuning Machine Learning systems is extremely non-intuitive

Page 6: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3

What is the most important unresolved problem in machine learning?

“...we still don't really know why some configurations of deep neural networks work in some case and not others, let alone having a more or less automatic approach to determining the architectures and the hyperparameters.”

Xavier Amatriain, VP Engineering at Quora(former Director of Research at Netflix)

Page 7: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

Photo: Joe Ross

Page 8: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

TUNABLE PARAMETERS IN DEEP LEARNING

Page 9: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

TUNABLE PARAMETERS IN DEEP LEARNING

Page 10: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

TUNABLE PARAMETERS IN DEEP LEARNING

Page 11: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

Photo: Tammy Strobel

Page 12: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

STANDARD METHODSFOR HYPERPARAMETER SEARCH

Page 13: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

EXAMPLE: FRANKE FUNCTION

Page 14: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

Grid Search Random Search

This slide’s GIF loops automatically

Page 15: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

Predictive Models

PredictiveModels

TUNING MACHINE LEARNING MODELS

New parameters

Objective Metric

Better Models

Big Data

Page 16: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

BAYESIAN GLOBAL OPTIMIZATION

Page 17: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

… the challenge of how to collect information as efficiently as possible, primarily for settings where collecting information is time consuming and expensive.

Prof. Warren Powell - Princeton

What is the most efficient way to collect information?Prof. Peter Frazier - Cornell

How do we make the most money, as fast as possible?Scott Clark - CEO, SigOpt

OPTIMAL LEARNING

Page 18: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Optimize objective function○ Loss, Accuracy, Likelihood

● Given parameters○ Hyperparameters, feature parameters

● Find the best hyperparameters○ Sample function as few times as possible○ Training on big data is expensive

BAYESIAN GLOBAL OPTIMIZATION

Page 19: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

1. Build Gaussian Process (GP) with points sampled so far

2. Optimize the fit of the GP (covariance hyperparameters)

3. Find the point(s) of highest Expected Improvement within parameter domain

4. Return optimal next best point(s) to sample

HOW DOES IT WORK?

Page 20: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

GAUSSIAN PROCESSES

Page 21: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

GAUSSIAN PROCESSES

Page 22: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

GAUSSIAN PROCESSES

Page 23: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

GAUSSIAN PROCESSES

Page 24: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

GAUSSIAN PROCESSES

Page 25: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

GAUSSIAN PROCESSES

Page 26: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

GAUSSIAN PROCESSES

Page 27: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

GAUSSIAN PROCESSES

Page 28: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

overfit good fit underfit

GAUSSIAN PROCESSES

Page 29: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

EXPECTED IMPROVEMENT

Page 30: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

EXPECTED IMPROVEMENT

Page 31: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

EXPECTED IMPROVEMENT

Page 32: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

EXPECTED IMPROVEMENT

Page 33: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

EXPECTED IMPROVEMENT

Page 34: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

EXPECTED IMPROVEMENT

Page 35: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

EVALUATING THE OPTIMIZER

Page 36: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

What is the best value found after optimization completes?

METRIC: BEST FOUND

BLUE RED

BEST_FOUND 0.7225 0.8949

Page 37: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

How quickly is optimum found? (area under curve)

METRIC: AUC

BLUE RED

BEST_FOUND 0.9439 0.9435

AUC 0.8299 0.9358

Page 38: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Optimization functions (eg Branin, Ackeley, Rosenbrock)● ML datasets (LIBSVM)

BENCHMARK SUITE

TEST FUNCTION TYPE COUNT

Continuous Params 184

Noisy Observations 188

Parallel Observations 45

Integer Params 34

Categorical Params / ML 47

Failure Observations 30

TOTAL 489

Page 39: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● On-demand cluster in AWS for parallel eval function optimization

● Full eval consists of

~10000 optimizations, taking ~4 hours

INFRASTRUCTURE

Page 40: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

VIZ TOOL : BEST SEEN TRACES

Page 41: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

METRICS: STOCHASTICITY

● Run each 20 times

● Mann-Whitney U test for significance

Page 42: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

RANKING OPTIMIZERS● Alternate methods exist for black box optimization :

Spearmint, TPE, SMAC, PSO, RND Search, Grid Search

● Important to understand / track method performance disparity on high-level categories of functions

● For a given test function, want a partial ranking (allowing for ties) of method performance

Page 43: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

RANKING OPTIMIZERS● First, Mann-Whitney U

tests using BEST_FOUND

● Tied results then partially ranked using AUC

● Any remaining ties, stay as ties for final ranking

Page 44: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

RANKING AGGREGATION

● Aggregate partial rankings across all eval functions using Borda count (sum of methods ranked lower)

Page 45: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

SHORT RESULTS SUMMARY

Page 46: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

SIGOPT SERVICE

Page 47: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

Predictive Models

PredictiveModels

HOW DOES SIGOPT INTEGRATE?

New parameters

Objective Metric

Better Models

Big Data

Page 48: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

SIMPLIFIED MANAGEMENT

Before SigOpt

Page 49: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

DISTRIBUTED MODEL TRAINING

● SigOpt serves as an AWS-ready distributed scheduler for training models across workers

● Each worker accesses the SigOpt API for the latest parameters to try

● Enables distributed training of non-distributed algorithms

Page 50: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

INTEGRATIONS

REST API

Page 52: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

SHORT EXAMPLES

Page 53: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

EXAMPLE: LOAN DATA

Loan Applications

Default Prediction

with tunableML parameters

● Income● Credit Score● Loan Amount

New parameters

Prediction Accuracy

Better Accuracy

Page 54: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

COMPARATIVE PERFORMANCE

Accu

racy

Grid Search

Random Search

AU

C

.698

.690

.683

.675$1,000100 hrs

$10,0001,000 hrs

$100,00010,000 hrs

Cost

● Better: 22% fewer bad loans vs baseline

● Faster/Cheaper: 100x less time and AWS cost than standard tuning methods

Page 55: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

EXAMPLE: ALGORITHMIC TRADING

Market Data

Trading Strategy

with tunableweights and thresholds

● Closing Prices● Day of Week● Market Volatility

New parameters

Expected Revenue

Higher Returns

Page 56: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

COMPARATIVE PERFORMANCE

Standard Method

Expert

● Better: 200% Higher model returns than expert

● Faster/Cheaper: 10x faster than standard methods

Page 57: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

1. SigOpt Live Demo

2. More Examplesa. Text Classification

b. Unsupervised + Supervised

c. Neural Nets with TensorFlow

ADDITIONAL TOPICS

Page 58: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

AUTOMATICALLY TUNING TEXT SENTIMENT CLASSIFIER

Page 59: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Automatically tune text sentiment classifier

● Amazon product review dataset (35K labels) eg : “This DVD is great. It brings back all the memories of the holidays as a young child.”

● Logistic regression is generally a good place to start

PROBLEM

Page 60: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Maximize mean of k-fold cross-validation accuracies ● k = 5 folds, train and valid randomly split 70%, 30%

OBJECTIVE FUNCTION

Page 61: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● n-gram vocabulary selection parameters

● (min_n_gram, ngram_offset) determine which n-grams● (log_min_df, df_offset) filter for n-grams within df range

TEXT FEATURE PARAMETERS

Original Text “SigOpt optimizes any complicated system”

1-grams { “SigOpt”, “optimizes”, “any”, “complicated”, “system”}

2-grams { “SigOpt_optimizes”, “optimizes_any”, “any_complicated” … }

3-grams { “SigOpt_optimizes_any”, “optimizes_any_complicated” … }

Page 62: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Logistic regression error cost parameters

M = number of training examples

θ = vector of weights the algorithm will learn for each n-gram in vocabulary

yi - training data label : {-1, 1} for our two class problem

xi - training data input vector: BOW vectors described in previous section

α - weight of regularization term (log_reg_coef in our experiment)

ρ - weight of l1 norm term (l1_coef in our experiment)

ERROR COST PARAMETERS

Page 63: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● 50 line python snippet to train and tune classifier with SigOpt

● 20 lines to define 6 parameter experiment and run optimization loop using SigOpt

PYTHON CODE

Page 64: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● E[f (λ)] after 20 runs, each run consisting of 60 function evaluations

● For Grid Search : 64 evenly spaced parameter configurations (order shuffled randomly)

● SigOpt statistical significance over grid and rnd (p = 0.0001, Mann-Whitney U test)

PERFORMANCE

SigOpt Rnd. Search Grid Search No Tuning (Baseline)

Best Found 0.8760 (+5.72%) 0.8673 (+4.67%) 0.8680 (+4.76%) 0.8286

Page 65: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

EXPLOITING UNLABELLED DATA

Page 66: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Classify house number digits with lack of labelled data

● Challenging digit variations, image clutter with neighboring digits

PROBLEM

Page 67: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● In general we’ll search for an optimized ML pipeline

OBJECTIVE

Page 68: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Transform image patches into vectors of centroid distances, then pool to form final representation

● SigOpt optimizes selection of w, pool_r, K

UNSUPERVISED MODEL PARAMS

Page 69: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Whitening transform often useful as image data pre-processing step, expose εZCA to SigOpt

UNSUPERVISED MODEL PARAMS

Page 70: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Tune sparsity of centroid distance transform

● SigOpt optimizes threshold (active_p) selection

UNSUPERVISED MODEL PARAMS

Page 71: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● learning rate number of trees tree parameters (max_depth, sub_sample_sz), exposed to SigOpt

SUPERVISED MODEL PARAMS

Page 72: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

METRIC OPTIMIZATION

Page 73: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● 20 optimization runs, each run consisting of 90 / 40 function evaluations for Unsup / Raw feature settings

● Optimized single CV fold on training set, ACC reported on test set as hold out

PERFORMANCE

SigOpt (Unsup Feats)

Rnd Search(Unsup Feats)

SigOpt (Raw Feats)

Rnd Search(Raw Feats)

No Tuning RF(Raw Feats)

Hold Out ACC 0.8708 (+51.4%) 0.8583 0.6844 0.6739 0.5751

Page 74: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

EFFICIENTLY BUILDING CONVNETS

Page 75: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Classify house numbers with more training data and more sophisticated model

PROBLEM

Page 76: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● TensorFlow makes it easier to design DNN architectures, but what structure works best on a given dataset?

CONVNET STRUCTURE

Page 77: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Per parameter adaptive SGD variants like RMSProp and Adagrad seem to work best

● Still require careful selection of learning rate (α), momentum (β), decay (γ) terms

STOCHASTIC GRADIENT DESCENT

Page 78: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Comparison of several RMSProp SGD parametrizations

● Not obvious which configurations will work best on a given dataset without experimentation

STOCHASTIC GRADIENT DESCENT

Page 79: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

METRIC OPTIMIZATION

Page 80: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

● Avg Hold out accuracy after 5 optimization runs consisting of 80 objective evaluations

● Optimized single 80/20 CV fold on training set, ACC reported on test set as hold out

PERFORMANCE

SigOpt(TensorFlow CNN)

Rnd Search(TensorFlow CNN)

No Tuning (sklearn RF)

No Tuning(TensorFlow CNN)

Hold Out ACC 0.8130 (+315.2%) 0.5690 0.5278 0.1958

Page 81: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

COST ANALYSIS

Model Performance (CV Acc. threshold)

Random Search Cost

SigOpt Cost

SigOpt Cost Savings

Potential Savings In Production (50 GPUs)

87 % $275 $42 84% $12,530

85 % $195 $23 88% $8,750

80 % $46 $21 55% $1,340

70 % $29 $21 27% $400

Page 82: Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

https://sigopt.com/getstarted

Try it yourself!