: autogenerating supervised learning programs

51
: Autogenerating Supervised Learning Programs José P. Cambronero, Martin Rinard MIT

Upload: others

Post on 20-Apr-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: : Autogenerating Supervised Learning Programs

: Autogenerating Supervised Learning

Programs

José P. Cambronero, Martin Rinard MIT

Page 2: : Autogenerating Supervised Learning Programs
Page 3: : Autogenerating Supervised Learning Programs

(Idealized) Supervised Learning

Labeled Data

SupervisedModel

Training

New Unlabeled Data

SupervisedModel

Inference

Page 4: : Autogenerating Supervised Learning Programs

(Reality) Pipelines

SupervisedModel

Missingdata

Non-numericfeatures

Featurescaling

Featureselection

Labeled Data

Labeled Data

Pipeline Code

Page 5: : Autogenerating Supervised Learning Programs

(Reality) Choices

Page 6: : Autogenerating Supervised Learning Programs

(Reality) Choices

Page 7: : Autogenerating Supervised Learning Programs

(Reality) Choices

Page 8: : Autogenerating Supervised Learning Programs

(Reality) Choices

• Just in Scikit-Learn and XGBoost: • 51 classes/functions for transformations • 96 classes/functions for learning

• Most pipelines have multiple steps

• Matching components requires expertise (and empirical evaluation)

Page 9: : Autogenerating Supervised Learning Programs

import xgboost import sklearn import sklearn.feature_extraction.text import sklearn.linear_model.logistic import sklearn.preprocessing.imputation import runtime_helpers

from sklearn.pipeline import Pipeline

# read inputs and split X, y = runtime_helpers.read_input('titanic.csv', 'Survived') X_train, y_train, X_val, y_val = train_test_split(X, y, test_size=0.25)

# build pipeline with transforms and model pipeline = Pipeline([ ('t0', runtime_helpers.ColumnLoop(sklearn.feature_extraction.text.CountVectorizer)), ('t1', sklearn.preprocessing.imputation.Imputer()), ('model', sklearn.linear_model.logistic.LogisticRegression()), ])

# fit pipeline pipeline.fit(X_train, y_train)

# evaluate on held-out data print(pipeline.score(X_val, y_val))

# train pipeline on train + val for new predictions pipeline.fit(X, y)

def predict(X_new): return pipeline.predict(X_new)

Page 10: : Autogenerating Supervised Learning Programs

import xgboost import sklearn import sklearn.feature_extraction.text import sklearn.linear_model.logistic import sklearn.preprocessing.imputation import runtime_helpers

from sklearn.pipeline import Pipeline

# read inputs and split X, y = runtime_helpers.read_input('titanic.csv', 'Survived') X_train, y_train, X_val, y_val = train_test_split(X, y, test_size=0.25)

# build pipeline with transforms and model pipeline = Pipeline([ ('t0', runtime_helpers.ColumnLoop(sklearn.feature_extraction.text.CountVectorizer)), ('t1', sklearn.preprocessing.imputation.Imputer()), ('model', sklearn.linear_model.logistic.LogisticRegression()), ])

# fit pipeline pipeline.fit(X_train, y_train)

# evaluate on held-out data print(pipeline.score(X_val, y_val))

# train pipeline on train + val for new predictions pipeline.fit(X, y)

def predict(X_new): return pipeline.predict(X_new)

Page 11: : Autogenerating Supervised Learning Programs

# build pipeline with transforms and model pipeline = Pipeline([ ('t0', runtime_helpers.ColumnLoop( sklearn.feature_extraction.text.CountVectorizer )), ('t1', sklearn.preprocessing.imputation.Imputer()) , ('model', sklearn.linear_model.logistic.LogisticRegression()) , ])

Page 12: : Autogenerating Supervised Learning Programs

ColumnLoop

# build pipeline with transforms and model pipeline = Pipeline([ ('t0', runtime_helpers.ColumnLoop( sklearn.feature_extraction.text.CountVectorizer )), ('t1', sklearn.preprocessing.imputation.Imputer()) , ('model', sklearn.linear_model.logistic.LogisticRegression()) , ])

Page 13: : Autogenerating Supervised Learning Programs

ColumnLoop

CountVectorizer

# build pipeline with transforms and model pipeline = Pipeline([ ('t0', runtime_helpers.ColumnLoop( sklearn.feature_extraction.text.CountVectorizer )), ('t1', sklearn.preprocessing.imputation.Imputer()) , ('model', sklearn.linear_model.logistic.LogisticRegression()) , ])

Page 14: : Autogenerating Supervised Learning Programs

ColumnLoop

CountVectorizer

Imputer

# build pipeline with transforms and model pipeline = Pipeline([ ('t0', runtime_helpers.ColumnLoop( sklearn.feature_extraction.text.CountVectorizer )), ('t1', sklearn.preprocessing.imputation.Imputer()) , ('model', sklearn.linear_model.logistic.LogisticRegression()) , ])

Page 15: : Autogenerating Supervised Learning Programs

ColumnLoop

CountVectorizer

Imputer Logistic Regression

# build pipeline with transforms and model pipeline = Pipeline([ ('t0', runtime_helpers.ColumnLoop( sklearn.feature_extraction.text.CountVectorizer )), ('t1', sklearn.preprocessing.imputation.Imputer()) , ('model', sklearn.linear_model.logistic.LogisticRegression()) , ])

Page 16: : Autogenerating Supervised Learning Programs

Requires

Page 17: : Autogenerating Supervised Learning Programs

RequiresInspecting data

Readingdocs

Writing code

Evaluating

Page 18: : Autogenerating Supervised Learning Programs

Why do I have to do this?

I’m not the first person to do machine learning.

Why can’t I just reuse what everyone else did?

Page 19: : Autogenerating Supervised Learning Programs

Conventional Answer

• Your data isn’t like everyone else’s data

• Their pipeline won’t work for your data

• You are special and need to have your own pipeline!

Page 20: : Autogenerating Supervised Learning Programs

Conventional Answer is Both True and False

• You do need your own pipeline

• Pipeline does depend on your data

• So you can’t just reuse someone else’s pipeline

• But your pipeline will be a lot like pipelines that other people have developed for similar data

Page 21: : Autogenerating Supervised Learning Programs

Conventional Answer is Both True and False

• We collected 500 pipelines, which use 9 datasets, from a public forum (Kaggle)

• On average 92% of the pipelines were different from pipelines targeting other datasets

• But within a dataset only 40% of pipelines are unique on average

Page 22: : Autogenerating Supervised Learning Programs
Page 23: : Autogenerating Supervised Learning Programs

Learning from Programs

Page 24: : Autogenerating Supervised Learning Programs

Approach

Instrument

Execute

Data Collection

ExtractCanonicalPipelines

Train PipelineModel New 

Dataset

Trained Model

Offline

Pipelines

AL Training

Generate

AL InferenceOnline

*.py

Traces

Page 25: : Autogenerating Supervised Learning Programs

Data Collection• Approximately 500 programs

• 9 datasets

• Collected from Kaggle

• Dynamic instrumentation inserted to collect • Functions called in key libraries • Argument summaries • Call dependences

Page 26: : Autogenerating Supervised Learning Programs

How do you generate a pipeline?

Page 27: : Autogenerating Supervised Learning Programs

DataCountVectorizer

Data'

???

DataCountVectorizer

Data'

InputFeatures

DataCountVectorizer

Data'

Page 28: : Autogenerating Supervised Learning Programs

DataCountVectorizer

Data'

???

DataCountVectorizer

Data'

InputFeatures

DataCountVectorizer

Data'

Page 29: : Autogenerating Supervised Learning Programs

DataCountVectorizer

Data'

???

DataCountVectorizer

Data'

InputFeatures

DataCountVectorizer

Data'

Page 30: : Autogenerating Supervised Learning Programs

Summarizing Inputs: Data Features

• Label probability (max, min mean) • Mean kurtosis • Count of columns and rows • Count of columns by type • Number of unique labels • Mean skewness • Label counts (max, min, mean) • Percentage missing values • Cross-column correlation (max, min, mean) • Non-zero counts (mean) • PDF/CDF with common distributions • …

Page 31: : Autogenerating Supervised Learning Programs

Dynamic Traces• Slicing-based algorithm to extract canonical pipeline from dynamic trace

TF-IDF Transform

SVD Transform

ScalingTransform

Random Forest

Page 32: : Autogenerating Supervised Learning Programs

Generating Multiple Pipelines: Beam Search with Probabilistic Model

Page 33: : Autogenerating Supervised Learning Programs

...

Previous Iteration

k currentpipelines

...

Previous Iteration

... ......

Extend eachwith top k

components basedon predictedprobability 

...

Previous Iteration

Sort based on pipeline probability and prune to k

...

... ......

Keep top kpipelines based on

predicted probability

Page 34: : Autogenerating Supervised Learning Programs

...

Previous Iteration

K currentpipelines

...

Previous Iteration

... ......

Extend eachwith top k

components basedon predictedprobability 

...

Previous Iteration

Sort based on pipeline probability and prune to k

...

... ......

Keep top kpipelines based on

predicted probability

Page 35: : Autogenerating Supervised Learning Programs

...

Previous Iteration

K currentpipelines

...

Previous Iteration

... ......

Extend eachwith top k

components basedon predictedprobability 

...

Previous Iteration

Sort based on pipeline probability and prune to k

...

... ......

Keep top kpipelines based on

predicted probability

Page 36: : Autogenerating Supervised Learning Programs

...

Previous Iteration

Sort based on pipeline probability  and prune to k

...

... ...

Next Iteration...

...

Iterate until provideddepth-bound

Page 37: : Autogenerating Supervised Learning Programs

...

Previous Iteration

Sort based on pipeline probability  and prune to k

...

... ...

Iterate to depth-bound

Sort onFinal Iteration

...

...

Use validation data tosort final set of

pipelines

Page 38: : Autogenerating Supervised Learning Programs

Evaluation

Page 39: : Autogenerating Supervised Learning Programs

AutoML Systems

• Genetic programming over tree-based pipelines

• Bayesian-optimization for pipelines, ensemble-based

• (Ours) learning from programs

Page 40: : Autogenerating Supervised Learning Programs

Benchmark Datasets• TPOT and Autosklearn paper datasets (21)

• Pre-processed so all systems run • Sourced from PLMB and OpenML

• Kaggle datasets (6) • Varied datatypes • No initial pre-processing

• Mulan datasets (4) • Multi-target regression/classification

Page 41: : Autogenerating Supervised Learning Programs

Search Time Configuration

• Autosklearn/TPOT given 1 hour budget, increased to 2 hours if search does not complete

• AL search depth bounded, time limit per operation in pipeline

Page 42: : Autogenerating Supervised Learning Programs

Evaluation Dimensions

• Pipeline generation time

• Ability to handle new datasets

• Accuracy of generated pipelines

Page 43: : Autogenerating Supervised Learning Programs

Results: Pipeline generation time

Page 44: : Autogenerating Supervised Learning Programs

Results: Ability to handle new datasets

Dataset Source # Datasets Outcomes

Kaggle 6 Only AL executed out-of-box

Mulan 4 Only AL executed out-of-box

Page 45: : Autogenerating Supervised Learning Programs

Results: Accuracy of generated pipelines

Dataset Source # Datasets Outcomes

TPOT Paper 9 / 9 7 / 9 Comparable

Autosklearn Paper 12 / 13 10 / 12 Comparable

Page 46: : Autogenerating Supervised Learning Programs

Improvement over Random Strategies Average performance of top 10 pipelines

Page 47: : Autogenerating Supervised Learning Programs

Average performance of top 10 pipelines

Improvement over Random Strategies

+8.9 F1on average

Page 48: : Autogenerating Supervised Learning Programs

Average performance of top 10 pipelines

Improvement over Random Strategies

+6.5 F1on average

Page 49: : Autogenerating Supervised Learning Programs

Instrumentation Impact• 80% of example programs have runtime ratio between 1.0 and 2.0 when instrumented

Page 50: : Autogenerating Supervised Learning Programs

Future Work• AL calls components with default hyperparameters

• Modify instrumentation and probability model to account for hyperparameter “equivalence classes”

• Or use AL pipelines as warm-start for hyper-parameter search in other tools

• Introduce new libraries • Adapt code generation/instrumentation as necessary

• Explore similar approach in other data analytics tasks (e.g. data visualization)

Page 51: : Autogenerating Supervised Learning Programs

Conclusion• Learning supervised learning pipelines from programs can:

• Speed up and simplify search

• Produce comparable performance to existing AutoML tools

• Extend to more datasets without additional effort