data science for lazy people, automated machine learning · data science for lazy people, automated...

58
Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Upload: others

Post on 20-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Data science for lazy people,

Automated Machine LearningBig Data Days Moscow 2019

Diego Hueltes@diegohueltes

Page 2: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

https://www.sli.do

Page 3: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Raw data Data Cleaning

Feature Selection

Feature Preprocessing

FeatureConstruction

Model Selection

Parameter Optimization

Model Validation

Page 4: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Raw data Data Cleaning

Feature Selection

Feature Preprocessing

FeatureConstruction

Model Selection

Parameter Optimization

Model Validation

Page 5: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Data cleaning

Page 6: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Data cleaning

Page 7: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Data cleaning

Page 8: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Raw data Data Cleaning

Feature Selection

Feature Preprocessing

FeatureConstruction

Model Selection

Parameter Optimization

Model Validation

Page 9: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Feature selection

Page 10: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Feature selection

Page 11: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Raw data Data Cleaning

Feature Selection

Feature Preprocessing

FeatureConstruction

Model Selection

Parameter Optimization

Model Validation

Page 12: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Feature preprocessing

Page 13: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Raw data Data Cleaning

Feature Selection

Feature Preprocessing

FeatureConstruction

Model Selection

Parameter Optimization

Model Validation

Page 14: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Feature construction

Page 15: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Raw data Data Cleaning

Feature Selection

Feature Preprocessing

FeatureConstruction

Model Selection

Parameter Optimization

Model Validation

Page 16: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Model selection

Page 17: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Raw data Data Cleaning

Feature Selection

Feature Preprocessing

FeatureConstruction

Model Selection

Parameter Optimization

Model Validation

Page 18: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Parameter optimizationRandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',

max_depth=2, max_features='auto', max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,

oob_score=False, random_state=0, verbose=0, warm_start=False)

Page 19: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Raw data Data Cleaning

Feature Selection

Feature Preprocessing

FeatureConstruction

Model Selection

Parameter Optimization

Model Validation

Page 20: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Model validation

Page 21: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Raw data Data Cleaning

Feature Selection

Feature Preprocessing

FeatureConstruction

Model Selection

Parameter Optimization

Model Validation

Page 22: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

lazyUnwilling to work or use energy

Oxford dictionary

Page 23: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

lazyUnwilling to work or use energyin repetitive tasks

Oxford dictionaryDiego dictionary

Page 24: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Raw data Data Cleaning

Feature Selection

Feature Preprocessing

FeatureConstruction

Model Selection

Parameter Optimization

Model Validation

Page 25: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes
Page 27: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

TPOT is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming.

https://github.com/EpistasisLab/tpot

Page 28: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

auto-sklearn frees a machine learning user from algorithm selection and hyperparameter

tuning. It leverages recent advantages in Bayesian optimization, meta-learning and

ensemble construction

auto-sklearn

https://github.com/automl/auto-sklearn

Page 29: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Geneticprogramming

Page 30: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Source: http://www.genetic-programming.org/gpbook4toc.html

Page 31: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Source: http://w3.onera.fr/smac/?q=tracker

Mutation Crossover

Page 32: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Bayesianoptimization

Page 33: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

source: https://advancedoptimizationatharvard.wordpress.com/2014/04/28/bayesian-optimization-part-ii/

Page 34: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

TPOT is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming.

https://github.com/EpistasisLab/tpot

Page 35: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

TPOTfrom tpot import TPOTClassifier, TPOTRegressor

tpot = TPOTClassifier()

tpot.fit(X_train, y_train)

predictions = tpot.predict(X_test)

tpot = TPOTRegressor()

tpot.fit(X_train, y_train)

predictions = tpot.predict(X_test)

Page 36: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

TPOT - configurationTPOTClassifier( config_dict = {

'sklearn.ensemble.RandomForestClassifier' : {

'n_estimators' : [100],

'criterion' : ["gini", "entropy"],

'max_features' : np.arange( 0.05, 1.01, 0.05),

'min_samples_split' : range(2, 21),

'min_samples_leaf' : range(1, 21),

'bootstrap' : [True, False]

},

'sklearn.feature_selection.RFE' : {

'step': np.arange( 0.05, 1.01, 0.05),

'estimator' : {

'sklearn.ensemble.ExtraTreesClassifier' : {

'n_estimators' : [100],

'criterion' : ['gini', 'entropy'],

'max_features' : np.arange( 0.05, 1.01, 0.05)

}

}

}

})

Page 37: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

auto-sklearn frees a machine learning user from algorithm selection and hyperparameter

tuning. It leverages recent advantages in Bayesian optimization, meta-learning and

ensemble construction

auto-sklearn

https://github.com/automl/auto-sklearn

Page 38: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

auto-sklearn

Source: http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf

Page 39: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

auto-sklearnimport autosklearn.classification

import autosklearn.regression

automl = autosklearn.classification.AutoSklearnClassifier()

automl.fit(X_train, y_train)

predictions = automl.predict(X_test)

automl = autosklearn.regression.AutoSklearnRegressor()

automl.fit(X_train, y_train)

predictions = automl.predict(X_test)

Page 40: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

auto-sklearn custom config

include_estimators

exclude_estimators

include_preprocessors

ex clude_preprocessors

Page 41: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Olive oil full dataset...

Test in Google Colab, clean dataset.

TPOT: 55% Accuracy

auto-sklearn: 56% Accuracy

H2O automl: 51% Accuracy

Page 42: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Benchmarking Automatic Machine Learning Frameworks

Adithya Balaji, Alexander Allen

https://arxiv.org/abs/1808.06492

Page 43: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes
Page 44: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes
Page 45: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Automated Machine Learning — A Paradigm Shift That Accelerates Data Scientist Productivity @ Airbnb

https://medium.com/airbnb-engineering/automated-machine-learning-a-paradigm-shift-that-accelerates-data-scientist-productivity-airbnb-f1f8a10d61f8

Page 46: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

What about neural networks?

Page 47: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

http://www.asimovinstitute.org/neural-network-zoo/

Page 48: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

http://torch.cogbits.com/doc/tutorials_supervised/

Page 49: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12

Page 50: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

TensorFlow (With Keras)

model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax')])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Page 51: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

https://xkcd.com/1425/

Page 52: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Auto-Keras provides functions to automatically search for architecture and hyperparameters

of deep learning models.

https://autokeras.com/

Page 53: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

AutoKeras

import autokeras as ak

clf = ak.ImageClassifier(max_trials=100)

clf.fit(x_train, y_train)

y = clf.predict(x_test, y_test)

Page 54: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

AutoKeras

ImageClassifier()

ImageRegressor()

TextClassifier()

TextRegressor()

Page 55: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Demotime!

Page 56: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Automated Machine Learning

- Beginners- Exploratory analysis- Selective discovering- New ideas for your model- Model optimization

Page 57: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Progress isn’t made by early risers. It’s made by lazy people trying to find easier ways to do something.

– Robert A. Heinlein

Page 58: Data science for lazy people, Automated Machine Learning · Data science for lazy people, Automated Machine Learning Big Data Days Moscow 2019 Diego Hueltes @diegohueltes

Thank you!

@diegohueltes

DiegoHueltes

Diego Hueltes

[email protected]

https://www.hueltes.com/automl