improving model predictions via stacking and hyper ... · stacking and hyper-parameters tuning...
TRANSCRIPT
![Page 1: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/1.jpg)
Improving Model Predict ions v ia Stack ing and Hyper -Parameters Tuning
Jo-fai (Joe) Chow
Data Scientist
@matlabulus
![Page 2: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/2.jpg)
About Me
• 2005 - 2015
• Water Engineer
o Consultant for Utilities
o EngD Research
• 2015 - Present
• Data Scientist
o Virgin Media
o Domino Data Lab
o H2O.ai
2
![Page 3: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/3.jpg)
Mango Data Science Radar
3
![Page 4: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/4.jpg)
About This Talk
• Predictive modelling
o Kaggle as an example
• Improve predictions with simple tricks
• Use data science for social good 👍
4
![Page 5: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/5.jpg)
About Kaggle
• World’s biggest predictive modelling competition platform
• 560k members • Competition types:
o Featured (prize) o Recruitment o Playground o 101
5
![Page 6: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/6.jpg)
Predict ing Shelter Animal Outcomes
• X: Predictors o Name o Gender o Type (👍 or 👍 ) o Date & Time o Age o Breed o Colour
• Y: Outcomes (5 types) o Adoption o Died o Euthanasia o Return to Owner o Transfer
• Data o Training (27k samples) o Test (11k)
6
![Page 7: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/7.jpg)
Basic Feature Engineering
X Raw (Before) Reformatted (After)
Name Elsa, Steve, Lassie [name_len]: 4, 5, 6
Date & Time 2014-02-12 18:22:00 [year]: 2014 [month]: 2 [weekday]: 4 [hour]: 18
Age 1 year, 3 weeks, 2 days [age_day]: 365, 21, 2
Breed German Shepherd, Pit Bull Mix [is_mix]: 0, 1
Colour Brown Brindle/White [simple_colour]: brown
7
![Page 8: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/8.jpg)
Common Machine Learning Techniques
• Ensembles o Bagging/boosting of
decision trees
o Reduces variance and increase accuracy
o Popular R Packages (used in next example)
• “randomForest”
• “xgboost”
• There are a lot more machine learning packages in R: o “caret”, “caretEnsemble”
o “h2o”, “h2oEnsemble”
o “mlr”
8
![Page 9: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/9.jpg)
Simple Trick – Model Averaging
• Stratified sampling o 80% for training o 20% for validation
• Evaluation metric o Multi-class Log Loss o Lower the better o 0 = Perfect
• 50 runs o different random seed
9
![Page 10: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/10.jpg)
More Advanced Methods
• Model Stacking o Uses a second-level
metalearner to learn the optimal combination of base learners
o R Packages: • “SuperLearner” • “subsemble” • “h2oEnsemble” • “caretEnsemble”
• Hyper-parameters Tuning o Improves the performance
of individual machine learning algorithms
o Grid search • Full / Random
o R Packages:
• “caret” • “h2o”
10
For more info, see
https://github.com/h2oai/h2o-meetups/tree/master/2016_05_20_MLconf_Seattle_Scalable_Ensembles
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.Rmd
![Page 11: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/11.jpg)
Trade-Off of Advanced Methods
• Strength o Model tuning + stacking
won nearly all Kaggle competitions.
o Multi-algorithm ensemble may better approximate the true predictive function than any single algorithm.
• Weakness o Increased training and
prediction times.
o Increased model complexity.
o Requires large machines or clusters for big data.
11
![Page 12: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/12.jpg)
R + H2O = Scalable Machine Learning
• H2O is an open-source, distributed machine learning library written in Java with APIs in R, Python and more.
• ”h2oEnsemble” is the scalable implementation of the Super Learner algorithm for H2O.
12
![Page 13: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/13.jpg)
H2O Random Grid Search Example
13
Define search range and criteria
Best models
![Page 14: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/14.jpg)
H2O Model Stacking Example
14 17 out of 717 teams (≈ top 2%)
Getting reasonable results Using h2o.stack(…) to combine multiple models
![Page 15: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/15.jpg)
Conclusions
• Many R packages for predictive modelling.
• Use hyper-parameters tuning to improve individual models.
• Use model averaging / stacking to improve predictions.
• Trade-off between model performance and computational costs.
• Use R + H2O for scalable machine learning.
• H2O random grid search and stacking.
• Use data science for social good 👍
15
![Page 16: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/16.jpg)
Big Thank You!
• Mango Solutions
• RStudio
• Domino Data Lab
• H2O
o Erin LeDell
o Raymond Peck
o Arno Candel
16
1st LondonR Talk
Crime Map Shiny App
bit.ly/londonr_crimemap
2nd LondonR Talk
Domino API Endpoint
bit.ly/1cYbZbF
![Page 17: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/17.jpg)
Any Questions?
• Contact o [email protected]
o @matlabulous
o github.com/woobe
• Slides & Code o github.com/h2oai/h2o-
meetups
• H2O in London o Meetups / Office (soon)
o www.h2o.ai/careers
• More H2O at Strata tomorrow o Innards of H2O (11:15)
o Intro to Generalised Low-Rank Models (14:05)
17
![Page 18: Improving Model Predictions via Stacking and Hyper ... · Stacking and Hyper-Parameters Tuning Jo-fai (Joe) Chow Data Scientist joe@h2o.ai @matlabulus . ... •Use data science for](https://reader030.vdocuments.net/reader030/viewer/2022040409/5ec5fcae4f8ce2596d27b2cf/html5/thumbnails/18.jpg)
Extra Sl ide (Stratif ied Sampling)
18