winning data science competitions
TRANSCRIPT
![Page 1: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/1.jpg)
Winning Data Science Competitions
3. 29. 2017
Jeong-Yoon Lee, Ph.D.
![Page 2: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/2.jpg)
Chief Data Scientist, Conversion Logic
70+ Competitions
6 Times Prize Winner (KDD Cup 2012 & 2015)
8 Top 10 Finishes (Deloitte, AARP, Liberty Mutual)
Top 10, Kaggle 2015
Father of 4 boys
Jeong-Yoon Lee, Ph.D.
![Page 3: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/3.jpg)
About Conversion Logic
3
Advanced Marketing Attribution For Diverse Customers
![Page 4: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/4.jpg)
Why Data Science Competition
![Page 5: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/5.jpg)
Why Compete
For fun
For experience
For learning
For networking
5
![Page 6: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/6.jpg)
Fun
Competing with others
Continuous improvement
6
![Page 7: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/7.jpg)
Experience
7
![Page 8: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/8.jpg)
Learning
8
![Page 9: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/9.jpg)
Learning
9
![Page 10: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/10.jpg)
Networking
10
![Page 11: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/11.jpg)
11
![Page 12: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/12.jpg)
Data Science Competitions
![Page 13: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/13.jpg)
Data Science Competitions
Since 1997
2006 - 2009
Since 2010
![Page 14: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/14.jpg)
Competition Structure
Training Data
Test Data
Feature Label
Provided Submission Public LB Score Private LB Score
![Page 15: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/15.jpg)
Kaggle
250+ competitions since 2010
900K users
50K+ competitors
$3MM+ prize paid out
![Page 16: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/16.jpg)
Kaggle
![Page 17: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/17.jpg)
Kaggle
![Page 18: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/18.jpg)
Misconceptions on Competitions
![Page 19: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/19.jpg)
Misconceptions on Competitions
No ETL
No EDA
Not worth it
Not for production
19
![Page 20: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/20.jpg)
No ETL? - Deloitte Western Australia Rental Prices
20
![Page 21: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/21.jpg)
No ETL? - Outbrain Click Prediction
21
2B page views. 16.9MM clicks. 700MM users. 560 sites
![Page 22: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/22.jpg)
No ETL? - YouTube-8M Video Understanding Challenge
22
1.7TB feature-level data. 31GB video-level data.
![Page 23: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/23.jpg)
No ETL?
23
![Page 24: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/24.jpg)
No EDA?Most of competitions provide actual labels - typical EDA
Anonymized data - more creative EDAo People decode age, states, time intervals, income, etc.
24
![Page 25: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/25.jpg)
No EDA?
Anonymized data - more creative EDA
25
![Page 26: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/26.jpg)
Not worth it?
Performance matters
You walk easier when you can run
26
![Page 27: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/27.jpg)
Not for Production?
Kaggle Kernelo Max execution time:10 minutes
o Max file output: 500MB
o Memory limit: 8GB
27
![Page 28: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/28.jpg)
Ensemble Pipeline at Conversion Logic
28
![Page 29: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/29.jpg)
Best Practices
![Page 30: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/30.jpg)
Best Practices
Feature Engineering
Diverse Algorithms
Cross Validation
Ensemble
Collaboration
30
![Page 31: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/31.jpg)
Feature Engineering
31
Types Note
Numerical Log, Log2(1 + x), Box-Cox, Normalization, Binning
Categorical One-hot-encoding, Label-encoding, Count, Weight-of-Evidence
Text Bag-of-Words, TF-IDF, N-gram, Character-n-gram, K-skip-n-gram
Timeseries/ Sensor data Descriptive Statistics, Derivatives, FFT, MFCC, ERP
Network Graph Degree, Closeness, Betweenness, PageRank
Numerical/ Timeseries Convert to categorical features using RF/GBM
Dimensionality Reduction PCA, SVD, Autoencoder, Hashing Trick
Interaction Addition/substraction/mutiplicaiton/division. Hashing Trick
* More comprehensive overview on feature engineering by HJ van Veen: https://www.slideshare.net/HJvanVeen/feature-engineering-72376750
![Page 32: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/32.jpg)
Diverse AlgorithmsAlgorithm Tool Note
Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions
Random Forests Scikit-Learn, randomForest Used to be popular before GBM
Extremely Random Trees Scikit-Learn
Neural Networks/ Deep Learning Keras, MXNet, Torch, CNTK Blends well with GBM. Best at image and speech recognition competitions
Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.
Support Vector Machine Scikit-Learn
FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions
Factorization Machine libFM, fastFM Winning solution for KDD Cup 2012
Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)32
![Page 33: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/33.jpg)
Cross Validation
Training data are split into five folds where the sample size and dropout rate are preserved (stratified).
33
![Page 34: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/34.jpg)
![Page 35: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/35.jpg)
Ensemble - Stacking
* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/35
![Page 36: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/36.jpg)
KDDCup 2015 Solution
36
![Page 37: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/37.jpg)
Collaboration
![Page 38: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/38.jpg)
Collaboration – Git Repo + S3/Dropbox
38
![Page 39: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/39.jpg)
Collaboration – Common Validation
39
![Page 40: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/40.jpg)
Collaboration – Internal Leaderboard
40
![Page 41: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/41.jpg)
Best Practices
For fun
For experiences
For learning
For networking
41
Feature Engineering
Diverse Algorithms
Cross Validation
Ensemble
Collaboration
Why Competition
![Page 42: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/42.jpg)
Things That Help
42
Keep competition journals and repos – both during and after competitions
Build and improve the automated pipeline and library for competitions
• https://github.com/jeongyoonlee/Kaggler
• https://gitlab.com/jeongyoonlee/allstate-claims-severity/tree/master
• http://kaggler.com/kagglers-toolbox-setup/
Be humble, and ready to try and learn something new
Make a commitment and work on competitions no matter what on a regular basis
![Page 43: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/43.jpg)
Resources
43
No Free Hunch by Kaggle
Winning Tips on Machine Learning Competitions by Marios Michailidis (KazAnova)
Feature Engineering, mlwave.com by HJ van Veen (Triskelion)
fastml.com by Zygmunt Zając (Foxtrot)
kaggler.com, facebook.com/Kaggler by Jeong-Yoon Lee @ CL and Hang Li @ Hulu
Tianqi Chen @ UW – Won KDDCup 2012, DSB 2015. Author of XGBoost, MXNet
Gilberto Titericz Junior in San Francisco - #1 at Kaggle
![Page 44: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/44.jpg)
Active Competitions
44
Kaggle – 6 Featured, 1 Job Competitions
KDD Cup 2017
RecSys Challenge 2017
CIKM AnalytiCup 2017
![Page 45: Winning Data Science Competitions](https://reader030.vdocuments.net/reader030/viewer/2022020301/58f9a8e2760da3da068b68b6/html5/thumbnails/45.jpg)
Thank You