winning data science competitions

Post on 21-Apr-2017

2.119 Views

Category:

Data & Analytics

12 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Winning Data Science Competitions

3. 29. 2017

Jeong-Yoon Lee, Ph.D.

Chief Data Scientist, Conversion Logic

70+ Competitions

6 Times Prize Winner (KDD Cup 2012 & 2015)

8 Top 10 Finishes (Deloitte, AARP, Liberty Mutual)

Top 10, Kaggle 2015

Father of 4 boys

Jeong-Yoon Lee, Ph.D.

About Conversion Logic

3

Advanced Marketing Attribution For Diverse Customers

Why Data Science Competition

Why Compete

For fun

For experience

For learning

For networking

5

Fun

Competing with others

Continuous improvement

6

Experience

7

Learning

8

Learning

9

Networking

10

11

Data Science Competitions

Data Science Competitions

Since 1997

2006 - 2009

Since 2010

Competition Structure

Training Data

Test Data

Feature Label

Provided Submission Public LB Score Private LB Score

Kaggle

250+ competitions since 2010

900K users

50K+ competitors

$3MM+ prize paid out

Kaggle

Kaggle

Misconceptions on Competitions

Misconceptions on Competitions

No ETL

No EDA

Not worth it

Not for production

19

No ETL? - Deloitte Western Australia Rental Prices

20

No ETL? - Outbrain Click Prediction

21

2B page views. 16.9MM clicks. 700MM users. 560 sites

No ETL? - YouTube-8M Video Understanding Challenge

22

1.7TB feature-level data. 31GB video-level data.

No ETL?

23

No EDA?Most of competitions provide actual labels - typical EDA

Anonymized data - more creative EDAo People decode age, states, time intervals, income, etc.

24

No EDA?

Anonymized data - more creative EDA

25

Not worth it?

Performance matters

You walk easier when you can run

26

Not for Production?

Kaggle Kernelo Max execution time:10 minutes

o Max file output: 500MB

o Memory limit: 8GB

27

Ensemble Pipeline at Conversion Logic

28

Best Practices

Best Practices

Feature Engineering

Diverse Algorithms

Cross Validation

Ensemble

Collaboration

30

Feature Engineering

31

Types Note

Numerical Log, Log2(1 + x), Box-Cox, Normalization, Binning

Categorical One-hot-encoding, Label-encoding, Count, Weight-of-Evidence

Text Bag-of-Words, TF-IDF, N-gram, Character-n-gram, K-skip-n-gram

Timeseries/ Sensor data Descriptive Statistics, Derivatives, FFT, MFCC, ERP

Network Graph Degree, Closeness, Betweenness, PageRank

Numerical/ Timeseries Convert to categorical features using RF/GBM

Dimensionality Reduction PCA, SVD, Autoencoder, Hashing Trick

Interaction Addition/substraction/mutiplicaiton/division. Hashing Trick

* More comprehensive overview on feature engineering by HJ van Veen: https://www.slideshare.net/HJvanVeen/feature-engineering-72376750

Diverse AlgorithmsAlgorithm Tool Note

Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions

Random Forests Scikit-Learn, randomForest Used to be popular before GBM

Extremely Random Trees Scikit-Learn

Neural Networks/ Deep Learning Keras, MXNet, Torch, CNTK Blends well with GBM. Best at image and speech recognition competitions

Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.

Support Vector Machine Scikit-Learn

FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions

Factorization Machine libFM, fastFM Winning solution for KDD Cup 2012

Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)32

Cross Validation

Training data are split into five folds where the sample size and dropout rate are preserved (stratified).

33

Ensemble - Stacking

* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/35

KDDCup 2015 Solution

36

Collaboration

Collaboration – Git Repo + S3/Dropbox

38

Collaboration – Common Validation

39

Collaboration – Internal Leaderboard

40

Best Practices

For fun

For experiences

For learning

For networking

41

Feature Engineering

Diverse Algorithms

Cross Validation

Ensemble

Collaboration

Why Competition

Things That Help

42

Keep competition journals and repos – both during and after competitions

Build and improve the automated pipeline and library for competitions

• https://github.com/jeongyoonlee/Kaggler

• https://gitlab.com/jeongyoonlee/allstate-claims-severity/tree/master

• http://kaggler.com/kagglers-toolbox-setup/

Be humble, and ready to try and learn something new

Make a commitment and work on competitions no matter what on a regular basis

Resources

43

No Free Hunch by Kaggle

Winning Tips on Machine Learning Competitions by Marios Michailidis (KazAnova)

Feature Engineering, mlwave.com by HJ van Veen (Triskelion)

fastml.com by Zygmunt Zając (Foxtrot)

kaggler.com, facebook.com/Kaggler by Jeong-Yoon Lee @ CL and Hang Li @ Hulu

Tianqi Chen @ UW – Won KDDCup 2012, DSB 2015. Author of XGBoost, MXNet

Gilberto Titericz Junior in San Francisco - #1 at Kaggle

Active Competitions

44

Kaggle – 6 Featured, 1 Job Competitions

KDD Cup 2017

RecSys Challenge 2017

CIKM AnalytiCup 2017

Thank You

top related