stories behind kaggle competitions
TRANSCRIPT
![Page 2: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/2.jpg)
kaggle runs public machine learning competitions
![Page 3: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/3.jpg)
we worked with clients/hosts on various types of problems and data of different sizes
![Page 4: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/4.jpg)
my job as a data scientist at kaggle
![Page 5: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/5.jpg)
“data science is not just kaggle competitions”
whyyyy???
![Page 6: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/6.jpg)
machine learning processes
● Business Problem● Collect Data● Transform Data● Dataset Splitting● Evaluation Metric● Feature Extraction
● Feature Selection● Model Training● Model Ensembling● Methodology Selection● Production System● Ongoing Optimization
![Page 7: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/7.jpg)
not every problem can be turned into a kaggle competition
![Page 8: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/8.jpg)
![Page 9: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/9.jpg)
size matters! where bigger is better (most of the time)
![Page 10: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/10.jpg)
data cleaning/formatting:
● easy to make a quick submission● boosts participation● (too) clean data kills creativity
![Page 11: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/11.jpg)
data privacy/anonymization
![Page 12: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/12.jpg)
metric: how do you measure success?
● Classification - AUC/ Logarithmic Loss/Accuracy
● Regression - RMSE/MAE
● Ranking - MAP/NDCG
● Other / Custom
https://www.kaggle.com/wiki/Metrics
![Page 13: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/13.jpg)
the design of a competition shapes how people are going to solve a problem
![Page 14: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/14.jpg)
Splitting dataset
● training/test
● public/private
![Page 15: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/15.jpg)
Time series data
![Page 16: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/16.jpg)
data leakage
“Deemed ‘one of the top ten data mining mistakes’, leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from”
“the concept of identifying and harnessing leakage has been openly addressed as one of three key aspects for winning data mining competitions”
“Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al
![Page 17: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/17.jpg)
do you have thousands of people reviewing your performance at work 24/7?
I do.
![Page 18: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/18.jpg)
1. people make mistakes. honesty is the best policy.
![Page 19: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/19.jpg)
2. crowdsourcing is powerful. anything that can go wrong will go wrong.
![Page 20: stories behind kaggle competitions](https://reader035.vdocuments.net/reader035/viewer/2022062406/55cad4acbb61ebae438b45c0/html5/thumbnails/20.jpg)