cm utaipei kaggle share
TRANSCRIPT
- 1. UTAIPEI Chih-Ming
- 2. About ME CM Ph.D Student in TIGP-SNHCC Research Assistant at AS CITI Research Intern at KKBOX Advisor: Prof. Victor Tsai () Advisor: Dr. Eric Yang () CLIP Lab MAC Lab Research, Machine Learning team https://about.me/chewme
- 3. Kaggle https://www.facebook.com/groups/kaggletw/
- 4. 4
- 5. Why Compete? For Fun: Competing with others like running or racing For Learning: Improving your abilities
- 6. Why Compete? For Fun: Competing with others like running or racing For Learning: Improving your abilities What's Your Motivation?
- 7. Why Compete? For Fun: Competing with others like running or racing For Learning: Improving your abilities What's Your Motivation?
- 8. Why Compete?
- 9. Related Websites http://dc.dsp.im/index.php
- 10. Related Websites https://tianchi.aliyun.com/
- 11. 11
- 12. Common Prediction Tasks Binary Classification Multi-label Classification Regression Recommendations
- 13. Other Prediction Tasks Route Path Object Detection propose a solution / design a webpage / exploratory data analysis (EDA) /
- 14. Evaluation Metric https://www.kaggle.com/wiki/Metrics Many existing toolkits have provided the solver that can optimize the loss with certain metric.
- 15. Why Optimize with Given Metric? 5 4 3 3 2 1 3 4 3 Perfect Ranking Bad Loss Bad Ranking Better Loss
- 16. Why Optimize with Given Metric? 5 4 3 3+2 2+2 1+2 3 4 3 Perfect Ranking Perfect Loss Bad Ranking Better Loss
- 17. Check the Provided Data The Distribution of Train/Test Data - random splitting - split by time - split by Ids Available Features - categorical, numerical - text - image, audio - time - sparse, dense
- 18. Cross Validation (1) Train Validation TRAIN VAL TEST VAL TEST TRAIN TEST TRAIN VAL Test Round 1: Round 2: Round 3:
- 19. Common Given Data TRAIN TEST
- 20. Cross Validation (2) Train Validation TRAIN TRAIN VAL TRAIN VAL TRAIN VAL TRAIN TRAIN Round 1: Round 2: Round 3: TEST
- 21. Cross Validation (3) Train Validation TRAIN TRAIN VAL1 Round only: Find out the best VAL. TEST
- 22. Hold A Proper Validation Random Splitting Split by Time Split by Id Train Validation Test 7 DAYS7 DAYS 5/20 5/275/13 or
- 23. Data Cleaning / Preprocessing Missing Values - drop the missing data - replace them by certain statistical values - label them as the missing value Outlier Detection - https://en.wikipedia.org/wiki/Outlier Redundant Features - we usually remove them mean / median / mode / clustering / modeling methods
- 24. Data Cleaning / Preprocessing Python - Pandas
- 25. Data Cleaning / Preprocessing Python - Pandas - drop - replace - label
- 26. Data Cleaning / Preprocessing Age User A 19 User B 27 User C 200
- 27. Data Cleaning / Preprocessing Age User A 19 User B 27 User C 200 drop
- 28. Data Cleaning / Preprocessing Age User A 19 User B 27 User C 200 Std. 1.2 1.1 6.3 Age User A 19 User B 27 User C 200 Out. 1 1 0 drop add label
- 29. Data Cleaning / Preprocessing Age User A 19 User B 27 User C 200 Std. 1.2 1.1 6.3 Age User A 19 User B 27 User C 200 Out. 1 1 0 drop Age User A 19 User B 27 User C 36 replaceadd label mean / median / mode / clustering / modeling methods
- 30. Categorical Features One-hot Encoding Clustering Group Mayday 1 0 0 0 Sodagreen 0 1 0 0 SEKAI_NO_OWARI 0 0 1 0 The_Beatles 0 0 0 1 Mayday 1 0 0 Sodagreen 1 0 0 SEKAI_NO_OWARI 0 1 0 The_Beatles 0 0 1 Language Id
- 31. Categorical Features Col-hot Encoding Count-hot Encoding Likelihood Encoding T1 T2 T3 T1 T2U T3 23 1 6 23 1 6 1 0 1 count binary probability 23/30 1/30 6/30
- 32. Categorical Features (2) Latent Representations - Principal Component Analysis (PCA) - Linear Discriminant Analysis (LDA) - Laplacian Eigenmaps (LE) - Locally linear embedding (LLE) - Low-Rank Approximation / Latent Factorization - Latent Topic Model reduce the computation cost alleviating the overfitting issue finding out the meaningful components remove the noise https://en.wikipedia.org/wiki/Dimensionality_reduction
- 33. Numerical Features Standardization / Normalization Rescaling Transform the Distribution - logarithmic transformation - tf-idf like transformation Binning / Sampling https://en.wikipedia.org/wiki/Feature_scaling required by many ML algorithms https://en.wikipedia.org/wiki/Data_transformation_(statistics)
- 34. Categorical vs. Numerical Ordinal Categories HATE DONT MIND LIKE LOVE 0 1 2 3 0 2 4 6 8 HATE DON'T MIND LIKE LOVE exp(value)
- 35. Data Sampling Label Imbalance Problem 3:1 Find out more from: https://github.com/scikit-learn-contrib/imbalanced-learn
- 36. Data Sampling Label Imbalance Problem Over Sampling 3:1 1:1 Find out more from: https://github.com/scikit-learn-contrib/imbalanced-learn
- 37. Data Sampling Label Imbalance Problem Over Sampling Under Sampling 3:1 1:1 1:1 Find out more from: https://github.com/scikit-learn-contrib/imbalanced-learn
- 38. Data Sampling Label Imbalance Problem Over Sampling Under Sampling 3:1 1:1 1:1 Find out more from: https://github.com/scikit-learn-contrib/imbalanced-learn Over-Under Sampling?
- 39. Other Feature Kinds Text-based - Natural Language Processing Image-based, Audio-based - Image/Signal Processing Time-based - Time Series Domain Knowledge is Important
- 40. Example (1) Text-based - Vector Space Model - Word Embeddings https://en.wikipedia.org/wiki/Vector_space_model MAN WOMAN KING QUEEN need stemming? lemmatization?
- 41. Example (2) Text ... ... segmentation [] [] [] [] [] [] [] [] [] [] [] :1 :1 :1 :2 :1 :1 :2 :1filtering Word Embeddings? dummy variables :2 :1 :1 :4 :2 :1 :1 :0.8 Advanced Weighting?
- 42. Example (3) Image-based - SIFT - Convolutional NN https://en.wikipedia.org/wiki/Scale-invariant_feature_transform https://en.wikipedia.org/wiki/Convolutional_neural_network
- 43. Realize the Meaning Behind the Observed Features 2017/05/20 08:00 Taipei Holiday? Weekday? Day? Night? Asia Mandarin
- 44. ML Libraries sci-kit learn xgboost, lightgbm, vowpal wabbit libsvm, liblinear, libfm, libffm, tensorflow, keras, h2o, caffe, mxnet,
- 45. Understand the Pros and Cons Linear Model - simple, fast and easy to tune - occupy low memory - non-complex Random Forest - work very well in many competitions - fast and easy to tune - memory hungry
- 46. Understand the Pros and Cons (2) Neural Networks - easy end2end learning - flexible - hard to tune/train SVM - strong theoretical guarantees - good to prevent from overfit - slow and memory heavy - usually needs grid-search on hyper parameters
- 47. Understand the Pros and Cons (3) Gradient Boosting Machine (GBM) - usually unbeatable for using dense feature sets Factorization Machine (FM) - the master in dealing with sparse data
- 48. Understand the Pros and Cons (4) There are too many details Find some online courses or ML books The Elements of Statistical Learning Machine Learning, A Probabilistic Perspective Programming Collective Intelligence Information Science and Statistics Pattern Recognition and Machine Learning
- 49. Understand the Pros and Cons (5) Ill tell you everything.
- 50. Exploratory Data Analysis (EDA) Quora Duplicated Question as the example - How do I read and find my YouTube comments? - How can I see all my Youtube comments? & - What is the alternative to machine learning? - How do I over-sample a multi-class imbalance data set? & - What is the biggest monster in Monster Hunter? - Is there a Monster Hunter PC game?
- 51. Example EDA Statistics Helps - min, max, variance, mode, Data Visualization Helps
- 52. Model Ensemble Voting Averaging Bagging Boosting Blending Stacking 1 2 3 4 1 2 3 4 AVERAGE 1 2 3 4 AVERAGE
- 53. Diverse Models Ensemble https://mlwave.com/kaggle-ensembling-guide/
- 54. Diverse Models Ensemble https://mlwave.com/kaggle-ensembling-guide/
- 55. Diverse Models Ensemble https://mlwave.com/kaggle-ensembling-guide/
- 56. Model Ensemble Voting Averaging Bagging Boosting Blending Stacking 1 2 3 4 1 2 3 4 ENSEMBLE MODEL 1 2 3 4 ENSEMBLE MODEL
- 57. Model Ensembling Voting Averaging Bagging Boosting Blending Stacking 1 2 3 4 1 1 1 1 new feature 1 2 3 4 AVG. Avg.
- 58. Other Tricks Data Leakage Magic/Lucky Parameters
- 59. Overall - To Get into Top Correct Validation Good Feature Extractions Diverse Model Proper Model Ensemble Advanced Way - understand and modify the model
- 60. Step by Step Feature Set A Feature Set B Feature Set C Model A DATASET Focus on using 1 single model. Extract N features everyday. Check the validation score. Prediction A
- 61. Step by Step Feature Set A Feature Set B Feature Set C Model A DATASET Diversify the models. Try different features combination. Model B Model C Prediction A Prediction B Prediction C
- 62. Step by Step Feature Set A Feature Set B Feature Set C Model A DATASET Ensemble the models. Model B Model C Prediction A
- 63. Step by Step Feature Set A Feature Set B Feature Set C Model A DATASET Package it. Model B Model C Prediction A
- 64. Step by Step Feature Set D Feature Set E Feature Set F Model D DATASET Stacking Model E Model F Prediction B FeaFeaFea Mo Mo Mo Pr
- 65. https://www.slideshare.net/NathanielShimoni/starting-data-science-with-kagglecom?qid=8f0c66f9-43ba-4646-8a05- d03bf30b2eeb&v=&b=&from_search=9
- 66. https://www.linkedin.com/pulse/ideas-sharing-kaggle-crowdflower-search-results-relevance-mark-peng/
- 67. http://blog.kaggle.com/2017/02/27/allstate-claims-severity-competition-2nd-place-winners- interview-alexey-noskov/
- 68. http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/
- 69. Learning from Others/Winners http://blog.kaggle.com/
- 70. https://docs.google.com/presentation/d/ 1bo7SahuYMzEEylVUTbJE29Oot7T4Cqk9SknOS9Lgsq4/edit?usp=sharing Learning from Others/Winners
- 71. ANY QUESTION? changecandy at gmail