team: convolution · • understand the evaluation well, and design corresponding loss function •...

18
KDD CUP 2017 Travel Time Prediction Predicting Travel Time The Winning Solution of KDD CUP 2017 Team: Convolution

Upload: others

Post on 20-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

KDD CUP 2017 Travel Time Prediction

Predicting Travel Time – The Winning Solution of KDD CUP 2017

Team: Convolution

Page 2: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

Pan Huang

Bing Ads, Microsoft

Peng Yan

Meituan-Dianping

Ke Hu

Bing Ads, Microsoft

Huan Chen

Beihang University(Master Candidate, CS)

Page 3: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

Task understanding

Features

Models

Summary

Outline

Page 4: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

TaskUnderstanding

Problem Definition

Task• To estimate the average travel time from designated

intersections to tollgates

Metrics

Data• Vehicle trajectories along routes• Weather data in the target area• Road topology

Page 5: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

• Data is noisy and sparse, easy to overfit to specific validation data• Hard to conduct offline experiments

• Small abnormal values have big impact on final metrics, especially considering MAPE evaluation

• Time sequence is hard to predict, especially the following pattern of the next six windows

TaskUnderstanding

Challenges

Page 6: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

train test

train test

train test

train test

train test

train test

Problem understanding

• Moving window based CV split- Use Last 2-8 days as validation, last 9-98 days as training - Use Last 3-9 days as validation, last 10-98 days as training

• Interval K-fold based CV split (know the impact of last week data for train)

- Use second week as validation, other weeks as training- Use third week as validation, other weeks as training

• Use online leaderboard feedback to determine the weight of validation sets

- totally 13 CV sets used, much more consistent with online feedback

TaskUnderstanding

More Effective Validation Methods

Page 7: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

Offline training set

Offlinepredict result

Offlinescore

model

fee

db

ack

Online training set

Onlinepredict result

Onlinescore

feed

bac

k

optimize

many times a

day

One time a

day

Feature Extraction

Model Tuning

Ensemble

Problem understanding

TaskUnderstanding

Workflow

Page 8: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

Features

2 hours 1 2 3 4 5 6

Feature extraction targets

6 samples

x

2 hours

2 hours

20min…

Timeline

2 hours

2 hours

2 hours

2 hours

2 hours

250k training samples in total

Data Augmentation

Page 9: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

• Remove holiday days’ traffic

• CV based denoise

Data PreprocessingFeatures

Page 10: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

Aggregate information from 2 dimensions

Features Feature Extraction

Basic Features Session Level Features Long Term Features

• Last K cars statistics• Moving window statistics

• Time• Road

• Use exponential decay factor to conduct statistics

Page 11: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

• More than 100 features are extracted

• Feature selection based on CV experiments and online feedback

• Multiple feature combinations are used for ensemble, solving feature conflict

Fullfeature

set

Featureset 1

Featureset 2

Featureset 3

model1

model2

model3

Features Feature Engineering

Page 12: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

• Level-wise growth strategy • Stable

• Leaf-wise growth strategy• Good algo for category feature • Very Fast

• Every fold has a stop round, using CV fold weight to determine the full data stop round

• Use 1/label as sample weight, make the little value samples learnt better

XGBOOST

LightGBM

Model Tuning

Sample Weight

Level-wise tree growth

Leaf-wise tree growth

Models Tree Based Model

Page 13: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

• Using embedding layer for basic feature• Similar location’s embedding vector is close

• The balance of learning between embedding part and statistics part

• Multiple Layer Perception: Powerful Expression Ability, Learn feature interaction

• Recurrent Neural Network: Good to model sequence relationship, but unstable for generalization

One Hot Embedding

Early Interaction vs Later Interaction

MLP vs RNN

Models Neural Network

Onehot: Time, Tollgate id,

Location etc.

Dense Statistical Values

Hidden LayerEmbedding Layer

Hidden Layer

Hidden Layer

Final Output

Page 14: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

• N-fold Validation model ensemble without retrain• Global model and separate district model• Use 2,10,20 minutes as moving window to construct samples

• Different Exponential decay factor or smooth factor for statistics features• General route feature and link feature bagging• Long term statistics and short term statistics ensemble

Data Level Ensemble

Feature Level Ensemble

Models Ensemble

Page 15: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

• Different ml models, including xgb, lightgbm, dnn, rnn• Loss function changes, weighted norm-2 and fair loss to approximate norm-1 loss• Label transform, like customized log transform, and the sample weight transform• Model parameter, like nn structure, gbdt parameters

Model Level Ensemble

Result

• Choose 13 base models base on multiple weighted cv and leaderboard feedback

• Finally, we got 0.1748 mape in leaderboard

0.172

0.174

0.176

0.178

0.18

0.182

single gbdtmodel

model levelensemble

feature levelensemble

data levelensemble

Offline CV Evaluation of Ensemble Envolvement

0.1760.1770.1780.179

0.180.1810.182

single gbdtmodel

nn ensemble loss functionensemble

labeltransform

Offline CV Evaluation of Model Level Ensemble Envolvement

Models Ensemble

Linear Model

Tree Models NN Models Feature Level Ensemble Models

+𝑊 is determined by weighted CV Data

Page 16: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

Summary

Feature Basic Feature, Session Level Feature, Long Term Feature

XGBoost, LightGBM, Multiple Layer Perception

Weighted CV based, Log transform, Loss function etc.

Model

Ensemble

Page 17: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

Conclusion

• Pay more attention to data distribution, like data changes, noisy data

• Build scientific cross validation sets is very significant, and rely on CV much more than leaderboard

• Treating error composed of bias, variance and noise, try to decrease variance if bias is hard to decrease

• Think more and fine tuning less, or it will be easy to overfit to small data

• Understand the evaluation well, and design corresponding loss function

• Separate modeling is useful if data distribution differs a lot

Summary Lessons Learn

Page 18: Team: Convolution · • Understand the evaluation well, and design corresponding loss function • Separate modeling is useful if data distribution differs a lot Summary Lessons

Thanks

Q&A