1 authors: yin lou, rich caruana, johannes gehrke(cornell university) kdd, 2012 presented by:...
TRANSCRIPT
1
Authors: Yin Lou, Rich Caruana, Johannes Gehrke(Cornell University)
KDD, 2012
Presented by: Haotian Jiang
3.31.2015
Intelligible Models for Classification and Regression
2
1. Motivation
2. Generalized Additive Model----Introduction----Methodology
3. Experiments
4. Results
5. Discussion
6. Conclusion
Content
3
1. Motivation
2. Generalized Additive Model----Introduction----Methodology
3. Experiments
4. Results
5. Discussion
6. Conclusion
Content
4
Motivation of RWR
Regression and Classification:
a. Complex model:
SVMRandom ForestsDeep Neural Nets
Advantage: High accuracyDisadvantage: Uninterpretable
b. Linear model: Advantage: InterpretableDisadvantage: Low accuracy
Common Problem: In many applications what is learned is just as important as the accuracy of the predictions.
5
1. Motivation
2. Generalized Additive Model----Introduction----Methodology
3. Experiments
4. Results
5. Discussion
6. Conclusion
Content
6
We fit models of the form:
Introduction to GAMs
Link Function is identity: Regression modelLink Function is logistic function: Classification model
7
Introduction to GAMs
The data in Example 1 was drawn from a model with no interactions between features, a model of the form in Equation 1 is able to fit the data perfectly.
8
Methodology of GAMs
Let denote a training dataset of size N, where is a feature vector with n features and is the target.
In this paper, author consider both regression problem where , and binary classification problems where .
Given a model F, let denote the prediction of the model for data point .The goal is to minimize the expected value of some loss function .
Shape FunctionLearning Method
9
Shape Functions
Regression Splines. We consider regression splines of the following form:
We assume that a model containing only one smooth function of one covariate.
For example,
Recommendation reading: Generalized Additive Models an introduction with R
10
Shape Function
Trees and Ensembles of Trees
We control tree complexity by either fixing the number of leaves or by disallowing leaves that have fewer than an α-fraction of the number of training examples.
Single Tree: We use a single regression tree as a shape function.
Bagged Trees: We use the well-known technique of bagging to reducing variance.
Boosted Trees: We use gradient boosting, where each successive tree tries to predict the overall residual from all preceding trees.
Boosted Bagged Trees: We use a bagged ensemble in each step of stochastic gradient boosting, resulting in a boosted ensemble of bagged trees.
11
Shape Function
Rectangular area:
Our purpose is
12
Learning Method: Least Squares
Least Squares: We minimize with the smoothing parameter λ. Large values of λ lead to a straight line for while low values of λ allow the spline to fit closely to the data.
We use thin plate regression splines from the R package “mgcv” that automatically selects the best values for the parameters of the splines.
13
Learning Method: Gradient Boosting
We use standard gradient boosting with one difference: Since we want to learn shape functions for all features, in each iteration of boosting we cycle sequentially through all features.
14
Learning Method: Backfitting
The first shape function f1 is then learned using the training set with the goal to predict y.
15
Content
1. Motivation
2. Generalized Additive Model----Introduction----Methodology
3. Experiments
4. Results
5. Discussion
6. Conclusion
16
EXPERIMENTAL SETUP
Datasets: The “Concrete,” “Wine,” and “Music” regression datasets arefrom the UCI repository; “Delta” is the task of controlling the ailerons of a F16 aircraft; “CompAct” is a regression dataset from the Delve repository that describes the state of multiuser computers. The synthetic dataset was described in Example 1.
17
EXPERIMENTAL SETUPIn gradient boosting, we vary the number of leaves in the bagged or boosted trees: 2, 3, 4, 8, 12 to 16. Trained models will contain M such trees for each shape function after M iterations.
In backfitting, we control the complexity of the tree by adaptively choosing a parameter α that stops splitting nodes smaller than an α fraction of the size of the training data; we vary α {0.00125, 0.025, 0.05, 0.1, 0.15, 0.2, 0.25, ∈0.3, 0.35, 0.4, 0.45, 0.5}.
Dataset
Training Set, 0.64N
Validation Set, 0.16N
Test Set, 0.2N
Cross Validation, Checking for convergence.
Purpose: Computing M && Bias-Variance Analysis
Metrics: For regression problems, we report the root mean squared error(RMSE). For classification problems, we report the error rates.
18
1. Motivation
2. Generalized Additive Model----Introduction----Methodology
3. Experiments
4. Results
5. Discussion
6. Conclusion
Content
19
Results
20
Results
21
Results
22
Discussion
Bias-Variance Analysis
Question: why are tree-based methods more accurate for feature shaping than spline-based methods?
Define the average prediction on L samples for each points (xi, yi) in test set as
The squared bias:
The variance is calculated as:
23
Discussion
24
Disscusion
Computation Cost: P-LS and P-IRLS are very fast on small datasets, but on the larger datasets they are slower than the BSTTRx. Due to the extra cost of bagging, BST-bagTRx, BF-bagTR and BF-bbTRx are much slower than P-LS/P-IRLS or BST-TRx. The slowest method we tested is backfitting, which is expensive because at each iteration the previous shape functions are discarded and a new fit for each feature must be learned.
Limitations and Extensions: Dimensionality (only low-to-medium), Incorporating feature selection, Allowing a few pairwise interactions.
Accurate Intelligible Models with Pairwise Interactions
25
25
1. Motivation
2. Generalized Additive Model----Introduction----Methodology
3. Experiments
4. Results
5. Discussion
6. Conclusion
Content
26
26
Conclusion
Our bias-variance analysis shows that spline-based methods tend to underfit and thus may miss important non-smooth structure in the shape models. As expected, the bias-variance analysis also shows that tree-based methods are prone to overfitting and require careful regularization.
GAMs method based on gradient boosting of size-limited bagged trees thatyields significantly more accuracy than previous algorithms on both regression and classification problems while retaining the intelligibility of GAM models.
27
Haotian Jiang
Thank you! Q & A