1 authors: yin lou, rich caruana, johannes gehrke(cornell university) kdd, 2012 presented by:...

1

Authors: Yin Lou, Rich Caruana, Johannes Gehrke(Cornell University)

KDD, 2012

Presented by: Haotian Jiang

3.31.2015

Intelligible Models for Classification and Regression

2

1. Motivation

2. Generalized Additive Model----Introduction----Methodology

3. Experiments

4. Results

5. Discussion

6. Conclusion

Content

3

1. Motivation


3. Experiments

4. Results

5. Discussion

6. Conclusion

Content

4

Motivation of RWR

Regression and Classification:

a. Complex model:

SVMRandom ForestsDeep Neural Nets

Advantage: High accuracyDisadvantage: Uninterpretable

b. Linear model: Advantage: InterpretableDisadvantage: Low accuracy

Common Problem: In many applications what is learned is just as important as the accuracy of the predictions.

5

1. Motivation


3. Experiments

4. Results

5. Discussion

6. Conclusion

Content

6

We fit models of the form:

Introduction to GAMs

Link Function is identity: Regression modelLink Function is logistic function: Classification model

7

Introduction to GAMs

The data in Example 1 was drawn from a model with no interactions between features, a model of the form in Equation 1 is able to fit the data perfectly.

8

Methodology of GAMs

Let denote a training dataset of size N, where is a feature vector with n features and is the target.

In this paper, author consider both regression problem where , and binary classification problems where .

Given a model F, let denote the prediction of the model for data point .The goal is to minimize the expected value of some loss function .

Shape FunctionLearning Method

9

Shape Functions

Regression Splines. We consider regression splines of the following form:

We assume that a model containing only one smooth function of one covariate.

For example,

Recommendation reading: Generalized Additive Models an introduction with R

10

Shape Function

Trees and Ensembles of Trees

We control tree complexity by either fixing the number of leaves or by disallowing leaves that have fewer than an α-fraction of the number of training examples.

Single Tree: We use a single regression tree as a shape function.

Bagged Trees: We use the well-known technique of bagging to reducing variance.

Boosted Trees: We use gradient boosting, where each successive tree tries to predict the overall residual from all preceding trees.

Boosted Bagged Trees: We use a bagged ensemble in each step of stochastic gradient boosting, resulting in a boosted ensemble of bagged trees.

11

Shape Function

Rectangular area:

Our purpose is

12

Learning Method: Least Squares

Least Squares: We minimize with the smoothing parameter λ. Large values of λ lead to a straight line for while low values of λ allow the spline to fit closely to the data.

We use thin plate regression splines from the R package “mgcv” that automatically selects the best values for the parameters of the splines.

13

Learning Method: Gradient Boosting

We use standard gradient boosting with one difference: Since we want to learn shape functions for all features, in each iteration of boosting we cycle sequentially through all features.

14

Learning Method: Backfitting

The first shape function f1 is then learned using the training set with the goal to predict y.

15

Content

1. Motivation


3. Experiments

4. Results

5. Discussion

6. Conclusion

16

EXPERIMENTAL SETUP

Datasets: The “Concrete,” “Wine,” and “Music” regression datasets arefrom the UCI repository; “Delta” is the task of controlling the ailerons of a F16 aircraft; “CompAct” is a regression dataset from the Delve repository that describes the state of multiuser computers. The synthetic dataset was described in Example 1.

17

EXPERIMENTAL SETUPIn gradient boosting, we vary the number of leaves in the bagged or boosted trees: 2, 3, 4, 8, 12 to 16. Trained models will contain M such trees for each shape function after M iterations.

In backfitting, we control the complexity of the tree by adaptively choosing a parameter α that stops splitting nodes smaller than an α fraction of the size of the training data; we vary α {0.00125, 0.025, 0.05, 0.1, 0.15, 0.2, 0.25, ∈0.3, 0.35, 0.4, 0.45, 0.5}.

Dataset

Training Set, 0.64N

Validation Set, 0.16N

Test Set, 0.2N

Cross Validation, Checking for convergence.

Purpose: Computing M && Bias-Variance Analysis

Metrics: For regression problems, we report the root mean squared error(RMSE). For classification problems, we report the error rates.

18

1. Motivation


3. Experiments

4. Results

5. Discussion

6. Conclusion

Content

19

Results

20

Results

21

Results

22

Discussion

Bias-Variance Analysis

Question: why are tree-based methods more accurate for feature shaping than spline-based methods?

Define the average prediction on L samples for each points (xi, yi) in test set as

The squared bias:

The variance is calculated as:

23

Discussion

24

Disscusion

Computation Cost: P-LS and P-IRLS are very fast on small datasets, but on the larger datasets they are slower than the BSTTRx. Due to the extra cost of bagging, BST-bagTRx, BF-bagTR and BF-bbTRx are much slower than P-LS/P-IRLS or BST-TRx. The slowest method we tested is backfitting, which is expensive because at each iteration the previous shape functions are discarded and a new fit for each feature must be learned.

Limitations and Extensions: Dimensionality (only low-to-medium), Incorporating feature selection, Allowing a few pairwise interactions.

Accurate Intelligible Models with Pairwise Interactions

25

25

1. Motivation


3. Experiments

4. Results

5. Discussion

6. Conclusion

Content

26

26

Conclusion

Our bias-variance analysis shows that spline-based methods tend to underfit and thus may miss important non-smooth structure in the shape models. As expected, the bias-variance analysis also shows that tree-based methods are prone to overfitting and require careful regularization.

GAMs method based on gradient boosting of size-limited bagged trees thatyields significantly more accuracy than previous algorithms on both regression and classification problems while retaining the intelligibility of GAM models.

27

Haotian Jiang

Thank you! Q & A

1 authors: yin lou, rich caruana, johannes gehrke(cornell university) kdd, 2012 presented by:...

Documents

regression slide

classification model

regression model link

r slide

shape function trees

linear model

complex model

model f