september 2014 dan steinberg mikhail golovnya salford systems salford systems ©2014 introduction to...

54
September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression 1 Introduction to Modern Regression: From OLS to GPS® to MARS®

Upload: marshall-patrick

Post on 28-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

1

September 2014Dan Steinberg

Mikhail GolovnyaSalford Systems

Salford Systems ©2014 Introduction to Modern Regression

Introduction to Modern Regression:

From OLS to GPS® to MARS®

Page 2: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

2

Course Outline

• Regression Problem – quick overview• Classical OLS/LOGIT – the starting point• RIDGE/LASSO/GPS – regularized regression• MARS – adaptive non-linear regression

• Regression Trees• GBM – stochastic gradient boosting• ISLE/RULE-LEARNER – model compression

Salford Systems ©2014 Introduction to Modern Regression

Today’s Topics

Next Week Topics

Page 3: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

3

Regression• Regression analysis at least 200 years old

• Surely most used predictive modeling technique (including logistic regression)

• American Statistical Association reports 18,900 members

• Bureau of Labor Statistics reported more than 22,000 statisticians in the work force in 2008 in the USA

• Many other professionals involved in the sophisticated analysis of data not included in these countso Statistical specialists in scientific disciplines such as economics, medicine

bioinformaticso Machine Learning specialists, ‘Data Scientists’, database expertso Market researchers studying traditional targeted marketingo Web analytics, social media analytics, text analytics

• Few of these other researchers will call themselves statisticians but may make extensive use of variations of regression

Salford Systems ©2014 Introduction to Modern Regression

Page 4: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

4

Regression Challenges• Preparation of data – errors, missing values, etc.• Determination of predictors to include in model

o Hundreds, thousands, even tens and hundreds of thousands available

• Transformation or coding of predictorso Conventional approaches consider logarithm, power, inverse, etc..

• Detecting and modeling important interactions• Possibly huge number of records

o Super large samples render all predictors “significant”

• Complexity of underlying relationship• Lack of external knowledge

Salford Systems ©2014 Introduction to Modern Regression

Page 5: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

5

OLS Regression• OLS – ordinary least squares regression

o Discovered by Legendre (1805) and Gauss (1809) to solve problems in astronomy using pen and paper

o Solid statistical foundation by Fisher in 1920so 1950s – use of electro-mechanical calculators

• The model is always of the form

• The response surface is a hyper-plane!• A – the intercept term• B1, B2, B3, … – parameter estimates

• A usually unique combination of values exists which minimizes the mean squared error of predictions on the learn sample

• Step-wise approaches to determine model size

Response = A + B1X1 + B2X2 + B3X3 + …

Salford Systems ©2014 Introduction to Modern Regression

Page 6: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

6

Logistic Regression• Models Bernoulli response (binary outcome)• The log-odds of the probability of the event of

interest is a linear combination of predictors

• The solution minimizes the loss (or equivalently maximizes the logistic likelihood) on the available data (assuming the response is coded as +1 and -1)

• The solution has no closed form and is usually found by series of iterations utilizing Newton’s algorithm

Salford Systems ©2014 Introduction to Modern Regression

F(X) = Log[p/(1-p)] = A + B1X1 + B2X2 + B3X3 + …

Loss = i=1,N {Log[1 + exp(-Yi Fi)]}

Page 7: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

7

Key Problems with OLS and

LOGIT• In addition to the challenges already mentioned

above, note the following features:o OLS optimizes specific loss function (Mean Squared Error / Log-

likelihood) using all available data (learn sample)o Solution over-fits to the learn sampleo Solution becomes unstable in the presence of collinearityo Unique solution does not exist when data becomes wide

• Alternative strategies to construct useful linear combinations of predictorso By jointly optimizing MSE and L2-norm of the coefficients – Ridge regressiono By jointly optimizing MSE and L1-norm of the coefficients – Lasso regressiono Hybrid of the two aboveo All of the above plus extensions into very sparse solutions – Generalized

Path Seeker

Look for

Salford Systems ©2014 Introduction to Modern Regression

Page 8: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

8

Motivation for Regularized

Regression• Unsatisfactory regression results based on data

modeling physical processes (1970) when predictors correlatedo Coefficient estimates could change dramatically with small changes in datao Some coefficients judged to be much too largeo Frequent appearance of coefficients with “wrong” signo Problem severe with substantial multicollinearity but always present

• Solution proposed by Hoerl and Kennard, (Technometrics, 1970) was “Ridge regression”o First proposal by Hoerl just for stabilization of coefficients 1962o Initially poorly received by academic statistics profession

• Ridge Intentionally biased but yields more satisfactory coefficient estimates and superior generalizationo Better performance (test MSE) on previously unseen data (lower variance)

• “Shrinkage” of regression coefficients towards zero• OLS coefficient vector length is biased upwards

Salford Systems ©2014 Introduction to Modern Regression

Page 9: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

9

Lasso Regularized Regression

• Introduced by Tibshirani in 1996 explicitly as an improvement on the RIDGE regression

• Least Absolute Shrinkage and Selection Operator• Desire to gain the stability and lower variance of ridge

regression while also performing variable selection• Especially in the context of many possible predictors

looking for a simple, stable, low predictive variance model

• Historical note: The Lasso was inspired by related work in 1993 by Leo Breiman (of CART and RandomForests fame). This was the ‘non-negative garotte’.

• Breiman’s simulation studies showed the potential for improved prediction via selection and shrinkage

Salford Systems ©2014 Introduction to Modern Regression

Page 10: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

10

Regularized Regression - Theory

• Regularized regression approach balances model performance and model complexity

• λ – regularization parametero λ = ∞ Zero-coefficient solutiono λ = 0 OLS solution

• Model complexity expression defines type of modelSalford Systems ©2014 Introduction to Modern Regression

Mean Squared Error Model Complexity

OLS Regression

Minimize

Minimize

Regularized Regression

Ridge: Sum of squared

coefficients

Lasso: Sum of absolute

coefficients

Best Subsets: Number of coefficients

λ

Page 11: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

11

Penalty Function – Model

Complexity• RIDGE penalty Saj

2 squared

• LASSO penalty |S aj| absolute value

• COMPACT penalty |S aj|0 count

• Extended power family |S aj| g 0<= <=2g

• Elastic net family S {(β - 1)a2j/2 + (2 - β)|aj|},1 ≤ β ≤ 2

• Regularization parameter λ multiplies a penalty function of the a vector

• Each elasticity is based on fitting a model that minimizes the sum (residual sum of squared errors + penalty)

• Intermediate elasticities are mixtures, e.g. we could have a 50/50 mix of RIDGE and LASSO ( = 1.5)g

• GPS extends beyond the “elastic net” of Tibshirani/Hastie to include the “other half” of the power family 0 ≤ β ≤ 1

Salford Systems ©2014 Introduction to Modern Regression

Page 12: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

12

Lambda-Centric Approach• A number of regularized regression algorithms

were developed to find the value of the coefficients for the given lambdao Ridge regression: a(ridge) = (X’X +l I)-1X’y o LARS algorithm and its modification to obtain LASSO regressiono Elastic Net algorithm to generate a family of solutions “between”

LASSO and RIDGE regressions that closely approximate the power family

Pβ(a) = ∑ {(β - 1)a2j/2 + (2 - β)|aj|}, 1 ≤ β ≤ 2

Ridge β = 2.0Lasso β = 1.0

• All of these approaches are lambda-centric! Pick any combination of beta and lambda (usually on a user-defined grid) and then solve for the coefficients

Salford Systems ©2014 Introduction to Modern Regression

Page 13: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

13

Path-Centric Approach• Instead of focusing on the lambda, observe that

solutions traverse a path in the parameter space• The terminating points of the path are always

known:o OLS solution corresponding to λ = 0o Zero-coefficient solution corresponding to λ = ∞

• Start at either of the terminating points and work out an update algorithm to find the next point along the patho Starting at the OLS end requires finding the OLS solution first, which may

be time consumingo Starting at the zero end is more convenient and allows early path

termination upon reaching excessive compute burden or acceptable model performance

• Every point on the path maps (implicitly) to a monotone sequence of lambdas

Salford Systems ©2014 Introduction to Modern Regression

Page 14: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

14

Regularized Regression – Practical

Algorithm

• Start with the zero-coefficient solution• Series of iterations

o Update one of the coefficients by a small amount• If the selected coefficient was zero, a new variable effectively enters into the model• If the selected coefficient was not zero, the model is simply updated

• The end result is a collection of linear models which can be visualized as a path in the parameter space

Salford Systems ©2014 Introduction to Modern Regression

CurrentModel

X1 0.0

X2 0.0

X3 0.2

X4 0.0

X5 0.4

X6 0.5

X7 0.0

X8 0.0

X1 0.0

X2 0.0

X3 0.2

X4 0.1

X5 0.4

X6 0.5

X7 0.0

X8 0.0

Introducing New Variable

NextModel

CurrentModel

X1 0.0

X2 0.0

X3 0.2

X4 0.0

X5 0.4

X6 0.5

X7 0.0

X8 0.0

X1 0.0

X2 0.0

X3 0.3

X4 0.1

X5 0.4

X6 0.5

X7 0.0

X8 0.0

Updating Existing Model

NextModel

Page 15: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

15

Path Building Process

• Elasticity Parameter – controls the variable selection strategy along the path (using the LEARN sample only), it can be between 0 and 2, inclusiveo Elasticity = 2 – fast approximation of Ridge Regression, introduces

variables as quickly as possible and then jointly varies the magnitude of coefficients – lowest degree of compression

o Elasticity = 1 – fast approximation of Lasso Regression, introduces variables sparingly letting the current active variables develop their coefficients – good degree of compression versus accuracy

o Elasticity = 0 – fast approximation of Stepwise Regression, introduces new variables only after the current active variables were fully developed – excellent degree of compression but may loose accuracy

ZeroCoefficient

Model

A Variableis Added

Sequence of

1-variablemodels A Variable

is Added

Sequence of

2-variablemodels

A Variableis Added

Sequence of

3-variablemodels

FinalOLS

Solution

Variable Selection Strategy

Salford Systems ©2014 Introduction to Modern Regression

λ = ∞ λ = 0…

Page 16: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

16

Points Versus Steps

• Each path will have different number of steps• To facilitate model comparison among different

paths, the Point Selection Strategy extracts a fixed collection of models into the points grido This eliminates some of the original irregularity among individual paths

and facilitates model extraction and comparison

Path 2: Steps OLSSolution

Points

Path 1Path 2Path 3

ZeroSolution

Path 1: Steps

Path 3: Steps

Point Selection Strategy

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

Salford Systems ©2014 Introduction to Modern Regression

Page 17: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

17

OLS versus GPS

• GPS (Generalized Path Seeker) introduced by Jerome Friedman in 2008 (Fast Sparse Regression and Classification)

• Dramatically expands the pool of potential linear models by including different sets of variables in addition to varying the magnitude of coefficients

• The optimal model of any desirable size can then be selected based on its performance on the TEST sample

Learn Sample

OLS Regression

X1, X2 , X3, X4, X5, X6,…

Test Sample

X1, X2 , X3, X4, X5, X6,…

A Sequence of Linear Models1-variable model2-variable model3-variable model…Optim

ize Ignore

GPS Regression

Large Collection of Linear Models (Paths)1-variable models, varying coefficients2-variable models, varying coefficients3-variable models, varying coefficients…

Learn From Select Optim

al

Salford Systems ©2014 Introduction to Modern Regression

Page 18: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

18

Boston Housing Data Set• Concerns the housing values in Boston area

• Harrison, D. and D. Rubinfeld. Hedonic Prices and the Demand For Clean Air. Journal of Environmental Economics and Management, v5, 81-102 , 1978

• Combined information from 10 separate governmental and educational sources to produce this data set

•506 census tracts in City of Boston for the year 1970o Goal: study relationship between quality of life variables and property valueso MV median value of owner-occupied homes in tract ($1,000’s)o CRIM per capita crime rateso NOX concentration of nitric oxides (pp 10 million)o AGE percent built before 1940o DIS weighted distance to centers of employmento RM average number of rooms per houseo LSTAT % lower status of the populationo RAD accessibility to radial highwayso CHAS borders Charles River (0/1)o INDUS percent non-retail businesso TAX property tax rate per $10,000o PT pupil teacher ratio

Salford Systems ©2014 Introduction to Modern Regression

Page 19: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

19

OLS on Boston Data

• 414 records in the learn sample

• 92 records in the test sample

• Good agreemento LEARN MSE = 27.455o TEST MSE = 26.147

3-variable Solution

-0.597 +5.247

-0.858

Salford Systems ©2014 Introduction to Modern Regression

Page 20: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

20

Paths Produced by GPS

• Example of 21 paths with different variable selection strategies

Salford Systems ©2014 Introduction to Modern Regression

Page 21: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

21

Path Points on Boston Data

• Each path uses a different variable selection strategy and separate coefficient updates

Point 30 Point 100 Point 150 Point 190

Path Development

Salford Systems ©2014 Introduction to Modern Regression

Page 22: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

22

GPS on Boston Data3-variable Solution

• 414 records in the learn sample

• 92 records in the test sample• 15% performance

improvement on the test sampleo TEST MSE = 22.669

(OLS MSE was 26.147)

+5.247

-0.858

-0.597

OLS26.147

Salford Systems ©2014 Introduction to Modern Regression

Page 23: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

23

Key Problems with GPS

Salford Systems ©2014 Introduction to Modern Regression

Look for

• GPS model is still a linear regression!• Response surface is still a global hyper-plane• Incapable of discovering local structure in the

data

• Develop non-linear algorithms that build response surface locally based on the data itselfo By trying all possible data cuts as local boundarieso By fitting first-order adaptive splines locally

Page 24: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

24

Motivation for MARSMultivariate Adaptive Regression

Splines

• Developed by Jerome H. Friedman in late 1980s• After his work on CART (for classification)• Adapting many CART ideas for regression

o Automatic variable selectiono Automatic missing value handlingo Allow for nonlinearityo Allow for interactionso Leverage the power of regression where

linearity can be exploited

Salford Systems ©2014 Introduction to Modern Regression

Page 25: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

25

From Linear to Non-linear

• Classical regression and regularized regression builds globally linear models

• Further accuracy can be achieved by building locally linear models nicely connected to each other at the boundary points called knots

-100

102030405060

0 10 20 30 40LSTAT

MV

0

10

20

30

40

50

60

0 10 20 30 40

LSTAT

MV

Localize

Knots

Salford Systems ©2014 Introduction to Modern Regression

Page 26: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

26Salford Systems ©2014 Introduction to Modern Regression

Key Concept for Spline is the

“knot”• Knot marks end of one region of data and beginning of another

• Knot is where behavior of function changes

• In a classical spline knot positions are predetermined and are often evenly spaced

• In MARS, knots are determined by search procedure

• Only as many knots as needed end up in the MARS model

• If a straight line is adequate fit there will be no interior knotso in MARS there is always at least one knot

o Could correspond to smallest observed value of the predictor

Page 27: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

27Salford Systems ©2014 Introduction to Modern Regression

Placement of Knots• With only one predictor and one knot to

select, placement is straightforward:o test every possible knot locationo choose model with best fit (smallest SSE)o perhaps constrain by requiring a minimum

amount of data in each interval• Prevents interior knot being placed too close to a

boundary

• For computational efficiency, knots are always placed exactly at observed predictor valueso Can cause rare modeling artifacts that sometimes

appear due to discrete nature of data

Page 28: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

28

Finding Knots Automatically

• Stage-wise knot placement process on a flat-top function

0

20

40

60

80

0 30 60 90X

Y

0

20

40

60

80

0 30 60 90X

Y

True Knots Knot 1 Knot 2 Knot 3

Knot 4 Knot 5 Knot 6

Salford Systems ©2014 Introduction to Modern Regression

Page 29: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

29Salford Systems ©2014 Introduction to Modern Regression

Basis Functions• Example for knot selection works very well to illustrate

splines in one dimension

• Thinking in terms of knot locations is unwieldy for working with a large number of variables simultaneouslyo need a concise notation and programming expressions that are

easy to manipulate

o Not clear how to construct or represent interactions using knot locations

• Basis Functions (BF) provide analytical machinery to express the knot placement strategy

• MARS creates sets of basis functions to decompose the information in each variable individually

Page 30: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

30Salford Systems ©2014 Introduction to Modern Regression

The Hockey Stick Basis Function• Hockey Stick basis function is core MARS

buildingo can be applied to a single variable multiple times

• Hockey stick function:o Direct: max (0, X -c)

o Mirror: max (0, c - X)

o Maps variable X to new variable X*

o X* =0 for all values of X up to some threshold value c

o X* =X – c for all values of X greater than c

• the amount by which X exceeds threshold c

Page 31: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

31Salford Systems ©2014 Introduction to Modern Regression

Basis Function Example• X ranges from 0 to 100• 8 basis functions displayed (c=10,20,30,…,80)

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100 110

X

Val

ue

BF10

BF20

BF30

BF40

BF50

BF60

BF70

BF80

Page 32: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

32Salford Systems ©2014 Introduction to Modern Regression

Basis Functions: Separate Displayso Each function is graphed with same dimensions

o BF10 is offset from original value by 10o BF80 is zero for most of its rangeo Basis functions can be constructed for any value of co MARS considers constructing one for EVERY actual data

value

Page 33: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

33Salford Systems ©2014 Introduction to Modern Regression

Tabular Display of Basis Functions• Each new BF results in a different number of zeroes in

transformed variable

• The resulting collection is resistant to multicollinearity issues

• Three basis functions with knots at 25, 55, and 65:

Page 34: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

Spline with 1 Basis FunctionMV = 27.395 -

0.659*(INDUS -4)+

0

10

20

30

40

0 5 10 15 20 25 30

INDUS

MV

Slope = 0 Knot = 4Slope = -0.659

Salford Systems ©2014 Introduction to Modern Regression 34

Page 35: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

Spline with 2 Basis FunctionsMV = 30.290 - 2.439*(INDUS - 4)+ +

2.215*(INDUS-8) +

0

10

20

30

40

0 10 20 30INDUS

MV

Slope = 0 Slope = -2.439 Slope = -2.439+2.215=-0.224

Knot = 4 Knot = 8

Salford Systems ©2014 Introduction to Modern Regression 35

Page 36: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

Adding Mirror Image BF• A standard basis function (X - knot)+ does not provide for a non-zero slope for

values below the knot• To handle this MARS uses a “mirror image” basis function, which looks at the

interval of a variable X which lies below the threshold cMV= 29.433 + 0.925*(4 - INDUS)+ -2.180*(INDUS-4)+ +1.939*(INDUS-8)+

0

10

20

30

40

0 5 10 15 20 25 30INDUS

MV

Slope = -0.925 Slope = -2.180Slope = -2.180+1.939=-0.241

Knot = 4 Knot = 8

Salford Systems ©2014 Introduction to Modern Regression 36

Page 37: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

37

MARS Algorithm• Stands for Multivariate Adaptive Regression

Splines• Forward stage:

o Add pairs of BFs (direct and mirror pair of basis functions represents a single knot) in a step-wise regression manner

o The process stops once a user specified upper limit is reachedo Possible linear dependency is handled automatically by discarding

redundant BFs

• Backward stage:o Remove BFs one at a time in a step-wise regression mannero This creates a sequence of candidate models of varying complexity

• Selection stage:o Select optimal model based on the TEST performance (modern approach)o Select optimal model based on GCV criterion (legacy approach)

Salford Systems ©2014 Introduction to Modern Regression

Page 38: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

38

MARS on Boston Data9-BF (7-variable)

Solution

• 414 records in the learn sample

• 92 records in the test sample

• 40% performance improvement on the test sampleo TEST MSE = 15.749

(OLS was 26.147)

Salford Systems ©2014 Introduction to Modern Regression

Page 39: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

39

Non-linear Response Surface

• MARS automatically determined transition points between various local regions

• This model provides major insights into the nature of the relationship

Salford Systems ©2014 Introduction to Modern Regression

Page 40: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

40

200 Replications

• All of the above models were repeated on 200 randomly selected 20% test partitions

• GPS shows marginal performance improvement but much smaller model

• MARS shows dramatic performance improvement

Regression

GPS

MARS

Salford Systems ©2014 Introduction to Modern Regression

Page 41: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

Salford Systems ©2014 Introduction to Modern Regression41

Regression Trees• Regression trees result to piece-wise constant

models (multi-dimensional staircase) on an orthogonal partition of the data spaceo Thus usually not the best possible performer in terms of conventional

regression loss functions

• Only a very limited number of controls is available to influence the modeling processo Priors and costs are no longer applicableo There are two splitting rules: LS and LAD

• Very powerful in capturing high-order interactions but somewhat weak in explaining simple main effects

Page 42: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

Split Improvement

• Parent node• Left child• Right child• Improvement• One can show• Find the split with the largest improvement by

conducting exhaustive search over all splits in the parent node

42 Salford Systems ©2014 Introduction to Modern Regression

Page 43: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

Salford Systems ©2014 Introduction to Modern Regression43

Splitting the Root Node

• Improvement is defined in terms of the greatest reduction in the sum of squared errors when a single constant prediction is replaced by two separate constants on each side

Page 44: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

Salford Systems ©2014 Introduction to Modern Regression44

Regression Tree Model

• All cases in the given node are assigned the same predicted response – the node average of the original target

• Nodes are color-coded according to the predicted response• We have a convenient segmentation of the population

according to the average response levels

Page 45: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

Salford Systems ©2014 Introduction to Modern Regression45

The Best and the Worst Segments

Page 46: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

• New approach to machine learning /function approximation developed by Jerome H. Friedman at Stanford Universityo Co-author of CART® with Breiman, Olshen and Stoneo Author of MARS®, PRIM, Projection Pursuit

• Also known as Treenet®• Good for classification and regression problems• Built on small trees and thus

o Fast and efficiento Data driveno Immune to outlierso Invariant to monotone transformations of variables

• Resistant to over training – generalizes very well• Can be remarkably accurate with little effort• BUT resulting model may be very complex

Stochastic Gradient Boosting

46 Salford Systems ©2014 Introduction to Modern Regression

Page 47: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

The Algorithm• Begin with a very small tree as initial model

o Could be as small as ONE split generating 2 terminal nodeso Typical model will have 3-5 splits in a tree, generating 4-6 terminal nodeso Model is intentionally “weak”

• Compute “residuals” (prediction errors) for this simple model for every record in data

• Grow a second small tree to predict the residuals from the first tree

• Compute residuals from this new 2-tree model and grow a 3rd tree to predict revised residuals

• Repeat this process to grow a sequence of trees

+ + + …

Tree 1 Tree 2 Tree 3 More trees

47 Salford Systems ©2014 Introduction to Modern Regression

Page 48: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

48

Illustration: Saddle Function

• 500 {X1,X2} points randomly drawn from a [-3,+3] box to produce the XOR response surface Y = X1 * X2

• Will use 3-node trees to show the evolution of Treenet response surface

Salford Systems ©2014 Introduction to Modern Regression

1 Tree 2 Trees 3 Trees 4 Trees 10 Trees

20 Trees 30 Trees 40 Trees 100 Trees 195 Trees

Page 49: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

49

Notes on Treenet Solution• The solution evolves slowly and

usually includes hundreds or even thousands of small trees

• The process is myopic – only the next best tree given the current set of conditions is added

• There is a high degree of similarity and overlap among the resulting trees

• Very large tree sequences make the model scoring time and resource intensive

• Thus, ever present need to simplify (reduce) model complexity

Salford Systems ©2014 Introduction to Modern Regression

Page 50: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

50

A Tree is a Variable Transformation

• Any tree in a Treenet model can be represented by a derived continuous variable as a function of inputs

Salford Systems ©2014 Introduction to Modern Regression

X2 <= -1.83

TerminalNode 1

Avg = -0.250N = 15

X2 > -1.83

TerminalNode 2

Avg = -0.062N = 461

X1 <= 1.47

Node 2Avg = -0.068

N = 476

X1 > 1.47

TerminalNode 3

Avg = -0.695N = 30

Node 1Avg = -0.105

N = 506

TREE_1 = F(X1,X2)

Page 51: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

51

ISLE Compression of Treenet

• The original Treenet model combines all trees with equal coefficients

• ISLE accomplishes model compression by removing redundant trees and changing the relative contribution of the remaining trees by adjusting the coefficients

• Regularized Regression methodology provides the required machinery to accomplish this task!

Salford Systems ©2014 Introduction to Modern Regression

TN Model1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

ISLE Model1.3 0.0 0.4 0.0 0.0 1.8 1.2 0.3 0.0 0.0 0.0 0.1 0.0

Page 52: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

52

A Node is a Variable

Transformation

• Any node in a Treenet model can be represented by a derived dummy variable as a function of inputs

Salford Systems ©2014 Introduction to Modern Regression

X2 <= -1.83

TerminalNode 1

Avg = -0.250N = 15

X2 > -1.83

TerminalNode 2

Avg = -0.062N = 461

X1 <= 1.47

Node 2Avg = -0.068

N = 476

X1 > 1.47

TerminalNode 3

Avg = -0.695N = 30

Node 1Avg = -0.105

N = 506

NODE_X = F(X1,X2)

Page 53: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

53

RuleLearner Compression

• Create an exhaustive set of dummy variables for every node (internal and terminal) and every tree in a TN model

• Run GPS regularized regression to extract an informative subset of node dummies along with the modeling coefficients

• Thus, a model compression can be achieved by eliminating redundant nodes

• Each selected node dummy represents a specific rule-set which can be interpreted directly for further insights

Salford Systems ©2014 Introduction to Modern Regression

TN Model T1_N1 + T1_T1 + T1_T2 + T1_T3 + T2_N1 + T2_T1 + T2_T2 + T2_T3 + T3_N1 + T3_T1 + …

Coefficient 1Rule-set 1

Coefficient 2Rule-set 2

Coefficient 3Rule-set 3

Coefficient 4Rule-set 4

GPS Regression

Page 54: September 2014 Dan Steinberg Mikhail Golovnya Salford Systems Salford Systems ©2014 Introduction to Modern Regression1 Introduction to Modern Regression:

54

Salford Predictive Modeler SPM

• Download a current version from our website http://www.salford-systems.com

• Version will run without a license key for 10-days

• Request a license key [email protected]

• Request configuration to meet your needso Data handling capacityo Data mining engines made available

Salford Systems ©2014 Introduction to Modern Regression