september 2014 dan steinberg mikhail golovnya salford systems salford systems ©2014 introduction to...

1

September 2014Dan Steinberg

Mikhail GolovnyaSalford Systems

Salford Systems ©2014 Introduction to Modern Regression

Introduction to Modern Regression:

From OLS to GPS® to MARS®

2

Course Outline

• Regression Problem – quick overview• Classical OLS/LOGIT – the starting point• RIDGE/LASSO/GPS – regularized regression• MARS – adaptive non-linear regression

• Regression Trees• GBM – stochastic gradient boosting• ISLE/RULE-LEARNER – model compression


Today’s Topics

Next Week Topics

3

Regression• Regression analysis at least 200 years old

• Surely most used predictive modeling technique (including logistic regression)

• American Statistical Association reports 18,900 members

• Bureau of Labor Statistics reported more than 22,000 statisticians in the work force in 2008 in the USA

• Many other professionals involved in the sophisticated analysis of data not included in these countso Statistical specialists in scientific disciplines such as economics, medicine

bioinformaticso Machine Learning specialists, ‘Data Scientists’, database expertso Market researchers studying traditional targeted marketingo Web analytics, social media analytics, text analytics

• Few of these other researchers will call themselves statisticians but may make extensive use of variations of regression


4

Regression Challenges• Preparation of data – errors, missing values, etc.• Determination of predictors to include in model

o Hundreds, thousands, even tens and hundreds of thousands available

• Transformation or coding of predictorso Conventional approaches consider logarithm, power, inverse, etc..

• Detecting and modeling important interactions• Possibly huge number of records

o Super large samples render all predictors “significant”

• Complexity of underlying relationship• Lack of external knowledge


5

OLS Regression• OLS – ordinary least squares regression

o Discovered by Legendre (1805) and Gauss (1809) to solve problems in astronomy using pen and paper

o Solid statistical foundation by Fisher in 1920so 1950s – use of electro-mechanical calculators

• The model is always of the form

• The response surface is a hyper-plane!• A – the intercept term• B1, B2, B3, … – parameter estimates

• A usually unique combination of values exists which minimizes the mean squared error of predictions on the learn sample

• Step-wise approaches to determine model size

Response = A + B1X1 + B2X2 + B3X3 + …


6

Logistic Regression• Models Bernoulli response (binary outcome)• The log-odds of the probability of the event of

interest is a linear combination of predictors

• The solution minimizes the loss (or equivalently maximizes the logistic likelihood) on the available data (assuming the response is coded as +1 and -1)

• The solution has no closed form and is usually found by series of iterations utilizing Newton’s algorithm


F(X) = Log[p/(1-p)] = A + B1X1 + B2X2 + B3X3 + …

Loss = i=1,N {Log[1 + exp(-Yi Fi)]}

7

Key Problems with OLS and

LOGIT• In addition to the challenges already mentioned

above, note the following features:o OLS optimizes specific loss function (Mean Squared Error / Log-

likelihood) using all available data (learn sample)o Solution over-fits to the learn sampleo Solution becomes unstable in the presence of collinearityo Unique solution does not exist when data becomes wide

• Alternative strategies to construct useful linear combinations of predictorso By jointly optimizing MSE and L2-norm of the coefficients – Ridge regressiono By jointly optimizing MSE and L1-norm of the coefficients – Lasso regressiono Hybrid of the two aboveo All of the above plus extensions into very sparse solutions – Generalized

Path Seeker

Look for


8

Motivation for Regularized

Regression• Unsatisfactory regression results based on data

modeling physical processes (1970) when predictors correlatedo Coefficient estimates could change dramatically with small changes in datao Some coefficients judged to be much too largeo Frequent appearance of coefficients with “wrong” signo Problem severe with substantial multicollinearity but always present

• Solution proposed by Hoerl and Kennard, (Technometrics, 1970) was “Ridge regression”o First proposal by Hoerl just for stabilization of coefficients 1962o Initially poorly received by academic statistics profession

• Ridge Intentionally biased but yields more satisfactory coefficient estimates and superior generalizationo Better performance (test MSE) on previously unseen data (lower variance)

• “Shrinkage” of regression coefficients towards zero• OLS coefficient vector length is biased upwards


9

Lasso Regularized Regression

• Introduced by Tibshirani in 1996 explicitly as an improvement on the RIDGE regression

• Least Absolute Shrinkage and Selection Operator• Desire to gain the stability and lower variance of ridge

regression while also performing variable selection• Especially in the context of many possible predictors

looking for a simple, stable, low predictive variance model

• Historical note: The Lasso was inspired by related work in 1993 by Leo Breiman (of CART and RandomForests fame). This was the ‘non-negative garotte’.

• Breiman’s simulation studies showed the potential for improved prediction via selection and shrinkage


10

Regularized Regression - Theory

• Regularized regression approach balances model performance and model complexity

• λ – regularization parametero λ = ∞ Zero-coefficient solutiono λ = 0 OLS solution

• Model complexity expression defines type of modelSalford Systems ©2014 Introduction to Modern Regression

Mean Squared Error Model Complexity

OLS Regression

Minimize

Minimize

Regularized Regression

Ridge: Sum of squared

coefficients

Lasso: Sum of absolute

coefficients

Best Subsets: Number of coefficients

λ

11

Penalty Function – Model

Complexity• RIDGE penalty Saj

2 squared

• LASSO penalty |S aj| absolute value

• COMPACT penalty |S aj|0 count

• Extended power family |S aj| g 0<= <=2g

• Elastic net family S {(β - 1)a2j/2 + (2 - β)|aj|},1 ≤ β ≤ 2

• Regularization parameter λ multiplies a penalty function of the a vector

• Each elasticity is based on fitting a model that minimizes the sum (residual sum of squared errors + penalty)

• Intermediate elasticities are mixtures, e.g. we could have a 50/50 mix of RIDGE and LASSO ( = 1.5)g

• GPS extends beyond the “elastic net” of Tibshirani/Hastie to include the “other half” of the power family 0 ≤ β ≤ 1


12

Lambda-Centric Approach• A number of regularized regression algorithms

were developed to find the value of the coefficients for the given lambdao Ridge regression: a(ridge) = (X’X +l I)-1X’y o LARS algorithm and its modification to obtain LASSO regressiono Elastic Net algorithm to generate a family of solutions “between”

LASSO and RIDGE regressions that closely approximate the power family

Pβ(a) = ∑ {(β - 1)a2j/2 + (2 - β)|aj|}, 1 ≤ β ≤ 2

Ridge β = 2.0Lasso β = 1.0

• All of these approaches are lambda-centric! Pick any combination of beta and lambda (usually on a user-defined grid) and then solve for the coefficients


13

Path-Centric Approach• Instead of focusing on the lambda, observe that

solutions traverse a path in the parameter space• The terminating points of the path are always

known:o OLS solution corresponding to λ = 0o Zero-coefficient solution corresponding to λ = ∞

• Start at either of the terminating points and work out an update algorithm to find the next point along the patho Starting at the OLS end requires finding the OLS solution first, which may

be time consumingo Starting at the zero end is more convenient and allows early path

termination upon reaching excessive compute burden or acceptable model performance

• Every point on the path maps (implicitly) to a monotone sequence of lambdas


14

Regularized Regression – Practical

Algorithm

• Start with the zero-coefficient solution• Series of iterations

o Update one of the coefficients by a small amount• If the selected coefficient was zero, a new variable effectively enters into the model• If the selected coefficient was not zero, the model is simply updated

• The end result is a collection of linear models which can be visualized as a path in the parameter space


CurrentModel

X1 0.0

X2 0.0

X3 0.2

X4 0.0

X5 0.4

X6 0.5

X7 0.0

X8 0.0

X1 0.0

X2 0.0

X3 0.2

X4 0.1

X5 0.4

X6 0.5

X7 0.0

X8 0.0

Introducing New Variable

NextModel

CurrentModel

X1 0.0

X2 0.0

X3 0.2

X4 0.0

X5 0.4

X6 0.5

X7 0.0

X8 0.0

X1 0.0

X2 0.0

X3 0.3

X4 0.1

X5 0.4

X6 0.5

X7 0.0

X8 0.0

Updating Existing Model

NextModel

15

Path Building Process

• Elasticity Parameter – controls the variable selection strategy along the path (using the LEARN sample only), it can be between 0 and 2, inclusiveo Elasticity = 2 – fast approximation of Ridge Regression, introduces

variables as quickly as possible and then jointly varies the magnitude of coefficients – lowest degree of compression

o Elasticity = 1 – fast approximation of Lasso Regression, introduces variables sparingly letting the current active variables develop their coefficients – good degree of compression versus accuracy

o Elasticity = 0 – fast approximation of Stepwise Regression, introduces new variables only after the current active variables were fully developed – excellent degree of compression but may loose accuracy

ZeroCoefficient

Model

A Variableis Added

Sequence of

1-variablemodels A Variable

is Added

Sequence of

2-variablemodels

A Variableis Added

Sequence of

3-variablemodels

FinalOLS

Solution

Variable Selection Strategy


λ = ∞ λ = 0…

16

Points Versus Steps

• Each path will have different number of steps• To facilitate model comparison among different

paths, the Point Selection Strategy extracts a fixed collection of models into the points grido This eliminates some of the original irregularity among individual paths

and facilitates model extraction and comparison

Path 2: Steps OLSSolution

Points

Path 1Path 2Path 3

ZeroSolution

Path 1: Steps

Path 3: Steps

Point Selection Strategy

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10


17

OLS versus GPS

• GPS (Generalized Path Seeker) introduced by Jerome Friedman in 2008 (Fast Sparse Regression and Classification)

• Dramatically expands the pool of potential linear models by including different sets of variables in addition to varying the magnitude of coefficients

• The optimal model of any desirable size can then be selected based on its performance on the TEST sample

Learn Sample

OLS Regression

X1, X2 , X3, X4, X5, X6,…

Test Sample

X1, X2 , X3, X4, X5, X6,…

A Sequence of Linear Models1-variable model2-variable model3-variable model…Optim

ize Ignore

GPS Regression

Large Collection of Linear Models (Paths)1-variable models, varying coefficients2-variable models, varying coefficients3-variable models, varying coefficients…

Learn From Select Optim

al


18

Boston Housing Data Set• Concerns the housing values in Boston area

• Harrison, D. and D. Rubinfeld. Hedonic Prices and the Demand For Clean Air. Journal of Environmental Economics and Management, v5, 81-102 , 1978

• Combined information from 10 separate governmental and educational sources to produce this data set

•506 census tracts in City of Boston for the year 1970o Goal: study relationship between quality of life variables and property valueso MV median value of owner-occupied homes in tract ($1,000’s)o CRIM per capita crime rateso NOX concentration of nitric oxides (pp 10 million)o AGE percent built before 1940o DIS weighted distance to centers of employmento RM average number of rooms per houseo LSTAT % lower status of the populationo RAD accessibility to radial highwayso CHAS borders Charles River (0/1)o INDUS percent non-retail businesso TAX property tax rate per $10,000o PT pupil teacher ratio


19

OLS on Boston Data

• 414 records in the learn sample

• 92 records in the test sample

• Good agreemento LEARN MSE = 27.455o TEST MSE = 26.147

3-variable Solution

-0.597 +5.247

-0.858


20

Paths Produced by GPS

• Example of 21 paths with different variable selection strategies


21

Path Points on Boston Data

• Each path uses a different variable selection strategy and separate coefficient updates

Point 30 Point 100 Point 150 Point 190

Path Development


22

GPS on Boston Data3-variable Solution


• 92 records in the test sample• 15% performance

improvement on the test sampleo TEST MSE = 22.669

(OLS MSE was 26.147)

+5.247

-0.858

-0.597

OLS26.147


23

Key Problems with GPS


Look for

• GPS model is still a linear regression!• Response surface is still a global hyper-plane• Incapable of discovering local structure in the

data

• Develop non-linear algorithms that build response surface locally based on the data itselfo By trying all possible data cuts as local boundarieso By fitting first-order adaptive splines locally

24

Motivation for MARSMultivariate Adaptive Regression

Splines

• Developed by Jerome H. Friedman in late 1980s• After his work on CART (for classification)• Adapting many CART ideas for regression

o Automatic variable selectiono Automatic missing value handlingo Allow for nonlinearityo Allow for interactionso Leverage the power of regression where

linearity can be exploited


25

From Linear to Non-linear

• Classical regression and regularized regression builds globally linear models

• Further accuracy can be achieved by building locally linear models nicely connected to each other at the boundary points called knots

-100

102030405060

0 10 20 30 40LSTAT

MV

0

10

20

30

40

50

60

0 10 20 30 40

LSTAT

MV

Localize

Knots


26Salford Systems ©2014 Introduction to Modern Regression

Key Concept for Spline is the

“knot”• Knot marks end of one region of data and beginning of another

• Knot is where behavior of function changes

• In a classical spline knot positions are predetermined and are often evenly spaced

• In MARS, knots are determined by search procedure

• Only as many knots as needed end up in the MARS model

• If a straight line is adequate fit there will be no interior knotso in MARS there is always at least one knot

o Could correspond to smallest observed value of the predictor


Placement of Knots• With only one predictor and one knot to

select, placement is straightforward:o test every possible knot locationo choose model with best fit (smallest SSE)o perhaps constrain by requiring a minimum

amount of data in each interval• Prevents interior knot being placed too close to a

boundary

• For computational efficiency, knots are always placed exactly at observed predictor valueso Can cause rare modeling artifacts that sometimes

appear due to discrete nature of data

28

Finding Knots Automatically

• Stage-wise knot placement process on a flat-top function

0

20

40

60

80

0 30 60 90X

Y

0

20

40

60

80

0 30 60 90X

Y

True Knots Knot 1 Knot 2 Knot 3

Knot 4 Knot 5 Knot 6



Basis Functions• Example for knot selection works very well to illustrate

splines in one dimension

• Thinking in terms of knot locations is unwieldy for working with a large number of variables simultaneouslyo need a concise notation and programming expressions that are

easy to manipulate

o Not clear how to construct or represent interactions using knot locations

• Basis Functions (BF) provide analytical machinery to express the knot placement strategy

• MARS creates sets of basis functions to decompose the information in each variable individually


The Hockey Stick Basis Function• Hockey Stick basis function is core MARS

buildingo can be applied to a single variable multiple times

• Hockey stick function:o Direct: max (0, X -c)

o Mirror: max (0, c - X)

o Maps variable X to new variable X*

o X* =0 for all values of X up to some threshold value c

o X* =X – c for all values of X greater than c

• the amount by which X exceeds threshold c


Basis Function Example• X ranges from 0 to 100• 8 basis functions displayed (c=10,20,30,…,80)

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100 110

X

Val

ue

BF10

BF20

BF30

BF40

BF50

BF60

BF70

BF80


Basis Functions: Separate Displayso Each function is graphed with same dimensions

o BF10 is offset from original value by 10o BF80 is zero for most of its rangeo Basis functions can be constructed for any value of co MARS considers constructing one for EVERY actual data

value


Tabular Display of Basis Functions• Each new BF results in a different number of zeroes in

transformed variable

• The resulting collection is resistant to multicollinearity issues

• Three basis functions with knots at 25, 55, and 65:

Spline with 1 Basis FunctionMV = 27.395 -

0.659*(INDUS -4)+

0

10

20

30

40

0 5 10 15 20 25 30

INDUS

MV

Slope = 0 Knot = 4Slope = -0.659

Salford Systems ©2014 Introduction to Modern Regression 34

Spline with 2 Basis FunctionsMV = 30.290 - 2.439*(INDUS - 4)+ +

2.215*(INDUS-8) +

0

10

20

30

40

0 10 20 30INDUS

MV

Slope = 0 Slope = -2.439 Slope = -2.439+2.215=-0.224

Knot = 4 Knot = 8


Adding Mirror Image BF• A standard basis function (X - knot)+ does not provide for a non-zero slope for

values below the knot• To handle this MARS uses a “mirror image” basis function, which looks at the

interval of a variable X which lies below the threshold cMV= 29.433 + 0.925*(4 - INDUS)+ -2.180*(INDUS-4)+ +1.939*(INDUS-8)+

0

10

20

30

40

0 5 10 15 20 25 30INDUS

MV

Slope = -0.925 Slope = -2.180Slope = -2.180+1.939=-0.241

Knot = 4 Knot = 8


37

MARS Algorithm• Stands for Multivariate Adaptive Regression

Splines• Forward stage:

o Add pairs of BFs (direct and mirror pair of basis functions represents a single knot) in a step-wise regression manner

o The process stops once a user specified upper limit is reachedo Possible linear dependency is handled automatically by discarding

redundant BFs

• Backward stage:o Remove BFs one at a time in a step-wise regression mannero This creates a sequence of candidate models of varying complexity

• Selection stage:o Select optimal model based on the TEST performance (modern approach)o Select optimal model based on GCV criterion (legacy approach)


38

MARS on Boston Data9-BF (7-variable)

Solution


• 92 records in the test sample

• 40% performance improvement on the test sampleo TEST MSE = 15.749

(OLS was 26.147)


39

Non-linear Response Surface

• MARS automatically determined transition points between various local regions

• This model provides major insights into the nature of the relationship


40

200 Replications

• All of the above models were repeated on 200 randomly selected 20% test partitions

• GPS shows marginal performance improvement but much smaller model

• MARS shows dramatic performance improvement

Regression

GPS

MARS


Salford Systems ©2014 Introduction to Modern Regression41

Regression Trees• Regression trees result to piece-wise constant

models (multi-dimensional staircase) on an orthogonal partition of the data spaceo Thus usually not the best possible performer in terms of conventional

regression loss functions

• Only a very limited number of controls is available to influence the modeling processo Priors and costs are no longer applicableo There are two splitting rules: LS and LAD

• Very powerful in capturing high-order interactions but somewhat weak in explaining simple main effects

Split Improvement

• Parent node• Left child• Right child• Improvement• One can show• Find the split with the largest improvement by

conducting exhaustive search over all splits in the parent node

42 Salford Systems ©2014 Introduction to Modern Regression


Splitting the Root Node

• Improvement is defined in terms of the greatest reduction in the sum of squared errors when a single constant prediction is replaced by two separate constants on each side


Regression Tree Model

• All cases in the given node are assigned the same predicted response – the node average of the original target

• Nodes are color-coded according to the predicted response• We have a convenient segmentation of the population

according to the average response levels


The Best and the Worst Segments

• New approach to machine learning /function approximation developed by Jerome H. Friedman at Stanford Universityo Co-author of CART® with Breiman, Olshen and Stoneo Author of MARS®, PRIM, Projection Pursuit

• Also known as Treenet®• Good for classification and regression problems• Built on small trees and thus

o Fast and efficiento Data driveno Immune to outlierso Invariant to monotone transformations of variables

• Resistant to over training – generalizes very well• Can be remarkably accurate with little effort• BUT resulting model may be very complex

Stochastic Gradient Boosting


The Algorithm• Begin with a very small tree as initial model

o Could be as small as ONE split generating 2 terminal nodeso Typical model will have 3-5 splits in a tree, generating 4-6 terminal nodeso Model is intentionally “weak”

• Compute “residuals” (prediction errors) for this simple model for every record in data

• Grow a second small tree to predict the residuals from the first tree

• Compute residuals from this new 2-tree model and grow a 3rd tree to predict revised residuals

• Repeat this process to grow a sequence of trees

+ + + …

Tree 1 Tree 2 Tree 3 More trees


48

Illustration: Saddle Function

• 500 {X1,X2} points randomly drawn from a [-3,+3] box to produce the XOR response surface Y = X1 * X2

• Will use 3-node trees to show the evolution of Treenet response surface


1 Tree 2 Trees 3 Trees 4 Trees 10 Trees

20 Trees 30 Trees 40 Trees 100 Trees 195 Trees

49

Notes on Treenet Solution• The solution evolves slowly and

usually includes hundreds or even thousands of small trees

• The process is myopic – only the next best tree given the current set of conditions is added

• There is a high degree of similarity and overlap among the resulting trees

• Very large tree sequences make the model scoring time and resource intensive

• Thus, ever present need to simplify (reduce) model complexity


50

A Tree is a Variable Transformation

• Any tree in a Treenet model can be represented by a derived continuous variable as a function of inputs


X2 <= -1.83

TerminalNode 1

Avg = -0.250N = 15

X2 > -1.83

TerminalNode 2

Avg = -0.062N = 461

X1 <= 1.47

Node 2Avg = -0.068

N = 476

X1 > 1.47

TerminalNode 3

Avg = -0.695N = 30

Node 1Avg = -0.105

N = 506

TREE_1 = F(X1,X2)

51

ISLE Compression of Treenet

• The original Treenet model combines all trees with equal coefficients

• ISLE accomplishes model compression by removing redundant trees and changing the relative contribution of the remaining trees by adjusting the coefficients

• Regularized Regression methodology provides the required machinery to accomplish this task!


TN Model1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

ISLE Model1.3 0.0 0.4 0.0 0.0 1.8 1.2 0.3 0.0 0.0 0.0 0.1 0.0

52

A Node is a Variable

Transformation

• Any node in a Treenet model can be represented by a derived dummy variable as a function of inputs


X2 <= -1.83

TerminalNode 1

Avg = -0.250N = 15

X2 > -1.83

TerminalNode 2

Avg = -0.062N = 461

X1 <= 1.47

Node 2Avg = -0.068

N = 476

X1 > 1.47

TerminalNode 3

Avg = -0.695N = 30

Node 1Avg = -0.105

N = 506

NODE_X = F(X1,X2)

53

RuleLearner Compression

• Create an exhaustive set of dummy variables for every node (internal and terminal) and every tree in a TN model

• Run GPS regularized regression to extract an informative subset of node dummies along with the modeling coefficients

• Thus, a model compression can be achieved by eliminating redundant nodes

• Each selected node dummy represents a specific rule-set which can be interpreted directly for further insights


TN Model T1_N1 + T1_T1 + T1_T2 + T1_T3 + T2_N1 + T2_T1 + T2_T2 + T2_T3 + T3_N1 + T3_T1 + …

Coefficient 1Rule-set 1




GPS Regression

54

Salford Predictive Modeler SPM

• Download a current version from our website http://www.salford-systems.com

• Version will run without a license key for 10-days

• Request a license key [email protected]

• Request configuration to meet your needso Data handling capacityo Data mining engines made available


http://www.salford-systems.com/

mailto:[email protected]

september 2014 dan steinberg mikhail golovnya salford systems salford systems ©2014 introduction to...

Documents

modern regression6fx

regressionsalford systems

data scientists

available data

sophisticated analysis

model sizeresponse

winning combination

unique combination of