1/19/17
1
CSE 446: Machine Learning
CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017
©2017 Emily Fox
Regularized Regression: Geometric intuition of solution Plus: Cross validation
CSE 446: Machine Learning
Coordinate descent for lasso (for normalized features)
©2017 Emily Fox
1/19/17
2
CSE 446: Machine Learning 3
Coordinate descent for least squares regression
Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D
compute:
set: ŵj = ρj
©2017 Emily Fox
NX
i=1ρj = hj(xi)(yi – ŷi(ŵ-j))
prediction without feature j
residual without feature j
CSE 446: Machine Learning 4
Coordinate descent for lasso
Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D
compute:
set: ŵj =
©2017 Emily Fox
NX
i=1ρj = hj(xi)(yi – ŷi(ŵ-j))
ρj + λ/2 if ρj < -λ/2
ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]
1/19/17
3
CSE 446: Machine Learning 5
Soft thresholding
©2017 Emily Fox
ŵj
ρj
ŵj = ρj + λ/2 if ρj < -λ/2
ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]
CSE 446: Machine Learning 6
How to assess convergence?
Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D
compute:
set: ŵj =
©2017 Emily Fox
NX
i=1ρj = hj(xi)(yi – ŷi(ŵ-j))
ρj + λ/2 if ρj < -λ/2
ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]
1/19/17
4
CSE 446: Machine Learning 7
When to stop?
For convex problems, will start to take smaller and smaller steps
Measure size of steps taken in a full loop over all features - stop when max step < ε
Convergence criteria
©2017 Emily Fox
CSE 446: Machine Learning 8
Other lasso solvers
©2017 Emily Fox
Classically: Least angle regression (LARS) [Efron et al. ‘04]
Then: Coordinate descent algorithm [Fu ‘98, Friedman, Hastie, & Tibshirani ’08]
Now:
• Parallel CD (e.g., Shotgun, [Bradley et al. ‘11])
• Other parallel learning approaches for linear models - Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. ’11])
- Parallel independent solutions then averaging [Zhang et al. ‘12]
• Alternating directions method of multipliers (ADMM) [Boyd et al. ’11]
1/19/17
5
CSE 446: Machine Learning
Coordinate descent for lasso (for unnormalized features)
©2017 Emily Fox
CSE 446: Machine Learning 10
Coordinate descent for lasso with unnormalized features
Precompute:
Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D
compute:
set: ŵj =
©2017 Emily Fox
NX
i=1zj = hj(xi)
2
NX
i=1ρj = hj(xi)(yi – ŷi(ŵ-j))
(ρj + λ/2)/zj if ρj < -λ/2
(ρj – λ/2)/zj if ρj > λ/2 0 if ρj in [-λ/2, λ/2]
1/19/17
6
CSE 446: Machine Learning
Geometric intuition for sparsity of lasso solution
©2017 Emily Fox
CSE 446: Machine Learning
Geometric intuition for ridge regression
©2017 Emily Fox
1/19/17
7
CSE 446: Machine Learning 13
Visualizing the ridge cost in 2D
©2017 Emily Fox
NX
i=1
w0
w1
RSS Cost
−10 −5 0 5 10−10
−5
0
5
10
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w02+w1
2)
2
CSE 446: Machine Learning 14
Visualizing the ridge cost in 2D
©2017 Emily Fox
NX
i=1
w0
w1
L2 penalty
−10 −5 0 5 10−10
−5
0
5
10
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w02+w1
2)
2
1/19/17
8
CSE 446: Machine Learning 15
Visualizing the ridge cost in 2D
©2017 Emily Fox
NX
i=1
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w02+w1
2)
2
CSE 446: Machine Learning 16
Visualizing the ridge solution
©2017 Emily Fox
NX
i=1
5215
5215
5215
52155215
5215
5215
w0
w1
4.75
4.75
level sets intersect
−10 −5 0 5 10−10
−5
0
5
10
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w02+w1
2)
2
1/19/17
9
CSE 446: Machine Learning
Geometric intuition for lasso
©2017 Emily Fox
CSE 446: Machine Learning 18
Visualizing the lasso cost in 2D
©2017 Emily Fox
RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
w0
w1
RSS Cost
−10 −5 0 5 10−10
−5
0
5
10
1/19/17
10
CSE 446: Machine Learning 19
Visualizing the lasso cost in 2D
©2017 Emily Fox
RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
w0
w1
L1 penalty
−10 −5 0 5 10−10
−5
0
5
10
CSE 446: Machine Learning 20
Visualizing the lasso cost in 2D
©2017 Emily Fox
RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
1/19/17
11
CSE 446: Machine Learning 21
Visualizing the lasso solution
©2017 Emily Fox
RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
5215
5215
5215
52155215
5215
5215
w0
w1
2.75
2.75
level sets intersect
−10 −5 0 5 10−10
−5
0
5
10
CSE 446: Machine Learning 22
Revisit polynomial fit demo
What happens if we refit our high-order polynomial, but now using lasso regression?
Will consider a few settings of λ …
©2017 Emily Fox
1/19/17
12
CSE 446: Machine Learning
How to choose λ: Cross validation
©2017 Emily Fox
CSE 446: Machine Learning 24
If sufficient amount of data…
©2017 Emily Fox
Validation set
Training set Test set
fit ŵλ test performance of ŵλ to select λ*
assess generalization
error of ŵλ*
1/19/17
13
CSE 446: Machine Learning 25
Start with smallish dataset
©2017 Emily Fox
All data
CSE 446: Machine Learning 26
Still form test set and hold out
©2017 Emily Fox
Rest of data Test set
1/19/17
14
CSE 446: Machine Learning 27
How do we use the other data?
©2017 Emily Fox
Rest of data
use for both training and validation, but not so naively
CSE 446: Machine Learning 28
Recall naïve approach
Is validation set enough to compare performance of ŵλ across λ values?
©2017 Emily Fox
Valid. set
Training set
small validation set
No
1/19/17
15
CSE 446: Machine Learning 29
Choosing the validation set
Didn’t have to use the last data points tabulated to form validation set
Can use any data subset
©2017 Emily Fox
Valid. set
small validation set
CSE 446: Machine Learning 30
Choosing the validation set
©2017 Emily Fox
Valid. set
small validation set
Which subset should I use?
ALL! average performance over all choices
1/19/17
16
CSE 446: Machine Learning 31
(use same split of data for all other steps)
Preprocessing: Randomly assign data to K groups
©2017 Emily Fox
NK
NK
NK
NK
NK
Rest of data
K-fold cross validation
CSE 446: Machine Learning 32
For k=1,…,K 1. Estimate ŵλ
(k) on the training blocks
2. Compute error on validation block: errork(λ)
©2017 Emily Fox
Valid set
ŵλ(1)error1(λ)
K-fold cross validation
1/19/17
17
CSE 446: Machine Learning 33
For k=1,…,K 1. Estimate ŵλ
(k) on the training blocks
2. Compute error on validation block: errork(λ)
©2017 Emily Fox
Valid set
ŵλ(2)error2(λ)
K-fold cross validation
CSE 446: Machine Learning 34
For k=1,…,K 1. Estimate ŵλ
(k) on the training blocks
2. Compute error on validation block: errork(λ)
©2017 Emily Fox
Valid set
ŵλ(3) error3(λ)
K-fold cross validation
1/19/17
18
CSE 446: Machine Learning 35
For k=1,…,K 1. Estimate ŵλ
(k) on the training blocks
2. Compute error on validation block: errork(λ)
©2017 Emily Fox
Valid set
ŵλ(4) error4(λ)
K-fold cross validation
CSE 446: Machine Learning 36
For k=1,…,K 1. Estimate ŵλ
(k) on the training blocks
2. Compute error on validation block: errork(λ)
Compute average error: CV(λ) = errork(λ)
©2017 Emily Fox
Valid set
ŵλ(5) error5(λ)
1
K
KX
k=1
K-fold cross validation
1/19/17
19
CSE 446: Machine Learning 37
Repeat procedure for each choice of λ
Choose λ* to minimize CV(λ)
©2017 Emily Fox
Valid set
K-fold cross validation
CSE 446: Machine Learning 38
What value of K?
Formally, the best approximation occurs for validation sets of size 1 (K=N) Computationally intensive
- requires computing N fits of model per λ Typically, K=5 or 10
©2017 Emily Fox
leave-one-out cross validation
5-fold CV 10-fold CV
1/19/17
20
CSE 446: Machine Learning 39
Choosing λ via cross validation for lasso
Cross validation is choosing the λ that provides best predictive accuracy
Tends to favor less sparse solutions, and thus smaller λ, than optimal choice for feature selection
c.f., “Machine Learning: A Probabilistic Perspective”, Murphy, 2012 for further discussion
©2017 Emily Fox
CSE 446: Machine Learning
Practical concerns with lasso
©2017 Emily Fox
1/19/17
21
CSE 446: Machine Learning 41
Issues with standard lasso objective 1. With group of highly correlated features, lasso tends to select amongst
them arbitrarily - Often prefer to select all together
2. Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution
Elastic net aims to address these issues
- hybrid between lasso and ridge regression
- uses L1 and L2 penalties
See Zou & Hastie ‘05 for further discussion
©2017 Emily Fox
CSE 446: Machine Learning
Summary for feature selection and lasso regression
©2017 Emily Fox
1/19/17
22
CSE 446: Machine Learning 43
Impact of feature selection and lasso
Lasso has changed machine learning, statistics, & electrical engineering
But, for feature selection in general, be careful about interpreting selected features
- selection only considers features included
- sensitive to correlations between features
- result depends on algorithm used
- there are theoretical guarantees for lasso under certain conditions
©2017 Emily Fox
CSE 446: Machine Learning 44
What you can do now… • Describe “all subsets” and greedy variants for feature selection
• Analyze computational costs of these algorithms
• Formulate lasso objective
• Describe what happens to estimated lasso coefficients as tuning parameter λ is varied
• Interpret lasso coefficient path plot
• Contrast ridge and lasso regression
• Estimate lasso regression parameters using an iterative coordinate descent algorithm
• Implement K-fold cross validation to select lasso tuning parameter λ
©2017 Emily Fox
1/19/17
23
CSE 446: Machine Learning
CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017
©2017 Emily Fox
Linear classifiers
CSE 446: Machine Learning
Linear classifier: Intuition
©2017 Emily Fox
1/19/17
24
CSE 446: Machine Learning 47
Classifier
©2017 Emily Fox
Sentence from
review
Classifier MODEL
Input: x Output: y Predicted class
ŷ = +1
ŷ = -1 Sushi was awesome, the food was awesome, but the service was awful.
CSE 446: Machine Learning 48
Score(x) = weighted sum of features of sentence
If Score (x) > 0:
ŷ = Else:
ŷ =
Feature Coefficient
… …
©2017 Emily Fox
Sentence from
review
Input: x
Simple linear classifier
1/19/17
25
CSE 446: Machine Learning 49
A simple example: Word counts
©2017 Emily Fox
Feature Coefficient good 1.0
great 1.2
awesome 1.7
bad -1.0
terrible -2.1
awful -3.3
restaurant, the, we, where, …
0.0
… …
Input xi: Sushi was great, the food was awesome, but the service was terrible.
Called a linear classifier, because score is weighted sum of features.
CSE 446: Machine Learning 50
More generically…
©2017 Emily Fox
feature 1 = h0(x) … e.g., 1
feature 2 = h1(x) … e.g., x[1] = #awesome
feature 3 = h2(x) … e.g., x[2] = #awful or, log(x[7]) x[2] = log(#bad) x #awful
or, tf-idf(“awful”)
…
feature D+1 = hD(x) … some other function of x[1],…, x[d]
DX
j=0
Model: ŷi = sign(Score(xi))
Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD
hD(xi)
= wj hj(xi) = wTh(xi)
1/19/17
26
CSE 446: Machine Learning
Decision boundaries
©2017 Emily Fox
CSE 446: Machine Learning 52
Suppose only two words had non-zero coefficient
©2017 Emily Fox
#awesome 0 1 2 3 4 …
#aw
ful
0
1
2
3
4
…
Sushi was awesome, the food was awesome, but the service was awful.
Score(x) = 1.0 #awesome – 1.5 #awful
Input Coefficient Value
w0 0.0
#awesome w1 1.0
#awful w2 -1.5
1/19/17
27
CSE 446: Machine Learning 53
Decision boundary example
©2017 Emily Fox
#awesome
#aw
ful
0 1 2 3 4 …
0
1
2
3
4
…
1.0 #awesome – 1.5
#awful = 0
Score(x) > 0
Score(x) < 0
Score(x) = 1.0 #awesome – 1.5 #awful
Input Coefficient Value
w0 0.0
#awesome w1 1.0
#awful w2 -1.5
Decision boundary separates + and – predictions
CSE 446: Machine Learning 54
For more inputs (linear features)…
©2017 Emily Fox
#awesome
#aw
ful
x[1]
x[3]
Score(x) = w0 + w1
#awesome + w2
#awful + w3
#great
x[2
]
1/19/17
28
CSE 446: Machine Learning 55
For general features…
For more general classifiers (not just linear features) è more complicated shapes
©2017 Emily Fox