1 / 41
Experiments on Logistic Regression
Ning BaoUC Santa Cruz, CMPS242 Machine Learning
March 20, 2008
Preliminaries (same-o, same-o~)
Preliminaries (same-o,same-o~)
• Logistic Regression
• Gradient Descent
• Dataset
• Cross Validation
Vanilla LogisticRegression, EarlyStopping
Overfitting
Stochastic GradientDescent
2-Norm Regularization
Modifying transferfunction
Label Shrinking
2 / 41
Logistic Regression
3 / 41
• Data : (xt, yt), xt ∈ {0, 1}, yt ∈ {0, 1}
Logistic Regression
3 / 41
• Data : (xt, yt), xt ∈ {0, 1}, yt ∈ {0, 1}
• Linear activation : at = w · xt
• Sigmoid transfer function : yt = σ(at) = 1
1+e−at= 1
1+e−w·xt
Logistic Regression
3 / 41
• Data : (xt, yt), xt ∈ {0, 1}, yt ∈ {0, 1}
• Linear activation : at = w · xt
• Sigmoid transfer function : yt = σ(at) = 1
1+e−at= 1
1+e−w·xt
• Logistic Loss :
Loss(yt, yt) = yt lnyt
yt
+ (1 − yt) ln1 − yt
1 − yt
=
{
− ln(1 − yt) = ln(1 + ew·xt) if yt = 0
− ln yt = ln(1 + ew·xt) − w · xt if yt = 1
= ln(1 + ew·xt) − yt · w · xt
Gradient Descent
4 / 41
• Minimization problem :
infw
∑
t Loss(yt, σ(w · xt))
T
• Gradient :
∇w =1
T
∑
t
(yt − yt)xt =1
T
∑
t
(σ(w · xt) − yt)xt
Gradient Descent
4 / 41
• Minimization problem :
infw
∑
t Loss(yt, σ(w · xt))
T
• Gradient :
∇w =1
T
∑
t
(yt − yt)xt =1
T
∑
t
(σ(w · xt) − yt)xt
• Update formula :
wp+1 = wp − η∇w
m
∀i, wp+1,i = wp,i − η∇wi
Dataset
5 / 41
• Spam dataset of hw3 : 2000 ∗ 2001, binary features/labels
• Visualization of dataset (Thanks Nikhila)
The map of original dataset (2000*2001)
200 400 600 800 1000 1200 1400 1600 1800 2000
200
400
600
800
1000
1200
1400
1600
1800
2000
The map of permuted dataset (2000*2001)
200 400 600 800 1000 1200 1400 1600 1800 2000
200
400
600
800
1000
1200
1400
1600
1800
2000
Cross Validation
6 / 41
• Split dataset into training/testing set
• For Training set
◦ partition the training set into 3 parts
◦ for each of the 3 holdouts
• train all models on the 2/3 parts
• record average logistic loss on the 1/3 part
◦ the best model is chosen as the one with then best average
validation loss over the 3 handouts
Cross Validation
6 / 41
• Split dataset into training/testing set
• For Training set
◦ partition the training set into 3 parts
◦ for each of the 3 holdouts
• train all models on the 2/3 parts
• record average logistic loss on the 1/3 part
◦ the best model is chosen as the one with then best average
validation loss over the 3 handouts
• Model : certain parameter + trained weights vector
Vanilla Logistic Regression, EarlyStopping
Preliminaries (same-o,same-o~)
Vanilla LogisticRegression, EarlyStopping
• Vanilla version, log.reg. + GD• Vanilla version, log.reg. + GD, 1-Normstopping (plot)
• Vanilla version, log.reg. + GD, 2-Normstopping
• 1-Norm vs 2-Normstopping criteria (plot)• 1-Norm vs 2-Normstopping criteria
Overfitting
Stochastic GradientDescent
2-Norm Regularization
Modifying transferfunction
Label Shrinking
7 / 41
Vanilla version, log. reg. + GD
8 / 41
• Method outline
◦ Training the neuron on 3/4 of the whole dataset, 1/4 as the
testing set
◦ early stopping with the criterion that the 1-Norm/2-Norm of
gradient converge to some level
◦ report average logistic loss on training set & testing set for
comparison
Vanilla version, log. reg. + GD
8 / 41
• Method outline
◦ Training the neuron on 3/4 of the whole dataset, 1/4 as the
testing set
◦ early stopping with the criterion that the 1-Norm/2-Norm of
gradient converge to some level
◦ report average logistic loss on training set & testing set for
comparison
• 1-Norm stopping criterion
∥
∥
∥
∥
∥
1
T
∑
t
(σ(w · xt) − yt)xt
∥
∥
∥
∥
∥
1
≤ 10−i, i ∈ [0, 3]
Vanilla version, log. reg. + GD, 1-Norm stopping (plot)
9 / 41
10^(0) 10^(−1) 10^(−2)0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16Vanilla Log. Reg., 1−Norm stopping criterion , (η = 0.2)
The Gradient at stopping point
mea
n Lo
gist
ic L
oss
testing loss
training loss
Vanilla version, log. reg. + GD, 2-Norm stopping
10 / 41
• 2-Norm stopping criterion
∥
∥
∥
∥
∥
1
T
∑
t
(σ(w · xt) − yt)xt
∥
∥
∥
∥
∥
2
≤ 10−i, i ∈ [0, 4]
Vanilla version, log. reg. + GD, 2-Norm stopping
10 / 41
• 2-Norm stopping criterion
∥
∥
∥
∥
∥
1
T
∑
t
(σ(w · xt) − yt)xt
∥
∥
∥
∥
∥
2
≤ 10−i, i ∈ [0, 4]
• 2-Norm of gradient is smaller then 1-Norm =⇒ train more
1-Norm vs 2-Norm stopping criteria (plot)
11 / 41
10^(0) 10^(−1) 10^(−2)0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Vanilla Log. Reg., 1−Norm vs 2−Norm of gradient stopping criteria, (η = 0.2)
The Gradient at stopping point
mea
nLog
istic
Los
s
testing loss, 1−Norm
training loss 1−Norm
testing loss, 2−Norm
training loss 2−Norm
1-Norm vs 2-Norm stopping criteria
12 / 41
1. The two stopping criteria don’t have essential difference
2. 2-Norm needs more training
Overfitting
Preliminaries (same-o,same-o~)
Vanilla LogisticRegression, EarlyStopping
Overfitting
• Overfitting of Vanillalog. reg.
• A toy example ofunbounded weights
• A toy example ofunbounded weights(plot)•
Regularization/Weightscontrol
Stochastic GradientDescent
2-Norm Regularization
Modifying transferfunction
Label Shrinking
13 / 41
Overfitting of Vanilla log. reg.
14 / 41
• Overfitting is observable
Overfitting of Vanilla log. reg.
14 / 41
• Overfitting is observable
• Reason : Weights get unbounded
Overfitting of Vanilla log. reg.
14 / 41
• Overfitting is observable
• Reason : Weights get unbounded
• Derivative equation
∇w = 0 ⇐⇒ ∀i,
∑
t σ(w · xt)xt,i
T=
∑
t ytxt,i
T
⇐⇒ ∀i,
∑
t:xt,i=1σ(w · xt)
T=
∑
t:xt,i=1yt
T
as l.h.s of the above equation goes to 1 or 0, some weights in w must
get unbounded. [MW]
A toy example of unbounded weights
15 / 41
• Matrix :label x1 x2 x3
1 0 1 00 1 0 11 1 1 01 0 1 0
test 0 0 0 0
• w2 will keep growing while w3 keeps dropping
• Both are punished on the last row as testing example
A toy example of unbounded weights (plot)
16 / 41
10^(0) 10^(−1) 10^(−2) 10^(−3)0
1
2
3
4
5
6
7
8
The Gradient at stopping point
mea
n Lo
gist
ic L
oss
testing losstraining loss
10^(0) 10^(−1) 10^(−2) 10^(−3)−6
−4
−2
0
2
4
6
8
w1
w2
w3
• The more we train, the worse we get
Regularization/Weights control
17 / 41
• Idea : Weights should be bounded/controlled
Regularization/Weights control
17 / 41
• Idea : Weights should be bounded/controlled
• Techniques :
1. Stochastic Gradient Descent w/ Simulated Annealing
2. 2-Norm Regularization
3. Modifying transfer function when training (in weights update
formula)
4. Label Shrinking
Stochastic Gradient Descent
Preliminaries (same-o,same-o~)
Vanilla LogisticRegression, EarlyStopping
Overfitting
Stochastic GradientDescent• Stochastic GradientDescent• Stochastic GradientDescent (plot)
2-Norm Regularization
Modifying transferfunction
Label Shrinking
18 / 41
Stochastic Gradient Descent
19 / 41
• Updating per example : ’online setting’ instead of batch
Stochastic Gradient Descent
19 / 41
• Updating per example : ’online setting’ instead of batch
• Update formula :w = w − η0α
i−1∇w
in which i is the number of training passes
• ’Smaller steps’ as training more : α < 1 =⇒ i ↑,learning rate↓
Stochastic Gradient Descent (plot)
20 / 41
10 20 30 40 50 60 70 80 90 1000
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Stochastic Gradient Descent, (η0 = 0.2, α = 0.95)
Number of passes over training set
mea
n Lo
gist
ic L
oss
testing losstraining loss
10 20 30 40 50 60 70 80 90 100−5
−4
−3
−2
−1
0
1
2
3
4
5
Number of passes over the training set
wei
ghts
Stochastic G.D. (η0 = 0.2, α = 0.95)
• Weights are updated less and less, controlled
2-Norm Regularization
Preliminaries (same-o,same-o~)
Vanilla LogisticRegression, EarlyStopping
Overfitting
Stochastic GradientDescent
2-Norm Regularization
• 2-NormRegularization
• 2-NormRegularization (plot)
Modifying transferfunction
Label Shrinking
21 / 41
2-Norm Regularization
22 / 41
• Modified minimization problem :
infw
∑
t(λ2‖w‖2
2+ Loss(yt, σ(w · xt)))
T
• Update formula
wp+1 = wp − η(λwp +
∑
t(σ(wp · xt) − yt)xt
T)
2-Norm Regularization (plot)
23 / 41
10^(0) 10^(−1) 10^(−2) 10^(−3) 10^(−4) 10^(−5) 10^(−6)
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
0.162−Norm Regularized logistic reg. (η = 0.2, λ = .05)
The Gradient at stopping point
mea
n Lo
gist
ic L
oss
testing losstraining loss
10^(0) 10^(−1) 10^(−2) 10^(−3) 10^(−4) 10^(−5) 10^(−6)−1.5
−1
−0.5
0
0.5
1
1.5
1−Norm of gradients at stopping points
wei
ghts
Weights variation of 2−Norm Regularization, (η = .2, λ = .05)
Modifying transfer function
Preliminaries (same-o,same-o~)
Vanilla LogisticRegression, EarlyStopping
Overfitting
Stochastic GradientDescent
2-Norm Regularization
Modifying transferfunction• New trick : modifyingtransfer function• Modified Sigmoidfunc.• Modified Sigmoidfunc. (plot)
• Modified Sigmoidfunc., varyingσtheshold• Modified Sigmoidfunc., Side-effect• Step-like transferfunc.• Step-like transferfunc. (plot)
• Step-like transferfunc., varyingσtheshold• Modified sigmoid v.s.Step-like transfer func.
• Modified sigmoid &Step-like transfer func.
24 / 41
New trick : modifying transfer function
25 / 41
• Idea : if the linear activation is larger/smaller than certain threshold, no
weights update
• wp+1 = wp − η( 1
T
∑
t(yt − yt)xt) in which yt = σ(w · xt)
◦ If w · xt ≥ σtheshold then yt = 1 =⇒ wp+1 = wp (given yt = 1)
◦ If w · xt ≤ −σtheshold then yt = 0 =⇒ wp+1 = wp (given yt = 0)
• ’If the prediction is good, stay put ’
Modified Sigmoid func.
26 / 41
• Prediction in training
yt =
0 if w · xt ≤ −σtheshold1
1+e−w·xtotherwise
1 if w · xt ≥ σtheshold
• Plot
−10 −8 −6 −4 −2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Modified Sigmoid func, (σtheshold
= 3)
−10 −8 −6 −4 −2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Original Sigmoid func.
Modified Sigmoid func. (plot)
27 / 41
10^(0) 10^(−1) 10^(−2) 10^(−3)0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Bounded sigmoid prediction, (η = .2, σtheshold
= 3)
The Gradient at stopping point
mea
n Lo
gist
ic L
oss
testing losstraining loss
10^(0) 10^(−1) 10^(−2)0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16Vanilla Log. Reg., 1−Norm stopping criterion , (η = 0.2)
The Gradient at stopping point
mea
n Lo
gist
ic L
oss
testing loss
training loss
10^(0) 10^(−1) 10^(−2) 10^(−3)−5
−4
−3
−2
−1
0
1
2
3
4
5
gradients at stopping points
wei
ghts
Bounded sigmoid prediction, early stopping by 1−Norm gradient(η = .2, σ
theshold = 3)
10^(0) 10^(−1) 10^(−2) 10^(−3)−5
−4
−3
−2
−1
0
1
2
3
4
5
gradient at stopping points
wei
ghts
Original Sigmoid func., 1−Norm of gradient early stopping, (η = 0.2)
Modified Sigmoid func., varying σtheshold
28 / 41
• Using 3-fold Cross Validation to pick best weights vector for each value of
σtheshold.
2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.50
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Bounded Sigmoid func., varying σtheshold
, (η = .2)
σtheshold
mea
n Lo
gist
ic L
oss
testing losstraining loss
• Sanity check :
as σtheshold ↑, more like original Sigmoid =⇒ less control on weights,
testing loss ↑
Modified Sigmoid func., Side-effect
29 / 41
• Instability of 1-Norm of gradient.
0 2 4 6 8 10 12 14 16
x 104
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35Modified sigmoid, 1−Norm of gradient
Number of terations
1−N
orm
of g
radi
ent
• Explanation : the modified sigmoid function is incontinuos
• Observed : As σtheshold ↑, less fluctuation
Step-like transfer func.
30 / 41
• Prediction in training
yt =
0 if w · xt ≤ −σthesholdw·xt+σtheshold
2·σthesholdotherwise
1 if w · xt ≥ σtheshold
• Plot
−10 −8 −6 −4 −2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Step−like transfer func., (σtheshold
= 3)
−10 −8 −6 −4 −2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Original Sigmoid func.
Step-like transfer func. (plot)
31 / 41
10^(0) 10^(−1) 10^(−2) 10^(−3)0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Step−like transfer func. , (η = 0.2, σtheshold
= 3)
gradient at stopping points
mea
n Lo
gist
ic L
oss
testing losstraining loss
10^(0) 10^(−1) 10^(−2)0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16Vanilla Log. Reg., 1−Norm stopping criterion , (η = 0.2)
The Gradient at stopping point
mea
n Lo
gist
ic L
oss
testing loss
training loss
10^(0) 10^(−1) 10^(−2) 10^(−3)−5
−4
−3
−2
−1
0
1
2
3
4
5
gradients at stopping points
wei
ghts
Step−like transfer func., early stopping w/ 1−Norm gradient
10^(0) 10^(−1) 10^(−2) 10^(−3)−5
−4
−3
−2
−1
0
1
2
3
4
5
gradient at stopping points
wei
ghts
Original Sigmoid func., 1−Norm of gradient early stopping, (η = 0.2)
Step-like transfer func., varying σtheshold
32 / 41
• Using 3-fold Cross Validation to pick best weights vector for each value of
σtheshold.
2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Step−like transfer func., varying σtheshold
, (η = .2)
σtheshold
mea
n Lo
gist
ic L
oss
testing losstraining loss
• Sanity check :
as σtheshold ↑, less control on weights
Modified sigmoid v.s. Step-like transfer func.
33 / 41
10^(0) 10^(−1) 10^(−2) 10^(−3) 10^(−4)0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Bounded sigmoid prediction, (η = .2, σtheshold
= 3)
The Gradient at stopping point
mea
n Lo
gist
ic L
oss
testing losstraining loss
10^(0) 10^(−1) 10^(−2) 10^(−3) 10^(−4)0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Step−like transfer func., (η = .2, σtheshold
= 3)
The Gradient at stopping point
mea
n Lo
gist
ic L
oss
testing losstraining loss
10^(0) 10^(−1) 10^(−2) 10^(−3) 10^(−4)−5
−4
−3
−2
−1
0
1
2
3
4
5
Bounded sigmoid prediction, early stopping by 1−Norm gradient(η = .2, σ
theshold = 3)
1−Norm Gradients at stopping points
wei
ghts
10^(0) 10^(−1) 10^(−2) 10^(−3) 10^(−4)−5
−4
−3
−2
−1
0
1
2
3
4
5Step−like transfer func., early stopping w/ 1−Norm gradient
Gradients at stopping points
wei
ghts
Modified sigmoid & Step-like transfer func.
34 / 41
• Same idea : control weights when training, [−σtheshold, σtheshold]
• Usability?
1. Results only show : effective in bounding weights thereby
preventing overfitting
2. Extensive experiments are needed on various datasets in practice
Label Shrinking
Preliminaries (same-o,same-o~)
Vanilla LogisticRegression, EarlyStopping
Overfitting
Stochastic GradientDescent
2-Norm Regularization
Modifying transferfunction
Label Shrinking
• Label Shrinking
• Label Shrinking,weights (plot)
• Label Shrinking,mean log. losses (plot)
• Label Shrinking,explanations
• Conclusion andFuture Work
• Thanks
35 / 41
Label Shrinking
36 / 41
• {0, 1} → {a, b} a > 0, b < 1
y′t := a + (b − a)yt
• Or y′t := ε + (1 − 2ε)yt 0 ≤ ε ≤ 1 i.e.{0, 1} → {ε, 1 − ε}
• Shrunk labels =⇒ different loss function
−4 −2 0 2 4 6 8 100
0.5
1
1.5
2
2.5Label Shrinking effects, 1 → b
ah
y, σ
(ah)
, los
s(y,
ah)
σ(ah)b=.99999loss(.99999, ah)b=.8loss(.8, ah)b=.4loss(.4, ah)
−10 −8 −6 −4 −2 0 2 40
0.5
1
1.5
2
2.5Label Shrinking effects, 0 → a
ah
y, σ
(ah)
, los
s(y,
ah)
σ(ah)a=.00001loss(.00001, ah)a=.2loss(.2, ah)a=.4loss(.4, ah)
Label Shrinking, weights (plot)
37 / 41
10^(0) 10^(−1) 10^(−2)−4
−3
−2
−1
0
1
2
3
gradient at stopping points
wei
ghts
Degenerated Label Shrinking, (ε = 0)
10^(0) 10^(−1) 10^(−2)−3
−2
−1
0
1
2
3Label Shrinking, (ε = 0.01)
wei
ghts
gradient at stopping points
10^(0) 10^(−1) 10^(−2)−3
−2
−1
0
1
2
3
gradient at stopping points
wei
ghts
Label Shrinking, (ε = 0.1)
Label Shrinking, mean log. losses (plot)
38 / 41
10^(0) 10^(−1) 10^(−2)0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16Degenerated Label Shrinking, (ε = 0)
The Gradient at stopping point
mea
n Lo
gist
ic L
oss
testing losstraining loss
10^(0) 10^(−1) 10^(−2)0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16Label Shrinking, (ε = 0.01)
The Gradient at stopping point
mea
n Lo
gist
ic L
oss
testing losstraining loss
10^(0) 10^(−1) 10^(−2)0
0.05
0.1
0.15
0.2
0.25
Label Shrinking, (ε = 0.1)
The Gradient at stopping point
mea
n Lo
gist
ic L
oss
testing losstraining loss
Label Shrinking, explanations
39 / 41
• Effective weights control
• Poor loss result
1. dataset, permutation
Label Shrinking, explanations
39 / 41
• Effective weights control
• Poor loss result
1. dataset, permutation
2. trained on changed loss func., different from real loss
−4 −2 0 2 4 6 8 100
0.5
1
1.5
2
2.5Label Shrinking effects, 1 → b
ah
y, σ
(ah)
, los
s(y,
ah)
σ(ah)b=.99999loss(.99999, ah)b=.8loss(.8, ah)b=.4loss(.4, ah)
Label Shrinking, explanations
39 / 41
• Effective weights control
• Poor loss result
1. dataset, permutation
2. trained on changed loss func., different from real loss
−4 −2 0 2 4 6 8 100
0.5
1
1.5
2
2.5Label Shrinking effects, 1 → b
ah
y, σ
(ah)
, los
s(y,
ah)
σ(ah)b=.99999loss(.99999, ah)b=.8loss(.8, ah)b=.4loss(.4, ah)
3. should be applied together with Prediction Stretching
Conclusion and Future Work
40 / 41
1. 1-Norm/2-Norm stopping criteria don’t have essential difference, early
stopping helps
Conclusion and Future Work
40 / 41
1. 1-Norm/2-Norm stopping criteria don’t have essential difference, early
stopping helps
2. Stochastic G.D. and 2-Norm Regularization helps, need to tune
parameters
Conclusion and Future Work
40 / 41
1. 1-Norm/2-Norm stopping criteria don’t have essential difference, early
stopping helps
2. Stochastic G.D. and 2-Norm Regularization helps, need to tune
parameters
3. Modifying transfer func. is effective, extensive experiments needed to
justify its usability
Conclusion and Future Work
40 / 41
1. 1-Norm/2-Norm stopping criteria don’t have essential difference, early
stopping helps
2. Stochastic G.D. and 2-Norm Regularization helps, need to tune
parameters
3. Modifying transfer func. is effective, extensive experiments needed to
justify its usability
4. Label Shrinking control weights, loss? Should apply w/ Prediction
Stretching
Thanks
41 / 41
• Maya Hristakeva and Nikhila Arkalgud, Bruno Astuto Arouche Nunes
• Prof. Manfred Warmuth
Fin.