experiments on logistic regression - course web pages · pdf filevanilla logistic regression,...

1 / 41

Experiments on Logistic Regression

Ning BaoUC Santa Cruz, CMPS242 Machine Learning

March 20, 2008

Preliminaries (same-o, same-o~)

Preliminaries (same-o,same-o~)

• Logistic Regression

• Gradient Descent

• Dataset

• Cross Validation

Vanilla LogisticRegression, EarlyStopping

Overfitting

Stochastic GradientDescent

2-Norm Regularization

Modifying transferfunction

Label Shrinking

2 / 41

Logistic Regression

3 / 41

• Data : (xt, yt), xt ∈ {0, 1}, yt ∈ {0, 1}

Logistic Regression

3 / 41

• Data : (xt, yt), xt ∈ {0, 1}, yt ∈ {0, 1}

• Linear activation : at = w · xt

• Sigmoid transfer function : yt = σ(at) = 1

1+e−at= 1

1+e−w·xt

Logistic Regression

3 / 41

• Data : (xt, yt), xt ∈ {0, 1}, yt ∈ {0, 1}

• Linear activation : at = w · xt

• Sigmoid transfer function : yt = σ(at) = 1

1+e−at= 1

1+e−w·xt

• Logistic Loss :

Loss(yt, yt) = yt lnyt

yt

+ (1 − yt) ln1 − yt

1 − yt

=

{

− ln(1 − yt) = ln(1 + ew·xt) if yt = 0

− ln yt = ln(1 + ew·xt) − w · xt if yt = 1

= ln(1 + ew·xt) − yt · w · xt

Gradient Descent

4 / 41

• Minimization problem :

infw

∑

t Loss(yt, σ(w · xt))

T

• Gradient :

∇w =1

T

∑

t

(yt − yt)xt =1

T

∑

t

(σ(w · xt) − yt)xt

Gradient Descent

4 / 41

• Minimization problem :

infw

∑

t Loss(yt, σ(w · xt))

T

• Gradient :

∇w =1

T

∑

t

(yt − yt)xt =1

T

∑

t


• Update formula :

wp+1 = wp − η∇w

m

∀i, wp+1,i = wp,i − η∇wi

Dataset

5 / 41

• Spam dataset of hw3 : 2000 ∗ 2001, binary features/labels

• Visualization of dataset (Thanks Nikhila)

The map of original dataset (2000*2001)

200 400 600 800 1000 1200 1400 1600 1800 2000

200

400

600

800

1000

1200

1400

1600

1800

2000

The map of permuted dataset (2000*2001)

200 400 600 800 1000 1200 1400 1600 1800 2000

200

400

600

800

1000

1200

1400

1600

1800

2000

Cross Validation

6 / 41

• Split dataset into training/testing set

• For Training set

◦ partition the training set into 3 parts

◦ for each of the 3 holdouts

• train all models on the 2/3 parts

• record average logistic loss on the 1/3 part

◦ the best model is chosen as the one with then best average

validation loss over the 3 handouts

Cross Validation

6 / 41

• Split dataset into training/testing set

• For Training set

◦ partition the training set into 3 parts

◦ for each of the 3 holdouts

• train all models on the 2/3 parts

• record average logistic loss on the 1/3 part

◦ the best model is chosen as the one with then best average

validation loss over the 3 handouts

• Model : certain parameter + trained weights vector

Vanilla Logistic Regression, EarlyStopping



• Vanilla version, log.reg. + GD• Vanilla version, log.reg. + GD, 1-Normstopping (plot)

• Vanilla version, log.reg. + GD, 2-Normstopping

• 1-Norm vs 2-Normstopping criteria (plot)• 1-Norm vs 2-Normstopping criteria

Overfitting




Label Shrinking

7 / 41

Vanilla version, log. reg. + GD

8 / 41

• Method outline

◦ Training the neuron on 3/4 of the whole dataset, 1/4 as the

testing set

◦ early stopping with the criterion that the 1-Norm/2-Norm of

gradient converge to some level

◦ report average logistic loss on training set & testing set for

comparison

Vanilla version, log. reg. + GD

8 / 41

• Method outline

◦ Training the neuron on 3/4 of the whole dataset, 1/4 as the

testing set

◦ early stopping with the criterion that the 1-Norm/2-Norm of

gradient converge to some level

◦ report average logistic loss on training set & testing set for

comparison

• 1-Norm stopping criterion

∥

∥

∥

∥

∥

1

T

∑

t


∥

∥

∥

∥

∥

1

≤ 10−i, i ∈ [0, 3]

Vanilla version, log. reg. + GD, 1-Norm stopping (plot)

9 / 41

10^(0) 10^(−1) 10^(−2)0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16Vanilla Log. Reg., 1−Norm stopping criterion , (η = 0.2)

The Gradient at stopping point

mea

n Lo

gist

ic L

oss

testing loss

training loss

Vanilla version, log. reg. + GD, 2-Norm stopping

10 / 41


∥

∥

∥

∥

∥

1

T

∑

t


∥

∥

∥

∥

∥

2

≤ 10−i, i ∈ [0, 4]

Vanilla version, log. reg. + GD, 2-Norm stopping

10 / 41


∥

∥

∥

∥

∥

1

T

∑

t


∥

∥

∥

∥

∥

2

≤ 10−i, i ∈ [0, 4]

• 2-Norm of gradient is smaller then 1-Norm =⇒ train more

1-Norm vs 2-Norm stopping criteria (plot)

11 / 41

10^(0) 10^(−1) 10^(−2)0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Vanilla Log. Reg., 1−Norm vs 2−Norm of gradient stopping criteria, (η = 0.2)


mea

nLog

istic

Los

s

testing loss, 1−Norm

training loss 1−Norm

testing loss, 2−Norm

training loss 2−Norm

1-Norm vs 2-Norm stopping criteria

12 / 41

1. The two stopping criteria don’t have essential difference

2. 2-Norm needs more training

Overfitting



Overfitting

• Overfitting of Vanillalog. reg.

• A toy example ofunbounded weights

• A toy example ofunbounded weights(plot)•

Regularization/Weightscontrol




Label Shrinking

13 / 41

Overfitting of Vanilla log. reg.

14 / 41

• Overfitting is observable


14 / 41


• Reason : Weights get unbounded


14 / 41


• Reason : Weights get unbounded

• Derivative equation

∇w = 0 ⇐⇒ ∀i,

∑

t σ(w · xt)xt,i

T=

∑

t ytxt,i

T

⇐⇒ ∀i,

∑

t:xt,i=1σ(w · xt)

T=

∑

t:xt,i=1yt

T

as l.h.s of the above equation goes to 1 or 0, some weights in w must

get unbounded. [MW]

A toy example of unbounded weights

15 / 41

• Matrix :label x1 x2 x3

1 0 1 00 1 0 11 1 1 01 0 1 0

test 0 0 0 0

• w2 will keep growing while w3 keeps dropping

• Both are punished on the last row as testing example

A toy example of unbounded weights (plot)

16 / 41

10^(0) 10^(−1) 10^(−2) 10^(−3)0

1

2

3

4

5

6

7

8


mea

n Lo

gist

ic L

oss

testing losstraining loss

10^(0) 10^(−1) 10^(−2) 10^(−3)−6

−4

−2

0

2

4

6

8

w1

w2

w3

• The more we train, the worse we get

Regularization/Weights control

17 / 41

• Idea : Weights should be bounded/controlled

Regularization/Weights control

17 / 41

• Idea : Weights should be bounded/controlled

• Techniques :

1. Stochastic Gradient Descent w/ Simulated Annealing

2. 2-Norm Regularization

3. Modifying transfer function when training (in weights update

formula)

4. Label Shrinking

Stochastic Gradient Descent



Overfitting

Stochastic GradientDescent• Stochastic GradientDescent• Stochastic GradientDescent (plot)



Label Shrinking

18 / 41


19 / 41

• Updating per example : ’online setting’ instead of batch


19 / 41

• Updating per example : ’online setting’ instead of batch

• Update formula :w = w − η0α

i−1∇w

in which i is the number of training passes

• ’Smaller steps’ as training more : α < 1 =⇒ i ↑,learning rate↓

Stochastic Gradient Descent (plot)

20 / 41

10 20 30 40 50 60 70 80 90 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Stochastic Gradient Descent, (η0 = 0.2, α = 0.95)

Number of passes over training set

mea

n Lo

gist

ic L

oss


10 20 30 40 50 60 70 80 90 100−5

−4

−3

−2

−1

0

1

2

3

4

5

Number of passes over the training set

wei

ghts

Stochastic G.D. (η0 = 0.2, α = 0.95)

• Weights are updated less and less, controlled




Overfitting



• 2-NormRegularization

• 2-NormRegularization (plot)


Label Shrinking

21 / 41


22 / 41

• Modified minimization problem :

infw

∑

t(λ2‖w‖2

2+ Loss(yt, σ(w · xt)))

T

• Update formula

wp+1 = wp − η(λwp +

∑

t(σ(wp · xt) − yt)xt

T)

2-Norm Regularization (plot)

23 / 41

10^(0) 10^(−1) 10^(−2) 10^(−3) 10^(−4) 10^(−5) 10^(−6)

0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

0.162−Norm Regularized logistic reg. (η = 0.2, λ = .05)


mea

n Lo

gist

ic L

oss


10^(0) 10^(−1) 10^(−2) 10^(−3) 10^(−4) 10^(−5) 10^(−6)−1.5

−1

−0.5

0

0.5

1

1.5

1−Norm of gradients at stopping points

wei

ghts

Weights variation of 2−Norm Regularization, (η = .2, λ = .05)

Modifying transfer function



Overfitting



Modifying transferfunction• New trick : modifyingtransfer function• Modified Sigmoidfunc.• Modified Sigmoidfunc. (plot)

• Modified Sigmoidfunc., varyingσtheshold• Modified Sigmoidfunc., Side-effect• Step-like transferfunc.• Step-like transferfunc. (plot)

• Step-like transferfunc., varyingσtheshold• Modified sigmoid v.s.Step-like transfer func.

• Modified sigmoid &Step-like transfer func.

24 / 41

New trick : modifying transfer function

25 / 41

• Idea : if the linear activation is larger/smaller than certain threshold, no

weights update

• wp+1 = wp − η( 1

T

∑

t(yt − yt)xt) in which yt = σ(w · xt)

◦ If w · xt ≥ σtheshold then yt = 1 =⇒ wp+1 = wp (given yt = 1)

◦ If w · xt ≤ −σtheshold then yt = 0 =⇒ wp+1 = wp (given yt = 0)

• ’If the prediction is good, stay put ’

Modified Sigmoid func.

26 / 41

• Prediction in training

yt =

0 if w · xt ≤ −σtheshold1

1+e−w·xtotherwise

1 if w · xt ≥ σtheshold

• Plot

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Modified Sigmoid func, (σtheshold

= 3)

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Original Sigmoid func.

Modified Sigmoid func. (plot)

27 / 41

10^(0) 10^(−1) 10^(−2) 10^(−3)0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Bounded sigmoid prediction, (η = .2, σtheshold

= 3)


mea

n Lo

gist

ic L

oss


10^(0) 10^(−1) 10^(−2)0

0.02

0.04

0.06

0.08

0.1

0.12

0.14



mea

n Lo

gist

ic L

oss

testing loss

training loss

10^(0) 10^(−1) 10^(−2) 10^(−3)−5

−4

−3

−2

−1

0

1

2

3

4

5

gradients at stopping points

wei

ghts

Bounded sigmoid prediction, early stopping by 1−Norm gradient(η = .2, σ

theshold = 3)

10^(0) 10^(−1) 10^(−2) 10^(−3)−5

−4

−3

−2

−1

0

1

2

3

4

5

gradient at stopping points

wei

ghts

Original Sigmoid func., 1−Norm of gradient early stopping, (η = 0.2)

Modified Sigmoid func., varying σtheshold

28 / 41

• Using 3-fold Cross Validation to pick best weights vector for each value of

σtheshold.

2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.50

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Bounded Sigmoid func., varying σtheshold

, (η = .2)

σtheshold

mea

n Lo

gist

ic L

oss


• Sanity check :

as σtheshold ↑, more like original Sigmoid =⇒ less control on weights,

testing loss ↑

Modified Sigmoid func., Side-effect

29 / 41

• Instability of 1-Norm of gradient.

0 2 4 6 8 10 12 14 16

x 104

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35Modified sigmoid, 1−Norm of gradient

Number of terations

1−N

orm

of g

radi

ent

• Explanation : the modified sigmoid function is incontinuos

• Observed : As σtheshold ↑, less fluctuation

Step-like transfer func.

30 / 41

• Prediction in training

yt =

0 if w · xt ≤ −σthesholdw·xt+σtheshold

2·σthesholdotherwise

1 if w · xt ≥ σtheshold

• Plot

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Step−like transfer func., (σtheshold

= 3)

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Original Sigmoid func.

Step-like transfer func. (plot)

31 / 41

10^(0) 10^(−1) 10^(−2) 10^(−3)0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Step−like transfer func. , (η = 0.2, σtheshold

= 3)


mea

n Lo

gist

ic L

oss


10^(0) 10^(−1) 10^(−2)0

0.02

0.04

0.06

0.08

0.1

0.12

0.14



mea

n Lo

gist

ic L

oss

testing loss

training loss

10^(0) 10^(−1) 10^(−2) 10^(−3)−5

−4

−3

−2

−1

0

1

2

3

4

5

gradients at stopping points

wei

ghts

Step−like transfer func., early stopping w/ 1−Norm gradient

10^(0) 10^(−1) 10^(−2) 10^(−3)−5

−4

−3

−2

−1

0

1

2

3

4

5


wei

ghts

Original Sigmoid func., 1−Norm of gradient early stopping, (η = 0.2)

Step-like transfer func., varying σtheshold

32 / 41

• Using 3-fold Cross Validation to pick best weights vector for each value of

σtheshold.

2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Step−like transfer func., varying σtheshold

, (η = .2)

σtheshold

mea

n Lo

gist

ic L

oss


• Sanity check :

as σtheshold ↑, less control on weights

Modified sigmoid v.s. Step-like transfer func.

33 / 41

10^(0) 10^(−1) 10^(−2) 10^(−3) 10^(−4)0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Bounded sigmoid prediction, (η = .2, σtheshold

= 3)


mea

n Lo

gist

ic L

oss


10^(0) 10^(−1) 10^(−2) 10^(−3) 10^(−4)0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Step−like transfer func., (η = .2, σtheshold

= 3)


mea

n Lo

gist

ic L

oss


10^(0) 10^(−1) 10^(−2) 10^(−3) 10^(−4)−5

−4

−3

−2

−1

0

1

2

3

4

5

Bounded sigmoid prediction, early stopping by 1−Norm gradient(η = .2, σ

theshold = 3)

1−Norm Gradients at stopping points

wei

ghts

10^(0) 10^(−1) 10^(−2) 10^(−3) 10^(−4)−5

−4

−3

−2

−1

0

1

2

3

4

5Step−like transfer func., early stopping w/ 1−Norm gradient

Gradients at stopping points

wei

ghts

Modified sigmoid & Step-like transfer func.

34 / 41

• Same idea : control weights when training, [−σtheshold, σtheshold]

• Usability?

1. Results only show : effective in bounding weights thereby

preventing overfitting

2. Extensive experiments are needed on various datasets in practice

Label Shrinking



Overfitting




Label Shrinking

• Label Shrinking

• Label Shrinking,weights (plot)

• Label Shrinking,mean log. losses (plot)

• Label Shrinking,explanations

• Conclusion andFuture Work

• Thanks

35 / 41

Label Shrinking

36 / 41

• {0, 1} → {a, b} a > 0, b < 1

y′t := a + (b − a)yt

• Or y′t := ε + (1 − 2ε)yt 0 ≤ ε ≤ 1 i.e.{0, 1} → {ε, 1 − ε}

• Shrunk labels =⇒ different loss function

−4 −2 0 2 4 6 8 100

0.5

1

1.5

2

2.5Label Shrinking effects, 1 → b

ah

y, σ

(ah)

, los

s(y,

ah)

σ(ah)b=.99999loss(.99999, ah)b=.8loss(.8, ah)b=.4loss(.4, ah)

−10 −8 −6 −4 −2 0 2 40

0.5

1

1.5

2

2.5Label Shrinking effects, 0 → a

ah

y, σ

(ah)

, los

s(y,

ah)

σ(ah)a=.00001loss(.00001, ah)a=.2loss(.2, ah)a=.4loss(.4, ah)

Label Shrinking, weights (plot)

37 / 41

10^(0) 10^(−1) 10^(−2)−4

−3

−2

−1

0

1

2

3


wei

ghts

Degenerated Label Shrinking, (ε = 0)

10^(0) 10^(−1) 10^(−2)−3

−2

−1

0

1

2

3Label Shrinking, (ε = 0.01)

wei

ghts


10^(0) 10^(−1) 10^(−2)−3

−2

−1

0

1

2

3


wei

ghts

Label Shrinking, (ε = 0.1)

Label Shrinking, mean log. losses (plot)

38 / 41

10^(0) 10^(−1) 10^(−2)0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16Degenerated Label Shrinking, (ε = 0)


mea

n Lo

gist

ic L

oss


10^(0) 10^(−1) 10^(−2)0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16Label Shrinking, (ε = 0.01)


mea

n Lo

gist

ic L

oss


10^(0) 10^(−1) 10^(−2)0

0.05

0.1

0.15

0.2

0.25

Label Shrinking, (ε = 0.1)


mea

n Lo

gist

ic L

oss


Label Shrinking, explanations

39 / 41

• Effective weights control

• Poor loss result

1. dataset, permutation


39 / 41




2. trained on changed loss func., different from real loss

−4 −2 0 2 4 6 8 100

0.5

1

1.5

2


ah

y, σ

(ah)

, los

s(y,

ah)



39 / 41




2. trained on changed loss func., different from real loss

−4 −2 0 2 4 6 8 100

0.5

1

1.5

2


ah

y, σ

(ah)

, los

s(y,

ah)


3. should be applied together with Prediction Stretching

Conclusion and Future Work

40 / 41

1. 1-Norm/2-Norm stopping criteria don’t have essential difference, early

stopping helps


40 / 41


stopping helps

2. Stochastic G.D. and 2-Norm Regularization helps, need to tune

parameters


40 / 41


stopping helps


parameters

3. Modifying transfer func. is effective, extensive experiments needed to

justify its usability


40 / 41


stopping helps


parameters

3. Modifying transfer func. is effective, extensive experiments needed to

justify its usability

4. Label Shrinking control weights, loss? Should apply w/ Prediction

Stretching

Thanks

41 / 41

• Maya Hristakeva and Nikhila Arkalgud, Bruno Astuto Arouche Nunes

• Prof. Manfred Warmuth

Fin.

experiments on logistic regression - course web pages · pdf filevanilla logistic regression,...

Documents