class22-linear regression learning

1

CS 2710 Foundations of AI

CS 2710 Foundations of AILecture 22

Milos [email protected] Sennott Square

Supervised learning.Linear and logistic regression.


Supervised learning

Data: a set of n examples

is an input vector of size dis the desired output (given by a teacher)

Objective: learn the mapping s.t.

• Regression: Y is continuousExample: earnings, product orders company stock price

• Classification: Y is discreteExample: handwritten digit digit label

},..,,{ 21 nDDDD =>=< iii yD ,x

YXf →:nify ii ,..,1allfor)( =≈ x

),,( ,2,1, diiii xxx L=xiy

2


Linear regression.


Linear regression

• Function is a linear combination of input components

∑=

+=+++=d

jjjdd xwwxwxwxwwf

1022110)( Kx

kwww K,, 10 - parameters (weights)

∑

1

1x ),( wxf

0w

1w

2w

dw

dx

2x

xInput vector

Bias term

YXf →:

3


Linear regression

• Shorter (vector) definition of the model– Include bias constant in the input vector

xwx Tdd xwxwxwxwf =+++= K221100)(

kwww K,, 10 - parameters (weights)

∑

1

1x ),( wxf

0w

1w

2w

dw

dx

2x

xInput vector

),,,1( 21 dxxx L=x


Linear regression. Error.

• Data:• Function:• We would like to have

• Error function– measures how much our predictions deviate from the

desired answers

• Learning: We want to find the weights minimizing the error !

2

,..1))((1

iini

n fyn

J x−= ∑=

>=< iii yD ,x)( ii f xx →

nify ii ,..,1allfor)( =≈ x

Mean-squared error

4


Linear regression. Example

• 1 dimensional input

-1.5 -1 -0.5 0 0.5 1 1.5 2-15

-10

-5

0

5

10

15

20

25

30

)( 1x=x


Linear regression. Example.

• 2 dimensional input

-3 -2-1 0 1

2 3 -4-2

02

4-20

-15

-10

-5

0

5

10

15

20

),( 21 xx=x

5


Linear regression. Optimization.

• We want the weights minimizing the error

• For the optimal set of parameters, derivatives of the error withrespect to each parameter must be 0

• Vector of derivatives:

2

,..1

2

,..1)(1))((1

iT

ini

iini

n yn

fyn

J xwx −=−= ∑∑==

0xxwww ww =−−=∇= ∑=

iiT

i

n

inn y

nJJ )(2))(())((grad

1

0)(2)( ,,1,10,01

=−−−−−=∂∂ ∑

=jididiii

n

in

j

xxwxwxwyn

Jw

Kw


Solving linear regression

By rearranging the terms we get a system of linear equationswith d+1 unknowns

0)(2)( ,,1,10,01

=−−−−−=∂∂ ∑

=jididiii

n

in

j

xxwxwxwyn

Jw

Kw

ji

n

iiji

n

ididji

n

ijijji

n

iiji

n

ii xyxxwxxwxxwxxw ,

1,

1,,

1,,

11,1,

10,0 ∑∑∑∑∑

=====

=+++++ KK

bAw=

1111111

,1

,1

1,11

0,0 ∑∑∑∑∑=====

=+++++n

ii

n

idid

n

ijij

n

ii

n

ii yxwxwxwxw KK

1,1

1,1

,1,1

,1,1

1,11,1

0,0 i

n

iii

n

ididi

n

ijiji

n

iii

n

ii xyxxwxxwxxwxxw ∑∑∑∑∑

=====

=+++++ KK

6


Solving linear regression

• The optimal set of weights satisfies:

Leads to a system of linear equations (SLE) with d+1unknowns of the form

Solution to SLE:

• matrix inversion

0xxwww =−−=∇ ∑=

iiT

i

n

in y

nJ )(2))((

1

ji

n

iiji

n

ididji

n

ijijji

n

iiji

n

ii xyxxwxxwxxwxxw ,

1,

1,,

1,,

11,1,

10,0 ∑∑∑∑∑

=====

=+++++ KK

bAw=

bAw 1−=


Gradient descent solution

Goal: the weight optimization in the linear regression model

An alternative to SLE solution: • Gradient descent

Idea:– Adjust weights in the direction that improves the Error– The gradient tells us what is the right direction

- a learning rate (scales the gradient changes)

)(www w iError∇−← α

2

,..1)),((1)( wxw ii

nin fy

nErrorJ −== ∑

=

0>α

7


Gradient descent method• Descend using the gradient information

• Change the value of w according to the gradient

w

*|)( ww wError∇

*w

)(wError


Direction of the descent


Gradient descent method

• New value of the parameter

- a learning rate (scales the gradient changes)

w

*|)( wwErrorw∂∂

*w

)(wError

*|)(* wj

jj wErrorw

ww∂∂

−← α

0>α

For all j

8


Gradient descent method

• Iteratively converge to the optimum of the Error function

w)0(w

)(wError

)2(w)1(w )3(w


Online gradient algorithm

• The error function is defined for the whole dataset D

• error for a sample

• Online gradient method: changes weights after every sample

• vector form:

2

,..1)),((1)( wxw ii

nin fy

nErrorJ −== ∑

=

2online )),((

21)( wxw iii fyErrorJ −==

)(wij

jj Errorw

ww∂∂

−← α

0>α - Learning rate that depends on the number of updates


>=< iii yD ,x

9


Online gradient method

2)),((21)( wxw iiionline fyErrorJ −==

(i)-th update step with :

xwx Tf =)(Linear modelOn-line error

ii 1)( ≈αAnnealed learning rate:

- Gradually rescales changes in weights

On-line algorithm: generates a sequence of online updates

)1(|)()()1()(−

∂∂

−← −i

j

iij

ij w

Erroriwww

wαj-th weight:

>=< iii yD ,x

jii

iii

ji

j xfyiww ,)1()1()( )),()(( −− −+← wxα


Online regression algorithm

Online-linear-regression (D, number of iterations)Initialize weightsfor i=1:1: number of iterations

do select a data point from Dset update weight vector

end forreturn weights

),,( 210 dwwww K=w

i/1=α

iii fy xwxww )),(( −+← α

),( iii yD x=

w

• Advantages: very easy to implement, continuous data streams,adapts to changes in the model over time

10


On-line learning. Example

-3 -2 -1 0 1 2 31

1.5

2

2.5

3

3.5

4

4.5

-3 -2 -1 0 1 2 31

1.5

2

2.5

3

3.5

4

4.5

-3 -2 -1 0 1 2 30.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

-3 -2 -1 0 1 2 30.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

1 2

3 4


Logistic regression

11


Binary classification

• Two classes• Our goal is to learn how to correctly classify two types of

examples – Class 0 – labeled as 0, – Class 1 – labeled as 1

• We would like to learn

• First step: we need to devise a model of the function f

• Inspiration: neuron (nerve cells)

}1,0{=Y

}1,0{: →Xf


Neuron

• neuron (nerve cell) and its activities

12


Neuron-based binary classification model

∑0w

1w2w

kw

z

x

1

y

Threshold function


Neuron-based binary classification

• Function we want to learn

xInput vector

Bias term

∑

1

)1(x

0w

1w2w

kw

)2(x

z

)(kx

Threshold function

y

}1,0{: →Xf

13


Logistic regression model

• A function model with smooth switching:)...()( )()1(

10k

k xwxwwgf ++=x

)1/(1)( zezg −+=where w are parameters of the modelsand g(z) is a logistic function

xInput vector

Bias term

∑

1

)1(x

0w

1w2w

kw

)2(x

z

)(kx

Logistic function

)(xf

]1,0[)( ∈xf


Logistic function

function

• also referred to as sigmoid function• replaces the ideal threshold function with smooth switching • takes a real number and outputs the number in the interval

[0,1]• Output of the logistic regression has probabilistic

interpretation

)1(1)( ze

zg −+=

-20 -15 -10 -5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

( ) )|1( xx == ypf - Probability of class 1

14


Logistic regression. Decision boundaryClassification with the logistic regression model:

Logistic regression model defines a linear decision boundary

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Decision boundary

If then choose class 1Else choose class 0

Example:

( ) 5.0)|1( ≥== xx ypf


Logistic regression. Parameter optimization.

• We construct a probabilistic version of the error function based on the likelihood of the data

• likelihood measures the goodness of fit (opposite of the error)

• Log-likelihood trick.

• Error is the opposite of the Log likelihood

)|(),( ww DPDL =

),(),( ww DlDError −=

),|(1

wx ii

n

i

yyP == ∏=

Independent samples (xi ,yi)

)|(log),( ww DLDl =

15


Logistic regression: parameter learning

jiiiij

xfyDErrorw ,i ))((),( w,xw −−=

∂∂

• Error function decomposes to online error components

• Derivatives of the online error component for the LR model(in terms of weights)

)),((),(i0

wxw iii fyDErrorw

−−=∂∂

∑∑==

−==n

ii

n

ii DlDErrorDError

1i

1i ),(),(),( www


• Assume • Let

• Then

• Find weights w that maximize the likelihood of outputs– The optimal weights are the same for both the likelihood

and the log-likelihood

Logistic regression: parameter learning.

=−=−= ∑∏=

−

=

−n

i

yi

yi

n

i

yi

yi

iiiiDl1

1

1

1 )1(log)1(log),( µµµµw

∏∏=

−

=

−===n

i

yi

yiii

n

i

iiyyPDL1

1

1

)1(),|(),( µµwxw

)()(),|1( xwwx Tiiii gzgyp ====µ

∑∑==

−=−−+=n

iiiii

n

ii DJyy

1online

1),()1log()1(log wµµ

>=< iii yD ,x

16


Logistic regression: parameter learning• Log likelihood

• Derivatives of the loglikelihood

• Gradient descent:

)1log()1(log),(1

iii

n

ii yyDl µµ −−+= ∑

=

w

)),(())((),(11

iii

n

ii

Tii

n

i

fygyDl xwxxwxww −−=−−=−∇ ∑∑==

)1(|)],([)()1()(−−∇−← −

kDlkkkww www α

Nonlinear in weights !!

∑=

−− −+←n

iii

ki

kk fyk1

)1()1()( )],([)( xxwww α

))((),(1

, ii

n

iji

j

zgyxDlw

−−=∂∂

− ∑=

w


Derivation of the gradient• Log likelihood

• Derivatives of the loglikelihood

)1log()1(log),(1

iii

n

ii yyDl µµ −−+= ∑

=

w

)),(())((),(11

iii

n

ii

Tii

n

i

fygyDl xwxxwxww −−=−−=∇ ∑∑==

[ ]j

in

iiiii

ij wzyy

zDl

w ∂∂

−−+∂∂

=∂∂ ∑

=1

)1log()1(log),( µµw

[ ]i

i

ii

i

i

iiiiii

i zzg

zgy

zzg

zgyyy

z ∂∂

−−

−+∂

∂=−−+

∂∂ )(

)(11)1()(

)(1)1log()1(log µµ

))(1)(()(ii

i

i zgzgzzg

−=∂

∂Derivative of a logistic function

)())()(1())(1( iiiiii zgyzgyzgy −=−−+−=

jij

i xwz

,=∂∂

17


Logistic regression. Online gradient.

• We want to optimize the online Error• On-line gradient update for the jth weight and ith step

• (i)th update for the logistic regression

The same update rule as used in the linear regression !!!

]|),([ )1()1()(

−

∂∂

−← −iwii

j

ij

ij DError

www wα

>=< iii yD ,x

( ) jii

iii

ji

j xfyiww ,)1()1()( )()( −− −+← w,xα

α - annealed learning rate (depends on the number of updates)

J-th weight


Online logistic regression algorithm

Online-logistic-regression (D, number of iterations)initialize weightsfor i=1:1: number of iterations

do select a data point d=<x,y> from Dset update weights (in parallel)

end forreturn weights

kwwww K210 ,,

i/1=α

jjj xfyww )],([ wx−+= α

)],([00 wxfyww −+= α

18


Online algorithm. Example.



19




Online updates

Logistic regressionLinear regression

∑

1

)|1( xyp =

0w

1w2w

dw

z∑

1

1x0w

1w2w

dw

dx

2x

)(xf

xwx Tf =)( )(),|1()( xwwxx Tgypf ===

xwxww )),(( fy−+← α

On-line gradient update: On-line gradient update:The same

=)(xf

xwxww )),(( fy−+← α

1x

dx

2x

20


Limitations of basic linear units

Logistic regressionLinear regression

∑

1

)|1( xyp =

0w

1w2w

dw

z∑

1

0w

1w2w

dw

)(xf

Function linear in inputs !! Linear decision boundary!!

1x

dx

2x1x

dx

2x

xwx Tf =)( )(),|1()( xwwxx Tgypf ===

class22-linear regression learning

Documents