class22-linear regression learning

Post on 28-May-2017

219 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

CS 2710 Foundations of AI

CS 2710 Foundations of AILecture 22

Milos Hauskrechtmilos@cs.pitt.edu5329 Sennott Square

Supervised learning.Linear and logistic regression.

CS 2710 Foundations of AI

Supervised learning

Data: a set of n examples

is an input vector of size dis the desired output (given by a teacher)

Objective: learn the mapping s.t.

• Regression: Y is continuousExample: earnings, product orders company stock price

• Classification: Y is discreteExample: handwritten digit digit label

},..,,{ 21 nDDDD =>=< iii yD ,x

YXf →:nify ii ,..,1allfor)( =≈ x

),,( ,2,1, diiii xxx L=xiy

2

CS 2710 Foundations of AI

Linear regression.

CS 2710 Foundations of AI

Linear regression

• Function is a linear combination of input components

∑=

+=+++=d

jjjdd xwwxwxwxwwf

1022110)( Kx

kwww K,, 10 - parameters (weights)

1

1x ),( wxf

0w

1w

2w

dw

dx

2x

xInput vector

Bias term

YXf →:

3

CS 2710 Foundations of AI

Linear regression

• Shorter (vector) definition of the model– Include bias constant in the input vector

xwx Tdd xwxwxwxwf =+++= K221100)(

kwww K,, 10 - parameters (weights)

1

1x ),( wxf

0w

1w

2w

dw

dx

2x

xInput vector

),,,1( 21 dxxx L=x

CS 2710 Foundations of AI

Linear regression. Error.

• Data:• Function:• We would like to have

• Error function– measures how much our predictions deviate from the

desired answers

• Learning: We want to find the weights minimizing the error !

2

,..1))((1

iini

n fyn

J x−= ∑=

>=< iii yD ,x)( ii f xx →

nify ii ,..,1allfor)( =≈ x

Mean-squared error

4

CS 2710 Foundations of AI

Linear regression. Example

• 1 dimensional input

-1.5 -1 -0.5 0 0.5 1 1.5 2-15

-10

-5

0

5

10

15

20

25

30

)( 1x=x

CS 2710 Foundations of AI

Linear regression. Example.

• 2 dimensional input

-3 -2-1 0 1

2 3 -4-2

02

4-20

-15

-10

-5

0

5

10

15

20

),( 21 xx=x

5

CS 2710 Foundations of AI

Linear regression. Optimization.

• We want the weights minimizing the error

• For the optimal set of parameters, derivatives of the error withrespect to each parameter must be 0

• Vector of derivatives:

2

,..1

2

,..1)(1))((1

iT

ini

iini

n yn

fyn

J xwx −=−= ∑∑==

0xxwww ww =−−=∇= ∑=

iiT

i

n

inn y

nJJ )(2))(())((grad

1

0)(2)( ,,1,10,01

=−−−−−=∂∂ ∑

=jididiii

n

in

j

xxwxwxwyn

Jw

Kw

CS 2710 Foundations of AI

Solving linear regression

By rearranging the terms we get a system of linear equationswith d+1 unknowns

0)(2)( ,,1,10,01

=−−−−−=∂∂ ∑

=jididiii

n

in

j

xxwxwxwyn

Jw

Kw

ji

n

iiji

n

ididji

n

ijijji

n

iiji

n

ii xyxxwxxwxxwxxw ,

1,

1,,

1,,

11,1,

10,0 ∑∑∑∑∑

=====

=+++++ KK

bAw=

1111111

,1

,1

1,11

0,0 ∑∑∑∑∑=====

=+++++n

ii

n

idid

n

ijij

n

ii

n

ii yxwxwxwxw KK

1,1

1,1

,1,1

,1,1

1,11,1

0,0 i

n

iii

n

ididi

n

ijiji

n

iii

n

ii xyxxwxxwxxwxxw ∑∑∑∑∑

=====

=+++++ KK

6

CS 2710 Foundations of AI

Solving linear regression

• The optimal set of weights satisfies:

Leads to a system of linear equations (SLE) with d+1unknowns of the form

Solution to SLE:

• matrix inversion

0xxwww =−−=∇ ∑=

iiT

i

n

in y

nJ )(2))((

1

ji

n

iiji

n

ididji

n

ijijji

n

iiji

n

ii xyxxwxxwxxwxxw ,

1,

1,,

1,,

11,1,

10,0 ∑∑∑∑∑

=====

=+++++ KK

bAw=

bAw 1−=

CS 2710 Foundations of AI

Gradient descent solution

Goal: the weight optimization in the linear regression model

An alternative to SLE solution: • Gradient descent

Idea:– Adjust weights in the direction that improves the Error– The gradient tells us what is the right direction

- a learning rate (scales the gradient changes)

)(www w iError∇−← α

2

,..1)),((1)( wxw ii

nin fy

nErrorJ −== ∑

=

0>α

7

CS 2710 Foundations of AI

Gradient descent method• Descend using the gradient information

• Change the value of w according to the gradient

w

*|)( ww wError∇

*w

)(wError

)(www w iError∇−← α

Direction of the descent

CS 2710 Foundations of AI

Gradient descent method

• New value of the parameter

- a learning rate (scales the gradient changes)

w

*|)( wwErrorw∂∂

*w

)(wError

*|)(* wj

jj wErrorw

ww∂∂

−← α

0>α

For all j

8

CS 2710 Foundations of AI

Gradient descent method

• Iteratively converge to the optimum of the Error function

w)0(w

)(wError

)2(w)1(w )3(w

CS 2710 Foundations of AI

Online gradient algorithm

• The error function is defined for the whole dataset D

• error for a sample

• Online gradient method: changes weights after every sample

• vector form:

2

,..1)),((1)( wxw ii

nin fy

nErrorJ −== ∑

=

2online )),((

21)( wxw iii fyErrorJ −==

)(wij

jj Errorw

ww∂∂

−← α

0>α - Learning rate that depends on the number of updates

)(www w iError∇−← α

>=< iii yD ,x

9

CS 2710 Foundations of AI

Online gradient method

2)),((21)( wxw iiionline fyErrorJ −==

(i)-th update step with :

xwx Tf =)(Linear modelOn-line error

ii 1)( ≈αAnnealed learning rate:

- Gradually rescales changes in weights

On-line algorithm: generates a sequence of online updates

)1(|)()()1()(−

∂∂

−← −i

j

iij

ij w

Erroriwww

wαj-th weight:

>=< iii yD ,x

jii

iii

ji

j xfyiww ,)1()1()( )),()(( −− −+← wxα

CS 2710 Foundations of AI

Online regression algorithm

Online-linear-regression (D, number of iterations)Initialize weightsfor i=1:1: number of iterations

do select a data point from Dset update weight vector

end forreturn weights

),,( 210 dwwww K=w

i/1=α

iii fy xwxww )),(( −+← α

),( iii yD x=

w

• Advantages: very easy to implement, continuous data streams,adapts to changes in the model over time

10

CS 2710 Foundations of AI

On-line learning. Example

-3 -2 -1 0 1 2 31

1.5

2

2.5

3

3.5

4

4.5

-3 -2 -1 0 1 2 31

1.5

2

2.5

3

3.5

4

4.5

-3 -2 -1 0 1 2 30.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

-3 -2 -1 0 1 2 30.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

1 2

3 4

CS 2710 Foundations of AI

Logistic regression

11

CS 2710 Foundations of AI

Binary classification

• Two classes• Our goal is to learn how to correctly classify two types of

examples – Class 0 – labeled as 0, – Class 1 – labeled as 1

• We would like to learn

• First step: we need to devise a model of the function f

• Inspiration: neuron (nerve cells)

}1,0{=Y

}1,0{: →Xf

CS 2710 Foundations of AI

Neuron

• neuron (nerve cell) and its activities

12

CS 2710 Foundations of AI

Neuron-based binary classification model

∑0w

1w2w

kw

z

x

1

y

Threshold function

CS 2710 Foundations of AI

Neuron-based binary classification

• Function we want to learn

xInput vector

Bias term

1

)1(x

0w

1w2w

kw

)2(x

z

)(kx

Threshold function

y

}1,0{: →Xf

13

CS 2710 Foundations of AI

Logistic regression model

• A function model with smooth switching:)...()( )()1(

10k

k xwxwwgf ++=x

)1/(1)( zezg −+=where w are parameters of the modelsand g(z) is a logistic function

xInput vector

Bias term

1

)1(x

0w

1w2w

kw

)2(x

z

)(kx

Logistic function

)(xf

]1,0[)( ∈xf

CS 2710 Foundations of AI

Logistic function

function

• also referred to as sigmoid function• replaces the ideal threshold function with smooth switching • takes a real number and outputs the number in the interval

[0,1]• Output of the logistic regression has probabilistic

interpretation

)1(1)( ze

zg −+=

-20 -15 -10 -5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

( ) )|1( xx == ypf - Probability of class 1

14

CS 2710 Foundations of AI

Logistic regression. Decision boundaryClassification with the logistic regression model:

Logistic regression model defines a linear decision boundary

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Decision boundary

If then choose class 1Else choose class 0

Example:

( ) 5.0)|1( ≥== xx ypf

CS 2710 Foundations of AI

Logistic regression. Parameter optimization.

• We construct a probabilistic version of the error function based on the likelihood of the data

• likelihood measures the goodness of fit (opposite of the error)

• Log-likelihood trick.

• Error is the opposite of the Log likelihood

)|(),( ww DPDL =

),(),( ww DlDError −=

),|(1

wx ii

n

i

yyP == ∏=

Independent samples (xi ,yi)

)|(log),( ww DLDl =

15

CS 2710 Foundations of AI

Logistic regression: parameter learning

jiiiij

xfyDErrorw ,i ))((),( w,xw −−=

∂∂

• Error function decomposes to online error components

• Derivatives of the online error component for the LR model(in terms of weights)

)),((),(i0

wxw iii fyDErrorw

−−=∂∂

∑∑==

−==n

ii

n

ii DlDErrorDError

1i

1i ),(),(),( www

CS 2710 Foundations of AI

• Assume • Let

• Then

• Find weights w that maximize the likelihood of outputs– The optimal weights are the same for both the likelihood

and the log-likelihood

Logistic regression: parameter learning.

=−=−= ∑∏=

=

−n

i

yi

yi

n

i

yi

yi

iiiiDl1

1

1

1 )1(log)1(log),( µµµµw

∏∏=

=

−===n

i

yi

yiii

n

i

iiyyPDL1

1

1

)1(),|(),( µµwxw

)()(),|1( xwwx Tiiii gzgyp ====µ

∑∑==

−=−−+=n

iiiii

n

ii DJyy

1online

1),()1log()1(log wµµ

>=< iii yD ,x

16

CS 2710 Foundations of AI

Logistic regression: parameter learning• Log likelihood

• Derivatives of the loglikelihood

• Gradient descent:

)1log()1(log),(1

iii

n

ii yyDl µµ −−+= ∑

=

w

)),(())((),(11

iii

n

ii

Tii

n

i

fygyDl xwxxwxww −−=−−=−∇ ∑∑==

)1(|)],([)()1()(−−∇−← −

kDlkkkww www α

Nonlinear in weights !!

∑=

−− −+←n

iii

ki

kk fyk1

)1()1()( )],([)( xxwww α

))((),(1

, ii

n

iji

j

zgyxDlw

−−=∂∂

− ∑=

w

CS 2710 Foundations of AI

Derivation of the gradient• Log likelihood

• Derivatives of the loglikelihood

)1log()1(log),(1

iii

n

ii yyDl µµ −−+= ∑

=

w

)),(())((),(11

iii

n

ii

Tii

n

i

fygyDl xwxxwxww −−=−−=∇ ∑∑==

[ ]j

in

iiiii

ij wzyy

zDl

w ∂∂

−−+∂∂

=∂∂ ∑

=1

)1log()1(log),( µµw

[ ]i

i

ii

i

i

iiiiii

i zzg

zgy

zzg

zgyyy

z ∂∂

−−

−+∂

∂=−−+

∂∂ )(

)(11)1()(

)(1)1log()1(log µµ

))(1)(()(ii

i

i zgzgzzg

−=∂

∂Derivative of a logistic function

)())()(1())(1( iiiiii zgyzgyzgy −=−−+−=

jij

i xwz

,=∂∂

17

CS 2710 Foundations of AI

Logistic regression. Online gradient.

• We want to optimize the online Error• On-line gradient update for the jth weight and ith step

• (i)th update for the logistic regression

The same update rule as used in the linear regression !!!

]|),([ )1()1()(

∂∂

−← −iwii

j

ij

ij DError

www wα

>=< iii yD ,x

( ) jii

iii

ji

j xfyiww ,)1()1()( )()( −− −+← w,xα

α - annealed learning rate (depends on the number of updates)

J-th weight

CS 2710 Foundations of AI

Online logistic regression algorithm

Online-logistic-regression (D, number of iterations)initialize weightsfor i=1:1: number of iterations

do select a data point d=<x,y> from Dset update weights (in parallel)

end forreturn weights

kwwww K210 ,,

i/1=α

jjj xfyww )],([ wx−+= α

)],([00 wxfyww −+= α

18

CS 2710 Foundations of AI

Online algorithm. Example.

CS 2710 Foundations of AI

Online algorithm. Example.

19

CS 2710 Foundations of AI

Online algorithm. Example.

CS 2710 Foundations of AI

Online updates

Logistic regressionLinear regression

1

)|1( xyp =

0w

1w2w

dw

z∑

1

1x0w

1w2w

dw

dx

2x

)(xf

xwx Tf =)( )(),|1()( xwwxx Tgypf ===

xwxww )),(( fy−+← α

On-line gradient update: On-line gradient update:The same

=)(xf

xwxww )),(( fy−+← α

1x

dx

2x

20

CS 2710 Foundations of AI

Limitations of basic linear units

Logistic regressionLinear regression

1

)|1( xyp =

0w

1w2w

dw

z∑

1

0w

1w2w

dw

)(xf

Function linear in inputs !! Linear decision boundary!!

1x

dx

2x1x

dx

2x

xwx Tf =)( )(),|1()( xwwxx Tgypf ===

top related