class22-linear regression learning

CS 2710 Foundations of AI

CS 2710 Foundations of AILecture 22

Milos Hauskrechtmilos@cs.pitt.edu5329 Sennott Square

Supervised learning.Linear and logistic regression.

Supervised learning

Data: a set of n examples

is an input vector of size dis the desired output (given by a teacher)

Objective: learn the mapping s.t.

• Regression: Y is continuousExample: earnings, product orders company stock price

• Classification: Y is discreteExample: handwritten digit digit label

},..,,{ 21 nDDDD =>=< iii yD ,x

YXf →:nify ii ,..,1allfor)( =≈ x

),,( ,2,1, diiii xxx L=xiy

Linear regression.

Linear regression

• Function is a linear combination of input components

+=+++=d

jjjdd xwwxwxwxwwf

1022110)( Kx

kwww K,, 10 - parameters (weights)

1x ),( wxf

xInput vector

Bias term

YXf →:

Linear regression

• Shorter (vector) definition of the model– Include bias constant in the input vector

xwx Tdd xwxwxwxwf =+++= K221100)(

kwww K,, 10 - parameters (weights)

1x ),( wxf

xInput vector

),,,1( 21 dxxx L=x

Linear regression. Error.

• Data:• Function:• We would like to have

• Error function– measures how much our predictions deviate from the

desired answers

• Learning: We want to find the weights minimizing the error !

,..1))((1

J x−= ∑=

>=< iii yD ,x)( ii f xx →

nify ii ,..,1allfor)( =≈ x

Mean-squared error

Linear regression. Example

• 1 dimensional input

-1.5 -1 -0.5 0 0.5 1 1.5 2-15

)( 1x=x

Linear regression. Example.

• 2 dimensional input

-3 -2-1 0 1

2 3 -4-2

),( 21 xx=x

Linear regression. Optimization.

• We want the weights minimizing the error

• For the optimal set of parameters, derivatives of the error withrespect to each parameter must be 0

• Vector of derivatives:

,..1)(1))((1

J xwx −=−= ∑∑==

0xxwww ww =−−=∇= ∑=

nJJ )(2))(())((grad

0)(2)( ,,1,10,01

=−−−−−=∂∂ ∑

=jididiii

xxwxwxwyn

Solving linear regression

By rearranging the terms we get a system of linear equationswith d+1 unknowns

0)(2)( ,,1,10,01

=−−−−−=∂∂ ∑

=jididiii

xxwxwxwyn

ididji

ijijji

ii xyxxwxxwxxwxxw ,

10,0 ∑∑∑∑∑

=+++++ KK

1111111

0,0 ∑∑∑∑∑=====

=+++++n

ii yxwxwxwxw KK

1,11,1

ii xyxxwxxwxxwxxw ∑∑∑∑∑

=+++++ KK

Solving linear regression

• The optimal set of weights satisfies:

Leads to a system of linear equations (SLE) with d+1unknowns of the form

Solution to SLE:

• matrix inversion

0xxwww =−−=∇ ∑=

nJ )(2))((

ididji

ijijji

ii xyxxwxxwxxwxxw ,

10,0 ∑∑∑∑∑

=+++++ KK

bAw 1−=

Gradient descent solution

Goal: the weight optimization in the linear regression model

An alternative to SLE solution: • Gradient descent

Idea:– Adjust weights in the direction that improves the Error– The gradient tells us what is the right direction

- a learning rate (scales the gradient changes)

)(www w iError∇−← α

,..1)),((1)( wxw ii

nin fy

nErrorJ −== ∑

Gradient descent method• Descend using the gradient information

• Change the value of w according to the gradient

*|)( ww wError∇

)(wError

Direction of the descent

Gradient descent method

• New value of the parameter

- a learning rate (scales the gradient changes)

*|)( wwErrorw∂∂

)(wError

*|)(* wj

jj wErrorw

ww∂∂

−← α

For all j

Gradient descent method

• Iteratively converge to the optimum of the Error function

)(wError

)2(w)1(w )3(w

Online gradient algorithm

• The error function is defined for the whole dataset D

• error for a sample

• Online gradient method: changes weights after every sample

• vector form:

,..1)),((1)( wxw ii

nin fy

nErrorJ −== ∑

2online )),((

21)( wxw iii fyErrorJ −==

jj Errorw

ww∂∂

−← α

0>α - Learning rate that depends on the number of updates

>=< iii yD ,x

Online gradient method

2)),((21)( wxw iiionline fyErrorJ −==

(i)-th update step with :

xwx Tf =)(Linear modelOn-line error

ii 1)( ≈αAnnealed learning rate:

- Gradually rescales changes in weights

On-line algorithm: generates a sequence of online updates

)1(|)()()1()(−

∂∂

−← −i

Erroriwww

wαj-th weight:

>=< iii yD ,x

j xfyiww ,)1()1()( )),()(( −− −+← wxα

Online regression algorithm

Online-linear-regression (D, number of iterations)Initialize weightsfor i=1:1: number of iterations

do select a data point from Dset update weight vector

end forreturn weights

),,( 210 dwwww K=w

i/1=α

iii fy xwxww )),(( −+← α

),( iii yD x=

• Advantages: very easy to implement, continuous data streams,adapts to changes in the model over time

On-line learning. Example

-3 -2 -1 0 1 2 31

-3 -2 -1 0 1 2 30.5

Logistic regression

Binary classification

• Two classes• Our goal is to learn how to correctly classify two types of

examples – Class 0 – labeled as 0, – Class 1 – labeled as 1

• We would like to learn

• First step: we need to devise a model of the function f

• Inspiration: neuron (nerve cells)

}1,0{=Y

}1,0{: →Xf

Neuron

• neuron (nerve cell) and its activities

Neuron-based binary classification model

Threshold function

Neuron-based binary classification

• Function we want to learn

xInput vector

Bias term

Threshold function

}1,0{: →Xf

Logistic regression model

• A function model with smooth switching:)...()( )()1(

k xwxwwgf ++=x

)1/(1)( zezg −+=where w are parameters of the modelsand g(z) is a logistic function

xInput vector

Bias term

Logistic function

]1,0[)( ∈xf

Logistic function

function

• also referred to as sigmoid function• replaces the ideal threshold function with smooth switching • takes a real number and outputs the number in the interval

[0,1]• Output of the logistic regression has probabilistic

interpretation

)1(1)( ze

zg −+=

-20 -15 -10 -5 0 5 10 15 200

( ) )|1( xx == ypf - Probability of class 1

Logistic regression. Decision boundaryClassification with the logistic regression model:

Logistic regression model defines a linear decision boundary

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

2Decision boundary

If then choose class 1Else choose class 0

Example:

( ) 5.0)|1( ≥== xx ypf

Logistic regression. Parameter optimization.

• We construct a probabilistic version of the error function based on the likelihood of the data

• likelihood measures the goodness of fit (opposite of the error)

• Log-likelihood trick.

• Error is the opposite of the Log likelihood

)|(),( ww DPDL =

),(),( ww DlDError −=

yyP == ∏=

Independent samples (xi ,yi)

)|(log),( ww DLDl =

Logistic regression: parameter learning

jiiiij

xfyDErrorw ,i ))((),( w,xw −−=

∂∂

• Error function decomposes to online error components

• Derivatives of the online error component for the LR model(in terms of weights)

)),((),(i0

wxw iii fyDErrorw

−−=∂∂

∑∑==

−==n

ii DlDErrorDError

1i ),(),(),( www

• Assume • Let

• Then

• Find weights w that maximize the likelihood of outputs– The optimal weights are the same for both the likelihood

and the log-likelihood

Logistic regression: parameter learning.

=−=−= ∑∏=

iiiiDl1

1 )1(log)1(log),( µµµµw

∏∏=

−===n

iiyyPDL1

)1(),|(),( µµwxw

)()(),|1( xwwx Tiiii gzgyp ====µ

∑∑==

−=−−+=n

ii DJyy

1online

1),()1log()1(log wµµ

>=< iii yD ,x

Logistic regression: parameter learning• Log likelihood

• Derivatives of the loglikelihood

• Gradient descent:

)1log()1(log),(1

ii yyDl µµ −−+= ∑

)),(())((),(11

fygyDl xwxxwxww −−=−−=−∇ ∑∑==

)1(|)],([)()1()(−−∇−← −

kDlkkkww www α

Nonlinear in weights !!

−− −+←n

kk fyk1

)1()1()( )],([)( xxwww α

))((),(1

zgyxDlw

−−=∂∂

− ∑=

Derivation of the gradient• Log likelihood

• Derivatives of the loglikelihood

)1log()1(log),(1

ii yyDl µµ −−+= ∑

)),(())((),(11

fygyDl xwxxwxww −−=−−=∇ ∑∑==

ij wzyy

w ∂∂

−−+∂∂

=∂∂ ∑

)1log()1(log),( µµw

iiiiii

z ∂∂

−−

−+∂

∂=−−+

∂∂ )(

)(11)1()(

)(1)1log()1(log µµ

))(1)(()(ii

i zgzgzzg

−=∂

∂Derivative of a logistic function

)())()(1())(1( iiiiii zgyzgyzgy −=−−+−=

,=∂∂

Logistic regression. Online gradient.

• We want to optimize the online Error• On-line gradient update for the jth weight and ith step

• (i)th update for the logistic regression

The same update rule as used in the linear regression !!!

]|),([ )1()1()(

∂∂

−← −iwii

ij DError

www wα

>=< iii yD ,x

( ) jii

j xfyiww ,)1()1()( )()( −− −+← w,xα

α - annealed learning rate (depends on the number of updates)

J-th weight

Online logistic regression algorithm

Online-logistic-regression (D, number of iterations)initialize weightsfor i=1:1: number of iterations

do select a data point d=<x,y> from Dset update weights (in parallel)

end forreturn weights

kwwww K210 ,,

i/1=α

jjj xfyww )],([ wx−+= α

)],([00 wxfyww −+= α

Online algorithm. Example.

Online updates

Logistic regressionLinear regression

)|1( xyp =

xwx Tf =)( )(),|1()( xwwxx Tgypf ===

xwxww )),(( fy−+← α

On-line gradient update: On-line gradient update:The same

xwxww )),(( fy−+← α

Limitations of basic linear units

Logistic regressionLinear regression

)|1( xyp =

Function linear in inputs !! Linear decision boundary!!

xwx Tf =)( )(),|1()( xwwxx Tgypf ===

class22-linear regression learning

Documents

the algebra of linear regression - statpower.net...

simple linear regression. 2 simple regression linear...

linear regression: evaluating regression models overview...

chapter 8 linear regression. objectives & learning goals...

chapter 13: simple linear regression. 2 simple regression...

basic statistics linear regression. x y simple linear...

review of simple linear regression simple linear regression...

regression linear regression regression trees

linear regression: part 1 - ntnu · linear regression: part...

linear models & linear regression

1 curve-fitting polynomial interpolation. 2 curve fitting...

generalized linear regression - little dumb doctor to...

non-linear regression - penn state college of...

linear and non-linear regression: powerful and very...

regression linear regression

linear regression via normal equations...module 1...

multiple linear regression - analysis made easy · multiple...

regression analysis linear regression logistic regression

linear regression - wharton finance - finance...

linear regression