Download - Class22-Linear Regression Learning
![Page 1: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/1.jpg)
1
CS 2710 Foundations of AI
CS 2710 Foundations of AILecture 22
Milos [email protected] Sennott Square
Supervised learning.Linear and logistic regression.
CS 2710 Foundations of AI
Supervised learning
Data: a set of n examples
is an input vector of size dis the desired output (given by a teacher)
Objective: learn the mapping s.t.
• Regression: Y is continuousExample: earnings, product orders company stock price
• Classification: Y is discreteExample: handwritten digit digit label
},..,,{ 21 nDDDD =>=< iii yD ,x
YXf →:nify ii ,..,1allfor)( =≈ x
),,( ,2,1, diiii xxx L=xiy
![Page 2: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/2.jpg)
2
CS 2710 Foundations of AI
Linear regression.
CS 2710 Foundations of AI
Linear regression
• Function is a linear combination of input components
∑=
+=+++=d
jjjdd xwwxwxwxwwf
1022110)( Kx
kwww K,, 10 - parameters (weights)
∑
1
1x ),( wxf
0w
1w
2w
dw
dx
2x
xInput vector
Bias term
YXf →:
![Page 3: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/3.jpg)
3
CS 2710 Foundations of AI
Linear regression
• Shorter (vector) definition of the model– Include bias constant in the input vector
xwx Tdd xwxwxwxwf =+++= K221100)(
kwww K,, 10 - parameters (weights)
∑
1
1x ),( wxf
0w
1w
2w
dw
dx
2x
xInput vector
),,,1( 21 dxxx L=x
CS 2710 Foundations of AI
Linear regression. Error.
• Data:• Function:• We would like to have
• Error function– measures how much our predictions deviate from the
desired answers
• Learning: We want to find the weights minimizing the error !
2
,..1))((1
iini
n fyn
J x−= ∑=
>=< iii yD ,x)( ii f xx →
nify ii ,..,1allfor)( =≈ x
Mean-squared error
![Page 4: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/4.jpg)
4
CS 2710 Foundations of AI
Linear regression. Example
• 1 dimensional input
-1.5 -1 -0.5 0 0.5 1 1.5 2-15
-10
-5
0
5
10
15
20
25
30
)( 1x=x
CS 2710 Foundations of AI
Linear regression. Example.
• 2 dimensional input
-3 -2-1 0 1
2 3 -4-2
02
4-20
-15
-10
-5
0
5
10
15
20
),( 21 xx=x
![Page 5: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/5.jpg)
5
CS 2710 Foundations of AI
Linear regression. Optimization.
• We want the weights minimizing the error
• For the optimal set of parameters, derivatives of the error withrespect to each parameter must be 0
• Vector of derivatives:
2
,..1
2
,..1)(1))((1
iT
ini
iini
n yn
fyn
J xwx −=−= ∑∑==
0xxwww ww =−−=∇= ∑=
iiT
i
n
inn y
nJJ )(2))(())((grad
1
0)(2)( ,,1,10,01
=−−−−−=∂∂ ∑
=jididiii
n
in
j
xxwxwxwyn
Jw
Kw
CS 2710 Foundations of AI
Solving linear regression
By rearranging the terms we get a system of linear equationswith d+1 unknowns
0)(2)( ,,1,10,01
=−−−−−=∂∂ ∑
=jididiii
n
in
j
xxwxwxwyn
Jw
Kw
ji
n
iiji
n
ididji
n
ijijji
n
iiji
n
ii xyxxwxxwxxwxxw ,
1,
1,,
1,,
11,1,
10,0 ∑∑∑∑∑
=====
=+++++ KK
bAw=
1111111
,1
,1
1,11
0,0 ∑∑∑∑∑=====
=+++++n
ii
n
idid
n
ijij
n
ii
n
ii yxwxwxwxw KK
1,1
1,1
,1,1
,1,1
1,11,1
0,0 i
n
iii
n
ididi
n
ijiji
n
iii
n
ii xyxxwxxwxxwxxw ∑∑∑∑∑
=====
=+++++ KK
![Page 6: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/6.jpg)
6
CS 2710 Foundations of AI
Solving linear regression
• The optimal set of weights satisfies:
Leads to a system of linear equations (SLE) with d+1unknowns of the form
Solution to SLE:
• matrix inversion
0xxwww =−−=∇ ∑=
iiT
i
n
in y
nJ )(2))((
1
ji
n
iiji
n
ididji
n
ijijji
n
iiji
n
ii xyxxwxxwxxwxxw ,
1,
1,,
1,,
11,1,
10,0 ∑∑∑∑∑
=====
=+++++ KK
bAw=
bAw 1−=
CS 2710 Foundations of AI
Gradient descent solution
Goal: the weight optimization in the linear regression model
An alternative to SLE solution: • Gradient descent
Idea:– Adjust weights in the direction that improves the Error– The gradient tells us what is the right direction
- a learning rate (scales the gradient changes)
)(www w iError∇−← α
2
,..1)),((1)( wxw ii
nin fy
nErrorJ −== ∑
=
0>α
![Page 7: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/7.jpg)
7
CS 2710 Foundations of AI
Gradient descent method• Descend using the gradient information
• Change the value of w according to the gradient
w
*|)( ww wError∇
*w
)(wError
)(www w iError∇−← α
Direction of the descent
CS 2710 Foundations of AI
Gradient descent method
• New value of the parameter
- a learning rate (scales the gradient changes)
w
*|)( wwErrorw∂∂
*w
)(wError
*|)(* wj
jj wErrorw
ww∂∂
−← α
0>α
For all j
![Page 8: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/8.jpg)
8
CS 2710 Foundations of AI
Gradient descent method
• Iteratively converge to the optimum of the Error function
w)0(w
)(wError
)2(w)1(w )3(w
CS 2710 Foundations of AI
Online gradient algorithm
• The error function is defined for the whole dataset D
• error for a sample
• Online gradient method: changes weights after every sample
• vector form:
2
,..1)),((1)( wxw ii
nin fy
nErrorJ −== ∑
=
2online )),((
21)( wxw iii fyErrorJ −==
)(wij
jj Errorw
ww∂∂
−← α
0>α - Learning rate that depends on the number of updates
)(www w iError∇−← α
>=< iii yD ,x
![Page 9: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/9.jpg)
9
CS 2710 Foundations of AI
Online gradient method
2)),((21)( wxw iiionline fyErrorJ −==
(i)-th update step with :
xwx Tf =)(Linear modelOn-line error
ii 1)( ≈αAnnealed learning rate:
- Gradually rescales changes in weights
On-line algorithm: generates a sequence of online updates
)1(|)()()1()(−
∂∂
−← −i
j
iij
ij w
Erroriwww
wαj-th weight:
>=< iii yD ,x
jii
iii
ji
j xfyiww ,)1()1()( )),()(( −− −+← wxα
CS 2710 Foundations of AI
Online regression algorithm
Online-linear-regression (D, number of iterations)Initialize weightsfor i=1:1: number of iterations
do select a data point from Dset update weight vector
end forreturn weights
),,( 210 dwwww K=w
i/1=α
iii fy xwxww )),(( −+← α
),( iii yD x=
w
• Advantages: very easy to implement, continuous data streams,adapts to changes in the model over time
![Page 10: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/10.jpg)
10
CS 2710 Foundations of AI
On-line learning. Example
-3 -2 -1 0 1 2 31
1.5
2
2.5
3
3.5
4
4.5
-3 -2 -1 0 1 2 31
1.5
2
2.5
3
3.5
4
4.5
-3 -2 -1 0 1 2 30.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
-3 -2 -1 0 1 2 30.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
1 2
3 4
CS 2710 Foundations of AI
Logistic regression
![Page 11: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/11.jpg)
11
CS 2710 Foundations of AI
Binary classification
• Two classes• Our goal is to learn how to correctly classify two types of
examples – Class 0 – labeled as 0, – Class 1 – labeled as 1
• We would like to learn
• First step: we need to devise a model of the function f
• Inspiration: neuron (nerve cells)
}1,0{=Y
}1,0{: →Xf
CS 2710 Foundations of AI
Neuron
• neuron (nerve cell) and its activities
![Page 12: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/12.jpg)
12
CS 2710 Foundations of AI
Neuron-based binary classification model
∑0w
1w2w
kw
z
x
1
y
Threshold function
CS 2710 Foundations of AI
Neuron-based binary classification
• Function we want to learn
xInput vector
Bias term
∑
1
)1(x
0w
1w2w
kw
)2(x
z
)(kx
Threshold function
y
}1,0{: →Xf
![Page 13: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/13.jpg)
13
CS 2710 Foundations of AI
Logistic regression model
• A function model with smooth switching:)...()( )()1(
10k
k xwxwwgf ++=x
)1/(1)( zezg −+=where w are parameters of the modelsand g(z) is a logistic function
xInput vector
Bias term
∑
1
)1(x
0w
1w2w
kw
)2(x
z
)(kx
Logistic function
)(xf
]1,0[)( ∈xf
CS 2710 Foundations of AI
Logistic function
function
• also referred to as sigmoid function• replaces the ideal threshold function with smooth switching • takes a real number and outputs the number in the interval
[0,1]• Output of the logistic regression has probabilistic
interpretation
)1(1)( ze
zg −+=
-20 -15 -10 -5 0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
( ) )|1( xx == ypf - Probability of class 1
![Page 14: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/14.jpg)
14
CS 2710 Foundations of AI
Logistic regression. Decision boundaryClassification with the logistic regression model:
Logistic regression model defines a linear decision boundary
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Decision boundary
If then choose class 1Else choose class 0
Example:
( ) 5.0)|1( ≥== xx ypf
CS 2710 Foundations of AI
Logistic regression. Parameter optimization.
• We construct a probabilistic version of the error function based on the likelihood of the data
• likelihood measures the goodness of fit (opposite of the error)
• Log-likelihood trick.
• Error is the opposite of the Log likelihood
)|(),( ww DPDL =
),(),( ww DlDError −=
),|(1
wx ii
n
i
yyP == ∏=
Independent samples (xi ,yi)
)|(log),( ww DLDl =
![Page 15: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/15.jpg)
15
CS 2710 Foundations of AI
Logistic regression: parameter learning
jiiiij
xfyDErrorw ,i ))((),( w,xw −−=
∂∂
• Error function decomposes to online error components
• Derivatives of the online error component for the LR model(in terms of weights)
)),((),(i0
wxw iii fyDErrorw
−−=∂∂
∑∑==
−==n
ii
n
ii DlDErrorDError
1i
1i ),(),(),( www
CS 2710 Foundations of AI
• Assume • Let
• Then
• Find weights w that maximize the likelihood of outputs– The optimal weights are the same for both the likelihood
and the log-likelihood
Logistic regression: parameter learning.
=−=−= ∑∏=
−
=
−n
i
yi
yi
n
i
yi
yi
iiiiDl1
1
1
1 )1(log)1(log),( µµµµw
∏∏=
−
=
−===n
i
yi
yiii
n
i
iiyyPDL1
1
1
)1(),|(),( µµwxw
)()(),|1( xwwx Tiiii gzgyp ====µ
∑∑==
−=−−+=n
iiiii
n
ii DJyy
1online
1),()1log()1(log wµµ
>=< iii yD ,x
![Page 16: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/16.jpg)
16
CS 2710 Foundations of AI
Logistic regression: parameter learning• Log likelihood
• Derivatives of the loglikelihood
• Gradient descent:
)1log()1(log),(1
iii
n
ii yyDl µµ −−+= ∑
=
w
)),(())((),(11
iii
n
ii
Tii
n
i
fygyDl xwxxwxww −−=−−=−∇ ∑∑==
)1(|)],([)()1()(−−∇−← −
kDlkkkww www α
Nonlinear in weights !!
∑=
−− −+←n
iii
ki
kk fyk1
)1()1()( )],([)( xxwww α
))((),(1
, ii
n
iji
j
zgyxDlw
−−=∂∂
− ∑=
w
CS 2710 Foundations of AI
Derivation of the gradient• Log likelihood
• Derivatives of the loglikelihood
)1log()1(log),(1
iii
n
ii yyDl µµ −−+= ∑
=
w
)),(())((),(11
iii
n
ii
Tii
n
i
fygyDl xwxxwxww −−=−−=∇ ∑∑==
[ ]j
in
iiiii
ij wzyy
zDl
w ∂∂
−−+∂∂
=∂∂ ∑
=1
)1log()1(log),( µµw
[ ]i
i
ii
i
i
iiiiii
i zzg
zgy
zzg
zgyyy
z ∂∂
−−
−+∂
∂=−−+
∂∂ )(
)(11)1()(
)(1)1log()1(log µµ
))(1)(()(ii
i
i zgzgzzg
−=∂
∂Derivative of a logistic function
)())()(1())(1( iiiiii zgyzgyzgy −=−−+−=
jij
i xwz
,=∂∂
![Page 17: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/17.jpg)
17
CS 2710 Foundations of AI
Logistic regression. Online gradient.
• We want to optimize the online Error• On-line gradient update for the jth weight and ith step
• (i)th update for the logistic regression
The same update rule as used in the linear regression !!!
]|),([ )1()1()(
−
∂∂
−← −iwii
j
ij
ij DError
www wα
>=< iii yD ,x
( ) jii
iii
ji
j xfyiww ,)1()1()( )()( −− −+← w,xα
α - annealed learning rate (depends on the number of updates)
J-th weight
CS 2710 Foundations of AI
Online logistic regression algorithm
Online-logistic-regression (D, number of iterations)initialize weightsfor i=1:1: number of iterations
do select a data point d=<x,y> from Dset update weights (in parallel)
end forreturn weights
kwwww K210 ,,
i/1=α
jjj xfyww )],([ wx−+= α
)],([00 wxfyww −+= α
![Page 18: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/18.jpg)
18
CS 2710 Foundations of AI
Online algorithm. Example.
CS 2710 Foundations of AI
Online algorithm. Example.
![Page 19: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/19.jpg)
19
CS 2710 Foundations of AI
Online algorithm. Example.
CS 2710 Foundations of AI
Online updates
Logistic regressionLinear regression
∑
1
)|1( xyp =
0w
1w2w
dw
z∑
1
1x0w
1w2w
dw
dx
2x
)(xf
xwx Tf =)( )(),|1()( xwwxx Tgypf ===
xwxww )),(( fy−+← α
On-line gradient update: On-line gradient update:The same
=)(xf
xwxww )),(( fy−+← α
1x
dx
2x
![Page 20: Class22-Linear Regression Learning](https://reader036.vdocuments.net/reader036/viewer/2022073114/577ccf371a28ab9e788f2db1/html5/thumbnails/20.jpg)
20
CS 2710 Foundations of AI
Limitations of basic linear units
Logistic regressionLinear regression
∑
1
)|1( xyp =
0w
1w2w
dw
z∑
1
0w
1w2w
dw
)(xf
Function linear in inputs !! Linear decision boundary!!
1x
dx
2x1x
dx
2x
xwx Tf =)( )(),|1()( xwwxx Tgypf ===