class22-linear regression learning
Post on 28-May-2017
219 Views
Preview:
TRANSCRIPT
1
CS 2710 Foundations of AI
CS 2710 Foundations of AILecture 22
Milos Hauskrechtmilos@cs.pitt.edu5329 Sennott Square
Supervised learning.Linear and logistic regression.
CS 2710 Foundations of AI
Supervised learning
Data: a set of n examples
is an input vector of size dis the desired output (given by a teacher)
Objective: learn the mapping s.t.
• Regression: Y is continuousExample: earnings, product orders company stock price
• Classification: Y is discreteExample: handwritten digit digit label
},..,,{ 21 nDDDD =>=< iii yD ,x
YXf →:nify ii ,..,1allfor)( =≈ x
),,( ,2,1, diiii xxx L=xiy
2
CS 2710 Foundations of AI
Linear regression.
CS 2710 Foundations of AI
Linear regression
• Function is a linear combination of input components
∑=
+=+++=d
jjjdd xwwxwxwxwwf
1022110)( Kx
kwww K,, 10 - parameters (weights)
∑
1
1x ),( wxf
0w
1w
2w
dw
dx
2x
xInput vector
Bias term
YXf →:
3
CS 2710 Foundations of AI
Linear regression
• Shorter (vector) definition of the model– Include bias constant in the input vector
xwx Tdd xwxwxwxwf =+++= K221100)(
kwww K,, 10 - parameters (weights)
∑
1
1x ),( wxf
0w
1w
2w
dw
dx
2x
xInput vector
),,,1( 21 dxxx L=x
CS 2710 Foundations of AI
Linear regression. Error.
• Data:• Function:• We would like to have
• Error function– measures how much our predictions deviate from the
desired answers
• Learning: We want to find the weights minimizing the error !
2
,..1))((1
iini
n fyn
J x−= ∑=
>=< iii yD ,x)( ii f xx →
nify ii ,..,1allfor)( =≈ x
Mean-squared error
4
CS 2710 Foundations of AI
Linear regression. Example
• 1 dimensional input
-1.5 -1 -0.5 0 0.5 1 1.5 2-15
-10
-5
0
5
10
15
20
25
30
)( 1x=x
CS 2710 Foundations of AI
Linear regression. Example.
• 2 dimensional input
-3 -2-1 0 1
2 3 -4-2
02
4-20
-15
-10
-5
0
5
10
15
20
),( 21 xx=x
5
CS 2710 Foundations of AI
Linear regression. Optimization.
• We want the weights minimizing the error
• For the optimal set of parameters, derivatives of the error withrespect to each parameter must be 0
• Vector of derivatives:
2
,..1
2
,..1)(1))((1
iT
ini
iini
n yn
fyn
J xwx −=−= ∑∑==
0xxwww ww =−−=∇= ∑=
iiT
i
n
inn y
nJJ )(2))(())((grad
1
0)(2)( ,,1,10,01
=−−−−−=∂∂ ∑
=jididiii
n
in
j
xxwxwxwyn
Jw
Kw
CS 2710 Foundations of AI
Solving linear regression
By rearranging the terms we get a system of linear equationswith d+1 unknowns
0)(2)( ,,1,10,01
=−−−−−=∂∂ ∑
=jididiii
n
in
j
xxwxwxwyn
Jw
Kw
ji
n
iiji
n
ididji
n
ijijji
n
iiji
n
ii xyxxwxxwxxwxxw ,
1,
1,,
1,,
11,1,
10,0 ∑∑∑∑∑
=====
=+++++ KK
bAw=
1111111
,1
,1
1,11
0,0 ∑∑∑∑∑=====
=+++++n
ii
n
idid
n
ijij
n
ii
n
ii yxwxwxwxw KK
1,1
1,1
,1,1
,1,1
1,11,1
0,0 i
n
iii
n
ididi
n
ijiji
n
iii
n
ii xyxxwxxwxxwxxw ∑∑∑∑∑
=====
=+++++ KK
6
CS 2710 Foundations of AI
Solving linear regression
• The optimal set of weights satisfies:
Leads to a system of linear equations (SLE) with d+1unknowns of the form
Solution to SLE:
• matrix inversion
0xxwww =−−=∇ ∑=
iiT
i
n
in y
nJ )(2))((
1
ji
n
iiji
n
ididji
n
ijijji
n
iiji
n
ii xyxxwxxwxxwxxw ,
1,
1,,
1,,
11,1,
10,0 ∑∑∑∑∑
=====
=+++++ KK
bAw=
bAw 1−=
CS 2710 Foundations of AI
Gradient descent solution
Goal: the weight optimization in the linear regression model
An alternative to SLE solution: • Gradient descent
Idea:– Adjust weights in the direction that improves the Error– The gradient tells us what is the right direction
- a learning rate (scales the gradient changes)
)(www w iError∇−← α
2
,..1)),((1)( wxw ii
nin fy
nErrorJ −== ∑
=
0>α
7
CS 2710 Foundations of AI
Gradient descent method• Descend using the gradient information
• Change the value of w according to the gradient
w
*|)( ww wError∇
*w
)(wError
)(www w iError∇−← α
Direction of the descent
CS 2710 Foundations of AI
Gradient descent method
• New value of the parameter
- a learning rate (scales the gradient changes)
w
*|)( wwErrorw∂∂
*w
)(wError
*|)(* wj
jj wErrorw
ww∂∂
−← α
0>α
For all j
8
CS 2710 Foundations of AI
Gradient descent method
• Iteratively converge to the optimum of the Error function
w)0(w
)(wError
)2(w)1(w )3(w
CS 2710 Foundations of AI
Online gradient algorithm
• The error function is defined for the whole dataset D
• error for a sample
• Online gradient method: changes weights after every sample
• vector form:
2
,..1)),((1)( wxw ii
nin fy
nErrorJ −== ∑
=
2online )),((
21)( wxw iii fyErrorJ −==
)(wij
jj Errorw
ww∂∂
−← α
0>α - Learning rate that depends on the number of updates
)(www w iError∇−← α
>=< iii yD ,x
9
CS 2710 Foundations of AI
Online gradient method
2)),((21)( wxw iiionline fyErrorJ −==
(i)-th update step with :
xwx Tf =)(Linear modelOn-line error
ii 1)( ≈αAnnealed learning rate:
- Gradually rescales changes in weights
On-line algorithm: generates a sequence of online updates
)1(|)()()1()(−
∂∂
−← −i
j
iij
ij w
Erroriwww
wαj-th weight:
>=< iii yD ,x
jii
iii
ji
j xfyiww ,)1()1()( )),()(( −− −+← wxα
CS 2710 Foundations of AI
Online regression algorithm
Online-linear-regression (D, number of iterations)Initialize weightsfor i=1:1: number of iterations
do select a data point from Dset update weight vector
end forreturn weights
),,( 210 dwwww K=w
i/1=α
iii fy xwxww )),(( −+← α
),( iii yD x=
w
• Advantages: very easy to implement, continuous data streams,adapts to changes in the model over time
10
CS 2710 Foundations of AI
On-line learning. Example
-3 -2 -1 0 1 2 31
1.5
2
2.5
3
3.5
4
4.5
-3 -2 -1 0 1 2 31
1.5
2
2.5
3
3.5
4
4.5
-3 -2 -1 0 1 2 30.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
-3 -2 -1 0 1 2 30.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
1 2
3 4
CS 2710 Foundations of AI
Logistic regression
11
CS 2710 Foundations of AI
Binary classification
• Two classes• Our goal is to learn how to correctly classify two types of
examples – Class 0 – labeled as 0, – Class 1 – labeled as 1
• We would like to learn
• First step: we need to devise a model of the function f
• Inspiration: neuron (nerve cells)
}1,0{=Y
}1,0{: →Xf
CS 2710 Foundations of AI
Neuron
• neuron (nerve cell) and its activities
12
CS 2710 Foundations of AI
Neuron-based binary classification model
∑0w
1w2w
kw
z
x
1
y
Threshold function
CS 2710 Foundations of AI
Neuron-based binary classification
• Function we want to learn
xInput vector
Bias term
∑
1
)1(x
0w
1w2w
kw
)2(x
z
)(kx
Threshold function
y
}1,0{: →Xf
13
CS 2710 Foundations of AI
Logistic regression model
• A function model with smooth switching:)...()( )()1(
10k
k xwxwwgf ++=x
)1/(1)( zezg −+=where w are parameters of the modelsand g(z) is a logistic function
xInput vector
Bias term
∑
1
)1(x
0w
1w2w
kw
)2(x
z
)(kx
Logistic function
)(xf
]1,0[)( ∈xf
CS 2710 Foundations of AI
Logistic function
function
• also referred to as sigmoid function• replaces the ideal threshold function with smooth switching • takes a real number and outputs the number in the interval
[0,1]• Output of the logistic regression has probabilistic
interpretation
)1(1)( ze
zg −+=
-20 -15 -10 -5 0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
( ) )|1( xx == ypf - Probability of class 1
14
CS 2710 Foundations of AI
Logistic regression. Decision boundaryClassification with the logistic regression model:
Logistic regression model defines a linear decision boundary
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2Decision boundary
If then choose class 1Else choose class 0
Example:
( ) 5.0)|1( ≥== xx ypf
CS 2710 Foundations of AI
Logistic regression. Parameter optimization.
• We construct a probabilistic version of the error function based on the likelihood of the data
• likelihood measures the goodness of fit (opposite of the error)
• Log-likelihood trick.
• Error is the opposite of the Log likelihood
)|(),( ww DPDL =
),(),( ww DlDError −=
),|(1
wx ii
n
i
yyP == ∏=
Independent samples (xi ,yi)
)|(log),( ww DLDl =
15
CS 2710 Foundations of AI
Logistic regression: parameter learning
jiiiij
xfyDErrorw ,i ))((),( w,xw −−=
∂∂
• Error function decomposes to online error components
• Derivatives of the online error component for the LR model(in terms of weights)
)),((),(i0
wxw iii fyDErrorw
−−=∂∂
∑∑==
−==n
ii
n
ii DlDErrorDError
1i
1i ),(),(),( www
CS 2710 Foundations of AI
• Assume • Let
• Then
• Find weights w that maximize the likelihood of outputs– The optimal weights are the same for both the likelihood
and the log-likelihood
Logistic regression: parameter learning.
=−=−= ∑∏=
−
=
−n
i
yi
yi
n
i
yi
yi
iiiiDl1
1
1
1 )1(log)1(log),( µµµµw
∏∏=
−
=
−===n
i
yi
yiii
n
i
iiyyPDL1
1
1
)1(),|(),( µµwxw
)()(),|1( xwwx Tiiii gzgyp ====µ
∑∑==
−=−−+=n
iiiii
n
ii DJyy
1online
1),()1log()1(log wµµ
>=< iii yD ,x
16
CS 2710 Foundations of AI
Logistic regression: parameter learning• Log likelihood
• Derivatives of the loglikelihood
• Gradient descent:
)1log()1(log),(1
iii
n
ii yyDl µµ −−+= ∑
=
w
)),(())((),(11
iii
n
ii
Tii
n
i
fygyDl xwxxwxww −−=−−=−∇ ∑∑==
)1(|)],([)()1()(−−∇−← −
kDlkkkww www α
Nonlinear in weights !!
∑=
−− −+←n
iii
ki
kk fyk1
)1()1()( )],([)( xxwww α
))((),(1
, ii
n
iji
j
zgyxDlw
−−=∂∂
− ∑=
w
CS 2710 Foundations of AI
Derivation of the gradient• Log likelihood
• Derivatives of the loglikelihood
)1log()1(log),(1
iii
n
ii yyDl µµ −−+= ∑
=
w
)),(())((),(11
iii
n
ii
Tii
n
i
fygyDl xwxxwxww −−=−−=∇ ∑∑==
[ ]j
in
iiiii
ij wzyy
zDl
w ∂∂
−−+∂∂
=∂∂ ∑
=1
)1log()1(log),( µµw
[ ]i
i
ii
i
i
iiiiii
i zzg
zgy
zzg
zgyyy
z ∂∂
−−
−+∂
∂=−−+
∂∂ )(
)(11)1()(
)(1)1log()1(log µµ
))(1)(()(ii
i
i zgzgzzg
−=∂
∂Derivative of a logistic function
)())()(1())(1( iiiiii zgyzgyzgy −=−−+−=
jij
i xwz
,=∂∂
17
CS 2710 Foundations of AI
Logistic regression. Online gradient.
• We want to optimize the online Error• On-line gradient update for the jth weight and ith step
• (i)th update for the logistic regression
The same update rule as used in the linear regression !!!
]|),([ )1()1()(
−
∂∂
−← −iwii
j
ij
ij DError
www wα
>=< iii yD ,x
( ) jii
iii
ji
j xfyiww ,)1()1()( )()( −− −+← w,xα
α - annealed learning rate (depends on the number of updates)
J-th weight
CS 2710 Foundations of AI
Online logistic regression algorithm
Online-logistic-regression (D, number of iterations)initialize weightsfor i=1:1: number of iterations
do select a data point d=<x,y> from Dset update weights (in parallel)
end forreturn weights
kwwww K210 ,,
i/1=α
jjj xfyww )],([ wx−+= α
)],([00 wxfyww −+= α
18
CS 2710 Foundations of AI
Online algorithm. Example.
CS 2710 Foundations of AI
Online algorithm. Example.
19
CS 2710 Foundations of AI
Online algorithm. Example.
CS 2710 Foundations of AI
Online updates
Logistic regressionLinear regression
∑
1
)|1( xyp =
0w
1w2w
dw
z∑
1
1x0w
1w2w
dw
dx
2x
)(xf
xwx Tf =)( )(),|1()( xwwxx Tgypf ===
xwxww )),(( fy−+← α
On-line gradient update: On-line gradient update:The same
=)(xf
xwxww )),(( fy−+← α
1x
dx
2x
20
CS 2710 Foundations of AI
Limitations of basic linear units
Logistic regressionLinear regression
∑
1
)|1( xyp =
0w
1w2w
dw
z∑
1
0w
1w2w
dw
)(xf
Function linear in inputs !! Linear decision boundary!!
1x
dx
2x1x
dx
2x
xwx Tf =)( )(),|1()( xwwxx Tgypf ===
top related