ics 178 introduction machine learning & data mining instructor max welling lecture 6: logistic...
TRANSCRIPT
ICS 178Introduction Machine Learning
& data Mining
Instructor max Welling
Lecture 6: Logistic Regression
Logistic Regression
• This is also regression but with targets Y=(0,1). I.e. it is classification!
• We will fit a regression function on P(Y=1|X)
jxijA
( )if x
jxijA
n nY AX b
( 1| ) ( )n n nP Y X f AX b
1( )
1 exp[ ( )]f X
AX b
linear regression logistic regression
Sigmoid function f(x)
( )if x
jxijA
( 1| ) ( )n n nP Y X f AX b
1( )
1 exp[ ( )]f X
AX b
data-points with Y=0
data-points with Y=1
In 2 Dimensions
A,b determine1) orientation 2) thickness (margin)3) offsetof decision surface
sigmoid f(x)
Cost Function
• We want a different error measure that is better suited for 0/1 data.
• This can be derived from maximizing the probability of the data again.
1( | ) ( ) (1 ( ))n nY Yn n n nP Y X f X f Y
( 1| ) ( )n n nP Y X f AX b ( 0| ) 1 ( )n n nP Y X f AX b
1
log ( ) (1 )log(1 ( ))N
n n n nn
Error Y f X Y f X
Learning A,b
• Again, we take the derivatives of the Error wrt to the parameters.
• This time however, we can’t solve them analytically, so we use gradient descent.
dError dErrorA A b b
dA db
Gradients for Logistic Regression
1 ( ) (1 ) ( )
1 ( ) (1 ) ( )
T
n n n n nn
n n n nn
ErrorY f X Y f X X
A
ErrorY f X Y f X
b
• After the math (on the white-board) we find:
Note: first term in each eqn. (multiplied by Y) only sums over data with Y=1, while second term (multiplied by (1-Y) only sums over data with Y=0.
Follow the gradient until the change in A,b falls below a small theshold (e.g. 1E-6).
Classification
• Once we have found the optimal values for A,b we classify future data with:
( ( ))new newY round f X
• Least squares and Logistic regression are parametric methods since all the information in the data is stored in the parameters A,b, i.e. after learning you can toss out the data.
• Also, the decision surface is always linear, its complexity does not grow with the amount of data.
• We have imposed our prior knowledge that the decision surface should be linear.
A Real Example
• Fingerprints are matched against a data-base.• Each match is scored.• Using Logistic Regression we try to predict if a future match is a real or false.• Human fingerprint examiners claim 100% accuracy. Is this true?
collaboration with S. Cole)
Exercise
• You have layed your hands on a dataset where data have a single attribute and a class label (0 or 1). You train a logistic regression classifier.A new data-case is presented. What do you do to decide in what class it falls (use an equation or pseudo-code)
• How many parameters are there to tune for this problem? Explain what these parameters mean in terms of the function P(Y=1|X).