logistic regression -- generative verses discriminative ...lsong/teaching/cse6740/lecture5.pdf ·...
TRANSCRIPT
Machine Learning
10-701/15-781, Spring 2008
Logistic Regression
-- generative verses discriminative
classifier
Le Song
Lecture 5, September 4, 2012
Based on slides from Eric Xing, CMU
Reading: Chap. 3.1.3-4 CB
Generative vs. Discriminative
Classifiers
Goal: Wish to learn f: X Y, e.g., P(Y|X)
Generative classifiers (e.g., Naïve Bayes):
Assume some functional form for P(X|Y), P(Y)
This is a ‘generative’ model of the data!
Estimate parameters of P(X|Y), P(Y) directly from training data
Use Bayes rule to calculate P(Y|X= xi)
Discriminative classifiers:
Directly assume some functional form for P(Y|X)
This is a ‘discriminative’ model of the data!
Estimate parameters of P(Y|X) directly from training data
Yi
Xi
Yi
Xi
Consider the a Gaussian
Generative Classifier
learning f: X Y, where
X is a vector of real-valued features, < X1…Xn >
Y is boolean
What does that imply about the form of P(Y|X)?
The joint probability of a datum and its label is:
Given a datum xn, we predict its label using the conditional probability of the label
given the datum:
Yi
Xi
22
1212 2
21
111
)-(-exp)(
),,|()(),|,(
/ knk
knn
kn
knn
x
yxpypyxp
''/'
/
)-(-exp)(
)-(-exp)(
),,|(
kknk
knk
nkn
x
x
xyp2
21
212
22
1212
2
2
21
21
1
Naïve Bayes Classifier
When X is multivariate-Gaussian vector:
The joint probability of a datum and it label is:
The naïve Bayes simplification
More generally:
Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density
)-()-(-exp)(
),,|()(),|,(
/ knT
knk
knn
kn
knn
xx
yxpypyxp
121
2121
111
Yi
Xi
jjk
jn
jk
k
jjkjk
kn
jn
kn
knn
x
yxpypyxp
jk
22
1212 2
21
111
)-(-exp)(
),,|()(),|,(
,/,
,,
,
m
jn
jnnnn yxpypyxp
1
),|()|(),|,(
Yi
Xi1 Xi
2 Xim …
The predictive distribution
Understanding the predictive distribution
Under naïve Bayes assumption:
For two class (i.e., K=2), and when the two classes haves the same
variance, ** turns out to be the logistic function
* ),|,(
),|,(
),|(
),,|,(),,,|(
' '''
k kknk
kknk
n
nkn
nkn
xN
xN
xp
xypxyp
1
1
1
1222
1221
1
2222
12
1221
121
11
1
1
1
)(2
log)-(exp
log)-(exp
log)][-]([)-(exp
j
jjjjjnCx
Cx
jjj j
jjn
j
j jjj
nj
x
nT xe
1
1
)|( nn xyp 11
**
log)(exp
log)(exp
),,,|(
' ,''
,'
'
,
,
k j jkj
kj
n
jk
k
j jkj
kj
n
jk
k
nkn
Cx
Cx
xyp
2
2
22
21
21
1
𝒙𝒏𝒋𝜽𝒋
𝒋
𝜽𝟎
The decision boundary
The predictive distribution
The Bayes decision rule:
For multiple class (i.e., K>2), * correspond to a softmax function
j
x
x
nkn
nTj
nTk
e
exyp
)|( 1
nT xM
j
jnj
nne
x
xyp
1
1
1
11
01
1
exp
)|(
nT
x
x
x
nn
nn x
e
ee
xyp
xyp
nT
nT
nT
1
1
1
11
2
1 ln
)|(
)|(ln
Discussion: Generative and
discriminative classifiers
Generative:
Modeling the joint distribution
of all data
Discriminative:
Modeling only points
at the boundary
How? Regression!
Linear regression
The data:
Both nodes are observed: X is an input vector
Y is a response vector
(we first consider y as a generic
continuous response vector, then
we consider the special case of
classification where y is a discrete
indicator)
A regression scheme can be
used to model p(y|x) directly,
rather than p(x,y)
Yi
Xi
N
),(,),,(),,(),,( NN yxyxyxyx 332211
Logistic regression (sigmoid
classifier)
The condition distribution: a Bernoulli
where is a logistic function
We can used the brute-force gradient method as in LR
But we can also apply generic laws by observing the p(y|x) is
an exponential family function, more specifically, a
generalized linear model (see next lecture!)
yy xxxyp 11 ))(()()|(
xT
ex
1
1)(
Training Logistic Regression:
MCLE
Estimate parameters =<0, 1, ... n> to maximize the
conditional likelihood of training data
Training data
Data likelihood =
Data conditional likelihood =
Expressing Conditional Log
Likelihood
Recall the logistic function:
and conditional likelihood:
Maximizing Conditional Log
Likelihood
The objective:
Good news: l() is concave function of
Bad news: no closed-form solution to maximize l()
𝑙(𝜃2) 𝑙(𝜃1)
1
3𝑙(𝜃1) +
2
3𝑙(𝜃2)
𝒍1
3𝜃1 +2
3𝜃2
𝑙(𝜃)
𝜃 𝜃1 𝜃2
Gradient Ascent
Property of sigmoid function:
The gradient:
The gradient ascent algorithm iterate until change < ε
For all i,
repeat
𝜕𝑙(𝜃)
𝜕𝜃𝑗= 𝑦𝑖 − 1 𝑥𝑖
𝑗−
𝑖
1
𝜇𝜇 1 − 𝜇
𝜕(−𝜃⊤𝑥)
𝜕𝜃𝑗= 𝑦𝑖 − 1 𝑥𝑖 +
𝑖
1 − 𝜇 𝑥𝑖𝑗
𝜇 𝑡 =1
1 + 𝑒−𝑡
How about MAP?
It is very common to use regularized maximum likelihood.
One common approach is to define priors on – Normal distribution, zero mean, identity covariance
Helps avoid very large weights and overfitting
MAP estimate
MLE vs MAP
Maximum conditional likelihood estimate
Maximum a posteriori estimate
Logistic regression: practical
issues
IRLS takes O(Nd3) per iteration, where N = number of training
cases and d = dimension of input x.
Quasi-Newton methods, that approximate the Hessian, work
faster.
Conjugate gradient takes O(Nd) per iteration, and usually
works best in practice.
Stochastic gradient descent can also be used if N is large c.f.
perceptron rule:
Naïve Bayes vs Logistic
Regression
Consider Y boolean, X continuous, X=<X1 ... Xn>
Number of parameters to estimate:
NB: 2n+1
LR: n+1
Estimation method:
NB parameter estimates are uncoupled (super easy)
LR parameter estimates are coupled (less easy)
Naïve Bayes vs Logistic
Regression
Asymptotic comparison (# training examples infinity)
when model assumptions correct
NB, LR produce identical classifiers
when model assumptions incorrect
LR is less biased – does not assume conditional independence
therefore expected to outperform NB
Naïve Bayes vs Logistic
Regression
Non-asymptotic analysis (see [Ng & Jordan, 2002] )
convergence rate of parameter estimates – how many training
examples needed to assure good estimates?
NB order log n (where n = # of attributes in X)
LR order n
NB converges more quickly to its (perhaps less helpful)
asymptotic estimates
Rate of convergence: logistic
regression
Let hDis,m be logistic regression trained on m examples in n
dimensions. Then with high probability:
Implication: if we want
for some small constant e0, it suffices to pick order n
examples
Convergences to its asymptotic classifier, in order n examples
result follows from Vapnik’s structural risk bound, plus fact that the "VC
Dimension" of an n-dimensional linear separators is n
Rate of covergence: naïve Bayes
parameters
Let any 𝜖1, 𝛿>0, and any l 0 be fixed.
Assume that for some fixed 𝜌 > 0,
we have that
Let
Then we probability at least 1-d, after m examples:
1. For discrete input, for all i and b
2. For continuous inputs, for all i and b
Some experiments from UCI data
sets
Summary
Logistic regression
– Functional form follows from Naïve Bayes assumptions
For Gaussian Naïve Bayes assuming variance
For discrete-valued Naïve Bayes too
But training procedure picks parameters without the
conditional independence assumption
– MLE training: pick W to maximize P(Y | X; )
– MAP training: pick W to maximize P( | X,Y)
‘regularization’
helps reduce overfitting
Gradient ascent/descent
– General approach when closed-form solutions unavailable
Generative vs. Discriminative classifiers
– Bias vs. variance tradeoff
¾j ;k = ¾j