lecture 2: generative learningtzhao80/lectures/lecture_2.pdf · lecture 2: generative learning tuo...

Lecture 2: Generative Learning

Tuo Zhao

Schools of ISYE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Generative Learning

Tuo Zhao — Lecture 2: Generative Learning 2/47

Generative Learning

Modeling Dogs

Modeling Cats

Discriminative Learning

GenerativeDiscriminative GenerativeDiscriminative

Which One is Better for Classification?

Joint and Posterior Distributions

We consider a binary classification problem:

Feature: X ∈ Rd

Response: Y ∈ {0, 1}Class Prior: P(Y = 1) = p and P(Y = 0) = 1− pPosterior: Conditional Probability of Y Given X, i.e.,

P(Y |X) =P(Y )P(X|Y )

Feature: X ∈ Rd

P(Y |X) =P(Y )P(X|Y )

Feature: X ∈ Rd

P(Y |X) =P(Y )P(X|Y )

Feature: X ∈ Rd

P(Y |X) =P(Y )P(X|Y )

Feature: X ∈ Rd

P(Y |X) =P(Y )P(X|Y )

Discriminative Learning

Posterior is sufficient for prediction:

y = argmaxy

P(Y = y|X = x)

= argmaxy

P(Y = y)P(X = x|Y = y)

P(X = x)

= argmaxy

P(Y = y)P(X = x|Y = y)

= argmaxy

P(X = x, Y = y)

Which one to model?

Joint Distribution? Conditional Distribution?

Gaussian Discriminant Analysis

Multivariate Gaussian Distribution: X ∼ N(µ,Σ)

Probability Density Function

f(x;µ,Σ) =1

(2π)d/2|Σ|1/2 exp(−1

2(x− µ)>Σ−1(x− µ)

Expectation: EX = µ

Covariance: E(X − µ)(X − µ)> = Σ

Standard Gaussian Distribution: µ = 0 and Σ = Id.

f(x;µ,Σ) =1

(2π)d/2|Σ|1/2 exp(−1

2(x− µ)>Σ−1(x− µ)

f(x;µ,Σ) =1

(2π)d/2|Σ|1/2 exp(−1

2(x− µ)>Σ−1(x− µ)

f(x;µ,Σ) =1

(2π)d/2|Σ|1/2 exp(−1

2(x− µ)>Σ−1(x− µ)

real-valued random variable. The covariance can also be defined as Cov(Z) =E[ZZT ]− (E[Z])(E[Z])T . (You should be able to prove to yourself that thesetwo definitions are equivalent.) If X ∼ N (µ, Σ), then

Cov(X) = Σ.

Here’re some examples of what the density of a Gaussian distributionlooks like:

−3−2

The left-most figure shows a Gaussian with mean zero (that is, the 2x1zero-vector) and covariance matrix Σ = I (the 2x2 identity matrix). A Gaus-sian with zero mean and identity covariance is also called the standard nor-mal distribution. The middle figure shows the density of a Gaussian withzero mean and Σ = 0.6I; and in the rightmost figure shows one with , Σ = 2I.We see that as Σ becomes larger, the Gaussian becomes more “spread-out,”and as it becomes smaller, the distribution becomes more “compressed.”

Let’s look at some more examples.

−3−2

The figures above show Gaussians with mean 0, and with covariancematrices respectively

!1 00 1

"; Σ =

!1 0.5

"; .Σ =

!1 0.8

The leftmost figure shows the familiar standard normal distribution, and wesee that as we increase the off-diagonal entry in Σ, the density becomes more“compressed” towards the 45◦ line (given by x1 = x2). We can see this moreclearly when we look at the contours of the same three densities:

−3 −2 −1 0 1 2 3−3

Here’s one last set of examples generated by varying Σ:

−3 −2 −1 0 1 2 3−3

The plots above used, respectively,

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).

As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.

−3−2

The figures above were generated using Σ = I, and respectively

"; µ =

!-0.50

"; µ =

Generative View:

Y ∼ Bernoulli(p)

P(Y = y) = py(1− p)1−y

P(X|Y = 0) ∼ N(µ0,Σ)

P(X = x|Y = 0) =exp

2(x− µ0)>Σ−1(x− µ0)

(2π)d/2|Σ|1/2

P(X|Y = 1) ∼ N(µ1,Σ)

P(X = x|Y = 1) =exp

2(x− µ1)>Σ−1(x− µ1)

(2π)d/2|Σ|1/2

Generative View:

Y ∼ Bernoulli(p)

P(Y = y) = py(1− p)1−y

P(X|Y = 0) ∼ N(µ0,Σ)

P(X = x|Y = 0) =exp

2(x− µ0)>Σ−1(x− µ0)

(2π)d/2|Σ|1/2

P(X|Y = 1) ∼ N(µ1,Σ)

P(X = x|Y = 1) =exp

2(x− µ1)>Σ−1(x− µ1)

(2π)d/2|Σ|1/2

Generative View:

Y ∼ Bernoulli(p)

P(Y = y) = py(1− p)1−y

P(X|Y = 0) ∼ N(µ0,Σ)

P(X = x|Y = 0) =exp

2(x− µ0)>Σ−1(x− µ0)

(2π)d/2|Σ|1/2

P(X|Y = 1) ∼ N(µ1,Σ)

P(X = x|Y = 1) =exp

2(x− µ1)>Σ−1(x− µ1)

(2π)d/2|Σ|1/2

Generative Learning

Maximum Likelihood Estimation:

L(p,µ0,µ1,Σ) = log

f(xi, yi; p,µ0,µ1,Σ)

h(xi|yi; p,µ0,µ1,Σ)g(yi; p).

By maximizing ℓ with respect to the parameters, we find the maximum like-lihood estimate of the parameters (see problem set 1) to be:

1{y(i) = 1}

"mi=1 1{y(i) = 0}x(i)

"mi=1 1{y(i) = 0}

"mi=1 1{y(i) = 1}x(i)

"mi=1 1{y(i) = 1}

(x(i) − µy(i))(x(i) − µy(i))T .

Pictorially, what the algorithm is doing can be seen in as follows:

−2 −1 0 1 2 3 4 5 6 7−7

Shown in the figure are the training set, as well as the contours of thetwo Gaussian distributions that have been fit to the data in each of thetwo classes. Note that the two Gaussians have contours that are the sameshape and orientation, since they share a covariance matrix Σ, but they havedifferent means µ0 and µ1. Also shown in the figure is the straight linegiving the decision boundary at which p(y = 1|x) = 0.5. On one side ofthe boundary, we’ll predict y = 1 to be the most likely outcome, and on theother side, we’ll predict y = 0.

1.3 Discussion: GDA and logistic regression

The GDA model has an interesting relationship to logistic regression. If weview the quantity p(y = 1|x; φ, µ0, µ1, Σ) as a function of x, we’ll find that it

Generative Learning

Convex Minimization

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 xi · yi∑ni=1 yi

∑ni=1 yin

and Σk =1

(xi − µyi)(xi − µyi)>

d(d+ 1) + 2d+ 1 parameters to estimate.

Generative Learning

Convex Minimization

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 yin

and Σk =1

Generative Learning

Convex Minimization

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 yin

and Σk =1

Generative Learning

Convex Minimization

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 yin

and Σk =1

Generative Learning

Convex Minimization

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 yin

and Σk =1

Prediction: Given X ∈ Rd, we predict

Y = argmaxY ∈{0,1}

P(Y |X; p, µ0, µ1, Σ).

Since we have [Analytical Problem in HW3]

(P(Y = 1|X)

1− P(Y = 1|X)

)= −1

2(µ1 + µ0)

>Σ−1(µ1 − µ0)

+ (µ1 − µ0)Σ−1X + log

1− p

this is actually a logistic regression model!

But different from maximizing the conditional log likelihood!

P(Y |X; p, µ0, µ1, Σ).

(P(Y = 1|X)

1− P(Y = 1|X)

)= −1

2(µ1 + µ0)

>Σ−1(µ1 − µ0)

+ (µ1 − µ0)Σ−1X + log

1− p

P(Y |X; p, µ0, µ1, Σ).

(P(Y = 1|X)

1− P(Y = 1|X)

)= −1

2(µ1 + µ0)

>Σ−1(µ1 − µ0)

+ (µ1 − µ0)Σ−1X + log

1− p

GDA v.s. Logistic Regression

Modeling Assumption: Terrible

d(d+ 1)/2 + 2d+ 1 parameters: Terrible

Simple with a closed form solution: Not very useful!

Logistic Regression

Modeling Assumption: More Robust!

d parameters: Fewer!

Need an iterative optimization algorithm: Not bad!

Logistic Regression

Naive Bayes Classification

Naive Bayes Gaussian Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(X|Y = 0) ∼ N(µ0,Σ)

P(X|Y = 1) ∼ N(µ1,Σ)

σ21σ22

−3 −2 −1 0 1 2 3−3

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

−3−2

"; µ =

!-0.50

"; µ =

−3 −2 −1 0 1 2 3−3

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

−3−2

"; µ =

!-0.50

"; µ =

Generative View:

Y ∼ Bernoulli(p)

P(X|Y = 0) ∼ N(µ0,Σ)

P(X|Y = 1) ∼ N(µ1,Σ)

σ21σ22

−3 −2 −1 0 1 2 3−3

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

−3−2

"; µ =

!-0.50

"; µ =

−3 −2 −1 0 1 2 3−3

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

−3−2

"; µ =

!-0.50

"; µ =

Generative View:

Y ∼ Bernoulli(p)

P(X|Y = 0) ∼ N(µ0,Σ)

P(X|Y = 1) ∼ N(µ1,Σ)

σ21σ22

−3 −2 −1 0 1 2 3−3

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

−3−2

"; µ =

!-0.50

"; µ =

−3 −2 −1 0 1 2 3−3

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

−3−2

"; µ =

!-0.50

"; µ =

Generative View:

Y ∼ Bernoulli(p)

P(X|Y = 0) ∼ N(µ0,Σ)

P(X|Y = 1) ∼ N(µ1,Σ)

σ21σ22

−3 −2 −1 0 1 2 3−3

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

−3−2

"; µ =

!-0.50

"; µ =

−3 −2 −1 0 1 2 3−3

!1 -0.5

-0.5 1

"; Σ =

!1 -0.8

-0.8 1

"; .Σ =

!3 0.8

−3−2

"; µ =

!-0.50

"; µ =

Conditional Independence:

P(X|Y ) =

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

A Simpler Decision Rule:

P(Y = 1|X) =P(X|Y = 1)P(Y = 1)

∏dj=1 P(Xj |Y = 1)P(Y = 1)

∏dj=1 P(Xj |Y = 1)P(Y = 1) +

∏dj=1 P(Xj |Y = 0)P(Y = 0)

P(X|Y ) =

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

A Simpler Decision Rule:

P(Y = 1|X) =P(X|Y = 1)P(Y = 1)

∏dj=1 P(Xj |Y = 1)P(Y = 1)

∏dj=1 P(Xj |Y = 1)P(Y = 1) +

∏dj=1 P(Xj |Y = 0)P(Y = 0)

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 yin

and σ2j =1

(xi,j − µyi,j)2

3d+ 1 parameters to estimate.

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 yin

and σ2j =1

(xi,j − µyi,j)2

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 yin

and σ2j =1

(xi,j − µyi,j)2

Missing values?

Example: X = (X1, ..., Xd−1)

P(Y = 1|X) =P(X|Y = 1)P(Y = 1)

∏d−1j=1 P(Xj |Y = 1)P(Y = 1)

∏d−1j=1 P(Xj |Y = 1)P(Y = 1) +

∏d−1j=1 P(Xj |Y = 0)P(Y = 0)

GDA v.s. Naive Bayes GDA

Stronger Modeling Assumption: Terrible

A simple closed form solution: Not very useful!

Naive Bayes GDA

Even Stronger Modeling Assumption: Terrible!

3d+ 1 parameters: Good!

A super simple closed form solution: Useful sometimes!

Naive Bayes Bernoulli Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Xj |Y = 0) ∼ Bernoulli(γ(0)j )

P(X|Y ) =d∏

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Generative View:

Y ∼ Bernoulli(p)

P(X|Y ) =d∏

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Generative View:

Y ∼ Bernoulli(p)

P(X|Y ) =

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Generative View:

Y ∼ Bernoulli(p)

P(X|Y ) =

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Naive Bayes Poisson Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Xj |Y = 0) ∼ Poisson(λ(0)j )

P(X|Y ) =d∏

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Generative View:

Y ∼ Bernoulli(p)

P(X|Y ) =d∏

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Generative View:

Y ∼ Bernoulli(p)

P(X|Y ) =

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Generative View:

Y ∼ Bernoulli(p)

P(X|Y ) =

P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )

Example: Spam Email Classification

Data Set:

4601 email messages

Goal: predict whether an email message is spam or good.

Features: the frequencies in a message of 48 of the mostcommonly occurring words in all these email messages.

We coded spam as 1 and email as 0.

Data Set:

4601 email messages

Data Set:

4601 email messages

Data Set:

4601 email messages

Data Set:

4601 email messages

Transforming Features:

Naive Bayes GDA:

Relative Frequency of “free” =# of free in this email

# of all words in this email

Naive Bayes Bernoulli DA:

Indicator of “free” = 1 if “free” appears in this email

Naive Bayes Poisson DA: No transformation needed

Coding Problem in HW2.

Naive Bayes GDA:

Multiclass Fisher Discriminant Analysis

Revisiting GDA

A Dimensionality Reduction Perspective:

Between-Class Scatter Matrix:

Γ =∑

nkn(µk − µ)(µk − µ)>,

xi, n1 =

yi and n0 = n− n1.

Rayleigh Quotient Formulation

w = argmaxw

w>Σw= argmax

ww>Γw s.t. w>Σw = 1.

Revisiting GDA

A Dimensionality Reduction Perspective:

Γ =∑

nkn(µk − µ)(µk − µ)>,

xi, n1 =

yi and n0 = n− n1.

w = argmaxw

w>Σw= argmax

ww>Γw s.t. w>Σw = 1.

FDA and Dimension Reduction

Generative View:

Y ∼ Discrete(p1, ..., pm) with∑m

k=1 pk = 1

P(X|Y = k) ∼ N(µk,Σ)

nkn(µk − µ)(µk − µ)> with nk =

1(yi = k).

W = argmaxW∈Rd×r

trace(W>ΓW) s.t. W>ΣW = Ir.

Generative View:

k=1 pk = 1

1(yi = k).

W = argmaxW∈Rd×r

Generative View:

k=1 pk = 1

1(yi = k).

W = argmaxW∈Rd×r

Generative View:

k=1 pk = 1

1(yi = k).

W = argmaxW∈Rd×r

Generative View:

k=1 pk = 1

1(yi = k).

W = argmaxW∈Rd×r

Eigenvalue Problem (Rank-1)

w = argmaxw∈Rd

w>Aw s.t. w>w = 1.

Lagrangian Multiplier Method: λ ∈ R

L(w, λ) = w>Aw − λ(w>w − 1).

We only need eigenvectors of A, since

∇wL(w, λ) = 2Aw − 2λw = 0,

∇λL(w, λ) = w>w − 1 = 0.

w = argmaxw∈Rd

w>Aw s.t. w>w = 1.

L(w, λ) = w>Aw − λ(w>w − 1).

∇wL(w, λ) = 2Aw − 2λw = 0,

∇λL(w, λ) = w>w − 1 = 0.

w = argmaxw∈Rd

w>Aw s.t. w>w = 1.

L(w, λ) = w>Aw − λ(w>w − 1).

∇wL(w, λ) = 2Aw − 2λw = 0,

∇λL(w, λ) = w>w − 1 = 0.

Eigenvalue Problem (Rank-r)

U = argmaxU∈Rd×r

trace(U>AU) s.t. U>U = Ir,

Lagrangian Multiplier Method: Λ ∈ Rr×r and Λ = Λ>

L(U,Λ) = trace(U>AU)− trace(Λ>(U>U− Ir))

∇UL(U,Λ) = 2AU− 2UΛ = 0,

∇ΛL(U,Λ) = U>U− Ir = 0.

U = argmaxU∈Rd×r

∇UL(U,Λ) = 2AU− 2UΛ = 0,

∇ΛL(U,Λ) = U>U− Ir = 0.

U = argmaxU∈Rd×r

∇UL(U,Λ) = 2AU− 2UΛ = 0,

∇ΛL(U,Λ) = U>U− Ir = 0.

Generalized Eigenvalue Problem

W = argmaxW∈Rd×r

Replace U = Σ1/2

U = argmaxU∈Rd×r

where A = Σ−1/2

ΓΣ−1/2

Eigenvalue Problem!

W = argmaxW∈Rd×r

Replace U = Σ1/2

U = argmaxU∈Rd×r

where A = Σ−1/2

ΓΣ−1/2

Eigenvalue Problem!

W = argmaxW∈Rd×r

Replace U = Σ1/2

U = argmaxU∈Rd×r

where A = Σ−1/2

ΓΣ−1/2

Eigenvalue Problem!

Eigenvalue Problem

Power Iteration:

U(t+1) = QR(ΘU(t))

When r = 1, we have

u(t+1) =Θu(t)

∥∥u(t)∥∥2

where Θ = Σ−1/2

ΓΣ−1/2

We need T = O(gap · log(1/ε)) iterations to guarantee

|u>u(T )| = 1− ε,where gap = λ1/(λ1 − λ2).

Eigenvalue Problem

Power Iteration:

U(t+1) = QR(ΘU(t))

When r = 1, we have

u(t+1) =Θu(t)

∥∥u(t)∥∥2

where Θ = Σ−1/2

ΓΣ−1/2

Eigenvalue Problem

Power Iteration:

U(t+1) = QR(ΘU(t))

When r = 1, we have

u(t+1) =Θu(t)

∥∥u(t)∥∥2

where Θ = Σ−1/2

ΓΣ−1/2

Quadratic Discriminant Analysis

Generative View:

Y ∼ Bernoulli(p)

P(Y = y) = py(1− p)1−y

P(X|Y = 0) ∼ N(µ0,Σ0)

P(X = x|Y = 0) =exp

2(x− µ0)>Σ−10 (x− µ0)

(2π)d/2|Σ0|1/2

P(X|Y = 1) ∼ N(µ1,Σ1)

P(X = x|Y = 1) =exp

2(x− µ1)>Σ−11 (x− µ1)

(2π)d/2|Σ1|1/2

Generative View:

Y ∼ Bernoulli(p)

P(Y = y) = py(1− p)1−y

P(X|Y = 0) ∼ N(µ0,Σ0)

P(X = x|Y = 0) =exp

2(x− µ0)>Σ−10 (x− µ0)

(2π)d/2|Σ0|1/2

P(X|Y = 1) ∼ N(µ1,Σ1)

P(X = x|Y = 1) =exp

2(x− µ1)>Σ−11 (x− µ1)

(2π)d/2|Σ1|1/2

Generative View:

Y ∼ Bernoulli(p)

P(Y = y) = py(1− p)1−y

P(X|Y = 0) ∼ N(µ0,Σ0)

P(X = x|Y = 0) =exp

2(x− µ0)>Σ−10 (x− µ0)

(2π)d/2|Σ0|1/2

P(X|Y = 1) ∼ N(µ1,Σ1)

P(X = x|Y = 1) =exp

2(x− µ1)>Σ−11 (x− µ1)

(2π)d/2|Σ1|1/2

L(p,µ0,µ1,Σ0,Σ1) = log

h(xi|yi; p,µ0,µ1,Σ0,Σ1)g(yi; p).

Convex Minimization

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 yin

and Σk =1

(xi − µk)(xi − µk)>

Convex Minimization

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 yin

and Σk =1

Convex Minimization

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 yin

and Σk =1

Convex Minimization

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 yin

and Σk =1

Convex Minimization

∑ni=1 xi · (1− yi)n−∑n

i=1 yiand µ1 =

∑ni=1 yin

and Σk =1

GDA v.s. QDA

Stronger Modeling Assumption: Terrible

Weaker Modeling Assumption: Still Terrible!

d(d+ 1) + 2d+ 1 parameters: More Terrible!

Multiclass Classification

K-Nearest Neighbor Classification

Very intuitive....

Model Complexity?

More flexible for larger K’s?

Not really!

Curse of Dimensionality

Local Linear Regression

Build linear regression models using ONLY neighbors

Local Logistic Regression

Build logistic regression models using ONLY neighbors

Multinomial Regression

Given x1, ...,xn ∈ Rd, y1, ..., yn ∈ {1, 2, ...,m}, andθ∗1, ...,θ

∗m−1 ∈ Rd, for k = 1, ...,m− 1 and i = 1, ..., n,

P(yi = k) =exp(−x>i θ∗k)

m−1∑

exp(−x>i θ∗k),

P(yi = m) =1

m−1∑

exp(−x>i θ∗k)

Maximum Likelihood Estimation: Still a convex problem.

lecture 2: generative learningtzhao80/lectures/lecture_2.pdf · lecture 2: generative learning tuo...

Documents