linear classification models: generative prof. navneet goyal cs & is bits, pilani

54
Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Upload: rodger-greene

Post on 17-Dec-2015

230 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Linear Classification Models:Generative

Prof. Navneet GoyalCS & ISBITS, Pilani

Page 2: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

• Perceptron

Page 3: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Approaches to Classification Probabilistic

Inference stage – use training data to learn a model for p(Ck|x)

Decision stage – use posterior probabilities p(Ck|x) to make optimal class assignments

Discriminant fn. Solve both problems together and simply learn a

fn. that maps x directly to a class Probabilistic

Generative First solve inference problem of determining class

conditional densities p(x|Ck) for each class Ck and then determine posterior probabilities

Discriminative

Page 4: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Models Probabilistic view of classification! Models with linear decision boundaries arise

from simple assumptions about the distribution of data

Two approaches: Generative Models (2 steps) Discriminative Models (1 step)

In both cases use decision theory to assign a new x to a class

Page 5: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative Models 2 step process Model class conditional densities p(x|Ck) & class priors

p(Ck) Use them to compute class posterior probabilities p(Ck|x)

according to Bayes’ theorem 2 class case Posterior prob. for class C1 = p(C1|x) = σ (a), the logistic sigmoid fn. a = & Sigmoid fn is S-shaped and is also called the squashing fn

because it maps the whole real line into a finite interval(maps real a ε (-∞, +∞) to finite (0,1) interval)

Plays an important role in many classification algorithms

)()|(

)()|(ln

22

11

CpCxp

CpCxp)exp(1

1)(

aa

Page 6: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative Models Symmetry property Inverse of the logistic sigmoid fn is given by

& is called the logit or log odds function because it represents the log of the ratio of the probabilities

for the 2 classes

)(1)( aa

1lna

)()|(

)()|(ln

22

11

CpCxp

CpCxp

Page 7: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative Models Posterior probabilities have been written in an

equivalent form using σ!p(C1|x) = σ (a)

What’s the significance of doing so? We shall see this shortly when a(x) takes a simple functional form If a(x) is a linear fn. of x, then then posterior prob. is governed by a generalized linear model Generalized Linear Model?

Page 8: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative ModelsGeneralized Linear Model In linear regression models, the model prediction y(x,w)was given by a linear function of the parameters w In the simplest case, the model is also linear in the input variables x

so that y is a real no. In classification, we wish to predict discrete class labels, or more generally posterior probabilities in (0,1) To achieve this, we consider a generalization of this model in which we transform the linear fn of w using a non-linear fn f(.) so that

In ML, f(.) is know as an activation fn., whereas its inverse is called a link fn. in statistical literature

0)( wxwxy T

)()( 0wxwfxy T

Page 9: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative ModelsGeneralized Linear Model Decision surface corresponds to so th Decision surfaces are linear fns of x, even if the fn f(.) is non-linear Generalized linear models are no longer linear in the parameters due to non-linear fn f(.) More complex in terms of analytical and computational properties! Still simpler that the more general non-linear models

constxy )(constwxwT 0

Page 10: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative ModelsK>2 classes: Softmax Function

Page 11: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative ModelsK>2 classes: Softmax Function

Page 12: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative ModelsSoftmax Function

Page 13: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative ModelsSoftmax Function

Page 14: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative ModelsSoftmax Function

The soft maximum approximates the hard maximum and is a convex function just like the hard maximum. But the soft maximum is smooth. It has no sudden changes in direction and can be differentiated as many times as you like. These properties make it easy for convex optimization algorithms to work with the soft maximum. In fact, the function may have been invented for optimization;

Page 15: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative ModelsSoftmax Function• Accuracy of the soft maximum approximation depends on scale• Multiplying x and y by a large constant brings the soft maximum closer to the hard maximum• For example, g(1, 2) = 2.31, but g(10, 20) = 20.00004• “hardness” of the soft maximum can be controlled by generalizing the soft maximum to depend on a parameter k.

g(x, y; k) = log( exp(kx) + exp(ky) ) / k

• Soft maximum can be made as close to the hard maximum as desired by making k large enough• For every value of k the soft maximum is differentiable, but the size of the derivatives increase as k increases. • In the limit the derivative becomes infinite as the soft maximum converges to the hard maximum

Page 16: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative Models Forms of class-conditional densities

Continuous Inputs (x follows Gaussian distribution) Discrete Inputs (for example ) }1,0{ix

Page 17: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Generative Models Continuous Inputs (x follows Gaussian distribution) All classes share the same covariance matrix All classes do not share the same covariance matrix

Page 18: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Discriminative Models

Logistic Regression 2-class Multi-class Parameters using

Maximum LikelihoodIterative Reweighted Least Squares (IRLS)

Probit Regression

Page 19: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Discriminative Models Logistic Regression

Logistic regression is a form of regression analysis in which the outcome variable is binary or dichotomous Consider a binary response variable

Variable with two outcomes One outcome represented by a 1 and the other represented by a 0 Examples:Does the person have a disease? Yes or NoWho is the person voting for? McCain or ObamaOutcome of a baseball game? Win or loss

Page 20: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Discriminative Models Logistic Regression Example Data Set

Response Variable –> Admission to Grad School (Admit)0 if admitted, 1 if not admitted

Predictor VariablesGRE Score (gre)

ContinuousUniversity Prestige (topnotch)

1 if prestigious, 0 otherwise Grade Point Average (gpa)

Continuous

Page 21: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Discriminative Models First 10 Observations of the Data Set

ADMIT GRE TOPNOTCH GPA1 380 0 3.610 660 1 3.670 800 1 40 640 0 3.191 520 0 2.930 760 0 30 560 0 2.981 400 0 3.080 540 0 3.391 700 1 3.92

Page 22: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Logistic Regression

Consider the linear probability model

Issue: π(Xi) can take on values less than 0 or greater than 1

Issue: Predicted probability for some subjects fall outside of the [0,1] range.

iiiii xxxYPYE )()|0(

Page 23: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Logistic Regression

Consider the logistic regression model

GLM with binomial random component and identity link g(μ) = logit(μ)

Range of values for π(Xi) is 0 to 1

i

iiiii x

xxxYPYE

exp1

exp)()|0(

i

i

ii x

x

xxit

1

loglog

Page 24: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Logistic Regression Consider the logistic regression model

And the linear probability model

Then the graph of the predicted probabilities for different grade point averages:

ii gpaxit *log

ii gpax *)(

Page 25: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

What is Logistic Regression? In a nutshell:

A statistical method used to model dichotomous or binary outcomes (but not limited to) using predictor variables.Used when the research method is focused on whether or not an event occurred, rather than when it occurred (time course information is not used).

Page 26: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

What is Logistic Regression? What is the “Logistic” component?

Instead of modeling the outcome, Y, directly, the method models the log odds(Y) using the logistic function.

Page 27: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Logistic Regression

Simple logistic regression = logistic regression with 1 predictor variable Multiple logistic regression = logistic regression with multiple predictor variables Multiple logistic regression = Multivariable logistic regression = Multivariate logistic regression

Page 28: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Logistic Regression

0 1 1 2 2 K K

P Yln

1-P YX X X

predictor variables

YP1

YPln is the log(odds) of the outcome.

dichotomous outcome

Page 29: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Logistic Regression 0 1 1 2 2 K K

P Yln

1-P YX X X

intercept

YP1

YPln is the log(odds) of the outcome.

model coefficients

Page 30: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Odds & Probability

Probability eventOdds event =

1-Probability event

Odds eventProbability event

1+Odds event

Page 31: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Maximum Likelihood Flipped a fair coin 10 times: T, H, H, T, T, H, H, T, H, H

What is the Pr(Heads) given the data? 1/100? 1/5? 1/2? 6/10?

Page 32: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

T, H, H, T, T, H, H, T, H, H

What is the Pr(Heads) given the data? Most reasonable data-based estimate would

be 6/10.

In fact,

is the ML estimator of p.

flips of # total

heads of #ˆ

N

Xp

Maximum Likelihood

Page 33: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Discrete distribution, finite parameter space How biased an unfair coin is? Call the probability of tossing a HEAD p. Determine p. Toss the coin 80 times Outcome is 49 HEADS and 31 TAILS, Suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p = 1/3, one which gives HEADS with probability p = 1/2 and another which gives HEADS with probability p = 2/3. NO labels on these coins Using maximum likelihood estimation the coin that has the largest likelihood can be found, given the data that were observed. By using the probability mass function of the binomial distribution with sample size equal to 80, number successes equal to 49 but different values of p (the "probability of success"), the likelihood function (defined below) takes one of three values:

Maximum Likelihood: Example

Page 34: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

The likelihood is maximized when p = 2/3, and so this the maximum likelihood estimate for p.

Maximum Likelihood: ExampleDiscrete distribution, finite parameter space

Page 35: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Discrete distribution, continuous parameter spaceNow suppose that there was only one coin but its p could have been any value 0 ≤ p ≤ 1. The likelihood function to be maximized is:and the maximization is over all possible values 0 ≤ p ≤ 1.differentiating with respect to p

(solutions p = 0, p = 1, and p = 49/80) The solution which maximizes the likelihood is clearly p = 49/80Thus the maximum likelihood estimator for p is 49/80.

3149 )1(49

80)|49()( pppHfpL D

Maximum Likelihood: Example

Page 36: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Continuous distribution, continuous parameter space

Do it for Gaussian Distribution yourself! Two parameters, μ & σ

Maximum Likelihood: Example

Its expectation value is equal to the parameter μ of the given distribution,

which means that the maximum-likelihood estimator μ is unbiased.

This means that the estimator is biased. However, is consistent.

In this case it could be solved individually for each parameter. In general, it may not be the case.

Page 37: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

The method of maximum likelihood estimation chooses values for parameter estimates (regression coefficients) which make the observed data “maximally likely.” Standard errors are obtained as a by-product of the maximization process

Maximum Likelihood

Page 38: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

The Logistic Regression Model

0 1 1 2 2 K K

P Yln

1-P YX X X

intercept

YP1

YPln is the log(odds) of the outcome.

model coefficients

Page 39: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Maximum Likelihood

We want to choose β’s that maximizes the probability of observing the data we have:

N

iiNN yyyyyyyL

12121 PrPrPrPr,,,Pr

Assumption: independent y’s

Page 40: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Obvious possibility is to use traditional linear regression model

But this has problems Distribution of dependent variable hardly normal Predicted probabilities cannot be less than 0,

greater than 1

Linear probability model

x

Page 41: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Linear probability model predictions

-0.5

0

0.5

1

1.5

-4 -2 0 2 4

Probability

Page 42: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Instead, use logistic transformation (logit) of probability, log of the odds

Logistic regression model

x

1log

Page 43: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Logistic regression model predictions

0

0.5

1

-4 -2 0 2 4

Probability

Page 44: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Least-squares no longer best way of estimating parameters of logistic regression model

Instead use maximum likelihood estimation Finds values of parameters that have greatest

probability, given data

Estimation of logistic regression model

Page 45: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Data on 24 space shuttle launches prior to Challenger

Dependent variable, whether shuttle flight experienced thermal distress incident

Independent variables Date – whether shuttle changes or age has

effect Temperature – whether joint temperature on

booster has effect

Space shuttle data

Page 46: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Dependent variable Any thermal distress on launch

Independent variable Date (days since 1/1/60)

SPSS procedure Regression, Binary logistic

First model—date as single independent variable

Page 47: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Predicted probability of thermal distress using date

Page 48: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Exponential of B as change in odds

x

1log

xx eee

1

Page 49: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Odds is the ratio of probability of success to probability of failure Like odds on horse races Even odds, odds = 1, implies probability equals

0.5 Odds = 2 means 2 to 1 in favor of success, implies

probability of 0.667 Odds = 0.5 means 1 to 2 in favor (or 2 to 1

against) success, implies probability of 0.333

What does “odds” mean?

Page 50: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Logistic regression can be extended to use multiple independent variables exactly like linear regression

Multiple logistic regression

22111log xx

Page 51: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Dependent variable Any thermal distress on launch

Independent variables Date (days since 1/1/60) Joint temperature, degrees F

Adding joint temperature to the logistic regression model

Page 52: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Discriminative Models Find posterior class probabilities directly Use functional form of generalized linear model and determine its parameters directly by using maximum likelihood principle Iterative Reweighted Least Squares (IRLS) Maximize a likelihood fn defined through the conditional distribution p(Ck|x), which represents a form of discriminative training Advantage of discriminative approach – fewer no. of adaptive parameters to be determined (linear in M) Improved predictive performance particularly when the class-conditional density assumptions give a poor approximation of the true distributions

Page 53: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Discriminative Models Classification methods work directly with the original input vector All such algorithms are still applicable if we first make a fixed NL transformation of the inputs using a vector of basis fns ϕ(x) Decision boundaries linear in the feature space ϕ In linear models of regression, one of the basis fn is typically set to a constant say , so that the corresponding parameter plays the role of bias. A fixed basis fn transformation ϕ(x) will be used in

1)(0 x0w

Page 54: Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani

Probabilistic Discriminative Models

Original Input Space (x1,x2) Feature Space (φ1,φ2)

Although we use linear classification modelsLinear-separability in feature space does not implylinear-separability in input space