machine learning cuny graduate center lecture 4: logistic regression

54
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Post on 20-Dec-2015

225 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Machine Learning

CUNY Graduate Center

Lecture 4: Logistic Regression

Page 2: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Today

• Linear Regression– Bayesians v. Frequentists– Bayesian Linear Regression

• Logistic Regression– Linear Model for Classification

2

Page 3: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Regularization: Penalize large weights

• Introduce a penalty term in the loss function.

3

Regularized Regression(L2-Regularization or Ridge Regression)

Page 4: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

More regularization

• The penalty term defines the styles of regularization

• L2-Regularization• L1-Regularization• L0-Regularization

– L0-norm is the optimal subset of features

4

Page 5: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Curse of dimensionality

• Increasing dimensionality of features increases the data requirements exponentially.

• For example, if a single feature can be accurately approximated with 100 data points, to optimize the joint over two features requires 100*100 data points.

• Models should be small relative to the amount of available data

• Dimensionality reduction techniques – feature selection – can help.– L0-regularization is explicit feature selection– L1- and L2-regularizations approximate feature selection.

5

Page 6: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Bayesians v. Frequentists

• What is a probability?• Frequentists

– A probability is the likelihood that an event will happen– It is approximated by the ratio of the number of observed events to the

number of total events– Assessment is vital to selecting a model– Point estimates are absolutely fine

• Bayesians– A probability is a degree of believability of a proposition.– Bayesians require that probabilities be prior beliefs conditioned on data.– The Bayesian approach “is optimal”, given a good model, a good prior

and a good loss function. Don’t worry so much about assessment.– If you are ever making a point estimate, you’ve made a mistake. The

only valid probabilities are posteriors based on evidence given some prior

6

Page 7: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Bayesian Linear Regression

• The previous MLE derivation of linear regression uses point estimates for the weight vector, w.

• Bayesians say, “hold it right there”.– Use a prior distribution over w to estimate parameters

• Alpha is a hyperparameter over w, where alpha is the precision or inverse variance of the distribution.

• Now optimize:

7

Page 8: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Optimize the Bayesian posterior

8

As usual it’s easier to optimize after a log transform.

Page 9: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Optimize the Bayesian posterior

9

As usual it’s easier to optimize after a log transform.

Page 10: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Optimize the Bayesian posterior

10

Ignoring terms that do not depend on w

IDENTICAL formulation as L2-regularization

Page 11: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Context

• Overfitting is bad.

• Bayesians vs. Frequentists– Is one better?– Machine Learning uses techniques from both

camps.

11

Page 12: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Logistic Regression

• Linear model applied to classification

• Supervised: target information is available– Each data point xi has a corresponding target

ti.

• Goal: Identify a function

12

Page 13: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Target Variables

• In binary classification, it is convenient to represent ti as a scalar with a range of [0,1]– Interpretation of ti as the likelihood that xi is the member of

the positive class– Used to represent the confidence of a prediction.

• For L > 2 classes, ti is often represented as a K element vector. – tij represents the degree of membership in class j.– |ti| = 1– E.g. 5-way classification vector:

13

Page 14: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Graphical Example of Classification

14

Page 15: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Decision Boundaries

15

Page 16: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Graphical Example of Classification

16

Page 17: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Classification approaches

• Generative– Models the joint distribution

between c and x– Highest data requirements

• Discriminative– Fewer parameters to approximate

• Discriminant Function– May still be trained probabilistically,

but not necessarily modeling a likelihood.

17

Page 18: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Treating Classification as a Linear model

18

Page 19: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Relationship between Regression and Classification

• Since we’re classifying two classes, why not set one class to ‘0’ and the other to ‘1’ then use linear regression.– Regression: -infinity to infinity, while class labels

are 0, 1• Can use a threshold, e.g.

– y >= 0.5 then class 1– y < 0.5 then class 2

19

f(x)>=0.5?

Happy/Good/ClassA

Sad/Not Good/ClassB

1

Page 20: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Odds-ratio

• Rather than thresholding, we’ll relate the regression to the class-conditional probability.

• Ratio of the odd of prediction y = 1 or y = 0– If p(y=1|x) = 0.8 and p(y=0|x) = 0.2– Odds ratio = 0.8/0.2 = 4

• Use a linear model to predict odds rather than a class label.

20

Page 21: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Logit – Log odds ratio function

• LHS: 0 to infinity• RHS: -infinity to

infinity• Use a log function.

– Has the added bonus of disolving the division leading to easy manipulation

21

Page 22: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Logistic Regression

• A linear model used to predict log-odds ratio of two classes

22

Page 23: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Logit to probability

23

Page 24: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Sigmoid function

• Squashing function to map the reals to a finite domain.

24

Page 25: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Gaussian Class-conditional

• Assume the data is generated from a gaussian distribution for each class.

• Leads to a bayesian formulation of logistic regression.

25

Page 26: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Bayesian Logistic Regression

26

Page 27: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Maximum Likelihood ExtimationLogistic Regression

• Class-conditional Gaussian.

• Multinomial Class distribution.

• As ever, take the derivative of this likelihood function w.r.t.

27

Page 28: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Maximum Likelihood Estimation of the prior

28

Page 29: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Maximum Likelihood Estimation of the prior

29

Page 30: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Maximum Likelihood Estimation of the prior

30

Page 31: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Discriminative Training

• Take the derivatives w.r.t. – Be prepared for this for homework.

• In the generative formulation, we need to estimate the joint of t and x.– But we get an intuitive regularization

technique.

• Discriminative Training– Model p(t|x) directly.

31

Page 32: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

What’s the problem with generative training

• Formulated this way, in D dimensions, this function has D parameters.

• In the generative case, 2D means, and D(D+1)/2 covariance values

• Quadratic growth in the number of parameters.

• We’d rather linear growth.

32

Page 33: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Discriminative Training

33

Page 34: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Optimization

• Take the gradient in terms of w

34

Page 35: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Optimization

35

Page 36: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Optimization

36

Page 37: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Optimization

37

Page 38: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Optimization: putting it together

38

Page 39: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Optimization

• We know the gradient of the error function, but how do we find the maximum value?

• Setting to zero is nontrivial

• Numerical approximation

39

Page 40: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Gradient Descent

• Take a guess.

• Move in the direction of the negative gradient

• Jump again.

• In a convex function this will converge

• Other methods include Newton-Raphson40

Page 41: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Multi-class discriminant functions

• Can extend to multiple classes

• Other approaches include constructing K-1 binary classifiers.

• Each classifier compares cn to not cn

• Computationally simpler, but not without problems

41

Page 42: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Exponential Model

• Logistic Regression is a type of exponential model.– Linear combination of weights and features to

produce a probabilistic model.

42

Page 43: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Problems with Binary Discriminant functions

43

Page 44: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

K-class discriminant

44

Page 45: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Entropy

• Measure of uncertainty, or Measure of “Information”

• High uncertainty equals high entropy.

• Rare events are more “informative” than common events.

45

Page 46: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Entropy

• How much information is received when observing ‘x’?

• If independent, p(x,y) = p(x)p(y).– H(x,y) = H(x) + H(y)– The information contained in two unrelated

events is equal to their sum.

46

Page 47: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Entropy

• Binary coding of p(x): -log p(x)– “How many bits does it take to represent a

value p(x)?”– How many “decimal” places? How many

binary decimal places?

• Expected value of observed information

47

Page 48: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Examples of Entropy

• Uniform distributions have higher distributions.

48

Page 49: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Maximum Entropy

• Logistic Regression is also known as Maximum Entropy.

• Entropy is convex.– Convergence Expectation.

• Constrain this optimization to enforce good classification.

• Increase maximum likelihood of the data while making the distribution of weights most even.– Include as many useful features as possible.

49

Page 50: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Maximum Entropy with Constraints

• From Klein and Manning Tutorial50

Page 51: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Optimization formulation

• If we let the weights represent likelihoods of value for each feature.

51

For each feature i

Page 52: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Solving MaxEnt formulation

• Convex optimization with a concave objective function and linear constraints.

• Lagrange Multipliers

52

For each feature iDual representation of the

maximum likelihood estimation of Logistic Regression

Dual representation of the maximum likelihood estimation of

Logistic Regression

Page 53: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Summary

• Bayesian Regularization– Introduction of a prior over parameters serves to

constrain weights

• Logistic Regression– Log odds to construct a linear model– Formulation with Gaussian Class Conditionals– Discriminative Training– Gradient Descent

• Entropy– Logistic Regression as Maximum Entropy.

53

Page 54: Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Next Time

• Graphical Models

• Read Chapter 8.1, 8.2

54