lecture 2: math primer

71
Machine Learning CUNY Graduate Center Lecture 2: Math Primer

Upload: keala

Post on 24-Feb-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Lecture 2: Math Primer. Machine Learning CUNY Graduate Center. Today. Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers. Classical Artificial Intelligence. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 2: Math Primer

Machine LearningCUNY Graduate Center

Lecture 2: Math Primer

Page 2: Lecture 2: Math Primer

2

Today

• Probability and Statistics– Naïve Bayes Classification

• Linear Algebra– Matrix Multiplication– Matrix Inversion

• Calculus– Vector Calculus– Optimization– Lagrange Multipliers

Page 3: Lecture 2: Math Primer

3

Classical Artificial Intelligence

• Expert Systems• Theorem Provers• Shakey• Chess

• Largely characterized by determinism.

Page 4: Lecture 2: Math Primer

4

Modern Artificial Intelligence

• Fingerprint ID• Internet Search• Vision – facial ID, object recognition• Speech Recognition• Asimo• Jeopardy!

• Statistical modeling to generalize from data.

Page 5: Lecture 2: Math Primer

5

Two Caveats about Statistical Modeling

• Black Swans• The Long Tail

Page 6: Lecture 2: Math Primer

6

Black Swans

• In the 17th Century, all known swans were white.• Based on evidence, it is impossible for a swan to

be anything other than white.

• In the 18th Century, black swans were discovered in Western Australia

• Black Swans are rare, sometimes unpredictable events, that have extreme impact

• Almost all statistical models underestimate the likelihood of unseen events.

Page 7: Lecture 2: Math Primer

7

The Long Tail• Many events follow an exponential distribution• These distributions have a very long “tail”.

– I.e. A large region with significant probability mass, but low likelihood at any particular point.

• Often, interesting events occur in the Long Tail, but it is difficult to accurately model behavior in this region.

Page 8: Lecture 2: Math Primer

8

Boxes and Balls

• 2 Boxes, one red and one blue.• Each contain colored balls.

Page 9: Lecture 2: Math Primer

9

Boxes and Balls

• Suppose we randomly select a box, then randomly draw a ball from that box.

• The identity of the Box is a random variable, B.

• The identity of the ball is a random variable, L.

• B can take 2 values, r, or b• L can take 2 values, g or o.

Page 10: Lecture 2: Math Primer

10

Boxes and Balls

• Given some information about B and L, we want to ask questions about the likelihood of different events.

• What is the probability of selecting an apple?

• If I chose an orange ball, what is the probability that I chose from the blue box?

Page 11: Lecture 2: Math Primer

11

Some basics• The probability (or likelihood) of an event is the

fraction of times that the event occurs out of n trials, as n approaches infinity.

• Probabilities lie in the range [0,1]• Mutually exclusive events are events that cannot

simultaneously occur.– The sum of the likelihoods of all mutually exclusive events

must equal 1.• If two events are independent then,

p(X, Y) = p(X)p(Y) p(X|Y) = p(X)

Page 12: Lecture 2: Math Primer

12

Joint Probability – P(X,Y)• A Joint Probability function defines the likelihood of two

(or more) events occurring.

• Let nij be the number of times event i and event j simultaneously occur.

Orange GreenBlue box 1 3 4Red box 6 2 8

7 5 12

Page 13: Lecture 2: Math Primer

13

Generalizing the Joint Probability

Page 14: Lecture 2: Math Primer

14

Marginalization• Consider the probability of X irrespective of Y.

• The number of instances in column j is the sum of instances in each cell

• Therefore, we can marginalize or “sum over” Y:

Page 15: Lecture 2: Math Primer

15

Conditional Probability

• Consider only instances where X = xj. • The fraction of these instances where Y =

yi is the conditional probability– “The probability of y given x”

Page 16: Lecture 2: Math Primer

16

Relating the Joint, Conditional and Marginal

Page 17: Lecture 2: Math Primer

17

Sum and Product Rules• In general, we’ll refer to a distribution over a

random variable as p(X) and a distribution evaluated at a particular value as p(x).

Sum Rule

Product Rule

Page 18: Lecture 2: Math Primer

18

Bayes Rule

Page 19: Lecture 2: Math Primer

19

Interpretation of Bayes Rule

• Prior: Information we have before observation.

• Posterior: The distribution of Y after observing X

• Likelihood: The likelihood of observing X given Y

PriorPosterior

Likelihood

Page 20: Lecture 2: Math Primer

20

Boxes and Balls with Bayes Rule

• Assuming I’m inherently more likely to select the red box (66.6%) than the blue box (33.3%).

• If I selected an orange ball, what is the likelihood that I selected the red box? – The blue box?

Page 21: Lecture 2: Math Primer

21

Boxes and Balls

Page 22: Lecture 2: Math Primer

22

Naïve Bayes Classification

• This is a simple case of a simple classification approach.

• Here the Box is the class, and the colored ball is a feature, or the observation.

• We can extend this Bayesian classification approach to incorporate more independent features.

Page 23: Lecture 2: Math Primer

23

Naïve Bayes Classification

• Some theory first.

Page 24: Lecture 2: Math Primer

24

Naïve Bayes Classification

• Assuming independent features simplifies the math.

Page 25: Lecture 2: Math Primer

25

Naïve Bayes Example DataHOT LIGHT SOFT RED

COLD HEAVY SOFT RED

HOT HEAVY FIRM RED

HOT LIGHT FIRM RED

COLD LIGHT SOFT BLUE

COLD HEAVY FIRM BLUE

HOT HEAVY FIRM BLUE

HOT LIGHT FIRM BLUE

HOT HEAVY FIRM ?????

Page 26: Lecture 2: Math Primer

26

Naïve Bayes Example DataHOT LIGHT SOFT RED

COLD HEAVY SOFT RED

HOT HEAVY FIRM RED

HOT LIGHT FIRM RED

COLD LIGHT SOFT BLUE

COLD HEAVY FIRM BLUE

HOT HEAVY FIRM BLUE

HOT LIGHT FIRM BLUE

HOT HEAVY FIRM ?????

Prior:

Page 27: Lecture 2: Math Primer

27

Naïve Bayes Example DataHOT LIGHT SOFT RED

COLD HEAVY SOFT RED

HOT HEAVY FIRM RED

HOT LIGHT FIRM RED

COLD LIGHT SOFT BLUE

COLD HEAVY SOFT BLUE

HOT HEAVY FIRM BLUE

HOT LIGHT FIRM BLUE

HOT HEAVY FIRM ?????

Page 28: Lecture 2: Math Primer

28

Naïve Bayes Example DataHOT LIGHT SOFT RED

COLD HEAVY SOFT RED

HOT HEAVY FIRM RED

HOT LIGHT FIRM RED

COLD LIGHT SOFT BLUE

COLD HEAVY SOFT BLUE

HOT HEAVY FIRM BLUE

HOT LIGHT FIRM BLUE

HOT HEAVY FIRM ?????

Page 29: Lecture 2: Math Primer

29

Continuous Probabilities

• So far, X has been discrete where it can take one of M values.

• What if X is continuous?• Now p(x) is a continuous probability density

function.• The probability that x will lie in an interval (a,b) is:

Page 30: Lecture 2: Math Primer

30

Continuous probability example

Page 31: Lecture 2: Math Primer

31

Properties of probability density functions

Sum Rule

Product Rule

Page 32: Lecture 2: Math Primer

32

Expected Values

• Given a random variable, with a distribution p(X), what is the expected value of X?

Page 33: Lecture 2: Math Primer

33

Multinomial Distribution

• If a variable, x, can take 1-of-K states, we represent the distribution of this variable as a multinomial distribution.

• The probability of x being in state k is μk

Page 34: Lecture 2: Math Primer

34

Expected Value of a Multinomial

• The expected value is the mean values.

Page 35: Lecture 2: Math Primer

35

Gaussian Distribution

• One Dimension

• D-Dimensions

Page 36: Lecture 2: Math Primer

36

Gaussians

Page 37: Lecture 2: Math Primer

37

How machine learning uses statistical modeling

• Expectation– The expected value of a function is the

hypothesis• Variance

– The variance is the confidence in that hypothesis

Page 38: Lecture 2: Math Primer

38

Variance• The variance of a random variable describes how

much variability around the expected value there is.• Calculated as the expected squared error.

Page 39: Lecture 2: Math Primer

39

Covariance

• The covariance of two random variables expresses how they vary together.

• If two variables are independent, their covariance equals zero.

Page 40: Lecture 2: Math Primer

40

Linear Algebra• Vectors

– A one dimensional array. – If not specified, assume x is a column

vector.• Matrices

– Higher dimensional array.– Typically denoted with capital letters.– n rows by m columns

Page 41: Lecture 2: Math Primer

41

Transposition

• Transposing a matrix swaps columns and rows.

Page 42: Lecture 2: Math Primer

42

Transposition

• Transposing a matrix swaps columns and rows.

Page 43: Lecture 2: Math Primer

43

Addition

• Matrices can be added to themselves iff they have the same dimensions.– A and B are both n-by-m matrices.

Page 44: Lecture 2: Math Primer

44

Multiplication• To multiply two matrices, the inner dimensions must

be the same.– An n-by-m matrix can be multiplied by an m-by-k matrix

Page 45: Lecture 2: Math Primer

45

Inversion

• The inverse of an n-by-n or square matrix A is denoted A-1, and has the following property.

• Where I is the identity matrix is an n-by-n matrix with ones along the diagonal.– Iij = 1 iff i = j, 0 otherwise

Page 46: Lecture 2: Math Primer

46

Identity Matrix

• Matrices are invariant under multiplication by the identity matrix.

Page 47: Lecture 2: Math Primer

47

Helpful matrix inversion properties

Page 48: Lecture 2: Math Primer

48

Norm

• The norm of a vector, x, represents the euclidean length of a vector.

Page 49: Lecture 2: Math Primer

49

Positive Definite-ness

• Quadratic form– Scalar

– Vector

• Positive Definite matrix M

• Positive Semi-definite

Page 50: Lecture 2: Math Primer

50

Calculus

• Derivatives and Integrals• Optimization

Page 51: Lecture 2: Math Primer

51

Derivatives

• A derivative of a function defines the slope at a point x.

Page 52: Lecture 2: Math Primer

52

Derivative Example

Page 53: Lecture 2: Math Primer

53

Integrals

• Integration is the inverse operation of derivation (plus a constant)

• Graphically, an integral can be considered the area under the curve defined by f(x)

Page 54: Lecture 2: Math Primer

54

Integration Example

Page 55: Lecture 2: Math Primer

55

Vector Calculus

• Derivation with respect to a matrix or vector

• Gradient• Change of Variables with a Vector

Page 56: Lecture 2: Math Primer

56

Derivative w.r.t. a vector

• Given a vector x, and a function f(x), how can we find f’(x)?

Page 57: Lecture 2: Math Primer

57

Derivative w.r.t. a vector

• Given a vector x, and a function f(x), how can we find f’(x)?

Page 58: Lecture 2: Math Primer

58

Example Derivation

Page 59: Lecture 2: Math Primer

59

Example Derivation

Also referred to as the gradient of a function.

Page 60: Lecture 2: Math Primer

60

Useful Vector Calculus identities

• Scalar Multiplication

• Product Rule

Page 61: Lecture 2: Math Primer

61

Useful Vector Calculus identities

• Derivative of an inverse

• Change of Variable

Page 62: Lecture 2: Math Primer

62

Optimization

• Have an objective function that we’d like to maximize or minimize, f(x)

• Set the first derivative to zero.

Page 63: Lecture 2: Math Primer

63

Optimization with constraints

• What if I want to constrain the parameters of the model.– The mean is less than 10

• Find the best likelihood, subject to a constraint.

• Two functions:– An objective function to maximize– An inequality that must be satisfied

Page 64: Lecture 2: Math Primer

64

Lagrange Multipliers

• Find maxima of f(x,y) subject to a constraint.

Page 65: Lecture 2: Math Primer

65

General form

• Maximizing:

• Subject to:

• Introduce a new variable, and find a maxima.

Page 66: Lecture 2: Math Primer

66

Example

• Maximizing:

• Subject to:

• Introduce a new variable, and find a maxima.

Page 67: Lecture 2: Math Primer

67

Example

Now have 3 equations with 3 unknowns.

Page 68: Lecture 2: Math Primer

68

ExampleEliminate Lambda Substitute and Solve

Page 69: Lecture 2: Math Primer

69

Why does Machine Learning need these tools?

• Calculus– We need to identify the maximum likelihood, or

minimum risk. Optimization– Integration allows the marginalization of

continuous probability density functions• Linear Algebra

– Many features leads to high dimensional spaces– Vectors and matrices allow us to compactly

describe and manipulate high dimension al feature spaces.

Page 70: Lecture 2: Math Primer

70

Why does Machine Learning need these tools?

• Vector Calculus– All of the optimization needs to be performed

in high dimensional spaces– Optimization of multiple variables

simultaneously – Gradient Descent– Want to take a marginal over high

dimensional distributions like Gaussians.

Page 71: Lecture 2: Math Primer

71

Next Time

• Linear Regression and Regularization

• Read Chapter 1.1, 3.1, 3.3