introduction to machine learning -...

205
Probability Theory (review) Bayes Decision Theory Probability Density Estimation Introduction to Machine Learning Lecture 2 Chaohui Wang October 14, 2019 Chaohui Wang Introduction to Machine Learning 1 / 63

Upload: others

Post on 26-Jun-2020

30 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Introduction to Machine LearningLecture 2

Chaohui Wang

October 14, 2019

Chaohui Wang Introduction to Machine Learning 1 / 63

Page 2: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Outline of This Lecture

Probability Theory (review)

Bayes Decision Theory

Probability Density Estimation

Chaohui Wang Introduction to Machine Learning 2 / 63

Page 3: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Outline of This Lecture

Probability Theory (review)

Bayes Decision Theory

Probability Density Estimation

Chaohui Wang Introduction to Machine Learning 3 / 63

Page 4: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Basic Concepts

Let us consider the scenario where:• Two discrete variables:

X ∈ {xi} and Y ∈ {yi}• N trials and denote:

nij = #{X = xi ∧ Y = yi}ci = #{X = xi}rj = #{Y = yi}

→We then have:• Joint probability: Pr(X = xi,Y = yi) =

nijN

• Marginal probability: Pr(X = xi) = ciN

• Conditional probability: Pr(Y = yi|X = xi) =nijci

• Sum rule: Pr(X = xi) = 1N

∑Lj=1 nij =

∑Lj=1 Pr(X = xi,Y = yi)

• Product rule: Pr(X = xi,Y = yi) =nijN =

nijci· ci

N = Pr(Y =yi|X = xi) Pr(X = xi)

Chaohui Wang Introduction to Machine Learning 4 / 63

Page 5: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Basic Concepts

Let us consider the scenario where:• Two discrete variables:

X ∈ {xi} and Y ∈ {yi}• N trials and denote:

nij = #{X = xi ∧ Y = yi}ci = #{X = xi}rj = #{Y = yi}

→We then have:• Joint probability: Pr(X = xi,Y = yi) =

nijN

• Marginal probability: Pr(X = xi) = ciN

• Conditional probability: Pr(Y = yi|X = xi) =nijci

• Sum rule: Pr(X = xi) = 1N

∑Lj=1 nij =

∑Lj=1 Pr(X = xi,Y = yi)

• Product rule: Pr(X = xi,Y = yi) =nijN =

nijci· ci

N = Pr(Y =yi|X = xi) Pr(X = xi)

Chaohui Wang Introduction to Machine Learning 4 / 63

Page 6: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Basic Concepts

Let us consider the scenario where:• Two discrete variables:

X ∈ {xi} and Y ∈ {yi}• N trials and denote:

nij = #{X = xi ∧ Y = yi}ci = #{X = xi}rj = #{Y = yi}

→We then have:• Joint probability: Pr(X = xi,Y = yi) =

nijN

• Marginal probability: Pr(X = xi) = ciN

• Conditional probability: Pr(Y = yi|X = xi) =nijci

• Sum rule: Pr(X = xi) = 1N

∑Lj=1 nij =

∑Lj=1 Pr(X = xi,Y = yi)

• Product rule: Pr(X = xi,Y = yi) =nijN =

nijci· ci

N = Pr(Y =yi|X = xi) Pr(X = xi)

Chaohui Wang Introduction to Machine Learning 4 / 63

Page 7: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

The Rules of Probability

→ Thus we have:• Sum rule:

p(X) =∑

Y

p(X,Y)

• Product rule:p(X,Y) = p(Y|X)p(X)

→ Finally, we can derive:• Bayes’ Theorem:

p(Y|X) =p(X|Y)p(Y)

p(X), with p(X) =

∑Y

p(X|Y)p(Y)

Chaohui Wang Introduction to Machine Learning 5 / 63

Page 8: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

The Rules of Probability

→ Thus we have:• Sum rule:

p(X) =∑

Y

p(X,Y)

• Product rule:p(X,Y) = p(Y|X)p(X)

→ Finally, we can derive:• Bayes’ Theorem:

p(Y|X) =p(X|Y)p(Y)

p(X), with p(X) =

∑Y

p(X|Y)p(Y)

Chaohui Wang Introduction to Machine Learning 5 / 63

Page 9: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Probability Densities

• Probabilities over continuous variables are defined overtheir Probability density function (pdf) p(x):

Pr(x ∈ (a, b)) =

∫ b

ap(x)dx

• Cumulative distribution function: the probability that xlies in the interval (− inf, z)

P(z) =

∫ z

− infp(x)dx

Chaohui Wang Introduction to Machine Learning 6 / 63

Page 10: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Probability Densities

• Probabilities over continuous variables are defined overtheir Probability density function (pdf) p(x):

Pr(x ∈ (a, b)) =

∫ b

ap(x)dx

• Cumulative distribution function: the probability that xlies in the interval (− inf, z)

P(z) =

∫ z

− infp(x)dx

Chaohui Wang Introduction to Machine Learning 6 / 63

Page 11: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectations

• Expectation: The average value of some function f (x)under a probability distribution

discrete case: E[f ] =∑

x

p(x)f (x)

continuous case: E[f ] =

∫p(x)f (x)

→ Given N samples drawn from a pdf, the expectation canbe approximated by: E[f ] ≈ 1

N

∑Ni=1 f (xn)

• Conditional expectation:

discrete case: Ex[f |y] =∑

x

p(x|y)f (x)

continuous case: Ex(f |y) =

∫x

p(x|y)f (x)

Chaohui Wang Introduction to Machine Learning 7 / 63

Page 12: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectations

• Expectation: The average value of some function f (x)under a probability distribution

discrete case: E[f ] =∑

x

p(x)f (x)

continuous case: E[f ] =

∫p(x)f (x)

→ Given N samples drawn from a pdf, the expectation canbe approximated by: E[f ] ≈ 1

N

∑Ni=1 f (xn)

• Conditional expectation:

discrete case: Ex[f |y] =∑

x

p(x|y)f (x)

continuous case: Ex(f |y) =

∫x

p(x|y)f (x)

Chaohui Wang Introduction to Machine Learning 7 / 63

Page 13: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectations

• Expectation: The average value of some function f (x)under a probability distribution

discrete case: E[f ] =∑

x

p(x)f (x)

continuous case: E[f ] =

∫p(x)f (x)

→ Given N samples drawn from a pdf, the expectation canbe approximated by: E[f ] ≈ 1

N

∑Ni=1 f (xn)

• Conditional expectation:

discrete case: Ex[f |y] =∑

x

p(x|y)f (x)

continuous case: Ex(f |y) =

∫x

p(x|y)f (x)

Chaohui Wang Introduction to Machine Learning 7 / 63

Page 14: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Variances and Covariances

• Variance of a function f (x):

var[f ] = E[(f (x)− E[f (x)])2] = E[f (x)2]− E[f (x)]2

• Covariance between variables X and Y:

cov[X,Y] = Ex,y[{x− E[x]}{y− E[y]}] = Ex,y[xy]− E[x]E[y]

→ Covariance Matrix in case X and Y are vectors:

cov[X,Y] = Ex,y[xyᵀ]− E[x]E[yᵀ]

Chaohui Wang Introduction to Machine Learning 8 / 63

Page 15: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Variances and Covariances

• Variance of a function f (x):

var[f ] = E[(f (x)− E[f (x)])2] = E[f (x)2]− E[f (x)]2

• Covariance between variables X and Y:

cov[X,Y] = Ex,y[{x− E[x]}{y− E[y]}] = Ex,y[xy]− E[x]E[y]

→ Covariance Matrix in case X and Y are vectors:

cov[X,Y] = Ex,y[xyᵀ]− E[x]E[yᵀ]

Chaohui Wang Introduction to Machine Learning 8 / 63

Page 16: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Variances and Covariances

• Variance of a function f (x):

var[f ] = E[(f (x)− E[f (x)])2] = E[f (x)2]− E[f (x)]2

• Covariance between variables X and Y:

cov[X,Y] = Ex,y[{x− E[x]}{y− E[y]}] = Ex,y[xy]− E[x]E[y]

→ Covariance Matrix in case X and Y are vectors:

cov[X,Y] = Ex,y[xyᵀ]− E[x]E[yᵀ]

Chaohui Wang Introduction to Machine Learning 8 / 63

Page 17: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Outline of This Lecture

Probability Theory (review)

Bayes Decision Theory

Probability Density Estimation

Chaohui Wang Introduction to Machine Learning 9 / 63

Page 18: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classification Example

• Handwritten character recognition

→ Goal: Classify a letter in a test image such that theprobability of misclassification is minimized.

Chaohui Wang Introduction to Machine Learning 10 / 63

Page 19: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classification Example

• Handwritten character recognition

→ Goal: Classify a letter in a test image such that theprobability of misclassification is minimized.

Chaohui Wang Introduction to Machine Learning 10 / 63

Page 20: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Priors

• Concept 1: Priors (a priori probabilities) p(Ck)

• What we “know” (or assume in practice) about theprobability before seeing the data.

Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25

→ In general: ∑k

p(Ck) = 1

Chaohui Wang Introduction to Machine Learning 11 / 63

Page 21: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Priors

• Concept 1: Priors (a priori probabilities) p(Ck)

• What we “know” (or assume in practice) about theprobability before seeing the data.

Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25

→ In general: ∑k

p(Ck) = 1

Chaohui Wang Introduction to Machine Learning 11 / 63

Page 22: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Priors

• Concept 1: Priors (a priori probabilities) p(Ck)

• What we “know” (or assume in practice) about theprobability before seeing the data.Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25

→ In general: ∑k

p(Ck) = 1

Chaohui Wang Introduction to Machine Learning 11 / 63

Page 23: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Priors

• Concept 1: Priors (a priori probabilities) p(Ck)

• What we “know” (or assume in practice) about theprobability before seeing the data.Example: C1 = a, C2 = b, p(C1) = 0.75, p(C2) = 0.25

→ In general: ∑k

p(Ck) = 1

Chaohui Wang Introduction to Machine Learning 11 / 63

Page 24: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Conditional probabilities

• Concept 2: Conditional probabilities p(x|Ck)

• feature vector x: characterizes certain properties of theinput.

• p(x|Ck): describes the likelihood of x for a given class Ck

Example:

Chaohui Wang Introduction to Machine Learning 12 / 63

Page 25: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Conditional probabilities

• Concept 2: Conditional probabilities p(x|Ck)

• feature vector x: characterizes certain properties of theinput.

• p(x|Ck): describes the likelihood of x for a given class Ck

Example:

Chaohui Wang Introduction to Machine Learning 12 / 63

Page 26: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

How to decide?

• Example:

• Question: Which class to choose?

Chaohui Wang Introduction to Machine Learning 13 / 63

Page 27: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

How to decide?

• Example:

• Question: Which class to choose?

→ Since p(x|b) is much smaller than p(x|a), the decision shouldbe ’a’ here

Chaohui Wang Introduction to Machine Learning 13 / 63

Page 28: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

How to decide?

• Example:

• Question: Which class to choose?

→ Since p(x|a) is much smaller than p(x|b), the decision shouldbe ’b’ here

Chaohui Wang Introduction to Machine Learning 13 / 63

Page 29: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

How to decide?

• Example:

• Question: Which class to choose?

→ Attentions: p(a) = 0.75 and p(b) = 0.25!How we should do in this case?

Chaohui Wang Introduction to Machine Learning 13 / 63

Page 30: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Posterior probabilities

• Concept 3: Posterior probabilities p(Ck|x)

• p(Ck|x) characterizes the probability of class Ck given thefeature vector x.

• Bayes’ Theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)=

p(x|Ck)p(Ck)∑i p(x|Ci)p(Ci)

• Interpretation:

Posterior =Likelihood × Prior

NormalizationFactor

Chaohui Wang Introduction to Machine Learning 14 / 63

Page 31: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Posterior probabilities

• Concept 3: Posterior probabilities p(Ck|x)

• p(Ck|x) characterizes the probability of class Ck given thefeature vector x.

• Bayes’ Theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)=

p(x|Ck)p(Ck)∑i p(x|Ci)p(Ci)

• Interpretation:

Posterior =Likelihood × Prior

NormalizationFactor

Chaohui Wang Introduction to Machine Learning 14 / 63

Page 32: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Posterior probabilities

• Concept 3: Posterior probabilities p(Ck|x)

• p(Ck|x) characterizes the probability of class Ck given thefeature vector x.

• Bayes’ Theorem:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)=

p(x|Ck)p(Ck)∑i p(x|Ci)p(Ci)

• Interpretation:

Posterior =Likelihood × Prior

NormalizationFactor

Chaohui Wang Introduction to Machine Learning 14 / 63

Page 33: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

How to decide?

Chaohui Wang Introduction to Machine Learning 15 / 63

Page 34: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Decision Theory

• Goal: Minimize the probability of a misclassification

Chaohui Wang Introduction to Machine Learning 16 / 63

Page 35: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Decision Theory

• Optimal decision rule:• Decide for C1, if

p(C1|x) > p(C2|x)

• and vice versa.

→ p(C1|x) > p(C2|x) is equivalent to:

p(x|C1)p(C1) > p(x|C2)p(C2)

→ Further equivalent to (Likelihood-Ratio test):

p(x|C1)

p(x|C2)>

p(C2)

p(C1)

Chaohui Wang Introduction to Machine Learning 17 / 63

Page 36: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Decision Theory

• Optimal decision rule:• Decide for C1, if

p(C1|x) > p(C2|x)

• and vice versa.

→ p(C1|x) > p(C2|x) is equivalent to:

p(x|C1)p(C1) > p(x|C2)p(C2)

→ Further equivalent to (Likelihood-Ratio test):

p(x|C1)

p(x|C2)>

p(C2)

p(C1)

Chaohui Wang Introduction to Machine Learning 17 / 63

Page 37: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Decision Theory

• Optimal decision rule:• Decide for C1, if

p(C1|x) > p(C2|x)

• and vice versa.

→ p(C1|x) > p(C2|x) is equivalent to:

p(x|C1)p(C1) > p(x|C2)p(C2)

→ Further equivalent to (Likelihood-Ratio test):

p(x|C1)

p(x|C2)>

p(C2)

p(C1)

Chaohui Wang Introduction to Machine Learning 17 / 63

Page 38: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Generalization to More Than 2 Classes

• Decide for class k if it has the greatest posterior probabilityof all classes:

p(Ck|x) > p(Cj|x), ∀j 6= k

p(x|Ck)p(Ck) > p(x|Cj)p(Cj), ∀j 6= k

→ Example :

→ Likelihood-Ratio test:p(x|Ck)

p(x|Cj)>

p(Cj)

p(Ck), ∀j 6= k

Chaohui Wang Introduction to Machine Learning 18 / 63

Page 39: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Generalization to More Than 2 Classes

• Decide for class k if it has the greatest posterior probabilityof all classes:

p(Ck|x) > p(Cj|x), ∀j 6= k

p(x|Ck)p(Ck) > p(x|Cj)p(Cj), ∀j 6= k

→ Example :

→ Likelihood-Ratio test:p(x|Ck)

p(x|Cj)>

p(Cj)

p(Ck), ∀j 6= k

Chaohui Wang Introduction to Machine Learning 18 / 63

Page 40: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Generalization to More Than 2 Classes

• Decide for class k if it has the greatest posterior probabilityof all classes:

p(Ck|x) > p(Cj|x), ∀j 6= k

p(x|Ck)p(Ck) > p(x|Cj)p(Cj), ∀j 6= k

→ Example :

→ Likelihood-Ratio test:p(x|Ck)

p(x|Cj)>

p(Cj)

p(Ck), ∀j 6= k

Chaohui Wang Introduction to Machine Learning 18 / 63

Page 41: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of

misclassification• Can be asymmetric, for example:

loss(decision = healthy|patient = sick) >> loss(sick|healthy)

• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck

→ for example:

Chaohui Wang Introduction to Machine Learning 19 / 63

Page 42: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of

misclassification• Can be asymmetric, for example:

loss(decision = healthy|patient = sick) >> loss(sick|healthy)

• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck

→ for example:

Chaohui Wang Introduction to Machine Learning 19 / 63

Page 43: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of

misclassification• Can be asymmetric, for example:

loss(decision = healthy|patient = sick) >> loss(sick|healthy)

• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck

→ for example:

Chaohui Wang Introduction to Machine Learning 19 / 63

Page 44: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of

misclassification• Can be asymmetric, for example:

loss(decision = healthy|patient = sick) >> loss(sick|healthy)

• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck

→ for example:

Chaohui Wang Introduction to Machine Learning 19 / 63

Page 45: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Generalization to decisions with a loss function• Allowing inhomogeneous loss for different kinds of

misclassification• Can be asymmetric, for example:

loss(decision = healthy|patient = sick) >> loss(sick|healthy)

• Formalized using a loss matrix: Lkj is the loss for choosingCj while the truth is Ck

→ for example:

Chaohui Wang Introduction to Machine Learning 19 / 63

Page 46: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown

• Solution: Minimize the expected loss

E[L] =∑

k

∑j

∫Rj

Lkjp(x,Ck)dx

→ This can be done by choosing the region Rj for each x,such that

E[L] =∑

k

Lkjp(Ck|x)

is minimized

→ It still is the posterior probability p(Ck|x) that matters!

Chaohui Wang Introduction to Machine Learning 20 / 63

Page 47: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown

• Solution: Minimize the expected loss

E[L] =∑

k

∑j

∫Rj

Lkjp(x,Ck)dx

→ This can be done by choosing the region Rj for each x,such that

E[L] =∑

k

Lkjp(Ck|x)

is minimized

→ It still is the posterior probability p(Ck|x) that matters!

Chaohui Wang Introduction to Machine Learning 20 / 63

Page 48: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown

• Solution: Minimize the expected loss

E[L] =∑

k

∑j

∫Rj

Lkjp(x,Ck)dx

→ This can be done by choosing the region Rj for each x,such that

E[L] =∑

k

Lkjp(Ck|x)

is minimized

→ It still is the posterior probability p(Ck|x) that matters!

Chaohui Wang Introduction to Machine Learning 20 / 63

Page 49: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown

• Solution: Minimize the expected loss

E[L] =∑

k

∑j

∫Rj

Lkjp(x,Ck)dx

→ This can be done by choosing the region Rj for each x,such that

E[L] =∑

k

Lkjp(Ck|x)

is minimized

→ It still is the posterior probability p(Ck|x) that matters!

Chaohui Wang Introduction to Machine Learning 20 / 63

Page 50: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• Goal: choose the one that minimizes the loss→ But loss function depends on the true class → unknown

• Solution: Minimize the expected loss

E[L] =∑

k

∑j

∫Rj

Lkjp(x,Ck)dx

→ This can be done by choosing the region Rj for each x,such that

E[L] =∑

k

Lkjp(Ck|x)

is minimized

→ It still is the posterior probability p(Ck|x) that matters!

Chaohui Wang Introduction to Machine Learning 20 / 63

Page 51: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• For the binary classification problem: decide for C1, if

p(x|C1)

p(x|C2)>

(L21 − L22)p(C2)

(L12 − L11)p(C1)

→ Recall: Likelihood-Ratio test: p(x|C1)p(x|C2)

> p(C2)p(C1)

→ Take into account the loss function, leading to ageneralization above

Chaohui Wang Introduction to Machine Learning 21 / 63

Page 52: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• For the binary classification problem: decide for C1, if

p(x|C1)

p(x|C2)>

(L21 − L22)p(C2)

(L12 − L11)p(C1)

→ Recall: Likelihood-Ratio test: p(x|C1)p(x|C2)

> p(C2)p(C1)

→ Take into account the loss function, leading to ageneralization above

Chaohui Wang Introduction to Machine Learning 21 / 63

Page 53: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classifying with Loss Functions

• For the binary classification problem: decide for C1, if

p(x|C1)

p(x|C2)>

(L21 − L22)p(C2)

(L12 − L11)p(C1)

→ Recall: Likelihood-Ratio test: p(x|C1)p(x|C2)

> p(C2)p(C1)

→ Take into account the loss function, leading to ageneralization above

Chaohui Wang Introduction to Machine Learning 21 / 63

Page 54: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classification via Discriminant Functions

• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:

yk(x) > yj(x),∀j 6= k

→ Examples (Bayes Decision Theory):

yk(x) = p(Ck|x)

yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?→ Probability Density EstimationE.g., In supervised training: data and class labels areknown

Chaohui Wang Introduction to Machine Learning 22 / 63

Page 55: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classification via Discriminant Functions

• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:

yk(x) > yj(x),∀j 6= k

→ Examples (Bayes Decision Theory):

yk(x) = p(Ck|x)

yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?→ Probability Density EstimationE.g., In supervised training: data and class labels areknown

Chaohui Wang Introduction to Machine Learning 22 / 63

Page 56: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classification via Discriminant Functions

• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:

yk(x) > yj(x),∀j 6= k

→ Examples (Bayes Decision Theory):

yk(x) = p(Ck|x)

yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?

→ Probability Density EstimationE.g., In supervised training: data and class labels areknown

Chaohui Wang Introduction to Machine Learning 22 / 63

Page 57: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Classification via Discriminant Functions

• Formulate classification in terms of comparisons• Discriminant functions: y1(x), . . . , yK(x)• Classify x as class Ck, if:

yk(x) > yj(x),∀j 6= k

→ Examples (Bayes Decision Theory):

yk(x) = p(Ck|x)

yk(x) = p(x|Ck)p(Ck)

yk(x) = log p(x|Ck) + log p(Ck)

→ Question: how we represent and estimate thoseprobabilities p(x|Ck), p(Ck)?→ Probability Density EstimationE.g., In supervised training: data and class labels areknown

Chaohui Wang Introduction to Machine Learning 22 / 63

Page 58: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Outline of This Lecture

Probability Theory (review)

Bayes Decision Theory

Probability Density Estimation

Chaohui Wang Introduction to Machine Learning 23 / 63

Page 59: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Probability Density Estimation

• Methods• Parametric• Non-parametric• Mixture models

Chaohui Wang Introduction to Machine Learning 24 / 63

Page 60: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Parametric Methods

• Given• Data X = x1, x2, . . . , xN• Parametric form of the distribution with parameters θ→ e.g., Gaussian distribution: θ = (µ, σ)

• Learning→ Estimation of the parameters θ

→ For example :

Using Gaussian distribution as the parametric model →What is θ = (µ, σ)?

Chaohui Wang Introduction to Machine Learning 25 / 63

Page 61: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Parametric Methods

• Given• Data X = x1, x2, . . . , xN• Parametric form of the distribution with parameters θ→ e.g., Gaussian distribution: θ = (µ, σ)

• Learning→ Estimation of the parameters θ

→ For example :

Using Gaussian distribution as the parametric model →What is θ = (µ, σ)?

Chaohui Wang Introduction to Machine Learning 25 / 63

Page 62: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Parametric Methods

• Given• Data X = x1, x2, . . . , xN• Parametric form of the distribution with parameters θ→ e.g., Gaussian distribution: θ = (µ, σ)

• Learning→ Estimation of the parameters θ

→ For example :

Using Gaussian distribution as the parametric model →What is θ = (µ, σ)?

Chaohui Wang Introduction to Machine Learning 25 / 63

Page 63: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

Chaohui Wang Introduction to Machine Learning 26 / 63

Page 64: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

Chaohui Wang Introduction to Machine Learning 26 / 63

Page 65: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

Chaohui Wang Introduction to Machine Learning 26 / 63

Page 66: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

Chaohui Wang Introduction to Machine Learning 26 / 63

Page 67: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

Chaohui Wang Introduction to Machine Learning 26 / 63

Page 68: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• Likelihood L(θ) of θ: Probability that the data X haveindeed been generated from a probability density withparameters θ:

L(θ) = p(X|θ)

• Computation of the likelihood• Single data point: p(xn|θ)• Assuming that all data points are independent:

L(θ) = ΠNn=1p(xn|θ)

• Negative log-likelihood:E(θ) = − log L(θ) = −

∑Nn=1 log p(xn|θ)

• Estimation/Learning of the parameters θ• Maximize the likelihood→ Minimize the negative log-likelihood

Chaohui Wang Introduction to Machine Learning 26 / 63

Page 69: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

1N

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

σ2

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Chaohui Wang Introduction to Machine Learning 27 / 63

Page 70: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

1N

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

σ2

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Chaohui Wang Introduction to Machine Learning 27 / 63

Page 71: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

1N

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

σ2

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Chaohui Wang Introduction to Machine Learning 27 / 63

Page 72: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

1N

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

σ2

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Chaohui Wang Introduction to Machine Learning 27 / 63

Page 73: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

1N

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

σ2

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Chaohui Wang Introduction to Machine Learning 27 / 63

Page 74: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach

• How to minimize the negative log-likelihood?→ Take the derivative and set it to zero

• Result for Normal distribution (1D case): θ̂ = (µ̂, σ̂)

µML =1N

N∑n=1

xn, σ2ML =

1N

N∑n=1

(xn − µML)2

→ Unfortunately, it is not so correct ...

→ Assume the samples {xn} come from a true Gaussiandistribution with mean µ and variance σ2, we have:

E(µML) = µ,E(σ2ML) =

N − 1N

σ2

• Corrected estimate: σ̃2 = NN−1σ

2ML = 1

N−1∑N

n=1(xn − µ̂)2

Chaohui Wang Introduction to Machine Learning 27 / 63

Page 75: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach - Limitations

• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:

• ML overfits to the observed data• Although we often use ML, it is important to know this

limitation

Chaohui Wang Introduction to Machine Learning 28 / 63

Page 76: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach - Limitations

• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:

• ML overfits to the observed data• Although we often use ML, it is important to know this

limitation

Chaohui Wang Introduction to Machine Learning 28 / 63

Page 77: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach - Limitations

• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:

• ML overfits to the observed data• Although we often use ML, it is important to know this

limitation

Chaohui Wang Introduction to Machine Learning 28 / 63

Page 78: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Maximum Likelihood Approach - Limitations

• It systematically underestimates the variance of thedistribution→ consider the extreme case: N = 1,X = {x1}Maximum-likelihood estimate is like:

• ML overfits to the observed data• Although we often use ML, it is important to know this

limitation

Chaohui Wang Introduction to Machine Learning 28 / 63

Page 79: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

Chaohui Wang Introduction to Machine Learning 29 / 63

Page 80: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

Chaohui Wang Introduction to Machine Learning 29 / 63

Page 81: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

Chaohui Wang Introduction to Machine Learning 29 / 63

Page 82: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

Chaohui Wang Introduction to Machine Learning 29 / 63

Page 83: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

Chaohui Wang Introduction to Machine Learning 29 / 63

Page 84: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

A Deeper Reason

• Maximum Likelihood is a Frequentist concept• In the Frequentist view, probabilities are the frequencies of

random, repeatable events• These frequencies are fixed, but can be estimated more

precisely when more data is available• This is in contrast to the Bayesian interpretation

• In the Bayesian view, probabilities quantify the uncertaintyabout certain states or events

• This uncertainty can be revised in the light of new evidence

Chaohui Wang Introduction to Machine Learning 29 / 63

Page 85: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Page 86: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Page 87: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Page 88: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Page 89: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Page 90: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Page 91: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian vs. Frequentist View

• To illustrate the difference ...• Suppose we want to estimate the uncertainty whether the

Arctic ice cap will totally disappear by 2100• This question makes no sense in a Frequentist view, since

the event cannot be repeated numerous times• In the Bayesian view, we generally have a prior, e.g. from

calculations how fast the polar ice is melting• If we now get fresh evidence, e.g. from a new satellite, we

may revise our opinion and update the uncertainty from theprior, via:

Posterior ∝ Likelihood × Prior

• This generally allows to get better uncertainty estimates formany situations→ Main Frequentist criticism: The prior has to come fromsomewhere and if it is wrong, the result will be worse

Chaohui Wang Introduction to Machine Learning 30 / 63

Page 92: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Page 93: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Page 94: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Page 95: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Page 96: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Page 97: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Page 98: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach to Parameter Learning

• Conceptual shift• Maximum Likelihood views the true parameter vector θ to

be unknown, but fixed• In Bayesian learning, we consider θ to be a random variable

• This allows us to use knowledge about the parameters θ• Use a prior for θ• Training data then converts this prior distribution on θ into a

posterior probability density

→ The prior thus encodes knowledge we have about thetype of distribution we expect to see for θ

Chaohui Wang Introduction to Machine Learning 31 / 63

Page 99: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Chaohui Wang Introduction to Machine Learning 32 / 63

Page 100: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Chaohui Wang Introduction to Machine Learning 32 / 63

Page 101: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Chaohui Wang Introduction to Machine Learning 32 / 63

Page 102: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Chaohui Wang Introduction to Machine Learning 32 / 63

Page 103: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Chaohui Wang Introduction to Machine Learning 32 / 63

Page 104: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Bayesian Approach

• Bayesian view:• Consider the parameter vector θ as a random variable• When estimating the distribution, what we are interested is

Chaohui Wang Introduction to Machine Learning 32 / 63

Page 105: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Page 106: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Page 107: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Page 108: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Page 109: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Page 110: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Page 111: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Page 112: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Page 113: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Page 114: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Page 115: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Summary: ML vs. Bayesian Learning

• Maximum Likelihood• Simple approach, often analytically possible• Problem: estimation is biased, tends to overfit to the data→ Often needs some correction or regularization

• But: Approximation gets accurate when N → + inf• Bayesian Learning

• General approach, avoids the estimation bias through aprior

• Problems:I Need to choose a suitable prior (not always obvious)I Integral over θ often not analytically feasible anymore→ Resort to efficient stochastic sampling techniques

Chaohui Wang Introduction to Machine Learning 33 / 63

Page 116: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Non-Parametric Methods

• Non-parametric representations→ Often the functional form of the distribution is unknown,such as:

• Estimate probability density from data• Histograms• Kernel density estimation (Parzen window / Gaussian

kernels)• k-Nearest-Neighbor• etc.

Chaohui Wang Introduction to Machine Learning 34 / 63

Page 117: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Non-Parametric Methods

• Non-parametric representations→ Often the functional form of the distribution is unknown,such as:

• Estimate probability density from data• Histograms• Kernel density estimation (Parzen window / Gaussian

kernels)• k-Nearest-Neighbor• etc.

Chaohui Wang Introduction to Machine Learning 34 / 63

Page 118: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Histograms

• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):

pi =ni

N∆i

Chaohui Wang Introduction to Machine Learning 35 / 63

Page 119: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Histograms

• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):

pi =ni

N∆i

• Usually the same width is used for all bins: ∆i = ∆• In principle, it can be adopted for any dimensionality D

→ But the number of bins grows exponentially with D!→ A suitable N is required to get an informative histogram

Chaohui Wang Introduction to Machine Learning 36 / 63

Page 120: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Histograms

• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):

pi =ni

N∆i

• Usually the same width is used for all bins: ∆i = ∆• In principle, it can be adopted for any dimensionality D

→ But the number of bins grows exponentially with D!→ A suitable N is required to get an informative histogram

Chaohui Wang Introduction to Machine Learning 36 / 63

Page 121: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Histograms

• Idea: Partition the data space into distinct bins with widths∆i and count the number of observations, ni, in each bin(among N observations in total):

pi =ni

N∆i

• Usually the same width is used for all bins: ∆i = ∆• In principle, it can be adopted for any dimensionality D

→ But the number of bins grows exponentially with D!→ A suitable N is required to get an informative histogram

Chaohui Wang Introduction to Machine Learning 36 / 63

Page 122: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Histograms

• The bin width ∆ acts as a smoothing factor

Chaohui Wang Introduction to Machine Learning 37 / 63

Page 123: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Towards More “Statistically”-founded Approaches

• Data point x comes from the underlying pdf p(x): theprobability that x falls into small region R

P =

∫R

p(y)dy

• If R is sufficiently small such that p(x) is roughly constant

P =

∫R

p(y)dy ≈ p(x)V

where V denotes the volume of R• If the number N of samples is sufficiently large, we can

estimate P as:

P =KN

=⇒ p(x) ≈ KNV

where K denotes the number of samples falling in RChaohui Wang Introduction to Machine Learning 38 / 63

Page 124: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Towards More “Statistically”-founded Approaches

• Data point x comes from the underlying pdf p(x): theprobability that x falls into small region R

P =

∫R

p(y)dy

• If R is sufficiently small such that p(x) is roughly constant

P =

∫R

p(y)dy ≈ p(x)V

where V denotes the volume of R• If the number N of samples is sufficiently large, we can

estimate P as:

P =KN

=⇒ p(x) ≈ KNV

where K denotes the number of samples falling in RChaohui Wang Introduction to Machine Learning 38 / 63

Page 125: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Towards More “Statistically”-founded Approaches

• Data point x comes from the underlying pdf p(x): theprobability that x falls into small region R

P =

∫R

p(y)dy

• If R is sufficiently small such that p(x) is roughly constant

P =

∫R

p(y)dy ≈ p(x)V

where V denotes the volume of R• If the number N of samples is sufficiently large, we can

estimate P as:

P =KN

=⇒ p(x) ≈ KNV

where K denotes the number of samples falling in RChaohui Wang Introduction to Machine Learning 38 / 63

Page 126: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Towards More “Statistically”-founded Approaches

Chaohui Wang Introduction to Machine Learning 39 / 63

Page 127: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:

k(u)

{1, if |ui| ≤ 1

2 , ∀i = {1, . . . ,D}0, else

→ Considering a cube with side width h, the distributionof K in the space:

K(x) =

N∑n=1

k(x− xn

h),V =

∫k(u)du = hD

→ Probability density estimate:

p(x) ≈ K(x)

NV=

1NhD

N∑n=1

k(x− xn

h) =

1N

N∑n=1

1hD k(

x− xn

h)

Chaohui Wang Introduction to Machine Learning 40 / 63

Page 128: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:

k(u)

{1, if |ui| ≤ 1

2 , ∀i = {1, . . . ,D}0, else

→ Considering a cube with side width h, the distributionof K in the space:

K(x) =

N∑n=1

k(x− xn

h),V =

∫k(u)du = hD

→ Probability density estimate:

p(x) ≈ K(x)

NV=

1NhD

N∑n=1

k(x− xn

h) =

1N

N∑n=1

1hD k(

x− xn

h)

Chaohui Wang Introduction to Machine Learning 40 / 63

Page 129: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:

k(u)

{1, if |ui| ≤ 1

2 , ∀i = {1, . . . ,D}0, else

→ Considering a cube with side width h, the distributionof K in the space:

K(x) =

N∑n=1

k(x− xn

h),V =

∫k(u)du = hD

→ Probability density estimate:

p(x) ≈ K(x)

NV=

1NhD

N∑n=1

k(x− xn

h) =

1N

N∑n=1

1hD k(

x− xn

h)

Chaohui Wang Introduction to Machine Learning 40 / 63

Page 130: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window: Determine the number K of data pointsinside a fixed hypercube→ Unit hypercube around the origin:

k(u)

{1, if |ui| ≤ 1

2 , ∀i = {1, . . . ,D}0, else

→ Considering a cube with side width h, the distributionof K in the space:

K(x) =

N∑n=1

k(x− xn

h),V =

∫k(u)du = hD

→ Probability density estimate:

p(x) ≈ K(x)

NV=

1NhD

N∑n=1

k(x− xn

h) =

1N

N∑n=1

1hD k(

x− xn

h)

Chaohui Wang Introduction to Machine Learning 40 / 63

Page 131: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Chaohui Wang Introduction to Machine Learning 41 / 63

Page 132: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Chaohui Wang Introduction to Machine Learning 41 / 63

Page 133: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Chaohui Wang Introduction to Machine Learning 41 / 63

Page 134: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Chaohui Wang Introduction to Machine Learning 41 / 63

Page 135: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Chaohui Wang Introduction to Machine Learning 41 / 63

Page 136: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods

• Parzen Window - Interpretations• 1st interpretation : place a rescaled kernel window at

location x and count how many data points fall inside it

• 2nd interpretation : place a rescaled kernel window karound each data point xn and sum up their influences atlocation x→ Direct visualization of the density

• Issue: artificial discontinuities at the cube boundaries→ smoother k function (e.g., Gaussian) → smootherdensity model

Chaohui Wang Introduction to Machine Learning 41 / 63

Page 137: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods: Gaussian Kernel

• Gaussian kernel• Kernel function

k(u) =1

(2πh2)D2

exp{− u2

2h2 }

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Probability density estimate

p(x) ≈ K(x)

NV=

1N

N∑n=1

1

(2πh2)D2

exp{−‖ x− xn ‖2

2h2 }

Chaohui Wang Introduction to Machine Learning 42 / 63

Page 138: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods: Gaussian Kernel

• Gaussian kernel• Kernel function

k(u) =1

(2πh2)D2

exp{− u2

2h2 }

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Probability density estimate

p(x) ≈ K(x)

NV=

1N

N∑n=1

1

(2πh2)D2

exp{−‖ x− xn ‖2

2h2 }

Chaohui Wang Introduction to Machine Learning 42 / 63

Page 139: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods: Gaussian Kernel

• Gaussian kernel• Kernel function

k(u) =1

(2πh2)D2

exp{− u2

2h2 }

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Probability density estimate

p(x) ≈ K(x)

NV=

1N

N∑n=1

1

(2πh2)D2

exp{−‖ x− xn ‖2

2h2 }

Chaohui Wang Introduction to Machine Learning 42 / 63

Page 140: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods - General Principle

• In general, a kernel satisfying the following properties canbe used:

k(u) ≥ 0,∫

k(u)du = 1

• Then

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Then we get the probability density estimate

p(x) ≈ K(x)

NV=

1N

N∑n=1

k(x− xn)

Chaohui Wang Introduction to Machine Learning 43 / 63

Page 141: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods - General Principle

• In general, a kernel satisfying the following properties canbe used:

k(u) ≥ 0,∫

k(u)du = 1

• Then

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Then we get the probability density estimate

p(x) ≈ K(x)

NV=

1N

N∑n=1

k(x− xn)

Chaohui Wang Introduction to Machine Learning 43 / 63

Page 142: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Kernel Methods - General Principle

• In general, a kernel satisfying the following properties canbe used:

k(u) ≥ 0,∫

k(u)du = 1

• Then

K(x) =

N∑n=1

k(x− xn),V =

∫k(u)du = 1

• Then we get the probability density estimate

p(x) ≈ K(x)

NV=

1N

N∑n=1

k(x− xn)

Chaohui Wang Introduction to Machine Learning 43 / 63

Page 143: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Towards More “Statistically”-founded Approaches

Chaohui Wang Introduction to Machine Learning 44 / 63

Page 144: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Density Estimation

• Basic idea: increase the volume V until the Kth closestdata point is found

• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:

p(x) ≈ KNV̂(x,K)

→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while

Chaohui Wang Introduction to Machine Learning 45 / 63

Page 145: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Density Estimation

• Basic idea: increase the volume V until the Kth closestdata point is found

• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:

p(x) ≈ KNV̂(x,K)

→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while

Chaohui Wang Introduction to Machine Learning 45 / 63

Page 146: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Density Estimation

• Basic idea: increase the volume V until the Kth closestdata point is found

• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:

p(x) ≈ KNV̂(x,K)

→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while

Chaohui Wang Introduction to Machine Learning 45 / 63

Page 147: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Density Estimation

• Basic idea: increase the volume V until the Kth closestdata point is found

• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:

p(x) ≈ KNV̂(x,K)

→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while

Chaohui Wang Introduction to Machine Learning 45 / 63

Page 148: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Density Estimation

• Basic idea: increase the volume V until the Kth closestdata point is found

• Fix K, and consider a hypersphere centered on x and let itgrow to a volume V̂(x,K) that includes K of the given Ndata pointsThen:

p(x) ≈ KNV̂(x,K)

→ Note: Strictly speaking, the model produced by K-NN is not a truedensity model, because the integral over all space diverges.E.g. consider K = 1 and x = xj (i.e., x is exactly on a data point xj)→ It is often exploited in a relative manner to compare between classese.g., KNN classification → to see in a while

Chaohui Wang Introduction to Machine Learning 45 / 63

Page 149: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor - Examples

Chaohui Wang Introduction to Machine Learning 46 / 63

Page 150: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Recall: Bayesian Classification: posterior probability

p(Cj|x) =p(x|Cj)p(Cj)

p(x)

• Now we havep(x) ≈ K

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

N

→ p(Cj|x) ≈Kj(x,K)

NjV̂(x,K)

Nj

NNV̂(x,K)

K=

Kj(x,K)

K

Chaohui Wang Introduction to Machine Learning 47 / 63

Page 151: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Recall: Bayesian Classification: posterior probability

p(Cj|x) =p(x|Cj)p(Cj)

p(x)

• Now we havep(x) ≈ K

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

N

→ p(Cj|x) ≈Kj(x,K)

NjV̂(x,K)

Nj

NNV̂(x,K)

K=

Kj(x,K)

K

Chaohui Wang Introduction to Machine Learning 47 / 63

Page 152: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Recall: Bayesian Classification: posterior probability

p(Cj|x) =p(x|Cj)p(Cj)

p(x)

• Now we havep(x) ≈ K

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

N

→ p(Cj|x) ≈Kj(x,K)

NjV̂(x,K)

Nj

NNV̂(x,K)

K=

Kj(x,K)

K

Chaohui Wang Introduction to Machine Learning 47 / 63

Page 153: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Recall: Bayesian Classification: posterior probability

p(Cj|x) =p(x|Cj)p(Cj)

p(x)

• Now we havep(x) ≈ K

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

N

→ p(Cj|x) ≈Kj(x,K)

NjV̂(x,K)

Nj

NNV̂(x,K)

K=

Kj(x,K)

K

Chaohui Wang Introduction to Machine Learning 47 / 63

Page 154: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Recall: Bayesian Classification: posterior probability

p(Cj|x) =p(x|Cj)p(Cj)

p(x)

• Now we havep(x) ≈ K

NV̂(x,K)

p(x|Cj) ≈Kj(x,K)

NjV̂(x,K)

p(Cj) ≈Nj

N

→ p(Cj|x) ≈Kj(x,K)

NjV̂(x,K)

Nj

NNV̂(x,K)

K=

Kj(x,K)

K

Chaohui Wang Introduction to Machine Learning 47 / 63

Page 155: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

Chaohui Wang Introduction to Machine Learning 48 / 63

Page 156: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Results on an example data set

• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the

1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .

Chaohui Wang Introduction to Machine Learning 49 / 63

Page 157: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Results on an example data set

• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the

1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .

Chaohui Wang Introduction to Machine Learning 49 / 63

Page 158: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Results on an example data set

• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the

1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .

Chaohui Wang Introduction to Machine Learning 49 / 63

Page 159: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

K-Nearest Neighbor Classification

• Results on an example data set

• K: acts as a smoothing parameter• Theoretical property: When N →∞, the error rate of the

1-NN classifier is never more than twice the optimal error(obtained from the true conditional class distributions)→ However, N is usually quite small w.r.t. the realapplication . . .

Chaohui Wang Introduction to Machine Learning 49 / 63

Page 160: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture Models - Motivations

• A single parametric distribution is often not sufficient

Chaohui Wang Introduction to Machine Learning 50 / 63

Page 161: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Sum of M individual Gaussian distributions

→ In the limit, every smooth distribution can beapproximated in this way (if M is large enough)

p(x|θ) =

M∑m=1

πmp(x|θm), πm : p(ln = m|θm)

→ Parameters for MoG: θ = (π1, µ1, σ1, π2, µ2, σ2, . . . , πM, µM, σM)

Chaohui Wang Introduction to Machine Learning 51 / 63

Page 162: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Sum of M individual Gaussian distributions

→ In the limit, every smooth distribution can beapproximated in this way (if M is large enough)

p(x|θ) =

M∑m=1

πmp(x|θm), πm : p(ln = m|θm)

→ Parameters for MoG: θ = (π1, µ1, σ1, π2, µ2, σ2, . . . , πM, µM, σM)

Chaohui Wang Introduction to Machine Learning 51 / 63

Page 163: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Sum of M individual Gaussian distributions

→ In the limit, every smooth distribution can beapproximated in this way (if M is large enough)

p(x|θ) =

M∑m=1

πmp(x|θm), πm : p(ln = m|θm)

→ Parameters for MoG: θ = (π1, µ1, σ1, π2, µ2, σ2, . . . , πM, µM, σM)

Chaohui Wang Introduction to Machine Learning 51 / 63

Page 164: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Mixture of Gaussians (MoG):

p(x|θ) =

M∑m=1

πmp(x|θm)

• Prior of component m:

πm = p(ln = m|θm)

(∀m) 0 ≤ πm ≤ 1 andM∑

m=1

πm = 1

• Likelihood of x given the component m:

p(x|θm) =1

(2πσ2m)

12

exp{− (x− µm)2

2σ2m}

•∫

p(x)dx = 1

Chaohui Wang Introduction to Machine Learning 52 / 63

Page 165: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Mixture of Gaussians (MoG):

p(x|θ) =

M∑m=1

πmp(x|θm)

• Prior of component m:

πm = p(ln = m|θm)

(∀m) 0 ≤ πm ≤ 1 andM∑

m=1

πm = 1

• Likelihood of x given the component m:

p(x|θm) =1

(2πσ2m)

12

exp{− (x− µm)2

2σ2m}

•∫

p(x)dx = 1

Chaohui Wang Introduction to Machine Learning 52 / 63

Page 166: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Mixture of Gaussians (MoG):

p(x|θ) =

M∑m=1

πmp(x|θm)

• Prior of component m:

πm = p(ln = m|θm)

(∀m) 0 ≤ πm ≤ 1 andM∑

m=1

πm = 1

• Likelihood of x given the component m:

p(x|θm) =1

(2πσ2m)

12

exp{− (x− µm)2

2σ2m}

•∫

p(x)dx = 1

Chaohui Wang Introduction to Machine Learning 52 / 63

Page 167: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

• Mixture of Gaussians (MoG):

p(x|θ) =

M∑m=1

πmp(x|θm)

• Prior of component m:

πm = p(ln = m|θm)

(∀m) 0 ≤ πm ≤ 1 andM∑

m=1

πm = 1

• Likelihood of x given the component m:

p(x|θm) =1

(2πσ2m)

12

exp{− (x− µm)2

2σ2m}

•∫

p(x)dx = 1

Chaohui Wang Introduction to Machine Learning 52 / 63

Page 168: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Gaussians (MoG)

Chaohui Wang Introduction to Machine Learning 53 / 63

Page 169: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Mixture of Multivariate Gaussians

Chaohui Wang Introduction to Machine Learning 54 / 63

Page 170: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Estimation of MoG

• Maximum Likelihood: there is no direct analytical solution

∂{− log L(Θ)}∂µj

= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)

• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians

• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields

Chaohui Wang Introduction to Machine Learning 55 / 63

Page 171: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Estimation of MoG

• Maximum Likelihood: there is no direct analytical solution

∂{− log L(Θ)}∂µj

= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)

• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians

• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields

Chaohui Wang Introduction to Machine Learning 55 / 63

Page 172: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Estimation of MoG

• Maximum Likelihood: there is no direct analytical solution

∂{− log L(Θ)}∂µj

= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)

• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians

• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields

Chaohui Wang Introduction to Machine Learning 55 / 63

Page 173: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Estimation of MoG

• Maximum Likelihood: there is no direct analytical solution

∂{− log L(Θ)}∂µj

= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)

• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians

• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields

Chaohui Wang Introduction to Machine Learning 55 / 63

Page 174: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Estimation of MoG

• Maximum Likelihood: there is no direct analytical solution

∂{− log L(Θ)}∂µj

= f (π1,µ1,Σ1, π2,µ2,Σ2, . . . , πM,µM,ΣM)

• Complex gradient function (non-linear mutualdependencies)→ Optimization of one Gaussian depends on all otherGaussians

• Iterative numerical optimization could be applied, but wehave a simpler method, called Expectation-Maximization(EM) Algorithm→ Note that its idea is widely used in CV-related fields

Chaohui Wang Introduction to Machine Learning 55 / 63

Page 175: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (1)

• Basic Strategy:• Model the unobserved component label, via hidden

variable• Explore the probability that a training example is generated

by each component

Chaohui Wang Introduction to Machine Learning 56 / 63

Page 176: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (1)

• Basic Strategy:• Model the unobserved component label, via hidden

variable• Explore the probability that a training example is generated

by each component

Chaohui Wang Introduction to Machine Learning 56 / 63

Page 177: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (1)

• Basic Strategy:• Model the unobserved component label, via hidden

variable• Explore the probability that a training example is generated

by each component

Chaohui Wang Introduction to Machine Learning 56 / 63

Page 178: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (2)

• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the

Gaussians independently→ e.g., using Maximum Likelihood

li : the label for sample xi

N : the total number of samplesN̂j : the number of samples labeled j

π̂j ←N̂j

N, µ̂j ←

1N̂j

∑n:ln=j

xn

Σ̂j ←1N̂j

∑n:ln=j

(xn − µ̂j)(xn − µ̂j)T

• But we don’t have such labels li.→We may use some clustering results at first, but then...

Chaohui Wang Introduction to Machine Learning 57 / 63

Page 179: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (2)

• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the

Gaussians independently→ e.g., using Maximum Likelihood

li : the label for sample xi

N : the total number of samplesN̂j : the number of samples labeled j

π̂j ←N̂j

N, µ̂j ←

1N̂j

∑n:ln=j

xn

Σ̂j ←1N̂j

∑n:ln=j

(xn − µ̂j)(xn − µ̂j)T

• But we don’t have such labels li.→We may use some clustering results at first, but then...

Chaohui Wang Introduction to Machine Learning 57 / 63

Page 180: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (2)

• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the

Gaussians independently→ e.g., using Maximum Likelihood

li : the label for sample xi

N : the total number of samplesN̂j : the number of samples labeled j

π̂j ←N̂j

N, µ̂j ←

1N̂j

∑n:ln=j

xn

Σ̂j ←1N̂j

∑n:ln=j

(xn − µ̂j)(xn − µ̂j)T

• But we don’t have such labels li.→We may use some clustering results at first, but then...

Chaohui Wang Introduction to Machine Learning 57 / 63

Page 181: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (2)

• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the

Gaussians independently→ e.g., using Maximum Likelihood

li : the label for sample xi

N : the total number of samplesN̂j : the number of samples labeled j

π̂j ←N̂j

N, µ̂j ←

1N̂j

∑n:ln=j

xn

Σ̂j ←1N̂j

∑n:ln=j

(xn − µ̂j)(xn − µ̂j)T

• But we don’t have such labels li.→We may use some clustering results at first, but then...

Chaohui Wang Introduction to Machine Learning 57 / 63

Page 182: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (2)

• Mixture Estimation with Labeled Data• When examples are labeled, we can estimate the

Gaussians independently→ e.g., using Maximum Likelihood

li : the label for sample xi

N : the total number of samplesN̂j : the number of samples labeled j

π̂j ←N̂j

N, µ̂j ←

1N̂j

∑n:ln=j

xn

Σ̂j ←1N̂j

∑n:ln=j

(xn − µ̂j)(xn − µ̂j)T

• But we don’t have such labels li.→We may use some clustering results at first, but then...

Chaohui Wang Introduction to Machine Learning 57 / 63

Page 183: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (3)

• Idea: Mixture Estimation with “Soft” Assignments• Based on the mixture distribution parameter θ, we can

evaluate the posterior probability that xn was generatedfrom a specific component j:

p(ln = j|xn, θ) =p(ln = j, xn|θ)

p(xn|θ)=

p(ln = j, xn|θ)∑Mm=1 πmp(xn|θm)

p(ln = j, xn|θ) = p(ln = j|θ)p(xn|ln = j, θ) = πjp(xn|θj)

→ p(ln = j|xn, θ) =πjp(xn|θj)∑M

m=1 πmp(xn|θm)

Chaohui Wang Introduction to Machine Learning 58 / 63

Page 184: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (3)

• Idea: Mixture Estimation with “Soft” Assignments• Based on the mixture distribution parameter θ, we can

evaluate the posterior probability that xn was generatedfrom a specific component j:

p(ln = j|xn, θ) =p(ln = j, xn|θ)

p(xn|θ)=

p(ln = j, xn|θ)∑Mm=1 πmp(xn|θm)

p(ln = j, xn|θ) = p(ln = j|θ)p(xn|ln = j, θ) = πjp(xn|θj)

→ p(ln = j|xn, θ) =πjp(xn|θj)∑M

m=1 πmp(xn|θm)

Chaohui Wang Introduction to Machine Learning 58 / 63

Page 185: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (3)

• Idea: Mixture Estimation with “Soft” Assignments• Based on the mixture distribution parameter θ, we can

evaluate the posterior probability that xn was generatedfrom a specific component j:

p(ln = j|xn, θ) =p(ln = j, xn|θ)

p(xn|θ)=

p(ln = j, xn|θ)∑Mm=1 πmp(xn|θm)

p(ln = j, xn|θ) = p(ln = j|θ)p(xn|ln = j, θ) = πjp(xn|θj)

→ p(ln = j|xn, θ) =πjp(xn|θj)∑M

m=1 πmp(xn|θm)

Chaohui Wang Introduction to Machine Learning 58 / 63

Page 186: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Preliminaries (3)

• Idea: Mixture Estimation with “Soft” Assignments• Based on the mixture distribution parameter θ, we can

evaluate the posterior probability that xn was generatedfrom a specific component j:

p(ln = j|xn, θ) =p(ln = j, xn|θ)

p(xn|θ)=

p(ln = j, xn|θ)∑Mm=1 πmp(xn|θm)

p(ln = j, xn|θ) = p(ln = j|θ)p(xn|ln = j, θ) = πjp(xn|θj)

→ p(ln = j|xn, θ) =πjp(xn|θj)∑M

m=1 πmp(xn|θm)

Chaohui Wang Introduction to Machine Learning 58 / 63

Page 187: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• E-Step: softly assign samples to mixture components

γj(xn)←πjN (xn|µj,Σj)∑M

k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N

• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments

N̂j ←N∑

n=1

γj(xn): soft number of samples labeled j

π̂newj ← N̂j

N

µ̂newj ← 1

N̂j

N∑n=1

γj(xn)xn

Σ̂newj ← 1

N̂j

N∑n=1

γj(xn)(xn − µ̂newj )(xn − µ̂new

j )T

→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63

Page 188: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• E-Step: softly assign samples to mixture components

γj(xn)←πjN (xn|µj,Σj)∑M

k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N

• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments

N̂j ←N∑

n=1

γj(xn): soft number of samples labeled j

π̂newj ← N̂j

N

µ̂newj ← 1

N̂j

N∑n=1

γj(xn)xn

Σ̂newj ← 1

N̂j

N∑n=1

γj(xn)(xn − µ̂newj )(xn − µ̂new

j )T

→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63

Page 189: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• E-Step: softly assign samples to mixture components

γj(xn)←πjN (xn|µj,Σj)∑M

k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N

• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments

N̂j ←N∑

n=1

γj(xn): soft number of samples labeled j

π̂newj ← N̂j

N

µ̂newj ← 1

N̂j

N∑n=1

γj(xn)xn

Σ̂newj ← 1

N̂j

N∑n=1

γj(xn)(xn − µ̂newj )(xn − µ̂new

j )T

→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63

Page 190: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• E-Step: softly assign samples to mixture components

γj(xn)←πjN (xn|µj,Σj)∑M

k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N

• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments

N̂j ←N∑

n=1

γj(xn): soft number of samples labeled j

π̂newj ← N̂j

N

µ̂newj ← 1

N̂j

N∑n=1

γj(xn)xn

Σ̂newj ← 1

N̂j

N∑n=1

γj(xn)(xn − µ̂newj )(xn − µ̂new

j )T

→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63

Page 191: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• E-Step: softly assign samples to mixture components

γj(xn)←πjN (xn|µj,Σj)∑M

k=1 πkN (xn|µk,Σk),∀j = 1, . . . ,M, n = 1, . . . ,N

• M-Step: re-estimate the parameters (separately for eachmixture component) based on the soft assignments

N̂j ←N∑

n=1

γj(xn): soft number of samples labeled j

π̂newj ← N̂j

N

µ̂newj ← 1

N̂j

N∑n=1

γj(xn)xn

Σ̂newj ← 1

N̂j

N∑n=1

γj(xn)(xn − µ̂newj )(xn − µ̂new

j )T

→ How to initialize the algorithm then?Chaohui Wang Introduction to Machine Learning 59 / 63

Page 192: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• Initialization :• Way 1: initializing the algorithm with a set of initial

parameters, and then conducting an E-step• Way 2: Starting with a set of initial weights, and then doing

a first M-step

Chaohui Wang Introduction to Machine Learning 60 / 63

Page 193: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• Initialization :• Way 1: initializing the algorithm with a set of initial

parameters, and then conducting an E-step• Way 2: Starting with a set of initial weights, and then doing

a first M-step

Chaohui Wang Introduction to Machine Learning 60 / 63

Page 194: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Expectation-Maximization (EM) Algorithm

• Initialization :• Way 1: initializing the algorithm with a set of initial

parameters, and then conducting an E-step• Way 2: Starting with a set of initial weights, and then doing

a first M-step

Chaohui Wang Introduction to Machine Learning 60 / 63

Page 195: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

EM Algorithm - Example

Chaohui Wang Introduction to Machine Learning 61 / 63

Page 196: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

EM Algorithm - Implementation

• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why?

If component j is exactly centered on a data point xn,this data point will then contribute an infinite term in thelikelihood function

• How?

Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1

Chaohui Wang Introduction to Machine Learning 62 / 63

Page 197: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

EM Algorithm - Implementation

• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why?

If component j is exactly centered on a data point xn,this data point will then contribute an infinite term in thelikelihood function

• How?

Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1

Chaohui Wang Introduction to Machine Learning 62 / 63

Page 198: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

EM Algorithm - Implementation

• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why? If component j is exactly centered on a data point xn,

this data point will then contribute an infinite term in thelikelihood function

• How?

Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1

Chaohui Wang Introduction to Machine Learning 62 / 63

Page 199: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

EM Algorithm - Implementation

• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why? If component j is exactly centered on a data point xn,

this data point will then contribute an infinite term in thelikelihood function

• How?

Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1

Chaohui Wang Introduction to Machine Learning 62 / 63

Page 200: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

EM Algorithm - Implementation

• One issue in practice: singularities in the estimation→ Mixture components may collapse on single data points• Why? If component j is exactly centered on a data point xn,

this data point will then contribute an infinite term in thelikelihood function

• How? Introduce regularization, e.g., by enforcing minimumwidth for the Gaussians: use (Σ + σminI)−1 instead of Σ−1

Chaohui Wang Introduction to Machine Learning 62 / 63

Page 201: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Gaussian Mixture Models - Applications

• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...

• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels

Chaohui Wang Introduction to Machine Learning 63 / 63

Page 202: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Gaussian Mixture Models - Applications

• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...

• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels

Chaohui Wang Introduction to Machine Learning 63 / 63

Page 203: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Gaussian Mixture Models - Applications

• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...

• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels

Chaohui Wang Introduction to Machine Learning 63 / 63

Page 204: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Gaussian Mixture Models - Applications

• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...

• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels

Chaohui Wang Introduction to Machine Learning 63 / 63

Page 205: Introduction to Machine Learning - IGMigm.univ-mlv.fr/~cwang/TeachingMaterials/MachineLearningIntro/Lecture2.pdfChaohui Wang Introduction to Machine Learning 7 / 63. Probability Theory

Probability Theory (review) Bayes Decision Theory Probability Density Estimation

Gaussian Mixture Models - Applications

• Mixture models are used in many practical applications→ distributions with complex or unknown shapes need tobe represented...

• Popular applications in Computer Vision→ e.g., model distributions of pixel colors• Each pixel is one data point in, e.g., RGB space• Learn a MoG to represent the class-conditional densities• Use the learned models to classify other pixels

Chaohui Wang Introduction to Machine Learning 63 / 63