machine learning - university of...

31
UNICA Course – Machine Learning, Exercises, Cagliari, March-May 2017 Fabio Roli © 1 MACHINE LEARNING EXERCISES Elements of parametric techniques All the course material is available on the web site Course web site: http://pralab.diee.unica.it/MachineLearning

Upload: others

Post on 09-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

1

MACHINELEARNINGEXERCISES

Elementsofparametrictechniques

Allthecoursematerialisavailableonthewebsite

Coursewebsite:http://pralab.diee.unica.it/MachineLearning

Page 2: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

2

PART4:Elementsofparametrictechniques

Required knowledge to solve these exercises

1. Discriminant functions g(x) for Gaussian classes a. General case (different priors, different covariance matrix per class)

g 𝑥; 𝜇, 𝛴 = −12𝑥

,𝛴-.𝑥 +𝜇,𝛴-.𝑥 −12𝜇

,𝛴-.𝜇 + ln 𝑝 𝜔 −12 ln 𝛴

b. Degenerate cases i. Same priors, same isotropic covariance matrix 𝛴 = 𝜎5I for each class

(Nearest Mean Centroid classifier) g 𝑥 = 𝜇,𝑥 − .

5𝜇,𝜇 or g 𝑥 = −| 𝑥 − 𝜇 |5

ii. Different priors, same isotropic covariance matrix 𝛴 = 𝜎5I for each class

g 𝑥 = .89𝜇,𝑥 − .

589𝜇,𝜇 + ln 𝑝 𝜔 or g 𝑥 = − :-; 9

589+ ln 𝑝 𝜔

iii. Same priors, (arbitrary) covariance matrix equal for each class

g 𝑥 = 𝜇,𝛴-.𝑥 −12𝜇

,𝛴-.𝜇 + ln 𝑝 𝜔

Note that, if Gaussians are isotropic but different for each class, g(x) remains quadratic. The degenerate cases listed above hold only when the covariance matrix is the same for each class!

2. Classify a point using the MAP criterion:

argmax@𝑔B 𝑥

3. Plot decision boundaries among classes in feature space a. Determine the values of 𝑥∗ for which it holds that g. 𝑥 = 𝑔5(𝑥) b. Find the subset of points 𝑥∗ for which g. and g5 are the dominant classes, i.e.,

g. 𝑥∗ > 𝑔G(𝑥∗), g. 𝑥∗ > 𝑔H(𝑥∗), etc. This subset of points will identify the active boundary between class 1 and class 2

c. Repeat for any other pair of classes

4. Estimate parameter values (mean and covariance matrices) from data (using MLE) a. Mean µ = .

J𝑥KJ

KL.

b. Covariance Σ = .J-.

𝑥K − 𝜇 𝑥K − 𝜇 ,JKL.

Note that the product inside the summation is an outer product between a column and a row vector of d elements, being d the number of features, and that the sum is computed over the training samples. Accordingly, the covariance matrix is a d x d matrix. Depending on the assumptions made on the covariance matrix (e.g., equal for all classes, isotropic, etc.), one should adjust the above estimate accordingly (see Exercise 3 for some examples).

Page 3: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

3

Exercise 1 Let us consider a three-class problem, within a bi-dimensional feature space.

a. Classify the pattern xt = (1,1/3)T under the following assumptions:

• the probability density function of each class is Gaussian, with the same covariance matrix Σ=σ2I, and the parameter σ (variance) is unknown

• all the classes have the same prior probability • the mean vectors of the three classes are: µ1=(0,0)T; µ2=(0,2)T; µ3=(2,1)T.

b. Classify the same pattern xt =(1,1/3)T under the different assumptions about priors:

Solution

The general form of a multivariate normal pdf is

𝑝 𝑥 𝜔K =1

2𝜋O5 𝛴K

.5exp[−

12 𝑥 − 𝜇K

,𝛴K-. 𝑥 − 𝜇K ]

The discriminant function 𝑔K 𝑥 is

Considering a Gaussian model with åi=s2I we obtain

;

and finally

a) Equal priors

For this particular case, the discriminant function becomes:

for the pattern xt =(1, 1/3)T

𝑔. 𝑥 :L:T = − 𝑥 − 𝜇. 5 = −10/9

2/1)(;4/1)()( 321 === www PPP

di

2s=Σ ( )IΣ 12

1

s=-i

( ))(ln2

)( 2

2

ii

i Pg ws

+-

-=µx

x

2)( iig µxx --=

11 1( ) ( ) ( ) ln 2 ln ln ( )2 2 2

ti i i i i i

dg P-= - - - - p- + wx x µ Σ x µ Σ

Page 4: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

4

𝑔5 𝑥 :L:T = − 𝑥 − 𝜇5 5 = −34/9

𝑔G 𝑥 :L:T = − 𝑥 − 𝜇G 5 = −13/9

Comparing the values of the above discriminant functions g_1, g_2 and g_3 for the pattern xt we can see that the value of the posterior probability is maximum for class 1. Therefore, the pattern (1,1/3)T is assigned to class 1, regardless of the unknown value of 𝝈.

b) Different priors

For this particular case, the linear discriminant function becomes

for the pattern xt =(1, 1/3)T

Note that the inequality g1(x)>g2(x) does not depend on the value of ln(1/ 4), whereas the other two inequalities depend on the difference between ln(1/4) and ln(1/2).

Let’s calculate the value of the discriminant functions for the considered pattern x=xt:

2/1)(;4/1)()( 321 === www PPP

( ))(ln2

)( 2

2

ii

i Pg ws

+-

-=µx

x

( )4/1ln2

)( 2

21

1 +--

=sµx

xg

)4/1ln(2

)( 2

22

2 +--

=sµx

xg

)2/1ln(2

)( 2

23

3 +--

=sµx

xg

)4/1ln(29/10)4/1ln(

2)( 22

2

1 +-=+--

=ss

igµx

x

)4/1ln(29/34)4/1ln(

2)( 22

2

2 +-=+--

=ss

igµx

x

)2/1ln(29/13)2/1ln(

2)( 22

2

3 +-=+--

=ss

igµx

x

Page 5: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

5

Class 1 vs class 2

For the considered pattern, the inequality g1(xt)>g2(xt) is always true.

Class 1 vs class 3

From the inequality g1(xt)>g3(xt), we obtain

Pattern (1,1/3)T is assigned to class 1 only if s2 < 0.2404

Pattern (1,1/3)T will be assigned to class 3 if s2 > 0.2404

Class 2 vs. class 3

From the inequality g2(xt)>g3(xt) we obtain

log 2 − log 4 >761𝜎5

This inequality never holds, accordingly:

Pattern (1,1/3)T is assigned to class 1 only if s2 < 0.2404

Pattern (1,1/3)T will be assigned to class 3 if s2 > 0.2404

Homework:

Plot boundaries for

a) equal priors b) P1=P2=1/4, 𝜎 = 0.1 c) P1=P2=1/4, 𝜎 = 0.5

)2/1ln(29/13)4/1ln(

29/10

22 +->+-ss

0.2404)2ln()4ln(

11832 =÷÷

ø

öççè

æ-

<s

Page 6: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

6

Exercise 2 Let us consider an hypothetical detection problem of network intrusions. We want to classify network traffic using two characteristic measures (two features). We have identified three data classes of interest (denial-of-service attack, probing attack, normal traffic). Traffic patterns follow a normal distribution with different, known parameters (e.g., parameters have been previously estimated using the Maximum Likelihood method).

For the sake of simplicity, let us assume that covariance matrices are equal to each other:

Whereas Gaussian’s centers are as follows

Moreover, let us suppose that prior class probabilities are as follows (e.g. they have been estimated by the amount of traffic of each class)

A)

Find the discriminant functions using the MAP decision rule

Classify the network traffic pattern xc=(1,1)t.

B)

Find the decision boundaries

SOLUTION

The general form of a multivariate normal pdf is

𝑝 𝑥 𝜔K =1

2𝜋O5 𝛴K

.5exp[−

12 𝑥 − 𝜇K

,𝛴K-. 𝑥 − 𝜇K ]

The discriminant function 𝑔K 𝑥 is in the form

,4334

321 úû

ùêë

é-

-=S=S=S

,22

,22

,00

321 úû

ùêë

é--

=úû

ùêë

é=ú

û

ùêë

é= mmm

( ) ( )41

21 == ww PP

11 1( ) ( ) ( ) ln 2 ln ln ( )2 2 2

ti i i i i i

dg P-= - - - - p- + wx x µ Σ x µ Σ

Page 7: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

7

A)

Find the discriminant functions with the MAP decision rule

Classify the network traffic pattern xc=(1,1)t.

From Part 4 of the course, we know that if Si=S, patterns generate hyper-ellipsoidal “clusters” having the same dimension and shape, centered in µi , whose discriminant functions become

Alternatively, dropping the term we can write

where

;

Therefore, we can substitute the values of Si e µi, compute gi(x) for x=(1,1)t and assign this pattern to the class i such that gi(x)> gj(x), j¹i.

The MAP decision criterion is

and, using Bayes,

From this equation, we obtain the discriminant functions related to

The covariance matrices are as follows:

Σ-. = .`4 33 4

In the logarithmic space, we can skip common additive constants

)(ln)()(21)( 1

iit

ii Pg w+---= - µxΣµxx

xΣ 1

21 -- tx

0)( itii wg += xwx

ii µΣw 1-= )(ln21

11

0 wPw itii +-= - µΣµ

)|()|( xpxpx jii www >ÛÎ

jPxpPxpx jjiii "×>×ÛÎ )()|()()|( wwwww

)()|( ii Pxp ww ×

7||,4334

321 =Súû

ùêë

é-

-=S=S=S

Page 8: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

8

= −12 𝒙,Σ-.𝒙 + (−𝒙,Σ-.𝝁𝒊 − 𝝁𝒊,Σ-.𝒙) + 𝝁𝒊,Σ-.𝝁𝒊 + ln 𝑝 𝜔K ⇒

⇒ 𝑔K 𝑥 = −12 −2𝝁𝒊,Σ-.𝒙 + 𝝁𝒊,Σ-.𝝁𝒊 + ln 𝑝 𝜔K =

= 𝝁𝒊,Σ-.𝒙 −12𝝁𝒊

,Σ-.𝝁𝒊 + ln 𝑝 𝜔K

So, for each class we have (remember that P1=P2=1/4):

𝑔.(𝒙) = ln 𝑝 𝜔. = ln14

𝑔5 𝒙 = 22𝑥.𝑥5 − 4 + ln 𝑝 𝜔5 = 2 𝑥. + 𝑥5 − 4 + ln

14

𝑔G 𝒙 = − 22𝑥.𝑥5 − 4 + ln 𝑝 𝜔G = −2 𝑥. + 𝑥5 − 4 + ln

12

As expected, being all covariance matrices equal to each other, we obtain linear discriminant functions. Decision boundaries are defined by planes placed between each “neighboring” class. Let us classify pattern xc=(1,1)t. g1(xc)=-1.3863 g2(xc)=-1.3863 g3(xc)=-8.6931 Thus, such a point can be assigned to class 1 or 2.

))(ln()()(21)( 1

iiT

ii pmxmxxg w+úûù

êëé -×S×-×-= -

Page 9: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

9

B) Find the decision boundaries.

Graphs shown below are not strictly necessary for the resolution of the problem; they are shown for clarity.

The eigenvectors of are as follows: and , with eigenvalues 1 and 7, respectively;

Gaussians show this shape:

Fig.1 Probability densities of classes (schematic representation of contour lines)- prior probabilities are not considered

S ÷÷ø

öççè

æ

--

2222

÷÷ø

öççè

æ-2222

2

-2

-2

2

µ1

µ2

µ3

x1

x2

Page 10: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

10

Fig.2 Probability densities of classes - prior probabilities are not considered

Fig.3 Probability densities of classes - prior probabilities are not considered

-6 -4 -2 0 2 4 6-5

0

5

0

0.02

0.04

0.06

0.08

x1

x2

Prob

abilit

y De

nsity

(no

prio

r)

-6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

x1

x2

Page 11: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

11

Fig.4 Probability densities of classes - prior probabilities are 1/4 (class 1), 1/4 (class 2), 1/2 (class 3)

Fig.4 Probability densities of classes - prior probabilities are 1/4 (class 1), 1/4 (class 2), 1/2 (class 3)

-6-4

-20

24

6 -5

0

50

0.005

0.01

0.015

0.02

0.025

0.03

x2x1

Prob

abilit

y De

nsity

-6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

x1

x2

Page 12: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

12

Decision boundaries can be found considering that they correspond to - that is, to the set of points for which the posterior probability for class i and class j are identical - provided that no other class has higher probability. Discriminant functions are:

𝑔.(𝒙) = ln14

𝑔5 𝒙 = 2 𝑥. + 𝑥5 − 4 + ln14

𝑔G 𝒙 = −2 𝑥. + 𝑥5 − 4 + ln12

First decision boundary: 𝑔.(𝒙) = 𝑔5(𝒙) ⟹ 2 𝑥. + 𝑥5 − 4 = 0

𝑥. + 𝑥5 = 2

is defined by equation

Substituting this vector on the discriminant functions g1( ), g2 ( ) and g3 ( ) we obtain the value of these functions for the boundary defined by g1( )= g2 ( )

Obviously, we find the equality . The inequality tells us that the first hyper-plane is a decision boundary between class 1 and 2, for all values of 𝑥

)()( xgxg ji =

÷÷ø

öççè

æ=

2

1

ˆˆ

ˆxx

x 221 =+ xx

;6931.8)ˆ(;3863.1)ˆ(;3863.1)ˆ(

3

2

1

-=-=-=

xxx

ggg

)ˆ()ˆ( 21 xx gg =

)ˆ()ˆ()ˆ( 213 xxx ggg =<

Page 13: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

13

Second decision boundary

𝑔5 𝒙 = 𝑔G 𝒙 ⟹ 2 𝑥. + 𝑥5 + ln14 = −2 𝑥. + 𝑥5 + ln

12

⇒ 𝑥. + 𝑥5 =14 ln 2

is defined by equation

Substituting this vector on the discriminant functions g1( ), g2 ( ) and g3 ( ) we obtain the value of these functions defined by g2( )= g3 ( )

Obviously, we find the equality .

The inequality tells us that the second hyper-plane is not a decision boundary between class 2 and 3, as for points corresponding to 𝒙 the probability for class 1 is greater than the probability for classes 2 and 3. Therefore, this hyper-plane is NOT a decision boundary, since .

Third decision boundary

𝑔. 𝒙 = 𝑔G(𝒙) ⇒ ln14 = −2 𝑥. + 𝑥5 − 4 + ln

12

𝑥. + 𝑥5 =12 ln 2 − 2

is defined by equation

Substituting this vector on the discriminant functions g1( ), g2 ( ) and g3 ( ) we obtain the value of these functions in the hyper-plane defined by g1( )= g3 ( )

Obviously, the value of g1 is constant, and we find the equality .

The inequality tell us that the third hyper-plane is a decision boundary between class 1 and 3

÷÷ø

öççè

æ=

2

1

ˆˆ

ˆxx

x 2ln41

21 =+ xx

;0397.5)ˆ(;0397.5)ˆ(;3863.1)ˆ(

3

2

1

-=-=-=

xxx

ggg

)ˆ()ˆ( 32 xx gg =)ˆ()ˆ()ˆ( 321 xxx ggg =>

3,2,)ˆ()ˆ(1 => jgg j xx

÷÷ø

öççè

æ=

2

1

ˆˆ

ˆxx

x 22ln21

21 -==+ xx

-1.3863;)ˆ(;-8.6931)ˆ(;1.3863)ˆ(

3

2

1

==-=

xxx

ggg

)ˆ()ˆ( 31 xx gg =)ˆ()ˆ()ˆ( 312 xxx ggg =<

Page 14: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

14

Decision boundaries:

Between classes 1 and 2:

Between classes 1 and 3:

𝑥. + 𝑥5 =12 ln 2 − 2 = −1.6534

Fig.5 Probability densities and decision boundary. The decision boundary between class 1 and class 3 is identify by (1,3). The decision boundary between class 1 and class 2 is identify by (1,2).

It is easy to note that the point

lies exactly on the hyper-plane (1,2), in fact this point can be assigned to class 1 or 2 (see Exercise 2-A)

From the graphic above, it is easy to see that separation lines are not symmetric with respect to Gaussian centers, due to different class prior probabilities.

221 =+ xx

-6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

x1

Decision Boundaries

x2

(1,3)

(1,2)

÷÷ø

öççè

æ=11

cx

Page 15: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

15

Exercise 3 Let us consider the following training samples for class ωA and ωB, respectively.

A (-1, 0)T ( 1, 0)T (-0.1, 2)T ( 0.3, -1.8)T

B (5, 4)T (3, 3)T (4, 7)T (6.5, 2)T

Class-conditional distributions are as follows:

are not given but they should be estimated.

Find the optimal discriminant function and classify the pattern (2.2, 2.2)’ under the following hypotheses:

1) and ,

2) and

3) and

4) arbitrary and Note that by changing the hypotheses, classification changes as well.

SOLUTION

The general form of a multivariate normal pdf is

𝑝 𝑥 𝜔K =1

2𝜋O5 𝛴K

.5exp[−

12 𝑥 − 𝜇K

,𝛴K-. 𝑥 − 𝜇K ]

Taking the logarithm of the above expression, one yields the discriminant function 𝑔K 𝑥 as

);,()|();,()|( BBBAAA NpNp S=S= µwµw xx

BBAA SS ,,, µµ

=x̂

Ι2s=S=S BA )()( 21 ww PP =Ι2s=S=S BA )(2)( 21 ww PP =

BA S¹S )(2)( 21 ww PP =

BA S=S )(2)( 21 ww PP =

11 1( ) ( ) ( ) ln 2 ln ln ( )2 2 2

ti i i i i i

dg P-= - - - - p- + wx x µ Σ x µ Σ

Page 16: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

16

1) Under the hypothesis and , find the optimal discriminant

function and classify the pattern (2.2, 2.2)’ 1.a classify the pattern Under this hypothesis, we don’t need to estimate the covariance matrix. We just need to exploit the general expression of gi (see Part 4 of the course)

By removing the terms common to g_A and g_B, the discriminant functions are as follows:

And the optimal discriminant plane is:

I can estimate the mean vectors using:

;

Squared distances of X from the above means:

dA2 = 9.2450; dB

2 = 9.1206

gA(X)=- 9.2450; gB(X)=-9.1206;

Pattern is assigned to class B

Ι2s=S=S BA )()( 21 ww PP ==x̂

)(ln||ln21)2(ln

2)()(

21)( 1

iiiit

ii Pdg wp +-----= - ΣµxΣµxx

2)( AAg µxx --=

2)( BBg µxx --=

)()( xx BA gg =

å=

=n

kkn 1

1ˆ xµ

÷÷ø

öççè

æ=

05.005.0

Aµ ÷÷ø

öççè

æ=

44.625

Page 17: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

17

1.b Find the boundary

The boundary is defined by the equation

𝒙 − 𝝁𝑨𝟐 = 𝒙 − 𝝁𝑩

𝟐

where the generic 2-dimensional vector x is 𝒙 =𝑥.𝑥5 , that is,

𝑥. − 𝜇i.𝑥5 − 𝜇i5

5=

𝑥. − 𝜇j.𝑥5 − 𝜇j5

5

𝑥. − 𝜇i. 5 + 𝑥5 − 𝜇i5 5 = 𝑥. − 𝜇j. 5 + 𝑥5 − 𝜇j5 5

...

after a simple calculation, we obtain

9.15𝑥. + 7.9𝑥5 = 37.3856

An alternative approach is

𝒙 − 𝝁𝑨𝟐 = 𝒙 − 𝝁𝑩

𝟐

9.15𝑥. + 7.9𝑥5 = 37.3856

The point (2.2, 2.2) is slightly above the boundary, and thus classified as class B (as discussed before).

( ) ( ) ( ) ( )BTBA

TA µxµxµxµx --=--

BTB

TBB

TA

TA

TAA

T µµxµµxµµxµµx +--=+--

BTB

TBA

TA

TA µµxµµµxµ +-=+- 22 𝑥l

𝜇i

𝜇j

Page 18: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

18

2) Under the hypothesis and , find the optimal

discriminant function and classify pattern (2.2, 2.2)’

As in previous exercise, we can start our calculation from the general form of the discriminant function

by removing the terms common to both members:

(the quadratic term is the same in each gi() and it can thus be ignored)

We need to estimate the term σ. The estimation of covariance matrices is

But the hypothesis is

When all classes show the same covariance matrix, it can be proven that the maximum likelihood estimate for such a covariance matrix is (see slides of Part 4 of the course):

Σ =𝑛K𝑛 ΣK

n

KL.

Ι2s=S=S BA )(2)( 21 ww PP ==x̂

)(ln||ln21)2(ln

2)()(

21)( 1

iiiit

ii Pdg wp +-----= - ΣµxΣµxx

( ))(ln2

)( 2

2

ii

i Pg ws

+-

-=µx

x

( )( )tik

n

kiki n

µxµxΣ ˆˆ11ˆ

1

---

= å=

÷÷ø

öççè

æ=S

2.410.25-0.25- 0.6967ˆ

A

÷÷ø

öççè

æ=S

4.6667 1.3333-1.3333- 2.2292ˆ

B

Ι2s=S=S BA

Page 19: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

19

or, if we know the priors,

Σ = 𝑃(𝜔K)ΣK

n

KL.

In order to enforce compliance between the obtained data and the hypothesis 𝛴i = 𝛴j = 𝜎5𝑰, we may impose that these matrices are diagonal and proportional to the identity matrix (only zero values outside the diagonal). Essentially, we have to ignore out-of-diagonal terms, and average terms on the main diagonal. This corresponds to computing the variance of each feature, and then average the corresponding values (separately for each class). Finally, the average variance is computed using the above equation (a weighted mean), namely, weighing the value of each average class variance with its prior probability.

Σi =

0.6967 + 2.412

0

00.6967 + 2.41

2

= 1.5533𝑰

Σj =

2.2292 + 4.66672

0

02.2292 + 4.6667

2

= 3.4479𝑰

The above diagonal terms of the covariance matrix have been obtained by averaging the previous variance values.

Finally, Σ = 𝑃 𝜔K ΣKnKL. = 𝑃iΣq + 𝑃j𝛴j = 2.1849𝑰

so, 𝜎5 = 2.1849

2.a classify the pattern (2.2, 2.2)’

𝑔i 𝑥l = −𝑥l − 𝜇i

5

2𝜎5 + log 𝑃(𝜔i ) = −2.5212

𝑔j 𝑥l = −𝑥l − 𝜇j

5

2𝜎5 + log 𝑃(𝜔j ) = −3.1858

gB(x)< gA(x); The pattern is assigned to class A

;2.41000.6967÷÷ø

öççè

æ=SA

÷÷ø

öççè

æ=S

4.666700 2.2292

B

=x̂

Page 20: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

20

Alternative approach

The discriminant functions can be expressed as

, where

(see SLIDES, chapt. 4)

gB(x)< gA(x); The pattern is assigned to class A

0)( iTii wwxg += x

)(log21;1202 ii

Tiiii Pww wµµ

s+-==

-0.4066; 0.02290.0229

0 =÷÷ø

öççè

æ= AA ww

9.6554- ; 1.83082.1168

0 =÷÷ø

öççè

æ= BB ww

0.3059- )( 0 =+= ATAA wwg xx

0.9706- )( 0 =+= BTBB wwg xx

Page 21: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

21

2.b find boundary

𝑔i 𝑥 = −𝑥 − 𝜇i

5

2𝜎5 + log 𝑃(𝜔i )

𝑔j 𝑥 = −𝑥 − 𝜇j

5

2𝜎5 + log 𝑃(𝜔j )

The optimal discriminant plane is the hyperplane gA=gB, that is,

2.094𝑥. + 1.808𝑥5 = 9.2488

It is easy to see that (compare with point 1.b in the solution of this exercise) the boundary has been shifted towards the upper-right corner of the plot, due to the increase of the prior of the class A. Accordingly, the point (2.2, 2.2) is now below the decision boundary (thus in the decision region of class A).

Page 22: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

22

(alternative approach)

The optimal discriminant plane is the hyperplane gA=gB, that is,

𝒘i,𝒙 + 𝑤it = 𝒘j, 𝒙 + 𝑤jt

𝒘i, − 𝒘j, 𝒙 + 𝑤it − 𝑤jt = 0

The hyperplane can be written as

that is,

2.094𝑥. + 1.808𝑥5 = 9.2488

0)( 0 =- xxwT

Page 23: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

23

3) By assuming that e , find the optimal discriminant boundary

and classify pattern (2.2, 2.2)’

We can estimate the covariance matrices (see previous exercise):

gi(x) is a quadratic function which can be written as

𝑔K 𝒙 = 𝒙u𝑾K𝒙 + 𝒘Ku𝒙 + 𝑤Kt

where

𝑾K = −.5𝚺K-.; 𝑤K = 𝚺K-.𝝁K

𝑤Kt = −12𝝁Ku𝚺K-.𝝁K −

12ln 𝚺K + ln 𝑃(𝜔K)

BA S¹S )(2)( 21 ww PP ==x̂

( )( )tk

n

kkn

µxµxΣ ˆˆ11ˆ

1--

-= å

=

÷÷ø

öççè

æ=S

2.410.25-0.25- 0.6967

A

÷÷ø

öççè

æ=S

4.6667 1.3333-1.3333- 2.2292

B

Page 24: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

24

Substituting: 𝑾i =

−0.7454 −0.0773−0.0773 −0.2155

wa = [ 0.0823, 0.0293]’ wa0 = -0.6484 𝑾j =

−0.2705 −0.0773−0.0773 −0.1292

wb =[ 3.1207 , 1.7487]’ wb0 = -12.8893 we can compute the discriminant functions and classify the pattern 𝒙l =

2.22.2

Discriminant function:

𝑔x 𝒙 = −0.7454𝑥.5 − 0.2155𝑥55 − 0.1547𝑥.𝑥5 + 0.08227𝑥. + 0.02928𝑥5 − 0.6484

𝑔y 𝒙 = −0.2705𝑥.5 − 0.1546𝑥.𝑥5 + 3.12𝑥. − 0.1292𝑥55 + 1.749𝑥5 − 12.89

Boundary:

As we know from SLIDES, chapt. 4, we obtain a quadratic discriminant function

𝑔x 𝒙 − 𝑔y 𝒙 = 0Þ

Þ𝑥.5 + 0.0001937𝑥.𝑥5 + 6.397𝑥. + 0.1816𝑥55 + 3.621𝑥5 − 25.78 = 0

Page 25: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

25

and finally

ga(x_t)= -5.825

gb(x_t)=-4.8610

gb(x_t)> gA(x_t); The pattern is assigned to class B

Page 26: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

26

4) unknown and

Hint: we need to estimate the covariance matrix as a weighted mean of the two covariances and apply the Gaussian model which can satisfy the hypothesis 𝛴i = 𝛴j, as in Exercise 2.

BA S=S )(2)( 21 ww PP =

Page 27: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

27

Exercise 4 Let us consider a classification task with 3 data classes, a two-dimensional feature space, with Gaussian distributions for the 3 classes:

𝝁. =00 ;𝝁5 =

01 ;𝝁G =

11 ;𝑓𝑜𝑟𝑒𝑎𝑐ℎ𝑐𝑙𝑎𝑠𝑠𝚺𝑖 = 2𝑰

𝑃 𝜔. = 𝑃 𝜔5 =14

Find the optimal decision boundaries and the related decision regions.

Solution

Given that 𝚺𝒊 = 𝝈𝟐𝑰 we have 𝒈𝒊 𝒙 :

𝑔K 𝑥 = −𝑥 − 𝜇K

5

2𝜎5+ log 𝑃K

𝑔. 𝑥 = − .H𝑥.5 + 𝑥55 + log .

H

𝑔5 𝑥 = − .H𝑥.5 + 𝑥55 − .

H1 − 2𝑥5 + log .

H

𝑔G 𝑥 = − .H𝑥.5 + 𝑥55 − .

H1 − 2𝑥. + 1 − 2𝑥5 + log .

5

Equaltermscanbedeleted,soobtaining:

𝑔. 𝑥 = log .H

𝑔5 𝑥 = − .H1 − 2𝑥5 + log .

H

𝑔G 𝑥 = − .H1 − 2𝑥. + 1 − 2𝑥5 + log .

5

Page 28: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

28

Decisionboundarybetweenclass1andclass2

𝑔. 𝑥 = 𝑔5 𝑥 ⟹ 𝑥5 =.5

For 𝒙 ≡ (𝑥5 =.5)we have

𝑔. 𝒙 = 𝑔5 𝒙 = log .H;

𝑔G 𝒙 = − .H1 − 2𝑥. + log .

5

𝑥5 =.5istheboundaryifandonlyif𝑔. 𝒙 = 𝑔5 𝒙 > 𝑔G 𝒙 ,therefore

Boundarybetweenclass1and2

𝑥5 = 1/2

𝑖𝑓𝑥. <.5− 2 log 2 ≅ −0.8863

Decisionboundarybetweenclass1andclass3

𝑔. 𝑥 = 𝑔G 𝑥 ⟹ 𝑥. + 𝑥5 = 1 − 2log 2

For 𝒙 ≡ (𝑥. + 𝑥5 = 1 − 2 log 2)we have

𝑔. 𝒙 = 𝑔G 𝒙 = log .H;

𝑔5 𝒙 = .5𝑥5 −

.H+ log .

H

𝑥. + 𝑥5 = 1 − 2 log 2istheboundaryonlyif𝑔. 𝒙 = 𝑔G 𝒙 > 𝑔5 𝒙 ,therefore

Boundarybetweenclass1andclass3

𝑥. + 𝑥5 = 1 − 2 log 2 ≅ −0.3863

𝑖𝑓𝑥5 <.5

Page 29: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

29

Decisionboundarybetweenclass2andclasse3

𝑔5 𝑥 = 𝑔G 𝑥 ⟹ 𝑥. =.5− 2 log 2

For 𝒙 ≡ (𝑥1 =12− 2 log 2 )we have

𝑔5 𝒙 = 𝑔G 𝒙 = − .H1 − 2𝑥5 + log .

H ;

𝑔. 𝒙 = log .H

− .H1 − 2𝑥5 + log .

Histheboundaryonlyif𝑔5 𝒙 = 𝑔G 𝒙 > 𝑔. 𝒙 ,therefore

Boundaryclass2andclass3

𝑥. =.5− 2 log 2 ≅ −0.8863

𝑠𝑒𝑥5 >.5

Page 30: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

30

Legenda

Duetothedifferenceofpriors,both𝜇.and𝜇5lieinthedecisionregionofclass3.

-0.8863𝜇.

1/2

-0.3863

𝜇G𝜇5

𝑥. + 𝑥5 = 1 − 2log 2

𝑥. =12− 2 log 2

𝑥5 = 1/2

class1decisionregion

class3decisionregion

class2decisionregion

Page 31: MACHINE LEARNING - University of Cagliaripralab.diee.unica.it/sites/default/files/MachineLearningSlides/ml-part... · UNICA Course – Machine Learning, Exercises, Cagliari, March-May

UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©

31

Allthecoursematerialisavailableonthewebsite

Coursewebsite:http://pralab.diee.unica.it/MachineLearning