machine learning - university of...
TRANSCRIPT
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
1
MACHINELEARNINGEXERCISES
Elementsofparametrictechniques
Allthecoursematerialisavailableonthewebsite
Coursewebsite:http://pralab.diee.unica.it/MachineLearning
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
2
PART4:Elementsofparametrictechniques
Required knowledge to solve these exercises
1. Discriminant functions g(x) for Gaussian classes a. General case (different priors, different covariance matrix per class)
g 𝑥; 𝜇, 𝛴 = −12𝑥
,𝛴-.𝑥 +𝜇,𝛴-.𝑥 −12𝜇
,𝛴-.𝜇 + ln 𝑝 𝜔 −12 ln 𝛴
b. Degenerate cases i. Same priors, same isotropic covariance matrix 𝛴 = 𝜎5I for each class
(Nearest Mean Centroid classifier) g 𝑥 = 𝜇,𝑥 − .
5𝜇,𝜇 or g 𝑥 = −| 𝑥 − 𝜇 |5
ii. Different priors, same isotropic covariance matrix 𝛴 = 𝜎5I for each class
g 𝑥 = .89𝜇,𝑥 − .
589𝜇,𝜇 + ln 𝑝 𝜔 or g 𝑥 = − :-; 9
589+ ln 𝑝 𝜔
iii. Same priors, (arbitrary) covariance matrix equal for each class
g 𝑥 = 𝜇,𝛴-.𝑥 −12𝜇
,𝛴-.𝜇 + ln 𝑝 𝜔
Note that, if Gaussians are isotropic but different for each class, g(x) remains quadratic. The degenerate cases listed above hold only when the covariance matrix is the same for each class!
2. Classify a point using the MAP criterion:
argmax@𝑔B 𝑥
3. Plot decision boundaries among classes in feature space a. Determine the values of 𝑥∗ for which it holds that g. 𝑥 = 𝑔5(𝑥) b. Find the subset of points 𝑥∗ for which g. and g5 are the dominant classes, i.e.,
g. 𝑥∗ > 𝑔G(𝑥∗), g. 𝑥∗ > 𝑔H(𝑥∗), etc. This subset of points will identify the active boundary between class 1 and class 2
c. Repeat for any other pair of classes
4. Estimate parameter values (mean and covariance matrices) from data (using MLE) a. Mean µ = .
J𝑥KJ
KL.
b. Covariance Σ = .J-.
𝑥K − 𝜇 𝑥K − 𝜇 ,JKL.
Note that the product inside the summation is an outer product between a column and a row vector of d elements, being d the number of features, and that the sum is computed over the training samples. Accordingly, the covariance matrix is a d x d matrix. Depending on the assumptions made on the covariance matrix (e.g., equal for all classes, isotropic, etc.), one should adjust the above estimate accordingly (see Exercise 3 for some examples).
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
3
Exercise 1 Let us consider a three-class problem, within a bi-dimensional feature space.
a. Classify the pattern xt = (1,1/3)T under the following assumptions:
• the probability density function of each class is Gaussian, with the same covariance matrix Σ=σ2I, and the parameter σ (variance) is unknown
• all the classes have the same prior probability • the mean vectors of the three classes are: µ1=(0,0)T; µ2=(0,2)T; µ3=(2,1)T.
b. Classify the same pattern xt =(1,1/3)T under the different assumptions about priors:
Solution
The general form of a multivariate normal pdf is
𝑝 𝑥 𝜔K =1
2𝜋O5 𝛴K
.5exp[−
12 𝑥 − 𝜇K
,𝛴K-. 𝑥 − 𝜇K ]
The discriminant function 𝑔K 𝑥 is
Considering a Gaussian model with åi=s2I we obtain
;
and finally
a) Equal priors
For this particular case, the discriminant function becomes:
for the pattern xt =(1, 1/3)T
𝑔. 𝑥 :L:T = − 𝑥 − 𝜇. 5 = −10/9
2/1)(;4/1)()( 321 === www PPP
di
2s=Σ ( )IΣ 12
1
s=-i
( ))(ln2
)( 2
2
ii
i Pg ws
+-
-=µx
x
2)( iig µxx --=
11 1( ) ( ) ( ) ln 2 ln ln ( )2 2 2
ti i i i i i
dg P-= - - - - p- + wx x µ Σ x µ Σ
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
4
𝑔5 𝑥 :L:T = − 𝑥 − 𝜇5 5 = −34/9
𝑔G 𝑥 :L:T = − 𝑥 − 𝜇G 5 = −13/9
Comparing the values of the above discriminant functions g_1, g_2 and g_3 for the pattern xt we can see that the value of the posterior probability is maximum for class 1. Therefore, the pattern (1,1/3)T is assigned to class 1, regardless of the unknown value of 𝝈.
b) Different priors
For this particular case, the linear discriminant function becomes
for the pattern xt =(1, 1/3)T
Note that the inequality g1(x)>g2(x) does not depend on the value of ln(1/ 4), whereas the other two inequalities depend on the difference between ln(1/4) and ln(1/2).
Let’s calculate the value of the discriminant functions for the considered pattern x=xt:
2/1)(;4/1)()( 321 === www PPP
( ))(ln2
)( 2
2
ii
i Pg ws
+-
-=µx
x
( )4/1ln2
)( 2
21
1 +--
=sµx
xg
)4/1ln(2
)( 2
22
2 +--
=sµx
xg
)2/1ln(2
)( 2
23
3 +--
=sµx
xg
)4/1ln(29/10)4/1ln(
2)( 22
2
1 +-=+--
=ss
igµx
x
)4/1ln(29/34)4/1ln(
2)( 22
2
2 +-=+--
=ss
igµx
x
)2/1ln(29/13)2/1ln(
2)( 22
2
3 +-=+--
=ss
igµx
x
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
5
Class 1 vs class 2
For the considered pattern, the inequality g1(xt)>g2(xt) is always true.
Class 1 vs class 3
From the inequality g1(xt)>g3(xt), we obtain
Pattern (1,1/3)T is assigned to class 1 only if s2 < 0.2404
Pattern (1,1/3)T will be assigned to class 3 if s2 > 0.2404
Class 2 vs. class 3
From the inequality g2(xt)>g3(xt) we obtain
log 2 − log 4 >761𝜎5
This inequality never holds, accordingly:
Pattern (1,1/3)T is assigned to class 1 only if s2 < 0.2404
Pattern (1,1/3)T will be assigned to class 3 if s2 > 0.2404
Homework:
Plot boundaries for
a) equal priors b) P1=P2=1/4, 𝜎 = 0.1 c) P1=P2=1/4, 𝜎 = 0.5
)2/1ln(29/13)4/1ln(
29/10
22 +->+-ss
0.2404)2ln()4ln(
11832 =÷÷
ø
öççè
æ-
<s
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
6
Exercise 2 Let us consider an hypothetical detection problem of network intrusions. We want to classify network traffic using two characteristic measures (two features). We have identified three data classes of interest (denial-of-service attack, probing attack, normal traffic). Traffic patterns follow a normal distribution with different, known parameters (e.g., parameters have been previously estimated using the Maximum Likelihood method).
For the sake of simplicity, let us assume that covariance matrices are equal to each other:
Whereas Gaussian’s centers are as follows
Moreover, let us suppose that prior class probabilities are as follows (e.g. they have been estimated by the amount of traffic of each class)
A)
Find the discriminant functions using the MAP decision rule
Classify the network traffic pattern xc=(1,1)t.
B)
Find the decision boundaries
SOLUTION
The general form of a multivariate normal pdf is
𝑝 𝑥 𝜔K =1
2𝜋O5 𝛴K
.5exp[−
12 𝑥 − 𝜇K
,𝛴K-. 𝑥 − 𝜇K ]
The discriminant function 𝑔K 𝑥 is in the form
,4334
321 úû
ùêë
é-
-=S=S=S
,22
,22
,00
321 úû
ùêë
é--
=úû
ùêë
é=ú
û
ùêë
é= mmm
( ) ( )41
21 == ww PP
11 1( ) ( ) ( ) ln 2 ln ln ( )2 2 2
ti i i i i i
dg P-= - - - - p- + wx x µ Σ x µ Σ
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
7
A)
Find the discriminant functions with the MAP decision rule
Classify the network traffic pattern xc=(1,1)t.
From Part 4 of the course, we know that if Si=S, patterns generate hyper-ellipsoidal “clusters” having the same dimension and shape, centered in µi , whose discriminant functions become
Alternatively, dropping the term we can write
where
;
Therefore, we can substitute the values of Si e µi, compute gi(x) for x=(1,1)t and assign this pattern to the class i such that gi(x)> gj(x), j¹i.
The MAP decision criterion is
and, using Bayes,
From this equation, we obtain the discriminant functions related to
The covariance matrices are as follows:
Σ-. = .`4 33 4
In the logarithmic space, we can skip common additive constants
)(ln)()(21)( 1
iit
ii Pg w+---= - µxΣµxx
xΣ 1
21 -- tx
0)( itii wg += xwx
ii µΣw 1-= )(ln21
11
0 wPw itii +-= - µΣµ
)|()|( xpxpx jii www >ÛÎ
jPxpPxpx jjiii "×>×ÛÎ )()|()()|( wwwww
)()|( ii Pxp ww ×
7||,4334
321 =Súû
ùêë
é-
-=S=S=S
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
8
= −12 𝒙,Σ-.𝒙 + (−𝒙,Σ-.𝝁𝒊 − 𝝁𝒊,Σ-.𝒙) + 𝝁𝒊,Σ-.𝝁𝒊 + ln 𝑝 𝜔K ⇒
⇒ 𝑔K 𝑥 = −12 −2𝝁𝒊,Σ-.𝒙 + 𝝁𝒊,Σ-.𝝁𝒊 + ln 𝑝 𝜔K =
= 𝝁𝒊,Σ-.𝒙 −12𝝁𝒊
,Σ-.𝝁𝒊 + ln 𝑝 𝜔K
So, for each class we have (remember that P1=P2=1/4):
𝑔.(𝒙) = ln 𝑝 𝜔. = ln14
𝑔5 𝒙 = 22𝑥.𝑥5 − 4 + ln 𝑝 𝜔5 = 2 𝑥. + 𝑥5 − 4 + ln
14
𝑔G 𝒙 = − 22𝑥.𝑥5 − 4 + ln 𝑝 𝜔G = −2 𝑥. + 𝑥5 − 4 + ln
12
As expected, being all covariance matrices equal to each other, we obtain linear discriminant functions. Decision boundaries are defined by planes placed between each “neighboring” class. Let us classify pattern xc=(1,1)t. g1(xc)=-1.3863 g2(xc)=-1.3863 g3(xc)=-8.6931 Thus, such a point can be assigned to class 1 or 2.
))(ln()()(21)( 1
iiT
ii pmxmxxg w+úûù
êëé -×S×-×-= -
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
9
B) Find the decision boundaries.
Graphs shown below are not strictly necessary for the resolution of the problem; they are shown for clarity.
The eigenvectors of are as follows: and , with eigenvalues 1 and 7, respectively;
Gaussians show this shape:
Fig.1 Probability densities of classes (schematic representation of contour lines)- prior probabilities are not considered
S ÷÷ø
öççè
æ
--
2222
÷÷ø
öççè
æ-2222
2
-2
-2
2
µ1
µ2
µ3
x1
x2
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
10
Fig.2 Probability densities of classes - prior probabilities are not considered
Fig.3 Probability densities of classes - prior probabilities are not considered
-6 -4 -2 0 2 4 6-5
0
5
0
0.02
0.04
0.06
0.08
x1
x2
Prob
abilit
y De
nsity
(no
prio
r)
-6 -4 -2 0 2 4 6-6
-4
-2
0
2
4
6
x1
x2
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
11
Fig.4 Probability densities of classes - prior probabilities are 1/4 (class 1), 1/4 (class 2), 1/2 (class 3)
Fig.4 Probability densities of classes - prior probabilities are 1/4 (class 1), 1/4 (class 2), 1/2 (class 3)
-6-4
-20
24
6 -5
0
50
0.005
0.01
0.015
0.02
0.025
0.03
x2x1
Prob
abilit
y De
nsity
-6 -4 -2 0 2 4 6-6
-4
-2
0
2
4
6
x1
x2
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
12
Decision boundaries can be found considering that they correspond to - that is, to the set of points for which the posterior probability for class i and class j are identical - provided that no other class has higher probability. Discriminant functions are:
𝑔.(𝒙) = ln14
𝑔5 𝒙 = 2 𝑥. + 𝑥5 − 4 + ln14
𝑔G 𝒙 = −2 𝑥. + 𝑥5 − 4 + ln12
First decision boundary: 𝑔.(𝒙) = 𝑔5(𝒙) ⟹ 2 𝑥. + 𝑥5 − 4 = 0
𝑥. + 𝑥5 = 2
is defined by equation
Substituting this vector on the discriminant functions g1( ), g2 ( ) and g3 ( ) we obtain the value of these functions for the boundary defined by g1( )= g2 ( )
Obviously, we find the equality . The inequality tells us that the first hyper-plane is a decision boundary between class 1 and 2, for all values of 𝑥
)()( xgxg ji =
÷÷ø
öççè
æ=
2
1
ˆˆ
ˆxx
x 221 =+ xx
;6931.8)ˆ(;3863.1)ˆ(;3863.1)ˆ(
3
2
1
-=-=-=
xxx
ggg
)ˆ()ˆ( 21 xx gg =
)ˆ()ˆ()ˆ( 213 xxx ggg =<
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
13
Second decision boundary
𝑔5 𝒙 = 𝑔G 𝒙 ⟹ 2 𝑥. + 𝑥5 + ln14 = −2 𝑥. + 𝑥5 + ln
12
⇒ 𝑥. + 𝑥5 =14 ln 2
is defined by equation
Substituting this vector on the discriminant functions g1( ), g2 ( ) and g3 ( ) we obtain the value of these functions defined by g2( )= g3 ( )
Obviously, we find the equality .
The inequality tells us that the second hyper-plane is not a decision boundary between class 2 and 3, as for points corresponding to 𝒙 the probability for class 1 is greater than the probability for classes 2 and 3. Therefore, this hyper-plane is NOT a decision boundary, since .
Third decision boundary
𝑔. 𝒙 = 𝑔G(𝒙) ⇒ ln14 = −2 𝑥. + 𝑥5 − 4 + ln
12
𝑥. + 𝑥5 =12 ln 2 − 2
is defined by equation
Substituting this vector on the discriminant functions g1( ), g2 ( ) and g3 ( ) we obtain the value of these functions in the hyper-plane defined by g1( )= g3 ( )
Obviously, the value of g1 is constant, and we find the equality .
The inequality tell us that the third hyper-plane is a decision boundary between class 1 and 3
÷÷ø
öççè
æ=
2
1
ˆˆ
ˆxx
x 2ln41
21 =+ xx
;0397.5)ˆ(;0397.5)ˆ(;3863.1)ˆ(
3
2
1
-=-=-=
xxx
ggg
)ˆ()ˆ( 32 xx gg =)ˆ()ˆ()ˆ( 321 xxx ggg =>
3,2,)ˆ()ˆ(1 => jgg j xx
÷÷ø
öççè
æ=
2
1
ˆˆ
ˆxx
x 22ln21
21 -==+ xx
-1.3863;)ˆ(;-8.6931)ˆ(;1.3863)ˆ(
3
2
1
==-=
xxx
ggg
)ˆ()ˆ( 31 xx gg =)ˆ()ˆ()ˆ( 312 xxx ggg =<
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
14
Decision boundaries:
Between classes 1 and 2:
Between classes 1 and 3:
𝑥. + 𝑥5 =12 ln 2 − 2 = −1.6534
Fig.5 Probability densities and decision boundary. The decision boundary between class 1 and class 3 is identify by (1,3). The decision boundary between class 1 and class 2 is identify by (1,2).
It is easy to note that the point
lies exactly on the hyper-plane (1,2), in fact this point can be assigned to class 1 or 2 (see Exercise 2-A)
From the graphic above, it is easy to see that separation lines are not symmetric with respect to Gaussian centers, due to different class prior probabilities.
221 =+ xx
-6 -4 -2 0 2 4 6-6
-4
-2
0
2
4
6
x1
Decision Boundaries
x2
(1,3)
(1,2)
÷÷ø
öççè
æ=11
cx
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
15
Exercise 3 Let us consider the following training samples for class ωA and ωB, respectively.
A (-1, 0)T ( 1, 0)T (-0.1, 2)T ( 0.3, -1.8)T
B (5, 4)T (3, 3)T (4, 7)T (6.5, 2)T
Class-conditional distributions are as follows:
are not given but they should be estimated.
Find the optimal discriminant function and classify the pattern (2.2, 2.2)’ under the following hypotheses:
1) and ,
2) and
3) and
4) arbitrary and Note that by changing the hypotheses, classification changes as well.
SOLUTION
The general form of a multivariate normal pdf is
𝑝 𝑥 𝜔K =1
2𝜋O5 𝛴K
.5exp[−
12 𝑥 − 𝜇K
,𝛴K-. 𝑥 − 𝜇K ]
Taking the logarithm of the above expression, one yields the discriminant function 𝑔K 𝑥 as
);,()|();,()|( BBBAAA NpNp S=S= µwµw xx
BBAA SS ,,, µµ
=x̂
Ι2s=S=S BA )()( 21 ww PP =Ι2s=S=S BA )(2)( 21 ww PP =
BA S¹S )(2)( 21 ww PP =
BA S=S )(2)( 21 ww PP =
11 1( ) ( ) ( ) ln 2 ln ln ( )2 2 2
ti i i i i i
dg P-= - - - - p- + wx x µ Σ x µ Σ
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
16
1) Under the hypothesis and , find the optimal discriminant
function and classify the pattern (2.2, 2.2)’ 1.a classify the pattern Under this hypothesis, we don’t need to estimate the covariance matrix. We just need to exploit the general expression of gi (see Part 4 of the course)
By removing the terms common to g_A and g_B, the discriminant functions are as follows:
And the optimal discriminant plane is:
I can estimate the mean vectors using:
;
Squared distances of X from the above means:
dA2 = 9.2450; dB
2 = 9.1206
gA(X)=- 9.2450; gB(X)=-9.1206;
Pattern is assigned to class B
Ι2s=S=S BA )()( 21 ww PP ==x̂
)(ln||ln21)2(ln
2)()(
21)( 1
iiiit
ii Pdg wp +-----= - ΣµxΣµxx
2)( AAg µxx --=
2)( BBg µxx --=
)()( xx BA gg =
å=
=n
kkn 1
1ˆ xµ
÷÷ø
öççè
æ=
05.005.0
Aµ ÷÷ø
öççè
æ=
44.625
Bµ
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
17
1.b Find the boundary
The boundary is defined by the equation
𝒙 − 𝝁𝑨𝟐 = 𝒙 − 𝝁𝑩
𝟐
where the generic 2-dimensional vector x is 𝒙 =𝑥.𝑥5 , that is,
𝑥. − 𝜇i.𝑥5 − 𝜇i5
5=
𝑥. − 𝜇j.𝑥5 − 𝜇j5
5
𝑥. − 𝜇i. 5 + 𝑥5 − 𝜇i5 5 = 𝑥. − 𝜇j. 5 + 𝑥5 − 𝜇j5 5
...
after a simple calculation, we obtain
9.15𝑥. + 7.9𝑥5 = 37.3856
An alternative approach is
𝒙 − 𝝁𝑨𝟐 = 𝒙 − 𝝁𝑩
𝟐
9.15𝑥. + 7.9𝑥5 = 37.3856
The point (2.2, 2.2) is slightly above the boundary, and thus classified as class B (as discussed before).
( ) ( ) ( ) ( )BTBA
TA µxµxµxµx --=--
BTB
TBB
TA
TA
TAA
T µµxµµxµµxµµx +--=+--
BTB
TBA
TA
TA µµxµµµxµ +-=+- 22 𝑥l
𝜇i
𝜇j
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
18
2) Under the hypothesis and , find the optimal
discriminant function and classify pattern (2.2, 2.2)’
As in previous exercise, we can start our calculation from the general form of the discriminant function
by removing the terms common to both members:
(the quadratic term is the same in each gi() and it can thus be ignored)
We need to estimate the term σ. The estimation of covariance matrices is
But the hypothesis is
When all classes show the same covariance matrix, it can be proven that the maximum likelihood estimate for such a covariance matrix is (see slides of Part 4 of the course):
Σ =𝑛K𝑛 ΣK
n
KL.
Ι2s=S=S BA )(2)( 21 ww PP ==x̂
)(ln||ln21)2(ln
2)()(
21)( 1
iiiit
ii Pdg wp +-----= - ΣµxΣµxx
( ))(ln2
)( 2
2
ii
i Pg ws
+-
-=µx
x
( )( )tik
n
kiki n
µxµxΣ ˆˆ11ˆ
1
---
= å=
÷÷ø
öççè
æ=S
2.410.25-0.25- 0.6967ˆ
A
÷÷ø
öççè
æ=S
4.6667 1.3333-1.3333- 2.2292ˆ
B
Ι2s=S=S BA
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
19
or, if we know the priors,
Σ = 𝑃(𝜔K)ΣK
n
KL.
In order to enforce compliance between the obtained data and the hypothesis 𝛴i = 𝛴j = 𝜎5𝑰, we may impose that these matrices are diagonal and proportional to the identity matrix (only zero values outside the diagonal). Essentially, we have to ignore out-of-diagonal terms, and average terms on the main diagonal. This corresponds to computing the variance of each feature, and then average the corresponding values (separately for each class). Finally, the average variance is computed using the above equation (a weighted mean), namely, weighing the value of each average class variance with its prior probability.
Σi =
0.6967 + 2.412
0
00.6967 + 2.41
2
= 1.5533𝑰
Σj =
2.2292 + 4.66672
0
02.2292 + 4.6667
2
= 3.4479𝑰
The above diagonal terms of the covariance matrix have been obtained by averaging the previous variance values.
Finally, Σ = 𝑃 𝜔K ΣKnKL. = 𝑃iΣq + 𝑃j𝛴j = 2.1849𝑰
so, 𝜎5 = 2.1849
2.a classify the pattern (2.2, 2.2)’
𝑔i 𝑥l = −𝑥l − 𝜇i
5
2𝜎5 + log 𝑃(𝜔i ) = −2.5212
𝑔j 𝑥l = −𝑥l − 𝜇j
5
2𝜎5 + log 𝑃(𝜔j ) = −3.1858
gB(x)< gA(x); The pattern is assigned to class A
;2.41000.6967÷÷ø
öççè
æ=SA
÷÷ø
öççè
æ=S
4.666700 2.2292
B
=x̂
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
20
Alternative approach
The discriminant functions can be expressed as
, where
(see SLIDES, chapt. 4)
gB(x)< gA(x); The pattern is assigned to class A
0)( iTii wwxg += x
)(log21;1202 ii
Tiiii Pww wµµ
sµ
s+-==
-0.4066; 0.02290.0229
0 =÷÷ø
öççè
æ= AA ww
9.6554- ; 1.83082.1168
0 =÷÷ø
öççè
æ= BB ww
0.3059- )( 0 =+= ATAA wwg xx
0.9706- )( 0 =+= BTBB wwg xx
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
21
2.b find boundary
𝑔i 𝑥 = −𝑥 − 𝜇i
5
2𝜎5 + log 𝑃(𝜔i )
𝑔j 𝑥 = −𝑥 − 𝜇j
5
2𝜎5 + log 𝑃(𝜔j )
The optimal discriminant plane is the hyperplane gA=gB, that is,
2.094𝑥. + 1.808𝑥5 = 9.2488
It is easy to see that (compare with point 1.b in the solution of this exercise) the boundary has been shifted towards the upper-right corner of the plot, due to the increase of the prior of the class A. Accordingly, the point (2.2, 2.2) is now below the decision boundary (thus in the decision region of class A).
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
22
(alternative approach)
The optimal discriminant plane is the hyperplane gA=gB, that is,
𝒘i,𝒙 + 𝑤it = 𝒘j, 𝒙 + 𝑤jt
𝒘i, − 𝒘j, 𝒙 + 𝑤it − 𝑤jt = 0
The hyperplane can be written as
that is,
2.094𝑥. + 1.808𝑥5 = 9.2488
0)( 0 =- xxwT
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
23
3) By assuming that e , find the optimal discriminant boundary
and classify pattern (2.2, 2.2)’
We can estimate the covariance matrices (see previous exercise):
gi(x) is a quadratic function which can be written as
𝑔K 𝒙 = 𝒙u𝑾K𝒙 + 𝒘Ku𝒙 + 𝑤Kt
where
𝑾K = −.5𝚺K-.; 𝑤K = 𝚺K-.𝝁K
𝑤Kt = −12𝝁Ku𝚺K-.𝝁K −
12ln 𝚺K + ln 𝑃(𝜔K)
BA S¹S )(2)( 21 ww PP ==x̂
( )( )tk
n
kkn
µxµxΣ ˆˆ11ˆ
1--
-= å
=
÷÷ø
öççè
æ=S
2.410.25-0.25- 0.6967
A
÷÷ø
öççè
æ=S
4.6667 1.3333-1.3333- 2.2292
B
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
24
Substituting: 𝑾i =
−0.7454 −0.0773−0.0773 −0.2155
wa = [ 0.0823, 0.0293]’ wa0 = -0.6484 𝑾j =
−0.2705 −0.0773−0.0773 −0.1292
wb =[ 3.1207 , 1.7487]’ wb0 = -12.8893 we can compute the discriminant functions and classify the pattern 𝒙l =
2.22.2
Discriminant function:
𝑔x 𝒙 = −0.7454𝑥.5 − 0.2155𝑥55 − 0.1547𝑥.𝑥5 + 0.08227𝑥. + 0.02928𝑥5 − 0.6484
𝑔y 𝒙 = −0.2705𝑥.5 − 0.1546𝑥.𝑥5 + 3.12𝑥. − 0.1292𝑥55 + 1.749𝑥5 − 12.89
Boundary:
As we know from SLIDES, chapt. 4, we obtain a quadratic discriminant function
𝑔x 𝒙 − 𝑔y 𝒙 = 0Þ
Þ𝑥.5 + 0.0001937𝑥.𝑥5 + 6.397𝑥. + 0.1816𝑥55 + 3.621𝑥5 − 25.78 = 0
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
25
and finally
ga(x_t)= -5.825
gb(x_t)=-4.8610
gb(x_t)> gA(x_t); The pattern is assigned to class B
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
26
4) unknown and
Hint: we need to estimate the covariance matrix as a weighted mean of the two covariances and apply the Gaussian model which can satisfy the hypothesis 𝛴i = 𝛴j, as in Exercise 2.
BA S=S )(2)( 21 ww PP =
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
27
Exercise 4 Let us consider a classification task with 3 data classes, a two-dimensional feature space, with Gaussian distributions for the 3 classes:
𝝁. =00 ;𝝁5 =
01 ;𝝁G =
11 ;𝑓𝑜𝑟𝑒𝑎𝑐ℎ𝑐𝑙𝑎𝑠𝑠𝚺𝑖 = 2𝑰
𝑃 𝜔. = 𝑃 𝜔5 =14
Find the optimal decision boundaries and the related decision regions.
Solution
Given that 𝚺𝒊 = 𝝈𝟐𝑰 we have 𝒈𝒊 𝒙 :
𝑔K 𝑥 = −𝑥 − 𝜇K
5
2𝜎5+ log 𝑃K
𝑔. 𝑥 = − .H𝑥.5 + 𝑥55 + log .
H
𝑔5 𝑥 = − .H𝑥.5 + 𝑥55 − .
H1 − 2𝑥5 + log .
H
𝑔G 𝑥 = − .H𝑥.5 + 𝑥55 − .
H1 − 2𝑥. + 1 − 2𝑥5 + log .
5
Equaltermscanbedeleted,soobtaining:
𝑔. 𝑥 = log .H
𝑔5 𝑥 = − .H1 − 2𝑥5 + log .
H
𝑔G 𝑥 = − .H1 − 2𝑥. + 1 − 2𝑥5 + log .
5
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
28
Decisionboundarybetweenclass1andclass2
𝑔. 𝑥 = 𝑔5 𝑥 ⟹ 𝑥5 =.5
For 𝒙 ≡ (𝑥5 =.5)we have
𝑔. 𝒙 = 𝑔5 𝒙 = log .H;
𝑔G 𝒙 = − .H1 − 2𝑥. + log .
5
𝑥5 =.5istheboundaryifandonlyif𝑔. 𝒙 = 𝑔5 𝒙 > 𝑔G 𝒙 ,therefore
Boundarybetweenclass1and2
𝑥5 = 1/2
𝑖𝑓𝑥. <.5− 2 log 2 ≅ −0.8863
Decisionboundarybetweenclass1andclass3
𝑔. 𝑥 = 𝑔G 𝑥 ⟹ 𝑥. + 𝑥5 = 1 − 2log 2
For 𝒙 ≡ (𝑥. + 𝑥5 = 1 − 2 log 2)we have
𝑔. 𝒙 = 𝑔G 𝒙 = log .H;
𝑔5 𝒙 = .5𝑥5 −
.H+ log .
H
𝑥. + 𝑥5 = 1 − 2 log 2istheboundaryonlyif𝑔. 𝒙 = 𝑔G 𝒙 > 𝑔5 𝒙 ,therefore
Boundarybetweenclass1andclass3
𝑥. + 𝑥5 = 1 − 2 log 2 ≅ −0.3863
𝑖𝑓𝑥5 <.5
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
29
Decisionboundarybetweenclass2andclasse3
𝑔5 𝑥 = 𝑔G 𝑥 ⟹ 𝑥. =.5− 2 log 2
For 𝒙 ≡ (𝑥1 =12− 2 log 2 )we have
𝑔5 𝒙 = 𝑔G 𝒙 = − .H1 − 2𝑥5 + log .
H ;
𝑔. 𝒙 = log .H
− .H1 − 2𝑥5 + log .
Histheboundaryonlyif𝑔5 𝒙 = 𝑔G 𝒙 > 𝑔. 𝒙 ,therefore
Boundaryclass2andclass3
𝑥. =.5− 2 log 2 ≅ −0.8863
𝑠𝑒𝑥5 >.5
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
30
Legenda
Duetothedifferenceofpriors,both𝜇.and𝜇5lieinthedecisionregionofclass3.
-0.8863𝜇.
1/2
-0.3863
𝜇G𝜇5
𝑥. + 𝑥5 = 1 − 2log 2
𝑥. =12− 2 log 2
𝑥5 = 1/2
class1decisionregion
class3decisionregion
class2decisionregion
UNICACourse–MachineLearning,Exercises,Cagliari,March-May2017FabioRoli©
31
Allthecoursematerialisavailableonthewebsite
Coursewebsite:http://pralab.diee.unica.it/MachineLearning