22 machine learning feature selection
TRANSCRIPT
![Page 1: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/1.jpg)
Machine Learning for Data MiningFeature Selection
Andres Mendez-Vazquez
July 19, 2015
1 / 73
![Page 2: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/2.jpg)
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat is Feature Selection?Preprocessing
OutliersData NormalizationMissing Data
The Peaking Phenomena
2 Feature SelectionFeature SelectionFeature selection based on statistical hypothesis testingApplication of the t-Test in Feature SelectionConsidering Feature SetsScatter MatricesWhat to do with it?
Sequential Backward Selection
2 / 73
![Page 3: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/3.jpg)
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat is Feature Selection?Preprocessing
OutliersData NormalizationMissing Data
The Peaking Phenomena
2 Feature SelectionFeature SelectionFeature selection based on statistical hypothesis testingApplication of the t-Test in Feature SelectionConsidering Feature SetsScatter MatricesWhat to do with it?
Sequential Backward Selection
3 / 73
![Page 4: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/4.jpg)
Images/cinvestav-1.jpg
What is this?Main Question“Given a number of features, how can one select the most important ofthem so as to reduce their number and at the same time retain as much aspossible of their class discriminatory information? “
Why is important?1 If we selected features with little discrimination power, the subsequent
design of a classifier would lead to poor performance.2 if information-rich features are selected, the design of the classifier
can be greatly simplified.
ThereforeWe want features that lead to
1 Large between-class distance.2 Small within-class variance.
4 / 73
![Page 5: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/5.jpg)
Images/cinvestav-1.jpg
What is this?Main Question“Given a number of features, how can one select the most important ofthem so as to reduce their number and at the same time retain as much aspossible of their class discriminatory information? “
Why is important?1 If we selected features with little discrimination power, the subsequent
design of a classifier would lead to poor performance.2 if information-rich features are selected, the design of the classifier
can be greatly simplified.
ThereforeWe want features that lead to
1 Large between-class distance.2 Small within-class variance.
4 / 73
![Page 6: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/6.jpg)
Images/cinvestav-1.jpg
What is this?Main Question“Given a number of features, how can one select the most important ofthem so as to reduce their number and at the same time retain as much aspossible of their class discriminatory information? “
Why is important?1 If we selected features with little discrimination power, the subsequent
design of a classifier would lead to poor performance.2 if information-rich features are selected, the design of the classifier
can be greatly simplified.
ThereforeWe want features that lead to
1 Large between-class distance.2 Small within-class variance.
4 / 73
![Page 7: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/7.jpg)
Images/cinvestav-1.jpg
What is this?Main Question“Given a number of features, how can one select the most important ofthem so as to reduce their number and at the same time retain as much aspossible of their class discriminatory information? “
Why is important?1 If we selected features with little discrimination power, the subsequent
design of a classifier would lead to poor performance.2 if information-rich features are selected, the design of the classifier
can be greatly simplified.
ThereforeWe want features that lead to
1 Large between-class distance.2 Small within-class variance.
4 / 73
![Page 8: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/8.jpg)
Images/cinvestav-1.jpg
What is this?Main Question“Given a number of features, how can one select the most important ofthem so as to reduce their number and at the same time retain as much aspossible of their class discriminatory information? “
Why is important?1 If we selected features with little discrimination power, the subsequent
design of a classifier would lead to poor performance.2 if information-rich features are selected, the design of the classifier
can be greatly simplified.
ThereforeWe want features that lead to
1 Large between-class distance.2 Small within-class variance.
4 / 73
![Page 9: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/9.jpg)
Images/cinvestav-1.jpg
What is this?Main Question“Given a number of features, how can one select the most important ofthem so as to reduce their number and at the same time retain as much aspossible of their class discriminatory information? “
Why is important?1 If we selected features with little discrimination power, the subsequent
design of a classifier would lead to poor performance.2 if information-rich features are selected, the design of the classifier
can be greatly simplified.
ThereforeWe want features that lead to
1 Large between-class distance.2 Small within-class variance.
4 / 73
![Page 10: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/10.jpg)
Images/cinvestav-1.jpg
Then
Basically, we want nice separated and dense clusters!!!
5 / 73
![Page 11: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/11.jpg)
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat is Feature Selection?Preprocessing
OutliersData NormalizationMissing Data
The Peaking Phenomena
2 Feature SelectionFeature SelectionFeature selection based on statistical hypothesis testingApplication of the t-Test in Feature SelectionConsidering Feature SetsScatter MatricesWhat to do with it?
Sequential Backward Selection
6 / 73
![Page 12: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/12.jpg)
Images/cinvestav-1.jpg
However, Before That...
It is necessary to do the following1 Outlier removal.2 Data normalization.3 Deal with missing data.
ActuallyPREPROCESSING!!!
7 / 73
![Page 13: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/13.jpg)
Images/cinvestav-1.jpg
However, Before That...
It is necessary to do the following1 Outlier removal.2 Data normalization.3 Deal with missing data.
ActuallyPREPROCESSING!!!
7 / 73
![Page 14: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/14.jpg)
Images/cinvestav-1.jpg
However, Before That...
It is necessary to do the following1 Outlier removal.2 Data normalization.3 Deal with missing data.
ActuallyPREPROCESSING!!!
7 / 73
![Page 15: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/15.jpg)
Images/cinvestav-1.jpg
However, Before That...
It is necessary to do the following1 Outlier removal.2 Data normalization.3 Deal with missing data.
ActuallyPREPROCESSING!!!
7 / 73
![Page 16: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/16.jpg)
Images/cinvestav-1.jpg
OutliersDefinitionAn outlier is defined as a point that lies very far from the mean of thecorresponding random variable.
Note: We use the standard deviation
ExampleFor a normally distributed random
1 A distance of two times the standard deviation covers 95% of thepoints.
2 A distance of three times the standard deviation covers 99% of thepoints.
NotePoints with values very different from the mean value produce large errorsduring training and may have disastrous effects. These effects are evenworse when the outliers, and they are the result of noisy measureme
8 / 73
![Page 17: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/17.jpg)
Images/cinvestav-1.jpg
OutliersDefinitionAn outlier is defined as a point that lies very far from the mean of thecorresponding random variable.
Note: We use the standard deviation
ExampleFor a normally distributed random
1 A distance of two times the standard deviation covers 95% of thepoints.
2 A distance of three times the standard deviation covers 99% of thepoints.
NotePoints with values very different from the mean value produce large errorsduring training and may have disastrous effects. These effects are evenworse when the outliers, and they are the result of noisy measureme
8 / 73
![Page 18: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/18.jpg)
Images/cinvestav-1.jpg
OutliersDefinitionAn outlier is defined as a point that lies very far from the mean of thecorresponding random variable.
Note: We use the standard deviation
ExampleFor a normally distributed random
1 A distance of two times the standard deviation covers 95% of thepoints.
2 A distance of three times the standard deviation covers 99% of thepoints.
NotePoints with values very different from the mean value produce large errorsduring training and may have disastrous effects. These effects are evenworse when the outliers, and they are the result of noisy measureme
8 / 73
![Page 19: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/19.jpg)
Images/cinvestav-1.jpg
OutliersDefinitionAn outlier is defined as a point that lies very far from the mean of thecorresponding random variable.
Note: We use the standard deviation
ExampleFor a normally distributed random
1 A distance of two times the standard deviation covers 95% of thepoints.
2 A distance of three times the standard deviation covers 99% of thepoints.
NotePoints with values very different from the mean value produce large errorsduring training and may have disastrous effects. These effects are evenworse when the outliers, and they are the result of noisy measureme
8 / 73
![Page 20: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/20.jpg)
Images/cinvestav-1.jpg
OutliersDefinitionAn outlier is defined as a point that lies very far from the mean of thecorresponding random variable.
Note: We use the standard deviation
ExampleFor a normally distributed random
1 A distance of two times the standard deviation covers 95% of thepoints.
2 A distance of three times the standard deviation covers 99% of thepoints.
NotePoints with values very different from the mean value produce large errorsduring training and may have disastrous effects. These effects are evenworse when the outliers, and they are the result of noisy measureme
8 / 73
![Page 21: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/21.jpg)
Images/cinvestav-1.jpg
OutliersDefinitionAn outlier is defined as a point that lies very far from the mean of thecorresponding random variable.
Note: We use the standard deviation
ExampleFor a normally distributed random
1 A distance of two times the standard deviation covers 95% of thepoints.
2 A distance of three times the standard deviation covers 99% of thepoints.
NotePoints with values very different from the mean value produce large errorsduring training and may have disastrous effects. These effects are evenworse when the outliers, and they are the result of noisy measureme
8 / 73
![Page 22: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/22.jpg)
Images/cinvestav-1.jpg
Outlier Removal
ImportantThen removing outliers is the biggest importance.
ThereforeYou can do the following
1 If you have a small number ⇒ discard them!!!2 Adopt cost functions that are not sensitive to outliers:
1 For example, possibilistic clustering.3 For more techniques look at
1 Huber, P.J. “Robust Statistics,” JohnWiley and Sons, 2nd Ed 2009.
9 / 73
![Page 23: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/23.jpg)
Images/cinvestav-1.jpg
Outlier Removal
ImportantThen removing outliers is the biggest importance.
ThereforeYou can do the following
1 If you have a small number ⇒ discard them!!!2 Adopt cost functions that are not sensitive to outliers:
1 For example, possibilistic clustering.3 For more techniques look at
1 Huber, P.J. “Robust Statistics,” JohnWiley and Sons, 2nd Ed 2009.
9 / 73
![Page 24: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/24.jpg)
Images/cinvestav-1.jpg
Outlier Removal
ImportantThen removing outliers is the biggest importance.
ThereforeYou can do the following
1 If you have a small number ⇒ discard them!!!2 Adopt cost functions that are not sensitive to outliers:
1 For example, possibilistic clustering.3 For more techniques look at
1 Huber, P.J. “Robust Statistics,” JohnWiley and Sons, 2nd Ed 2009.
9 / 73
![Page 25: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/25.jpg)
Images/cinvestav-1.jpg
Outlier Removal
ImportantThen removing outliers is the biggest importance.
ThereforeYou can do the following
1 If you have a small number ⇒ discard them!!!2 Adopt cost functions that are not sensitive to outliers:
1 For example, possibilistic clustering.3 For more techniques look at
1 Huber, P.J. “Robust Statistics,” JohnWiley and Sons, 2nd Ed 2009.
9 / 73
![Page 26: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/26.jpg)
Images/cinvestav-1.jpg
Outlier Removal
ImportantThen removing outliers is the biggest importance.
ThereforeYou can do the following
1 If you have a small number ⇒ discard them!!!2 Adopt cost functions that are not sensitive to outliers:
1 For example, possibilistic clustering.3 For more techniques look at
1 Huber, P.J. “Robust Statistics,” JohnWiley and Sons, 2nd Ed 2009.
9 / 73
![Page 27: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/27.jpg)
Images/cinvestav-1.jpg
Outlier Removal
ImportantThen removing outliers is the biggest importance.
ThereforeYou can do the following
1 If you have a small number ⇒ discard them!!!2 Adopt cost functions that are not sensitive to outliers:
1 For example, possibilistic clustering.3 For more techniques look at
1 Huber, P.J. “Robust Statistics,” JohnWiley and Sons, 2nd Ed 2009.
9 / 73
![Page 28: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/28.jpg)
Images/cinvestav-1.jpg
Data Normalization
In the real worldIn many practical situations a designer is confronted with features whosevalues lie within different dynamic ranges.
For ExampleWe can have two features with the following ranges
xi ∈ [0, 100, 000]xj ∈ [0, 0.5]
ThusMany classification machines will be swamped by the first feature!!!
10 / 73
![Page 29: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/29.jpg)
Images/cinvestav-1.jpg
Data Normalization
In the real worldIn many practical situations a designer is confronted with features whosevalues lie within different dynamic ranges.
For ExampleWe can have two features with the following ranges
xi ∈ [0, 100, 000]xj ∈ [0, 0.5]
ThusMany classification machines will be swamped by the first feature!!!
10 / 73
![Page 30: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/30.jpg)
Images/cinvestav-1.jpg
Data Normalization
In the real worldIn many practical situations a designer is confronted with features whosevalues lie within different dynamic ranges.
For ExampleWe can have two features with the following ranges
xi ∈ [0, 100, 000]xj ∈ [0, 0.5]
ThusMany classification machines will be swamped by the first feature!!!
10 / 73
![Page 31: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/31.jpg)
Images/cinvestav-1.jpg
Data Normalization
We have the following situationFeatures with large values may have a larger influence in the cost functionthan features with small values.
Thus!!!This does not necessarily reflect their respective significance in the designof the classifier.
11 / 73
![Page 32: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/32.jpg)
Images/cinvestav-1.jpg
Data Normalization
We have the following situationFeatures with large values may have a larger influence in the cost functionthan features with small values.
Thus!!!This does not necessarily reflect their respective significance in the designof the classifier.
11 / 73
![Page 33: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/33.jpg)
Images/cinvestav-1.jpg
Data Normalization
We have the following situationFeatures with large values may have a larger influence in the cost functionthan features with small values.
Thus!!!This does not necessarily reflect their respective significance in the designof the classifier.
11 / 73
![Page 34: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/34.jpg)
Images/cinvestav-1.jpg
Example I
Be NaiveFor each feature i = 1, ..., d obtain the maxi and the mini such that
xik = xik −minimaxi −mini
(1)
ProblemThis simple normalization will send everything to a unitary sphere thusloosing data resolution!!!
12 / 73
![Page 35: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/35.jpg)
Images/cinvestav-1.jpg
Example I
Be NaiveFor each feature i = 1, ..., d obtain the maxi and the mini such that
xik = xik −minimaxi −mini
(1)
ProblemThis simple normalization will send everything to a unitary sphere thusloosing data resolution!!!
12 / 73
![Page 36: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/36.jpg)
Images/cinvestav-1.jpg
Example II
Use the idea ofEverything is Gaussian...
ThusFor each feature set...
1 xk = 1N∑N
i=1 xik , k = 1, 2, ..., d2 σ2
k = 1N−1
∑Ni=1 (xik − xk)2 , k = 1, 2, ..., d
Thus
xik = xik − xkσ
(2)
13 / 73
![Page 37: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/37.jpg)
Images/cinvestav-1.jpg
Example II
Use the idea ofEverything is Gaussian...
ThusFor each feature set...
1 xk = 1N∑N
i=1 xik , k = 1, 2, ..., d2 σ2
k = 1N−1
∑Ni=1 (xik − xk)2 , k = 1, 2, ..., d
Thus
xik = xik − xkσ
(2)
13 / 73
![Page 38: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/38.jpg)
Images/cinvestav-1.jpg
Example II
Use the idea ofEverything is Gaussian...
ThusFor each feature set...
1 xk = 1N∑N
i=1 xik , k = 1, 2, ..., d2 σ2
k = 1N−1
∑Ni=1 (xik − xk)2 , k = 1, 2, ..., d
Thus
xik = xik − xkσ
(2)
13 / 73
![Page 39: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/39.jpg)
Images/cinvestav-1.jpg
Example II
Use the idea ofEverything is Gaussian...
ThusFor each feature set...
1 xk = 1N∑N
i=1 xik , k = 1, 2, ..., d2 σ2
k = 1N−1
∑Ni=1 (xik − xk)2 , k = 1, 2, ..., d
Thus
xik = xik − xkσ
(2)
13 / 73
![Page 40: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/40.jpg)
Images/cinvestav-1.jpg
Example II
Use the idea ofEverything is Gaussian...
ThusFor each feature set...
1 xk = 1N∑N
i=1 xik , k = 1, 2, ..., d2 σ2
k = 1N−1
∑Ni=1 (xik − xk)2 , k = 1, 2, ..., d
Thus
xik = xik − xkσ
(2)
13 / 73
![Page 41: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/41.jpg)
Images/cinvestav-1.jpg
Example II
ThusAll new features have zero mean and unit variance.
FurtherOther linear techniques limit the feature values in the range of [0, 1] or[−1, 1] by proper scaling.
HoweverWe can non-linear mapping. For example the softmax scaling.
14 / 73
![Page 42: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/42.jpg)
Images/cinvestav-1.jpg
Example II
ThusAll new features have zero mean and unit variance.
FurtherOther linear techniques limit the feature values in the range of [0, 1] or[−1, 1] by proper scaling.
HoweverWe can non-linear mapping. For example the softmax scaling.
14 / 73
![Page 43: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/43.jpg)
Images/cinvestav-1.jpg
Example II
ThusAll new features have zero mean and unit variance.
FurtherOther linear techniques limit the feature values in the range of [0, 1] or[−1, 1] by proper scaling.
HoweverWe can non-linear mapping. For example the softmax scaling.
14 / 73
![Page 44: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/44.jpg)
Images/cinvestav-1.jpg
Example III
Softmax ScalingIt consists of two steps
First one
yik = xik − xkσ
(3)
Second one
xik = 11 + exp {−yik}
(4)
15 / 73
![Page 45: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/45.jpg)
Images/cinvestav-1.jpg
Example III
Softmax ScalingIt consists of two steps
First one
yik = xik − xkσ
(3)
Second one
xik = 11 + exp {−yik}
(4)
15 / 73
![Page 46: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/46.jpg)
Images/cinvestav-1.jpg
Example III
Softmax ScalingIt consists of two steps
First one
yik = xik − xkσ
(3)
Second one
xik = 11 + exp {−yik}
(4)
15 / 73
![Page 47: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/47.jpg)
Images/cinvestav-1.jpg
ExplanationNotice the red area is almost flat!!!
Thus, we have thatThe red region represents values of y inside of the region defined bythe mean and variance (small values of y).Then, if we have those values x behaves as a linear function.
And values too away from the meanThey are squashed by the exponential part of the function.
16 / 73
![Page 48: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/48.jpg)
Images/cinvestav-1.jpg
ExplanationNotice the red area is almost flat!!!
Thus, we have thatThe red region represents values of y inside of the region defined bythe mean and variance (small values of y).Then, if we have those values x behaves as a linear function.
And values too away from the meanThey are squashed by the exponential part of the function.
16 / 73
![Page 49: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/49.jpg)
Images/cinvestav-1.jpg
ExplanationNotice the red area is almost flat!!!
Thus, we have thatThe red region represents values of y inside of the region defined bythe mean and variance (small values of y).Then, if we have those values x behaves as a linear function.
And values too away from the meanThey are squashed by the exponential part of the function.
16 / 73
![Page 50: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/50.jpg)
Images/cinvestav-1.jpg
If you want more complex
A more complex analysisYou can use a Taylor’s expansion
x = f (y) = f (a) + f ′ (y) (y − a) + f ′′(y) (y − a)2
2 + ... (5)
17 / 73
![Page 51: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/51.jpg)
Images/cinvestav-1.jpg
Missing Data
This can happenIn practice, certain features may be missing from some feature vectors.
Examples where this happens1 Social sciences - incomplete surveys.2 Remote sensing - sensors go off-line.3 etc.
NoteCompleting the missing values in a set of data is also known as imputation.
18 / 73
![Page 52: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/52.jpg)
Images/cinvestav-1.jpg
Missing Data
This can happenIn practice, certain features may be missing from some feature vectors.
Examples where this happens1 Social sciences - incomplete surveys.2 Remote sensing - sensors go off-line.3 etc.
NoteCompleting the missing values in a set of data is also known as imputation.
18 / 73
![Page 53: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/53.jpg)
Images/cinvestav-1.jpg
Missing Data
This can happenIn practice, certain features may be missing from some feature vectors.
Examples where this happens1 Social sciences - incomplete surveys.2 Remote sensing - sensors go off-line.3 etc.
NoteCompleting the missing values in a set of data is also known as imputation.
18 / 73
![Page 54: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/54.jpg)
Images/cinvestav-1.jpg
Missing Data
This can happenIn practice, certain features may be missing from some feature vectors.
Examples where this happens1 Social sciences - incomplete surveys.2 Remote sensing - sensors go off-line.3 etc.
NoteCompleting the missing values in a set of data is also known as imputation.
18 / 73
![Page 55: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/55.jpg)
Images/cinvestav-1.jpg
Missing Data
This can happenIn practice, certain features may be missing from some feature vectors.
Examples where this happens1 Social sciences - incomplete surveys.2 Remote sensing - sensors go off-line.3 etc.
NoteCompleting the missing values in a set of data is also known as imputation.
18 / 73
![Page 56: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/56.jpg)
Images/cinvestav-1.jpg
Some traditional techniques to solve this problemUse zeros and risked it!!!The idea is not to add anything to the features
The sample mean/unconditional meanDoes not matter what distribution you have use the sample mean
x i = 1N
N∑k=1
xik (6)
Find the distribution of your dataUse the mean from that distribution. For example, if you have a betadistribution
x i = α
α+ β(7)
19 / 73
![Page 57: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/57.jpg)
Images/cinvestav-1.jpg
Some traditional techniques to solve this problemUse zeros and risked it!!!The idea is not to add anything to the features
The sample mean/unconditional meanDoes not matter what distribution you have use the sample mean
x i = 1N
N∑k=1
xik (6)
Find the distribution of your dataUse the mean from that distribution. For example, if you have a betadistribution
x i = α
α+ β(7)
19 / 73
![Page 58: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/58.jpg)
Images/cinvestav-1.jpg
Some traditional techniques to solve this problemUse zeros and risked it!!!The idea is not to add anything to the features
The sample mean/unconditional meanDoes not matter what distribution you have use the sample mean
x i = 1N
N∑k=1
xik (6)
Find the distribution of your dataUse the mean from that distribution. For example, if you have a betadistribution
x i = α
α+ β(7)
19 / 73
![Page 59: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/59.jpg)
Images/cinvestav-1.jpg
The MOST traditional
Drop itRemove that data
I Still you need to have a lot of data to have this luxury
20 / 73
![Page 60: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/60.jpg)
Images/cinvestav-1.jpg
Something more advanced
split data samples in two set of variables
xcomplete =(
xobservedxmissed
)(8)
Generate the following probability distribution
P (xmissed |xobserved ,Θ) = P (xmissed ,xobserved |Θ)P (xobserved |Θ) (9)
where
p (xobserved |Θ) =ˆX
p (xcomplete|Θ) dxmissed (10)
21 / 73
![Page 61: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/61.jpg)
Images/cinvestav-1.jpg
Something more advanced
split data samples in two set of variables
xcomplete =(
xobservedxmissed
)(8)
Generate the following probability distribution
P (xmissed |xobserved ,Θ) = P (xmissed ,xobserved |Θ)P (xobserved |Θ) (9)
where
p (xobserved |Θ) =ˆX
p (xcomplete|Θ) dxmissed (10)
21 / 73
![Page 62: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/62.jpg)
Images/cinvestav-1.jpg
Something more advanced
split data samples in two set of variables
xcomplete =(
xobservedxmissed
)(8)
Generate the following probability distribution
P (xmissed |xobserved ,Θ) = P (xmissed ,xobserved |Θ)P (xobserved |Θ) (9)
where
p (xobserved |Θ) =ˆX
p (xcomplete|Θ) dxmissed (10)
21 / 73
![Page 63: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/63.jpg)
Images/cinvestav-1.jpg
Something more advanced - A two step process
Clearly the Θ needs to be calculatedFor this, we use the Expectation Maximization Algorithm (Look at theDropbox for that)
Then, using Monte Carlo methodsWe draw samples from (Something as simple as slice sampler)
p (xmissed |xobserved ,Θ) (11)
22 / 73
![Page 64: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/64.jpg)
Images/cinvestav-1.jpg
Something more advanced - A two step process
Clearly the Θ needs to be calculatedFor this, we use the Expectation Maximization Algorithm (Look at theDropbox for that)
Then, using Monte Carlo methodsWe draw samples from (Something as simple as slice sampler)
p (xmissed |xobserved ,Θ) (11)
22 / 73
![Page 65: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/65.jpg)
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat is Feature Selection?Preprocessing
OutliersData NormalizationMissing Data
The Peaking Phenomena
2 Feature SelectionFeature SelectionFeature selection based on statistical hypothesis testingApplication of the t-Test in Feature SelectionConsidering Feature SetsScatter MatricesWhat to do with it?
Sequential Backward Selection
23 / 73
![Page 66: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/66.jpg)
Images/cinvestav-1.jpg
THE PEAKING PHENOMENON
RemeberNormally, to design a classifier with good generalization performance, wewant the number of sample N to be larger than the number of features d.
Why?Let’s look at the following example from the paper:
“A Problem of Dimensionality: A Simple Example” by G.A. Trunk
24 / 73
![Page 67: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/67.jpg)
Images/cinvestav-1.jpg
THE PEAKING PHENOMENON
RemeberNormally, to design a classifier with good generalization performance, wewant the number of sample N to be larger than the number of features d.
Why?Let’s look at the following example from the paper:
“A Problem of Dimensionality: A Simple Example” by G.A. Trunk
24 / 73
![Page 68: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/68.jpg)
Images/cinvestav-1.jpg
THE PEAKING PHENOMENON
Assume the following problemWe have two classes ω1, ω2 such that
P (ω1) = P (ω2) = 12 (12)
Both Classes have the following Gaussian distribution1 ω1 ⇒ µ and Σ = I2 ω2 ⇒ −µ and Σ = I
Where
µ =[1, 1√
2,
1√3, ...,
1√d
]
25 / 73
![Page 69: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/69.jpg)
Images/cinvestav-1.jpg
THE PEAKING PHENOMENON
Assume the following problemWe have two classes ω1, ω2 such that
P (ω1) = P (ω2) = 12 (12)
Both Classes have the following Gaussian distribution1 ω1 ⇒ µ and Σ = I2 ω2 ⇒ −µ and Σ = I
Where
µ =[1, 1√
2,
1√3, ...,
1√d
]
25 / 73
![Page 70: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/70.jpg)
Images/cinvestav-1.jpg
THE PEAKING PHENOMENON
Assume the following problemWe have two classes ω1, ω2 such that
P (ω1) = P (ω2) = 12 (12)
Both Classes have the following Gaussian distribution1 ω1 ⇒ µ and Σ = I2 ω2 ⇒ −µ and Σ = I
Where
µ =[1, 1√
2,
1√3, ...,
1√d
]
25 / 73
![Page 71: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/71.jpg)
Images/cinvestav-1.jpg
THE PEAKING PHENOMENON
Properties of the featuresSince the features are jointly Gaussian and Σ = I ,the involved featuresare statistically independent.
We use the following rule to classifyif for any vector x, we have that
1 ‖x − µ‖2 < ‖x + µ‖2 or z ≡ xTµ > 0 then x ∈ ω1.2 z ≡ xTµ < 0 then x ∈ ω2.
26 / 73
![Page 72: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/72.jpg)
Images/cinvestav-1.jpg
THE PEAKING PHENOMENON
Properties of the featuresSince the features are jointly Gaussian and Σ = I ,the involved featuresare statistically independent.
We use the following rule to classifyif for any vector x, we have that
1 ‖x − µ‖2 < ‖x + µ‖2 or z ≡ xTµ > 0 then x ∈ ω1.2 z ≡ xTµ < 0 then x ∈ ω2.
26 / 73
![Page 73: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/73.jpg)
Images/cinvestav-1.jpg
THE PEAKING PHENOMENON
Properties of the featuresSince the features are jointly Gaussian and Σ = I ,the involved featuresare statistically independent.
We use the following rule to classifyif for any vector x, we have that
1 ‖x − µ‖2 < ‖x + µ‖2 or z ≡ xTµ > 0 then x ∈ ω1.2 z ≡ xTµ < 0 then x ∈ ω2.
26 / 73
![Page 74: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/74.jpg)
Images/cinvestav-1.jpg
THE PEAKING PHENOMENON
Properties of the featuresSince the features are jointly Gaussian and Σ = I ,the involved featuresare statistically independent.
We use the following rule to classifyif for any vector x, we have that
1 ‖x − µ‖2 < ‖x + µ‖2 or z ≡ xTµ > 0 then x ∈ ω1.2 z ≡ xTµ < 0 then x ∈ ω2.
26 / 73
![Page 75: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/75.jpg)
Images/cinvestav-1.jpg
A little bit of algebra
For the first case
‖x − µ‖2 < ‖x + µ‖2
(x − µ)T (x − µ) < (x + µ)T (x + µ)xtx − 2xTµ + µTµ <xtx + 2xTµ + µTµ
0 <xTµ ≡ z
We have then two cases1 Known mean value µ.2 Unknown mean value µ.
27 / 73
![Page 76: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/76.jpg)
Images/cinvestav-1.jpg
A little bit of algebra
For the first case
‖x − µ‖2 < ‖x + µ‖2
(x − µ)T (x − µ) < (x + µ)T (x + µ)xtx − 2xTµ + µTµ <xtx + 2xTµ + µTµ
0 <xTµ ≡ z
We have then two cases1 Known mean value µ.2 Unknown mean value µ.
27 / 73
![Page 77: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/77.jpg)
Images/cinvestav-1.jpg
A little bit of algebra
For the first case
‖x − µ‖2 < ‖x + µ‖2
(x − µ)T (x − µ) < (x + µ)T (x + µ)xtx − 2xTµ + µTµ <xtx + 2xTµ + µTµ
0 <xTµ ≡ z
We have then two cases1 Known mean value µ.2 Unknown mean value µ.
27 / 73
![Page 78: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/78.jpg)
Images/cinvestav-1.jpg
Known mean value µ
Given that z is a linear combination of independent Gaussian Variables1 It is a Gaussian variable.2 E [z] =
∑di=1 µiE (xi) =
∑di=1
1√i
1√i =
∑di=1
1i = ‖µ‖2.
3 σ2z = ‖µ‖2.
28 / 73
![Page 79: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/79.jpg)
Images/cinvestav-1.jpg
Known mean value µ
Given that z is a linear combination of independent Gaussian Variables1 It is a Gaussian variable.2 E [z] =
∑di=1 µiE (xi) =
∑di=1
1√i
1√i =
∑di=1
1i = ‖µ‖2.
3 σ2z = ‖µ‖2.
28 / 73
![Page 80: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/80.jpg)
Images/cinvestav-1.jpg
Known mean value µ
Given that z is a linear combination of independent Gaussian Variables1 It is a Gaussian variable.2 E [z] =
∑di=1 µiE (xi) =
∑di=1
1√i
1√i =
∑di=1
1i = ‖µ‖2.
3 σ2z = ‖µ‖2.
28 / 73
![Page 81: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/81.jpg)
Images/cinvestav-1.jpg
Known mean value µ
Given that z is a linear combination of independent Gaussian Variables1 It is a Gaussian variable.2 E [z] =
∑di=1 µiE (xi) =
∑di=1
1√i
1√i =
∑di=1
1i = ‖µ‖2.
3 σ2z = ‖µ‖2.
28 / 73
![Page 82: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/82.jpg)
Images/cinvestav-1.jpg
Why the first statement?
Given that each element of the sum xTµ
it can be seen as random variable with mean 1√i and variance 1 we no
correlation between each other.
What about the variance of z?
Var (z) =E[(
z − ‖µ‖2)2]
=E[z2]− ‖µ‖4
=E[( d∑
i=1µixi
)( d∑i=1
µixi
)]−
d∑
i=1
1i2 +
d∑j=1
d∑h=1
j 6=h
1i ×
1j
29 / 73
![Page 83: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/83.jpg)
Images/cinvestav-1.jpg
Why the first statement?
Given that each element of the sum xTµ
it can be seen as random variable with mean 1√i and variance 1 we no
correlation between each other.
What about the variance of z?
Var (z) =E[(
z − ‖µ‖2)2]
=E[z2]− ‖µ‖4
=E[( d∑
i=1µixi
)( d∑i=1
µixi
)]−
d∑
i=1
1i2 +
d∑j=1
d∑h=1
j 6=h
1i ×
1j
29 / 73
![Page 84: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/84.jpg)
Images/cinvestav-1.jpg
Why the first statement?
Given that each element of the sum xTµ
it can be seen as random variable with mean 1√i and variance 1 we no
correlation between each other.
What about the variance of z?
Var (z) =E[(
z − ‖µ‖2)2]
=E[z2]− ‖µ‖4
=E[( d∑
i=1µixi
)( d∑i=1
µixi
)]−
d∑
i=1
1i2 +
d∑j=1
d∑h=1
j 6=h
1i ×
1j
29 / 73
![Page 85: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/85.jpg)
Images/cinvestav-1.jpg
Why the first statement?
Given that each element of the sum xTµ
it can be seen as random variable with mean 1√i and variance 1 we no
correlation between each other.
What about the variance of z?
Var (z) =E[(
z − ‖µ‖2)2]
=E[z2]− ‖µ‖4
=E[( d∑
i=1µixi
)( d∑i=1
µixi
)]−
d∑
i=1
1i2 +
d∑j=1
d∑h=1
j 6=h
1i ×
1j
29 / 73
![Page 86: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/86.jpg)
Images/cinvestav-1.jpg
Thus
But
E[x2
i
]= 1 + 1
i (13)
Remark: The rest is for you to solve so σ2z = ‖µ‖2 .
30 / 73
![Page 87: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/87.jpg)
Images/cinvestav-1.jpg
We get the probability of error
We know that the error is coming from the following equation
Pe = 12
x0ˆ
−∞
p (z|ω2) dx + 12
∞
x0
p (z|ω1) dx (14)
But, we have equiprobable classes
Pe =12
x0ˆ
−∞
p (z|ω2) dx + 12
∞
x0
p (z|ω1)
=∞
x0
p (z|ω1) dx
31 / 73
![Page 88: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/88.jpg)
Images/cinvestav-1.jpg
We get the probability of error
We know that the error is coming from the following equation
Pe = 12
x0ˆ
−∞
p (z|ω2) dx + 12
∞
x0
p (z|ω1) dx (14)
But, we have equiprobable classes
Pe =12
x0ˆ
−∞
p (z|ω2) dx + 12
∞
x0
p (z|ω1)
=∞
x0
p (z|ω1) dx
31 / 73
![Page 89: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/89.jpg)
Images/cinvestav-1.jpg
Thus, we have that
Now
exp term = − 12 ‖µ‖2
[(z − ‖µ‖2
)2]
(15)
Because we have the ruleWe can do a change of variable to a normalized z
Pe =∞
bd
1√2π
exp{−z2
2
}dz (16)
32 / 73
![Page 90: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/90.jpg)
Images/cinvestav-1.jpg
Thus, we have that
Now
exp term = − 12 ‖µ‖2
[(z − ‖µ‖2
)2]
(15)
Because we have the ruleWe can do a change of variable to a normalized z
Pe =∞
bd
1√2π
exp{−z2
2
}dz (16)
32 / 73
![Page 91: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/91.jpg)
Images/cinvestav-1.jpg
Known mean value µ
The probability of error is given by
Pe =∞
bd
1√2π
exp{−z2
2
}dz (17)
Where
bd =
√√√√ d∑i=1
1i (18)
How?
33 / 73
![Page 92: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/92.jpg)
Images/cinvestav-1.jpg
Known mean value µ
The probability of error is given by
Pe =∞
bd
1√2π
exp{−z2
2
}dz (17)
Where
bd =
√√√√ d∑i=1
1i (18)
How?
33 / 73
![Page 93: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/93.jpg)
Images/cinvestav-1.jpg
Known mean value µ
ThusWhen the series bd tends to infinity as d →∞, the probability of errortends to zero as the number of features increases.
34 / 73
![Page 94: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/94.jpg)
Images/cinvestav-1.jpg
Unknown mean value µFor This, we use the maximum likelihood
µ = 1N
N∑k=1
skxk (19)
where1 sk = 1 if xk ∈ ω12 sk = −1 if xk ∈ ω2
Now, we have aproblem z is no more a Gaussian variableStill, if we select d large enough and knowing that z =
∑xi µi , then for
the central limit theorem, we can consider z to be Gaussian.
With mean and variance1 E [z] =
∑di=1
1i .
2 σ2z =
(1 + 1
N
)∑di=1
1i + d
N .
35 / 73
![Page 95: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/95.jpg)
Images/cinvestav-1.jpg
Unknown mean value µFor This, we use the maximum likelihood
µ = 1N
N∑k=1
skxk (19)
where1 sk = 1 if xk ∈ ω12 sk = −1 if xk ∈ ω2
Now, we have aproblem z is no more a Gaussian variableStill, if we select d large enough and knowing that z =
∑xi µi , then for
the central limit theorem, we can consider z to be Gaussian.
With mean and variance1 E [z] =
∑di=1
1i .
2 σ2z =
(1 + 1
N
)∑di=1
1i + d
N .
35 / 73
![Page 96: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/96.jpg)
Images/cinvestav-1.jpg
Unknown mean value µFor This, we use the maximum likelihood
µ = 1N
N∑k=1
skxk (19)
where1 sk = 1 if xk ∈ ω12 sk = −1 if xk ∈ ω2
Now, we have aproblem z is no more a Gaussian variableStill, if we select d large enough and knowing that z =
∑xi µi , then for
the central limit theorem, we can consider z to be Gaussian.
With mean and variance1 E [z] =
∑di=1
1i .
2 σ2z =
(1 + 1
N
)∑di=1
1i + d
N .
35 / 73
![Page 97: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/97.jpg)
Images/cinvestav-1.jpg
Unknown mean value µFor This, we use the maximum likelihood
µ = 1N
N∑k=1
skxk (19)
where1 sk = 1 if xk ∈ ω12 sk = −1 if xk ∈ ω2
Now, we have aproblem z is no more a Gaussian variableStill, if we select d large enough and knowing that z =
∑xi µi , then for
the central limit theorem, we can consider z to be Gaussian.
With mean and variance1 E [z] =
∑di=1
1i .
2 σ2z =
(1 + 1
N
)∑di=1
1i + d
N .
35 / 73
![Page 98: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/98.jpg)
Images/cinvestav-1.jpg
Unknown mean value µFor This, we use the maximum likelihood
µ = 1N
N∑k=1
skxk (19)
where1 sk = 1 if xk ∈ ω12 sk = −1 if xk ∈ ω2
Now, we have aproblem z is no more a Gaussian variableStill, if we select d large enough and knowing that z =
∑xi µi , then for
the central limit theorem, we can consider z to be Gaussian.
With mean and variance1 E [z] =
∑di=1
1i .
2 σ2z =
(1 + 1
N
)∑di=1
1i + d
N .
35 / 73
![Page 99: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/99.jpg)
Images/cinvestav-1.jpg
Unknown mean value µFor This, we use the maximum likelihood
µ = 1N
N∑k=1
skxk (19)
where1 sk = 1 if xk ∈ ω12 sk = −1 if xk ∈ ω2
Now, we have aproblem z is no more a Gaussian variableStill, if we select d large enough and knowing that z =
∑xi µi , then for
the central limit theorem, we can consider z to be Gaussian.
With mean and variance1 E [z] =
∑di=1
1i .
2 σ2z =
(1 + 1
N
)∑di=1
1i + d
N .
35 / 73
![Page 100: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/100.jpg)
Images/cinvestav-1.jpg
Unknown mean value µ
Thus
bd = E [z]σz
(20)
Thus, using Pe
It can now be shown that bd → 0 as d →∞ and the probability of errortends to 1
2 for any finite number N .
36 / 73
![Page 101: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/101.jpg)
Images/cinvestav-1.jpg
Unknown mean value µ
Thus
bd = E [z]σz
(20)
Thus, using Pe
It can now be shown that bd → 0 as d →∞ and the probability of errortends to 1
2 for any finite number N .
36 / 73
![Page 102: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/102.jpg)
Images/cinvestav-1.jpg
Finally
Case IIf for any d the corresponding PDF is known, then we can perfectlydiscriminate the two classes by arbitrarily increasing the number offeatures.
Case IIIf the PDF’s are not known, then the arbitrary increase of the number offeatures leads to the maximum possible value of the error rate, that is, 1
2 .
ThusUnder a limited number of training data we must try to keep the numberof features to a relatively low number.
37 / 73
![Page 103: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/103.jpg)
Images/cinvestav-1.jpg
Finally
Case IIf for any d the corresponding PDF is known, then we can perfectlydiscriminate the two classes by arbitrarily increasing the number offeatures.
Case IIIf the PDF’s are not known, then the arbitrary increase of the number offeatures leads to the maximum possible value of the error rate, that is, 1
2 .
ThusUnder a limited number of training data we must try to keep the numberof features to a relatively low number.
37 / 73
![Page 104: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/104.jpg)
Images/cinvestav-1.jpg
Finally
Case IIf for any d the corresponding PDF is known, then we can perfectlydiscriminate the two classes by arbitrarily increasing the number offeatures.
Case IIIf the PDF’s are not known, then the arbitrary increase of the number offeatures leads to the maximum possible value of the error rate, that is, 1
2 .
ThusUnder a limited number of training data we must try to keep the numberof features to a relatively low number.
37 / 73
![Page 105: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/105.jpg)
Images/cinvestav-1.jpg
Graphically
For N2 � N1, minimum at d = Nαwith α ∈ [2, 10]
38 / 73
![Page 106: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/106.jpg)
Images/cinvestav-1.jpg
Back to Feature Selection
The Goal1 Select the “optimum” number d of features.2 Select the “best” d features.
Why? Large d has a three-fold disadvantage:High computational demands.Low generalization performance.Poor error estimates
39 / 73
![Page 107: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/107.jpg)
Images/cinvestav-1.jpg
Back to Feature Selection
The Goal1 Select the “optimum” number d of features.2 Select the “best” d features.
Why? Large d has a three-fold disadvantage:High computational demands.Low generalization performance.Poor error estimates
39 / 73
![Page 108: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/108.jpg)
Images/cinvestav-1.jpg
Back to Feature Selection
The Goal1 Select the “optimum” number d of features.2 Select the “best” d features.
Why? Large d has a three-fold disadvantage:High computational demands.Low generalization performance.Poor error estimates
39 / 73
![Page 109: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/109.jpg)
Images/cinvestav-1.jpg
Back to Feature Selection
The Goal1 Select the “optimum” number d of features.2 Select the “best” d features.
Why? Large d has a three-fold disadvantage:High computational demands.Low generalization performance.Poor error estimates
39 / 73
![Page 110: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/110.jpg)
Images/cinvestav-1.jpg
Back to Feature Selection
The Goal1 Select the “optimum” number d of features.2 Select the “best” d features.
Why? Large d has a three-fold disadvantage:High computational demands.Low generalization performance.Poor error estimates
39 / 73
![Page 111: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/111.jpg)
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat is Feature Selection?Preprocessing
OutliersData NormalizationMissing Data
The Peaking Phenomena
2 Feature SelectionFeature SelectionFeature selection based on statistical hypothesis testingApplication of the t-Test in Feature SelectionConsidering Feature SetsScatter MatricesWhat to do with it?
Sequential Backward Selection
40 / 73
![Page 112: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/112.jpg)
Images/cinvestav-1.jpg
Back to Feature Selection
Given Nd must be large enough to learn what makes classes different and whatmakes patterns in the same class similar
In additiond must be small enough not to learn what makes patterns of the sameclass different
In practiceIn practice, d < N/3 has been reported to be a sensible choice for anumber of cases
41 / 73
![Page 113: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/113.jpg)
Images/cinvestav-1.jpg
Back to Feature Selection
Given Nd must be large enough to learn what makes classes different and whatmakes patterns in the same class similar
In additiond must be small enough not to learn what makes patterns of the sameclass different
In practiceIn practice, d < N/3 has been reported to be a sensible choice for anumber of cases
41 / 73
![Page 114: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/114.jpg)
Images/cinvestav-1.jpg
Back to Feature Selection
Given Nd must be large enough to learn what makes classes different and whatmakes patterns in the same class similar
In additiond must be small enough not to learn what makes patterns of the sameclass different
In practiceIn practice, d < N/3 has been reported to be a sensible choice for anumber of cases
41 / 73
![Page 115: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/115.jpg)
Images/cinvestav-1.jpg
Thus
ohhhOnce d has been decided, choose the d most informative features:
Best: Large between class distance, Small within class variance.
The basic philosophy1 Discard individual features with poor information content.2 The remaining information rich features are examined jointly as
vectors
42 / 73
![Page 116: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/116.jpg)
Images/cinvestav-1.jpg
Thus
ohhhOnce d has been decided, choose the d most informative features:
Best: Large between class distance, Small within class variance.
The basic philosophy1 Discard individual features with poor information content.2 The remaining information rich features are examined jointly as
vectors
42 / 73
![Page 117: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/117.jpg)
Images/cinvestav-1.jpg
Thus
ohhhOnce d has been decided, choose the d most informative features:
Best: Large between class distance, Small within class variance.
The basic philosophy1 Discard individual features with poor information content.2 The remaining information rich features are examined jointly as
vectors
42 / 73
![Page 118: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/118.jpg)
Images/cinvestav-1.jpg
Thus
ohhhOnce d has been decided, choose the d most informative features:
Best: Large between class distance, Small within class variance.
The basic philosophy1 Discard individual features with poor information content.2 The remaining information rich features are examined jointly as
vectors
42 / 73
![Page 119: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/119.jpg)
Images/cinvestav-1.jpg
Example
Thus, we want
43 / 73
![Page 120: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/120.jpg)
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat is Feature Selection?Preprocessing
OutliersData NormalizationMissing Data
The Peaking Phenomena
2 Feature SelectionFeature SelectionFeature selection based on statistical hypothesis testingApplication of the t-Test in Feature SelectionConsidering Feature SetsScatter MatricesWhat to do with it?
Sequential Backward Selection
44 / 73
![Page 121: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/121.jpg)
Images/cinvestav-1.jpg
Using Statistics
Simplicity First Principles - Marcus AureliusA first step in feature selection is to look at each of the generated featuresindependently and test their discriminatory capability for the problem athand.
For this, we can use the following hypothesis testingAssume the samples for two classes ω1, ω2 are vectors of random variables.Then, assume the following hypothesis
1 H1: The values of the feature differ significantly2 H0: The values of the feature do not differ significantly
MeaningH0 is known as the null hypothesis and H1 as the alternative hypothesis.
45 / 73
![Page 122: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/122.jpg)
Images/cinvestav-1.jpg
Using Statistics
Simplicity First Principles - Marcus AureliusA first step in feature selection is to look at each of the generated featuresindependently and test their discriminatory capability for the problem athand.
For this, we can use the following hypothesis testingAssume the samples for two classes ω1, ω2 are vectors of random variables.Then, assume the following hypothesis
1 H1: The values of the feature differ significantly2 H0: The values of the feature do not differ significantly
MeaningH0 is known as the null hypothesis and H1 as the alternative hypothesis.
45 / 73
![Page 123: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/123.jpg)
Images/cinvestav-1.jpg
Using Statistics
Simplicity First Principles - Marcus AureliusA first step in feature selection is to look at each of the generated featuresindependently and test their discriminatory capability for the problem athand.
For this, we can use the following hypothesis testingAssume the samples for two classes ω1, ω2 are vectors of random variables.Then, assume the following hypothesis
1 H1: The values of the feature differ significantly2 H0: The values of the feature do not differ significantly
MeaningH0 is known as the null hypothesis and H1 as the alternative hypothesis.
45 / 73
![Page 124: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/124.jpg)
Images/cinvestav-1.jpg
Using Statistics
Simplicity First Principles - Marcus AureliusA first step in feature selection is to look at each of the generated featuresindependently and test their discriminatory capability for the problem athand.
For this, we can use the following hypothesis testingAssume the samples for two classes ω1, ω2 are vectors of random variables.Then, assume the following hypothesis
1 H1: The values of the feature differ significantly2 H0: The values of the feature do not differ significantly
MeaningH0 is known as the null hypothesis and H1 as the alternative hypothesis.
45 / 73
![Page 125: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/125.jpg)
Images/cinvestav-1.jpg
Using Statistics
Simplicity First Principles - Marcus AureliusA first step in feature selection is to look at each of the generated featuresindependently and test their discriminatory capability for the problem athand.
For this, we can use the following hypothesis testingAssume the samples for two classes ω1, ω2 are vectors of random variables.Then, assume the following hypothesis
1 H1: The values of the feature differ significantly2 H0: The values of the feature do not differ significantly
MeaningH0 is known as the null hypothesis and H1 as the alternative hypothesis.
45 / 73
![Page 126: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/126.jpg)
Images/cinvestav-1.jpg
Hypothesis Testing BasicsWe need to represent these ideas in a more mathematical wayFor this, given an unknown parameter θ:
H1 : θ 6= θ0
H0 : θ = θ0
We want to generate a qThat measures the quality of our answer under our knowledge of thesample features x1, x2, ..., xN .
We ask for1 Where a D (Acceptance Interval) is an interval where q lies with high
probability under hypothesis H0.2 Where D, the complement or critical region, is the region where we
reject H0.46 / 73
![Page 127: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/127.jpg)
Images/cinvestav-1.jpg
Hypothesis Testing BasicsWe need to represent these ideas in a more mathematical wayFor this, given an unknown parameter θ:
H1 : θ 6= θ0
H0 : θ = θ0
We want to generate a qThat measures the quality of our answer under our knowledge of thesample features x1, x2, ..., xN .
We ask for1 Where a D (Acceptance Interval) is an interval where q lies with high
probability under hypothesis H0.2 Where D, the complement or critical region, is the region where we
reject H0.46 / 73
![Page 128: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/128.jpg)
Images/cinvestav-1.jpg
Hypothesis Testing BasicsWe need to represent these ideas in a more mathematical wayFor this, given an unknown parameter θ:
H1 : θ 6= θ0
H0 : θ = θ0
We want to generate a qThat measures the quality of our answer under our knowledge of thesample features x1, x2, ..., xN .
We ask for1 Where a D (Acceptance Interval) is an interval where q lies with high
probability under hypothesis H0.2 Where D, the complement or critical region, is the region where we
reject H0.46 / 73
![Page 129: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/129.jpg)
Images/cinvestav-1.jpg
Example
Acceptance and critical regions for hypothesis testing. The area ofthe shaded region is the probability of an erroneous decision.
47 / 73
![Page 130: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/130.jpg)
Images/cinvestav-1.jpg
Known Variance Case
AssumeBe x a random variable and xi the resulting experimental samples.
Let1 E [x] = µ
2 E[(x − µ)2
]= σ2
We can estimate µ using
x = 1N
N∑i=1
xi (21)
48 / 73
![Page 131: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/131.jpg)
Images/cinvestav-1.jpg
Known Variance Case
AssumeBe x a random variable and xi the resulting experimental samples.
Let1 E [x] = µ
2 E[(x − µ)2
]= σ2
We can estimate µ using
x = 1N
N∑i=1
xi (21)
48 / 73
![Page 132: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/132.jpg)
Images/cinvestav-1.jpg
Known Variance Case
AssumeBe x a random variable and xi the resulting experimental samples.
Let1 E [x] = µ
2 E[(x − µ)2
]= σ2
We can estimate µ using
x = 1N
N∑i=1
xi (21)
48 / 73
![Page 133: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/133.jpg)
Images/cinvestav-1.jpg
Known Variance CaseIt can be proved that thex is an unbiased estimate of the mean of x.
In a similar wayThe variance of σ2
x of x is
E[(x − µ)2
]= E
( 1N
N∑i=1
xi − µ)2 = E
( 1N
N∑i=1
(xi − µ))2 (22)
Which is the following
E[(x − µ)2
]= 1
N 2
N∑i=1
E[(xi − µ)2
]+ 1
N 2
∑i
∑j 6=i
E [(xi − µ)(xj − µ)]
(23)49 / 73
![Page 134: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/134.jpg)
Images/cinvestav-1.jpg
Known Variance CaseIt can be proved that thex is an unbiased estimate of the mean of x.
In a similar wayThe variance of σ2
x of x is
E[(x − µ)2
]= E
( 1N
N∑i=1
xi − µ)2 = E
( 1N
N∑i=1
(xi − µ))2 (22)
Which is the following
E[(x − µ)2
]= 1
N 2
N∑i=1
E[(xi − µ)2
]+ 1
N 2
∑i
∑j 6=i
E [(xi − µ)(xj − µ)]
(23)49 / 73
![Page 135: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/135.jpg)
Images/cinvestav-1.jpg
Known Variance CaseIt can be proved that thex is an unbiased estimate of the mean of x.
In a similar wayThe variance of σ2
x of x is
E[(x − µ)2
]= E
( 1N
N∑i=1
xi − µ)2 = E
( 1N
N∑i=1
(xi − µ))2 (22)
Which is the following
E[(x − µ)2
]= 1
N 2
N∑i=1
E[(xi − µ)2
]+ 1
N 2
∑i
∑j 6=i
E [(xi − µ)(xj − µ)]
(23)49 / 73
![Page 136: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/136.jpg)
Images/cinvestav-1.jpg
Known Variance Case
Because independence
E [(xi − µ)((xj − µ)] = E [xi − µ] E [xj − µ] = 0 (24)
Thus
σ2x = 1
N σ2 (25)
Note: the larger the number of measurement samples, the smallerthe variance of x around the true mean.
50 / 73
![Page 137: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/137.jpg)
Images/cinvestav-1.jpg
Known Variance Case
Because independence
E [(xi − µ)((xj − µ)] = E [xi − µ] E [xj − µ] = 0 (24)
Thus
σ2x = 1
N σ2 (25)
Note: the larger the number of measurement samples, the smallerthe variance of x around the true mean.
50 / 73
![Page 138: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/138.jpg)
Images/cinvestav-1.jpg
What to do with it
Now, you are given a µ the estimated parameter (In our case themean sample)Thus:
H1 : E [x] 6= µ
H0 : E [x] = µ
We define q
q = x − µσN
(26)
Recalling the central limit theoremThe probability density function of x under H0 is approx Gaussian N
(µ, σ
N)
51 / 73
![Page 139: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/139.jpg)
Images/cinvestav-1.jpg
What to do with it
Now, you are given a µ the estimated parameter (In our case themean sample)Thus:
H1 : E [x] 6= µ
H0 : E [x] = µ
We define q
q = x − µσN
(26)
Recalling the central limit theoremThe probability density function of x under H0 is approx Gaussian N
(µ, σ
N)
51 / 73
![Page 140: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/140.jpg)
Images/cinvestav-1.jpg
What to do with it
Now, you are given a µ the estimated parameter (In our case themean sample)Thus:
H1 : E [x] 6= µ
H0 : E [x] = µ
We define q
q = x − µσN
(26)
Recalling the central limit theoremThe probability density function of x under H0 is approx Gaussian N
(µ, σ
N)
51 / 73
![Page 141: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/141.jpg)
Images/cinvestav-1.jpg
Thus
Thusq under H0 is approx N (0, 1)
ThenWe can choose an acceptance level ρ with interval D = [−xρ, xρ] such thatq lies on it with probability 1− ρ.
52 / 73
![Page 142: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/142.jpg)
Images/cinvestav-1.jpg
Thus
Thusq under H0 is approx N (0, 1)
ThenWe can choose an acceptance level ρ with interval D = [−xρ, xρ] such thatq lies on it with probability 1− ρ.
52 / 73
![Page 143: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/143.jpg)
Images/cinvestav-1.jpg
Final Process
First StepGiven the N experimental samples of x, compute x and then q.
Second OneChoose the significance level ρ.
Third OneCompute from the corresponding tables for N (0, 1) the acceptance intervalD = [−xρ, xρ] with probability 1− ρ.
53 / 73
![Page 144: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/144.jpg)
Images/cinvestav-1.jpg
Final Process
First StepGiven the N experimental samples of x, compute x and then q.
Second OneChoose the significance level ρ.
Third OneCompute from the corresponding tables for N (0, 1) the acceptance intervalD = [−xρ, xρ] with probability 1− ρ.
53 / 73
![Page 145: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/145.jpg)
Images/cinvestav-1.jpg
Final Process
First StepGiven the N experimental samples of x, compute x and then q.
Second OneChoose the significance level ρ.
Third OneCompute from the corresponding tables for N (0, 1) the acceptance intervalD = [−xρ, xρ] with probability 1− ρ.
53 / 73
![Page 146: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/146.jpg)
Images/cinvestav-1.jpg
Final Process
Final StepIf q ∈ D decide H0 , if not decide H1.
Second oneBasically, all we say is that we expect the resulting value q to lie inthe high-percentage 1− ρ interval.If it does not, then we decide that this is because the assumed meanvalue is not “correct.”
54 / 73
![Page 147: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/147.jpg)
Images/cinvestav-1.jpg
Final Process
Final StepIf q ∈ D decide H0 , if not decide H1.
Second oneBasically, all we say is that we expect the resulting value q to lie inthe high-percentage 1− ρ interval.If it does not, then we decide that this is because the assumed meanvalue is not “correct.”
54 / 73
![Page 148: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/148.jpg)
Images/cinvestav-1.jpg
Final Process
Final StepIf q ∈ D decide H0 , if not decide H1.
Second oneBasically, all we say is that we expect the resulting value q to lie inthe high-percentage 1− ρ interval.If it does not, then we decide that this is because the assumed meanvalue is not “correct.”
54 / 73
![Page 149: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/149.jpg)
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat is Feature Selection?Preprocessing
OutliersData NormalizationMissing Data
The Peaking Phenomena
2 Feature SelectionFeature SelectionFeature selection based on statistical hypothesis testingApplication of the t-Test in Feature SelectionConsidering Feature SetsScatter MatricesWhat to do with it?
Sequential Backward Selection
55 / 73
![Page 150: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/150.jpg)
Images/cinvestav-1.jpg
Application of the t -Test in Feature Selection
Very SimpleUse the difference µ1 − µ2 for the testing.
Note Each µ correspond to a class ω1, ω2
ThusWhat is the logic?
Assume that the variance of the feature values is the same in both
σ21 = σ2
2 = σ2 (27)
56 / 73
![Page 151: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/151.jpg)
Images/cinvestav-1.jpg
Application of the t -Test in Feature Selection
Very SimpleUse the difference µ1 − µ2 for the testing.
Note Each µ correspond to a class ω1, ω2
ThusWhat is the logic?
Assume that the variance of the feature values is the same in both
σ21 = σ2
2 = σ2 (27)
56 / 73
![Page 152: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/152.jpg)
Images/cinvestav-1.jpg
Application of the t -Test in Feature Selection
Very SimpleUse the difference µ1 − µ2 for the testing.
Note Each µ correspond to a class ω1, ω2
ThusWhat is the logic?
Assume that the variance of the feature values is the same in both
σ21 = σ2
2 = σ2 (27)
56 / 73
![Page 153: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/153.jpg)
Images/cinvestav-1.jpg
Application of the t -Test in Feature Selection
Very SimpleUse the difference µ1 − µ2 for the testing.
Note Each µ correspond to a class ω1, ω2
ThusWhat is the logic?
Assume that the variance of the feature values is the same in both
σ21 = σ2
2 = σ2 (27)
56 / 73
![Page 154: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/154.jpg)
Images/cinvestav-1.jpg
What is the Hypothesis?
A very simple one
H1 : ∆µ = µ1 − µ2 6= 0H0 : ∆µ = µ1 − µ2 = 0
The new random variable is
z = x − y (28)
where x, y denote the random variables corresponding to the values of thefeature in the two classes.
PropertiesE [z] = µ1 − µ2
σ2z = 2σ2
57 / 73
![Page 155: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/155.jpg)
Images/cinvestav-1.jpg
What is the Hypothesis?
A very simple one
H1 : ∆µ = µ1 − µ2 6= 0H0 : ∆µ = µ1 − µ2 = 0
The new random variable is
z = x − y (28)
where x, y denote the random variables corresponding to the values of thefeature in the two classes.
PropertiesE [z] = µ1 − µ2
σ2z = 2σ2
57 / 73
![Page 156: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/156.jpg)
Images/cinvestav-1.jpg
What is the Hypothesis?
A very simple one
H1 : ∆µ = µ1 − µ2 6= 0H0 : ∆µ = µ1 − µ2 = 0
The new random variable is
z = x − y (28)
where x, y denote the random variables corresponding to the values of thefeature in the two classes.
PropertiesE [z] = µ1 − µ2
σ2z = 2σ2
57 / 73
![Page 157: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/157.jpg)
Images/cinvestav-1.jpg
Then
It is possible to prove that z follows the distribution
N(µ1 − µ2,
2σ2
N
)(29)
SoWe can use the following
q = (x − y)− (µ1 − µ2)sz√
2N
(30)
where
s2z = 1
2N − 2
( N∑i=1
(xi − x)2 +N∑
i=1(yi − y)2
)(31)
58 / 73
![Page 158: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/158.jpg)
Images/cinvestav-1.jpg
Then
It is possible to prove that z follows the distribution
N(µ1 − µ2,
2σ2
N
)(29)
SoWe can use the following
q = (x − y)− (µ1 − µ2)sz√
2N
(30)
where
s2z = 1
2N − 2
( N∑i=1
(xi − x)2 +N∑
i=1(yi − y)2
)(31)
58 / 73
![Page 159: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/159.jpg)
Images/cinvestav-1.jpg
Then
It is possible to prove that z follows the distribution
N(µ1 − µ2,
2σ2
N
)(29)
SoWe can use the following
q = (x − y)− (µ1 − µ2)sz√
2N
(30)
where
s2z = 1
2N − 2
( N∑i=1
(xi − x)2 +N∑
i=1(yi − y)2
)(31)
58 / 73
![Page 160: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/160.jpg)
Images/cinvestav-1.jpg
Testing
Thusq turns out to follow the t-distribution with 2N − 2 degrees of freedom
59 / 73
![Page 161: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/161.jpg)
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat is Feature Selection?Preprocessing
OutliersData NormalizationMissing Data
The Peaking Phenomena
2 Feature SelectionFeature SelectionFeature selection based on statistical hypothesis testingApplication of the t-Test in Feature SelectionConsidering Feature SetsScatter MatricesWhat to do with it?
Sequential Backward Selection
60 / 73
![Page 162: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/162.jpg)
Images/cinvestav-1.jpg
Considering Feature Sets
Something NotableThe emphasis so far was on individually considered features.
ButThat is, two features may be rich in information, but if they are highlycorrelated we need not consider both of them.
ThenCombine features to search for the “best” combination after features havebeen discarded.
61 / 73
![Page 163: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/163.jpg)
Images/cinvestav-1.jpg
Considering Feature Sets
Something NotableThe emphasis so far was on individually considered features.
ButThat is, two features may be rich in information, but if they are highlycorrelated we need not consider both of them.
ThenCombine features to search for the “best” combination after features havebeen discarded.
61 / 73
![Page 164: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/164.jpg)
Images/cinvestav-1.jpg
Considering Feature Sets
Something NotableThe emphasis so far was on individually considered features.
ButThat is, two features may be rich in information, but if they are highlycorrelated we need not consider both of them.
ThenCombine features to search for the “best” combination after features havebeen discarded.
61 / 73
![Page 165: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/165.jpg)
Images/cinvestav-1.jpg
What to do?
PossibleUse different feature combinations to form the feature vector.Train the classifier, and choose the combination resulting in the bestclassifier performance.
HoweverA major disadvantage of this approach is the high complexity.Also, local minimum may give misleading results.
BetterAdopt a class separability measure and choose the best featurecombination against this cost.
62 / 73
![Page 166: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/166.jpg)
Images/cinvestav-1.jpg
What to do?
PossibleUse different feature combinations to form the feature vector.Train the classifier, and choose the combination resulting in the bestclassifier performance.
HoweverA major disadvantage of this approach is the high complexity.Also, local minimum may give misleading results.
BetterAdopt a class separability measure and choose the best featurecombination against this cost.
62 / 73
![Page 167: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/167.jpg)
Images/cinvestav-1.jpg
What to do?
PossibleUse different feature combinations to form the feature vector.Train the classifier, and choose the combination resulting in the bestclassifier performance.
HoweverA major disadvantage of this approach is the high complexity.Also, local minimum may give misleading results.
BetterAdopt a class separability measure and choose the best featurecombination against this cost.
62 / 73
![Page 168: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/168.jpg)
Images/cinvestav-1.jpg
What to do?
PossibleUse different feature combinations to form the feature vector.Train the classifier, and choose the combination resulting in the bestclassifier performance.
HoweverA major disadvantage of this approach is the high complexity.Also, local minimum may give misleading results.
BetterAdopt a class separability measure and choose the best featurecombination against this cost.
62 / 73
![Page 169: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/169.jpg)
Images/cinvestav-1.jpg
What to do?
PossibleUse different feature combinations to form the feature vector.Train the classifier, and choose the combination resulting in the bestclassifier performance.
HoweverA major disadvantage of this approach is the high complexity.Also, local minimum may give misleading results.
BetterAdopt a class separability measure and choose the best featurecombination against this cost.
62 / 73
![Page 170: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/170.jpg)
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat is Feature Selection?Preprocessing
OutliersData NormalizationMissing Data
The Peaking Phenomena
2 Feature SelectionFeature SelectionFeature selection based on statistical hypothesis testingApplication of the t-Test in Feature SelectionConsidering Feature SetsScatter MatricesWhat to do with it?
Sequential Backward Selection
63 / 73
![Page 171: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/171.jpg)
Images/cinvestav-1.jpg
Scatter Matrices
DefinitionThese are used as a measure of the way data are scattered in therespective feature space.
Within-class Scatter Matrix
Sw =C∑
i=1PiSi (32)
where C is the number of classes.
where1 Si = E
[(x − µi) (x − µi)T
]2 Pi the a priori probability of class ωi defined as Pi ∼= ni/N .
1 ni is the number of samples in class ωi .
64 / 73
![Page 172: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/172.jpg)
Images/cinvestav-1.jpg
Scatter Matrices
DefinitionThese are used as a measure of the way data are scattered in therespective feature space.
Within-class Scatter Matrix
Sw =C∑
i=1PiSi (32)
where C is the number of classes.
where1 Si = E
[(x − µi) (x − µi)T
]2 Pi the a priori probability of class ωi defined as Pi ∼= ni/N .
1 ni is the number of samples in class ωi .
64 / 73
![Page 173: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/173.jpg)
Images/cinvestav-1.jpg
Scatter Matrices
DefinitionThese are used as a measure of the way data are scattered in therespective feature space.
Within-class Scatter Matrix
Sw =C∑
i=1PiSi (32)
where C is the number of classes.
where1 Si = E
[(x − µi) (x − µi)T
]2 Pi the a priori probability of class ωi defined as Pi ∼= ni/N .
1 ni is the number of samples in class ωi .
64 / 73
![Page 174: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/174.jpg)
Images/cinvestav-1.jpg
Scatter Matrices
DefinitionThese are used as a measure of the way data are scattered in therespective feature space.
Within-class Scatter Matrix
Sw =C∑
i=1PiSi (32)
where C is the number of classes.
where1 Si = E
[(x − µi) (x − µi)T
]2 Pi the a priori probability of class ωi defined as Pi ∼= ni/N .
1 ni is the number of samples in class ωi .
64 / 73
![Page 175: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/175.jpg)
Images/cinvestav-1.jpg
Scatter Matrices
DefinitionThese are used as a measure of the way data are scattered in therespective feature space.
Within-class Scatter Matrix
Sw =C∑
i=1PiSi (32)
where C is the number of classes.
where1 Si = E
[(x − µi) (x − µi)T
]2 Pi the a priori probability of class ωi defined as Pi ∼= ni/N .
1 ni is the number of samples in class ωi .
64 / 73
![Page 176: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/176.jpg)
Images/cinvestav-1.jpg
Scatter Matrices
Between-class scatter matrix
Sb =C∑
i=1Pi (x − µ0) (x − µ0)T (33)
Where
µ0 =C∑
i=1Piµi (34)
The global mean.
Mixture scatter matrix
Sm = E[(x − µ0) (x − µ0)T
](35)
Note: it can be proved that Sm = Sw + Sb
65 / 73
![Page 177: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/177.jpg)
Images/cinvestav-1.jpg
Scatter Matrices
Between-class scatter matrix
Sb =C∑
i=1Pi (x − µ0) (x − µ0)T (33)
Where
µ0 =C∑
i=1Piµi (34)
The global mean.
Mixture scatter matrix
Sm = E[(x − µ0) (x − µ0)T
](35)
Note: it can be proved that Sm = Sw + Sb
65 / 73
![Page 178: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/178.jpg)
Images/cinvestav-1.jpg
Scatter Matrices
Between-class scatter matrix
Sb =C∑
i=1Pi (x − µ0) (x − µ0)T (33)
Where
µ0 =C∑
i=1Piµi (34)
The global mean.
Mixture scatter matrix
Sm = E[(x − µ0) (x − µ0)T
](35)
Note: it can be proved that Sm = Sw + Sb
65 / 73
![Page 179: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/179.jpg)
Images/cinvestav-1.jpg
Criterions
First One
J1 = trace {Sm}trace {Sw}
(36)
It takes takes large values when samples in the d-dimensional space arewell clustered around their mean, within each class, and the clusters of thedifferent classes are well separated.
Other Criteria are1 J2 = |Sm |
|Sw |2 J3 = trace
{S−1
w Sm}
66 / 73
![Page 180: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/180.jpg)
Images/cinvestav-1.jpg
Criterions
First One
J1 = trace {Sm}trace {Sw}
(36)
It takes takes large values when samples in the d-dimensional space arewell clustered around their mean, within each class, and the clusters of thedifferent classes are well separated.
Other Criteria are1 J2 = |Sm |
|Sw |2 J3 = trace
{S−1
w Sm}
66 / 73
![Page 181: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/181.jpg)
Images/cinvestav-1.jpg
Criterions
First One
J1 = trace {Sm}trace {Sw}
(36)
It takes takes large values when samples in the d-dimensional space arewell clustered around their mean, within each class, and the clusters of thedifferent classes are well separated.
Other Criteria are1 J2 = |Sm |
|Sw |2 J3 = trace
{S−1
w Sm}
66 / 73
![Page 182: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/182.jpg)
Images/cinvestav-1.jpg
Example
Classes with (a) small within-class variance and small between-classdistances, (b) large within- class variance and small between-classdistances, and (c) small within-class variance and large between-classdistances.
67 / 73
![Page 183: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/183.jpg)
Images/cinvestav-1.jpg
Outline
1 IntroductionWhat is Feature Selection?Preprocessing
OutliersData NormalizationMissing Data
The Peaking Phenomena
2 Feature SelectionFeature SelectionFeature selection based on statistical hypothesis testingApplication of the t-Test in Feature SelectionConsidering Feature SetsScatter MatricesWhat to do with it?
Sequential Backward Selection
68 / 73
![Page 184: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/184.jpg)
Images/cinvestav-1.jpg
What to do with itWe want to avoidHigh Complexities
As for example1 Select a class separability2 Then, get all possible combinations of features(
ml
)with l = 1, 2, ..., m
We can do better1 Sequential Backward Selection2 Sequential Forward Selection3 Floating Search Methods
However these are sub-optimal methods
69 / 73
![Page 185: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/185.jpg)
Images/cinvestav-1.jpg
What to do with itWe want to avoidHigh Complexities
As for example1 Select a class separability2 Then, get all possible combinations of features(
ml
)with l = 1, 2, ..., m
We can do better1 Sequential Backward Selection2 Sequential Forward Selection3 Floating Search Methods
However these are sub-optimal methods
69 / 73
![Page 186: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/186.jpg)
Images/cinvestav-1.jpg
What to do with itWe want to avoidHigh Complexities
As for example1 Select a class separability2 Then, get all possible combinations of features(
ml
)with l = 1, 2, ..., m
We can do better1 Sequential Backward Selection2 Sequential Forward Selection3 Floating Search Methods
However these are sub-optimal methods
69 / 73
![Page 187: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/187.jpg)
Images/cinvestav-1.jpg
What to do with itWe want to avoidHigh Complexities
As for example1 Select a class separability2 Then, get all possible combinations of features(
ml
)with l = 1, 2, ..., m
We can do better1 Sequential Backward Selection2 Sequential Forward Selection3 Floating Search Methods
However these are sub-optimal methods
69 / 73
![Page 188: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/188.jpg)
Images/cinvestav-1.jpg
What to do with itWe want to avoidHigh Complexities
As for example1 Select a class separability2 Then, get all possible combinations of features(
ml
)with l = 1, 2, ..., m
We can do better1 Sequential Backward Selection2 Sequential Forward Selection3 Floating Search Methods
However these are sub-optimal methods
69 / 73
![Page 189: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/189.jpg)
Images/cinvestav-1.jpg
What to do with itWe want to avoidHigh Complexities
As for example1 Select a class separability2 Then, get all possible combinations of features(
ml
)with l = 1, 2, ..., m
We can do better1 Sequential Backward Selection2 Sequential Forward Selection3 Floating Search Methods
However these are sub-optimal methods
69 / 73
![Page 190: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/190.jpg)
Images/cinvestav-1.jpg
What to do with itWe want to avoidHigh Complexities
As for example1 Select a class separability2 Then, get all possible combinations of features(
ml
)with l = 1, 2, ..., m
We can do better1 Sequential Backward Selection2 Sequential Forward Selection3 Floating Search Methods
However these are sub-optimal methods
69 / 73
![Page 191: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/191.jpg)
Images/cinvestav-1.jpg
For example: Sequential Backward Selection
We have the following exampleGiven x1, x2, x3, x4 and we wish to select two of them
Step 1Adopt a class separability criterion, C , and compute its value for thefeature vector [x1, x2, x3, x4]T .
Step 2Eliminate one feature, you get
[x1, x2, x3]T , [x1, x2, x4]T , [x1, x3, x4]T , [x2, x3, x4]T ,
70 / 73
![Page 192: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/192.jpg)
Images/cinvestav-1.jpg
For example: Sequential Backward Selection
We have the following exampleGiven x1, x2, x3, x4 and we wish to select two of them
Step 1Adopt a class separability criterion, C , and compute its value for thefeature vector [x1, x2, x3, x4]T .
Step 2Eliminate one feature, you get
[x1, x2, x3]T , [x1, x2, x4]T , [x1, x3, x4]T , [x2, x3, x4]T ,
70 / 73
![Page 193: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/193.jpg)
Images/cinvestav-1.jpg
For example: Sequential Backward Selection
We have the following exampleGiven x1, x2, x3, x4 and we wish to select two of them
Step 1Adopt a class separability criterion, C , and compute its value for thefeature vector [x1, x2, x3, x4]T .
Step 2Eliminate one feature, you get
[x1, x2, x3]T , [x1, x2, x4]T , [x1, x3, x4]T , [x2, x3, x4]T ,
70 / 73
![Page 194: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/194.jpg)
Images/cinvestav-1.jpg
For example: Sequential Backward Selection
You use your criterion CThus the winner is [x1, x2, x3]T
Step 3Now, eliminate a feature and generate [x1, x2]T , [x1, x3]T , [x2, x3]T ,
Use criterion CTo select the best one
71 / 73
![Page 195: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/195.jpg)
Images/cinvestav-1.jpg
For example: Sequential Backward Selection
You use your criterion CThus the winner is [x1, x2, x3]T
Step 3Now, eliminate a feature and generate [x1, x2]T , [x1, x3]T , [x2, x3]T ,
Use criterion CTo select the best one
71 / 73
![Page 196: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/196.jpg)
Images/cinvestav-1.jpg
For example: Sequential Backward Selection
You use your criterion CThus the winner is [x1, x2, x3]T
Step 3Now, eliminate a feature and generate [x1, x2]T , [x1, x3]T , [x2, x3]T ,
Use criterion CTo select the best one
71 / 73
![Page 197: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/197.jpg)
Images/cinvestav-1.jpg
Complexity of the Method
ComplexityThus, starting from m, at each step we drop out one feature from the“best” combination until we obtain a vector of l features.
Thus, we need1 + 1/2((m + 1)m − l(l + 1)) combinations
HoweverThe method is sub-optimalIt suffers of the so called nesting-effect
I Once a feature is discarded, there is no way to reconsider that featureagain.
72 / 73
![Page 198: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/198.jpg)
Images/cinvestav-1.jpg
Complexity of the Method
ComplexityThus, starting from m, at each step we drop out one feature from the“best” combination until we obtain a vector of l features.
Thus, we need1 + 1/2((m + 1)m − l(l + 1)) combinations
HoweverThe method is sub-optimalIt suffers of the so called nesting-effect
I Once a feature is discarded, there is no way to reconsider that featureagain.
72 / 73
![Page 199: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/199.jpg)
Images/cinvestav-1.jpg
Complexity of the Method
ComplexityThus, starting from m, at each step we drop out one feature from the“best” combination until we obtain a vector of l features.
Thus, we need1 + 1/2((m + 1)m − l(l + 1)) combinations
HoweverThe method is sub-optimalIt suffers of the so called nesting-effect
I Once a feature is discarded, there is no way to reconsider that featureagain.
72 / 73
![Page 200: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/200.jpg)
Images/cinvestav-1.jpg
Complexity of the Method
ComplexityThus, starting from m, at each step we drop out one feature from the“best” combination until we obtain a vector of l features.
Thus, we need1 + 1/2((m + 1)m − l(l + 1)) combinations
HoweverThe method is sub-optimalIt suffers of the so called nesting-effect
I Once a feature is discarded, there is no way to reconsider that featureagain.
72 / 73
![Page 201: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/201.jpg)
Images/cinvestav-1.jpg
Complexity of the Method
ComplexityThus, starting from m, at each step we drop out one feature from the“best” combination until we obtain a vector of l features.
Thus, we need1 + 1/2((m + 1)m − l(l + 1)) combinations
HoweverThe method is sub-optimalIt suffers of the so called nesting-effect
I Once a feature is discarded, there is no way to reconsider that featureagain.
72 / 73
![Page 202: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/202.jpg)
Images/cinvestav-1.jpg
Similar Problem
ForSequential Forward Selection
We can overcome this by usingFloating Search Methods
A more elegant methods are the ones based onDynamic ProgrammingBranch and Bound
73 / 73
![Page 203: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/203.jpg)
Images/cinvestav-1.jpg
Similar Problem
ForSequential Forward Selection
We can overcome this by usingFloating Search Methods
A more elegant methods are the ones based onDynamic ProgrammingBranch and Bound
73 / 73
![Page 204: 22 Machine Learning Feature Selection](https://reader034.vdocuments.net/reader034/viewer/2022051521/587bf2bc1a28ab765a8b73a9/html5/thumbnails/204.jpg)
Images/cinvestav-1.jpg
Similar Problem
ForSequential Forward Selection
We can overcome this by usingFloating Search Methods
A more elegant methods are the ones based onDynamic ProgrammingBranch and Bound
73 / 73