many factors affect the success of machine learning … · web viewmany factors affect the...

Chapter 5 FEATURE SELECTION

Chapter 5FEATURE SELECTION

5.1 Need for Feature ReductionMany factors affect the success of machine learning on a given task. The

representation and quality of the example data is first and foremost. Nowadays, the need

to process large databases is becoming increasingly common. Full text databases learners

typically deal with tens of thousands of features; vision systems, spoken word and

character recognition problems all require hundreds of classes and may have thousands of

input features. The majority of real-world classification problems require supervised

learning where the underlying class probabilities and class-conditionals probabilities are

unknown, and each instance is associated with a class label. In real-world situations,

relevant features are often unknown a priori. Therefore, many candidate features are

introduced to better represent the domain. Theoretically, having more features should

result in more discriminating power. However, practical experience with machine

learning algorithms has shown that this is not always the case, current machine learning

toolkits are insufficiently equipped to deal with contemporary datasets and many

algorithms are susceptible to exhibit poor complexity with respect to the number of

features. Furthermore, when faced with many noisy features, some algorithms take an

inordinately long time to converge, or never converge at all. And even if they do

converge conventional algorithms will tend to construct poor classifiers [Kon94].

Many of the introduced features during the training of a classifier are either partially

or completely irrelevant/redundant to the target concept; an irrelevant feature does not

affect the target concept in any way, and a redundant feature does not add anything new

63


to the target concept. In many applications, the size of a dataset is so large that learning

might not work as well before removing these unwanted features. Recent research has

shown that common machine learning algorithms are adversely affected by irrelevant and

redundant training information. The simple nearest neighbor algorithm is sensitive to

irrelevant attributes, its sample complexity (number of training examples needed to reach

a given accuracy level) grows exponentially with the number of irrelevant attributes (s.

[Lan94a, Lan94b, Aha91]). Sample complexity for decision tree algorithms can grow

exponentially on some concepts (such as parity) as well. The naive Bayes classifier can

be adversely affected by redundant attributes due to its assumption that attributes are

independent given the class [Lan94c]. Decision tree algorithms such as C4.5 [Qui86,

Qui93] can sometimes over fit training data, resulting in large trees. In many cases,

removing irrelevant and redundant information can result in C4.5 producing smaller trees

[Koh96]. Neural Networks are supposed to cope with irrelevant and redundant features

when the amount of training data is enough to compensate this drawback, otherwise they

are also affected by the amount of irrelevant information.

Reducing the number of irrelevant/redundant features drastically reduces the running

time of a learning algorithm and yields a more general concept. This helps in getting a

better insight into the underlying concept of a real-world classification problem. Feature

selection methods try to pick up a subset of features that are relevant to the target concept.

5.2 Feature Selection process The problem introduced in previous section can be alleviated by preprocessing the

dataset to remove noisy and low-information bearing attributes.

“Feature selection is the problem of choosing a small subset of features that ideally

is necessary and sufficient to describe the target concept”

Kira & Rendell

From the terms “necessary” and “sufficient” included in the given definition, it can

be stated that feature selection attempts to select the minimally sized subset of features

according to the following criteria:

64


Ideally, feature selection methods search through the subsets of features, and try to

find the best one among 2N candidate subsets according to some evaluation function.

However this procedure is exhaustive as it tries to find only the best one. It may be too

costly and practically prohibitive even for a medium sized feature set. Other methods

based on heuristic or random search methods attempt to reduce computational complexity

by compromising performance. These methods need a stopping criterion to prevent an

exhaustive search of subsets. There are four basic steps in a typical feature selection

method:

1. Starting point: Selecting a point in the feature subset space from which to begin

the search can affect the direction of the search. One option is to begin with no

features and successively add attributes. In this case, the search is said to proceed

forward through the search space. Conversely, the search can begin with all

features and successively remove them. In this case, the search proceeds

backward through the search space. Another alternative is to begin somewhere in

the middle and move outwards from this point.

2. Search organization: An exhaustive search of the feature subspace is prohibitive

for all but a small initial number of features. With N initial features there exist

possible subsets. Heuristic search strategies are more feasible than exhaustive

ones and can give good results, although they do not guarantee finding the

optimal subset.

3. Evaluation strategy: How feature subsets are evaluated is the single biggest

differentiating factor among feature selection algorithms for machine learning.

One paradigm, dubbed the filter [Koh95, Koh96], operates independent of any

65

Figure 5.1 General criteria for a feature selection method.

1. the classification accuracy do not significantly decrease; and

2. the resulting class distribution, given only the values for the selected

features, is as close as possible to the original class distribution, given all

features.


learning algorithm—undesirable features are filtered out of the data before

learning begins. These algorithms use heuristics based on general characteristics

of the data to evaluate the merit of feature subsets. Another school of thought

argues that the bias of a particular induction algorithm should be taken into

account when selecting features. This method, called the wrapper [Koh95,

Koh96], uses an induction algorithm along with a statistical re-sampling

technique such as cross-validation to estimate the final accuracy of feature

subsets.

4. Stopping criterion: A feature selector must decide when to stop searching

through the space of feature subsets. Depending on the evaluation strategy, a

feature selector might stop adding or removing features when none of the

alternatives improves upon the merit of a current feature subset. Alternatively, the

algorithm might continue to revise the feature subset as long as the merit does not

degrade. A further option could be to continue generating feature subsets until

reaching the opposite end of the search space and then select the best.

Many learning algorithms can be viewed as making a (biased) estimate of the

probability of the class label given a set of features. This is a complex, high dimensional

distribution. Unfortunately, induction is often performed on limited data. This makes

estimating the many probabilistic parameters difficult. In order to avoid over fitting the

training data, many algorithms employ the Occam’s Razor [Gam97] bias to build a simple

model that still achieves some acceptable level of performance on the training data. This

bias often leads an algorithm to prefer a small number of predictive attributes over a large

number of features that, if used in the proper combination, are fully predictive of the class

label. If there is too much irrelevant and redundant information present or the data is

noisy and unreliable, then learning during the training phase is more difficult.

Feature subset selection is the process of identifying and removing as much irrelevant

and redundant information as possible. This reduces the dimensionality of the data and

may allow learning algorithms to operate faster and more effectively. In some cases,

accuracy on future classification can be improved; in others, the result is a more compact,

easily interpreted representation of the target concept.

66


5.3 Feature selection methods overviewFeature subset selection has long been a research area within statistics and pattern

recognition [Dev82, Mil90]. It is not surprising that feature selection is as much of an

issue for machine learning as it is for pattern recognition, as both fields share the common

task of classification. In pattern recognition, feature selection can have an impact on the

economics of data acquisition and on the accuracy and complexity of the classifier

[Dev82]. This is also true of machine learning, which has the added concern of distilling

useful knowledge from data. Fortunately, feature selection has been shown to improve the

comprehensibility of extracted knowledge [Koh96].

GENERATIONHeuristic Complete Random

EVALUATION

DistanceRelief [Kir92], Relief-F

[Kon94], Segen [Seg84]

Branch & Bound

[Nar77], BFF [XuL88],

Bobrowski [Bob88]

InformationDTM [Car93] , Koller &

Sahami [Kol96]MDLM [She90]

DependencyPOE1ACC [Muc71],

PRESET [Mod93]

Consistency

Focus [Alm92],

Schlimmer [Sch93],

MIFES-1 [Oli92]

LVF [Liu96]

Classifier

Error Rate

SBS, SFS [Dev82], SBS-

SLASH [Car94], PQSS,

BDS [Doa92], Schemata

search [Moo94], RC

[Dom96], Queiros &

Gelsema [Que84]

Ichino & Sklansky

[Ichi84] [Ichi84b]

LVW [Liu96b],

GA [Vaf94], SA,

RGSS [Doa92],

RMHC-PF1

[Ska94]

Table 5.1 Different feature selection methods as stated by M. Dash and H. Liu [Das97].

67


There are a huge number of different feature selection methods. A study carried out

by M. Dash and H. Liu [Das97] presents 32 different methods grouped based on the types

of generation and evaluation function using in them. If the original feature set contains N

number of features, then the total number of competing candidate subsets to be generated

is 2N. This is a huge number even for medium-sized N. Generation procedures are

different approaches for solving this problem, namely: complete, in which all the subsets

are evaluated; heuristic, whose generation of subsets is made by adding/removing

attributes (incremental/decremental); and random, which evaluates a certain number of

random generated subsets. On the other hand, the aim of an evaluation function is to

measure the discriminating ability of a feature or a subset to distinguish the different class

labels. There are two common approaches: a wrapper uses the intended learning

algorithm itself to evaluate the usefulness of features, while a filter evaluates features

according to heuristics based on general characteristics of the data. The wrapper approach

is generally considered to produce better feature subsets but runs much more slowly than

a filter [Hal99]. The study of M. Dash and H. Liu [Das97] divides evaluation functions

into five categories: distance, which evaluates differences between class conditional

probabilities; information, based on the information gain of a feature; dependence, based

in correlation measurements; consistency, in which an acceptable inconsistency rate is set

by the user; and classifier error rate, which uses the classifier as evaluation function.

According to the general approach, only the last evaluation function, classifier error rate,

could be counted as a wrapper.

Table 5.1 resumes the classification of methods in [Das97]. The blank boxes in the

table signify that no method exists yet for these combinations. Since deeper analysis of

each one of the included feature selection techniques is far from this thesis purpose,

references for further information about them are given in the table.

In [Hal99], M. A. Hall and L. A. Smith present a particular approach to feature

selection, Correlation-based Feature Selection (CFS), that uses a correlation-based

heuristic to evaluate the worth of features. Despite this method has not been plainly

employed during this work, various ideas about feature selection using correlation

measurements between features and between features and output classes have been

extracted from it and its direct application is seriously taken into consideration for future

68


work. Consequently a brief overview of this particular feature selection criteria is given in

following section 5.3.1.

5.3.1 Correlation-based Feature Selection

CFS algorithm relies on a heuristic for evaluating the worth or merit of a subset of

features. This heuristic takes into account the usefulness of individual features for

predicting the class label along with the level of intercorrelation among them. The

hypotheses on which the heuristic is based can be stated:

“Good feature subsets contain features highly correlated with (predictive of) the

class, yet uncorrelated with (not predictive of) each other” [Hal99]

Following the same directions, Genari [Gen89] states “Features are relevant if their

values vary systematically with category membership.” In other words, a feature is useful

if it is correlated with or predictive of the class; otherwise it is irrelevant. Empirical

evidence from the feature selection literature shows that, along with irrelevant features,

redundant information should be eliminated as well (s. [Lan94c, Koh96, Koh95]. A

feature is said to be redundant if one or more of the other features are highly correlated

with it. The above definitions for relevance and redundancy lead to the idea that best

features for a given classification are those that are highly correlated with one of the

classes and have an insignificant correlation with the rest of the features in the set.

If the correlation between each of the components in a test and the outside variable is

known, and the inter-correlation between each pair of components is given, then the

correlation between a composite1 consisting of the summed components and the outside

variable can be predicted from [Ghi64, Hog77, Zaj62]:

(5.1)

Where1 Subset of features selected for evaluation.

69


rzc = correlation between the summed components and the outside variable.

k = number of components (features).

= average of the correlations between the components and the outside variable.

= average inter-correlation between components.

Equation 5.1 represents the Pearson’s correlation coefficient, where all the variables

have been standarised. The numerator can be thought of as giving an indication of how

predictive of the class a group of features are; the denominator of how much redundancy

there is among them. Thus, equation 5.1 shows that the correlation between a composite

and an outside variable is a function of the number of component variables in the

composite and the magnitude of the inter-correlations among them, together with the

magnitude of the correlations between the components and the outside variable. Some

conclusions can be extracted from (5.1):

The higher the correlations between the components and the outside variable, the

higher the correlation between the composite and the outside variable.

As the number of components in the composite increases, the correlation between

the composite and the outside variable increases.

The lower the inter-correlation among the components, the higher the correlation

between the composite and the outside variable.

Theoretically, when the number of components in the composite increases, the

correlation between the composite and the outside variable also increases. However, it is

unlikely that a group of components that are highly correlated with the outside variable

will at the same time bear low correlations with each other [Ghi64]. Furthermore, Hogart

[Hog77] notes that, when inclusion of an additional component is considered, low inter-

correlation with the already selected components may well predominate over high

correlation with the outside variable.

5.4 Feature Selection Procedures employed in this work

70


Neural Networks, principal classifier used during this thesis, are able to handle

redundant and irrelevant features, assuming that enough training patterns are available to

estimate suitable weights during the learning process. Since our database hardly satisfies

this requirement, pre-processing of the input features becomes indispensable to improve

prediction accuracy. Regression models have been mainly applied as selection procedure

during this work. However, other procedures, such as Fischer’s discriminant (F-Ratio),

NN pruning, correlation analysis and graphical analysis of features’ statistics (boxplot)

have been tested. Subsequent subsections give a description of each one of these methods

bases.

5.4.1 Regression models

Mainly linear regression models are employed during this thesis to select a subset of

features from a higher amount of them in order to reduce the input dimensionality of the

classifier, as stated in former sections. For some specific cases (speaker independent),

also quadratic regression models were used.

Linear regression models emotions by linear combination of features and selects only

those, which significantly modifies the model. Quadratic models performance works in a

similar way but allowing quadratic combinations in the modelling of the emotions. Both

models are implemented using R2, a language and environment for statistical computing

and graphics. The resulting selected features are scored by R according to their influence

on the model. There are four different scores ordered by grade of importance:

- three points (most influential),

- two points,

- one point and,

- just one remark (less influential).

Which features are taken into account, for classification tasks, varies among different

experiments and it is specified in the conditions descriptions in chapters 8 and 9.

Feature sets selected through regression models are tested in following experiments:

2 http://www.r-project.org/

71


PROSODIC EXP. QUALITY EXP.

SPKR DEPENDENT 8.2.1.1, 8.2.1.4, 8.1.2.5, 8.2.2.2 9.2.2.2

SPKR INDEPENDENT 8.3.1.2 9.3.2.1, 9.3.2.2

Table 5.2 Experiments where the regression-based feature selection is tested.

5.4.2 Fischer’s discriminant: F-Ratio

Fischer’s discriminant is a measure of separability of the recognition classes. It is

based on the idea that the ability of a feature to separate two classes depends on the

distance between classes and the scatter within classes. In figure 5.2 it becomes clear that,

although the means of X are more widely separated, Y is better at separating the two

classes, because there is no overlap between the distributions: Overlap depends on two

factors:

The distance between distributions for the two classes and,

The width of (i.e. scatter within) the distributions.

Class 1 Class

2 Feature X

Class 1 Class 2 Feature Y

Figure 5.2 Performance of a recognition feature depends on the class-to-class difference relative to the scatter within classes.

72


A reasonable way to characterise this numerically would be to take the ratio of the

difference of the means to the standard deviation of the measurements or, since there are

two set of measurements, to find the average of the two standard deviations. Fischer’s

discriminant is based on this principle.

(5.2)

Where

mean of the feature for the class n.

variance of the feature for the class n.

Generally there will be many more than two classes. In that case, the class-to-class

separation of the feature over all the classes has to be considered. This estimation can be

done by representing each class by its mean and taking the variance of the means. This

variance is then compared to the average width of the distribution for each class, i.e. the

mean of the individual variances. This measure is commonly called the F-Ratio:

(5.3)

Where

n = number of measurements for each class.

m = number of different classes.

= ith measurement for class j.

μj = mean of all measurements for class j. = mean of all measurements over all classes.

The F-Ratio feature selection method is tested in experiment 9.3.2.3. There, it’s

observed that the performance of the classifier doesn’t improve when the set resulting

from the F-Ratio analysis of the features is tested. However, an evident reason can be the

cause of this lack of success: The F-Ratio is used for evaluating a single feature and when

73


many potential features are to be evaluated, it’s not safe just to rank features by the F ratio

and pick the best one for use unless all the features are uncorrelated, which usually are

not. Two possible responses to this problem are proposed for future work:

Use various techniques for evaluation combination of features.

Transform features into dependent ones and then pick the best ones in the

transformed space.

5.4.3 NN pruning

This procedure is implicitly performed during the neural network training phase if the

pruning algorithm is selected. A pruned neural network eliminates all those nodes

(features) and/or links that are not relevant for the classification tasks. Since this selection

method is closer of being a neural network learning method, detailed information about

its functioning can be found in section 6.3.3.3.

5.4.4 Correlation analysis

Despite there are many proposed algorithms to implement correlation-based feature

selection, which uses heuristic generation of subsets and selects the best composite

among many different possibilities, the implementation, test and search of an optimal

algorithm would widely spread the boundaries of the present thesis and was proposed as a

separate topic for a future Diploma Thesis. Therefore, the correlation analysis employed

in the first experiments of this thesis was simply to apply the main ideas presented in

section 5.3.1.

First, the correlation matrix was calculated taking into account both the features and

the outputs of the specific problem. Then, following the conclusions extracted from

section 5.3.1, the features that were the most correlated with one of the outputs and, at the

same time, had weak correlation with the rest of the features, were selected among all

candidates.

This procedure of feature selection was employed in the preliminary experiments

carried out with prosodic features. The results showed that, despite the selected features

actually seem to possess relevant information, the optimisation of a correlation-based

method would be a valuable proposal for future work, as it has been already introduced.

74


5.4.5 Graphical analysis of feature statistics (“boxplot”)

A “boxplot” provides an excellent visual summary of many important aspects of a

distribution. The box stretches from the lower hinge (defined as the 25th percentile3) to the

upper hinge (the 75th percentile) and therefore the length of the box contains the middle

half of the scores in the distribution.

1 2 3

0.0

e+00

5.0

e+07

1.0

e+08

1.5

e+08

1 2 3

0 e

+00

1 e

+05

2 e

+05

3 e

+05

4 e

+05

5 e

+05

1 2 3

100

150

200

250

300

1 2 3

100

150

200

250

300

350

Figure 5. 3 Boxplot graphical representations of features (a) P1.5, (b) P1.7, (c) P1.14 and (d) P1.23 used

for the selection in experiment 8.2.2.2. Each box represents one of the three activation levels (3 classes).

3 A percentile rank is the proportion of scores in a distribution that a specific score is greater than or equal to. For instance, if you received a score of 95 on a math test and this score was greater than or equal to the scores of 88% of the students taking the test, then your percentile rank would be 88. You would be in the 88th percentile.

75

(a) (b)

(c) (d)


The median4 is shown as a line across the box. Therefore 1/4 of the distribution is

between this line and the top of the box and 1/4 of the distribution is between this line and

the bottom of the box.

It is often useful to compare data from two or more groups by viewing “boxplots”

from the groups side by side. Boxplots are useful to compare two or more samples by

comparing the center value (median) and variation (length of the box) and to compare

how well a given feature is capable of separating predetermined classes. For instance,

figure 5.3 shows the boxplot representations of four different features when three outputs

are considered (experiment 8.2.2.2). The first feature (a) would a priori represents a good

feature to discriminate among the three given classes because their median values are

well distanced and also the box lengths are not significantly overlapped. Also features (b)

and (c) were considered for the selected set after the observation of their statistics.

However, feature (d) was omitted because its statistics are not able to make distinctions

among the classes, as depicted in figure 5.3 (d).

This graphical method has been employed during this thesis in combination with other

procedure, mainly linear regression, in order to achieve a enhanced analysis of the

features.

4 The median is the middle of a distribution: half the scores are above the median and half are below the median.

76

many factors affect the success of machine learning … · web viewmany factors affect the...

Documents