feature selection concepts and methods

Post on 20-Jan-2015

725 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

1

Feature Selection

Concepts and Methods

Electronic & Computer Department

Isfahan University Of Technology

Reza Ramezani

What are Features?

Features are attributes that their value make an instance.

With features we can identify instances.

Features are determinant values that determine instance belong to which class.

2

Classifying Features

Relevance: These are features that have an influence on the output and whose role can not be assumed by the rest.

Irrelevance: Features that don't have any influence on the output, and whose values are generated at random for each example.

Redundance: A redundancy exists whenever a feature can take the role of another.

3

What is Feature Selection?Feature selection, is a preprocessing step to

machine learning that choose a subset of original features according to a certain evaluation criterion and is effective in:

Removing/Reduce effect of irrelevant data removing redundant data reducing dimensionality (binary model) increasing learning accuracy and improving result comprehensibility.

4

Other DefinitionsProcess which select a subset of features

defined by one of three approaches:

1) the subset with a specified size that optimizes an evaluation measure

2) the subset of smaller size that satisfies a certain restriction on the evaluation measure

3) the subset with the best compromise among its size and the value of its evaluation measure (general case).

5

Feature Selection Algorithm (FSA) FSA is a computational solution that is

motivated by a certain definition of relevance.1) The relevance of a feature may have several

definitions depending on the objective that is looked for.

2) Find a compromise among minimizing and maximizing (general case).

3) An irrelevant feature is not useful for induction, but not all relevant features are necessarily useful for induction.

6

Classifying FSAs The FSAs can be classified according to

the kind of output they yield:

1) Algorithms that giving a weighed linear order of features. (Continuous feature selection problem)

2) Algorithms that giving a subset of the original features. (Binary feature selection problem)

Note that both types can be seen in an unified way by noting that in (2) the weighting is binary.

7

Notation X = Feature Set X’ = Feature Subset xi = Feature I = Instances p = Probability distribution on E W = Space of labels (e.g. classes). c = Objective function c:E T according to

its relevant features. (Classifier) S = Data set (Training set)

8

Relevance of a feature

The purpose of a FSA is to identify relevant features according to a definition of relevance.

Unfortunately the notion of relevance in machine learning has not yet been rigorously defined on a common agreement.

Let us to define Relevance in many aspect:9

Relevance with respect to an

objective Feature Relevant to objective function c Two examples A, B in the instance space E A and B differ only in their assignment to .

10

Strong relevance with respect to S Fature

Strongly relevant to the sample S Two examples A and B differ in their assignment to .

That is to say, it is the same last definition, but now and the definition is with respect to S.

11

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

class

MarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

12

Strong relevance with respect to p Feature

Strongly relevant to an objective c in the distribution p

Two examples with 0 and 0 A and B differ in their assignment to .

This definition is the natural extension of last definition and, contrary to it, the distribution is assumed to be known. 13

Weak relevance with respect to S Feature

Weakly relevant to the sample S There exists at least a proper where is

strongly relevant with respect to S.

A weakly relevant feature can appear when a subset containing at least one strongly relevant feature is removed.

14

Weak relevance with respect to p Feature

Weakly relevant to the objective c in the distribution p

There exists at least a proper where is strongly relevant with respect to p.

These 5 definitions are important to decide what features should be conserved and which can be eliminated.

15

Strongly Relevant Features The strongly relevant features are, in theory,

important to maintain a structure in the domain

And they should be conserved by any feature selection algorithm in order to avoid the addition of ambiguity to the sample.

16

Weakly Relevant Features

Weakly relevant features could be important or not, depending on:

The other features already selected.

The evaluation measure that has been chosen (accuracy, simplicity, consistency, etc.).

17

Relevance as a complexity measure

Define r(S,c) Smallest number of relevant features to c The error in S is the least possible for the

inducer.

In other words, it refers to the smallest number of features required by a specific inducer to reach optimum performance in the task of modeling c using S.

18

Incremental usefulness a data sample S

a learning algorithm L and a subset of features

The feature is incrementally useful to L with respect to if the accuracy of the group of features better than the accuracy reached using only the subset of features .

19

ExampleX1…………...X11…………....X21……………..X30

100000000000000000000000000000 +

111111111100000000000000000000 +

000000000011111111110000000000 +

000000000000000000001111111111 +

000000000000000000000000000000 – X1 is strongly relevant, the rest are weakly relevant. r(S,c) = 3 Incremental usefulness: after choosing {X1, X2}, none of X3…

X10 would be incrementally useful, but any of X11…X30 would.20

General Schemes for Feature Selection

Relationship between a FSA and the inducer Inducer:

• Chosen process to evaluate the usefulness of the features

• Learning Process

Filter Scheme

Wrapper Scheme

Embedded Scheme 21

Filter Scheme Feature selection process takes place before

the induction step

This scheme is independent of the induction algorithm.

• High Speed• Low Accuracy

22

Wrapper Scheme Use the learning algorithm as a subroutine to

evaluate the features subsets.

Inducer must be known.

• Low Speed• High Accuracy

23

Embedded Scheme Similar to the wrapper approach

Features are specifically selected for a certain inducer

Inducer selects the features in the process of learning (Explicitly or Implicitly).

24

MarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

25

Embedded Scheme Example

Refund Marital Status

Taxable Income

Age Cheat

Yes Single 125K 18 No

No Married 100K 30 No

No Single 70K 28 No

Yes Married 120K 19 No

No Divorced 95K 18 Yes

No Married 60K 20 No

Yes Divorced 220K 25 No

No Single 85K 30 Yes

No Married 75K 20 No

No Single 90K 18 Yes 10

categoric

al

categoric

al

continuous

classcontin

uous

Decision Tree Maker Algorithm, willAutomatically Remove ‘Age’ Feature.

Characterization of FSAs

Search Organization: General strategy with which the space of hypothesis is explored.

Generation of Successors: Mechanism by which possible successor candidates of the current state are proposed.

Evaluation Measure: Function by which successor candidates are evaluated.

26

Types of Search Organization

We consider three types of search:

Exponential

Sequential

Random

27

Exponential Search Algorithms that carry out searches with

cost Best solution is guaranteed. The exhaustive search is an optimal search. An optimal search need not be exhaustive;

• Branch and Bound for monotonic evaluation measure

• search with an admissible heuristic.• A measure is monotonic if for any two subsets and

, then .28

Sequential Search This Strategy selects one among all the

successors to the current state. Once the state is selected it is not possible to go

back. The number of such steps must be limited by . Let be the number of evaluated subsets in each

state change. The cost of this search is therefore polynomial .

These methods do not guarantee an optimal result.

29

Random Search Use Randomness to avoid the algorithm to

stay on a local minimum.

Allow temporarily moving to other states with worse solutions.

These are anytime algorithms.

Can give several optimal subsets as solution.

30

Types of Successors Generation Forward

Backward

Compound

Weighting

Random 31

Forward Successors Generation Starting with .

Adds features to the current solution , among those that have not been selected yet.

In each step, the feature that makes be greater is added to the solution.

The cost of operator is

32

Backward Successors Generation Starting with .

Removes features from the current solution , among those that have not been removed yet.

In each step, the feature that makes J be greater is removed from the solution.

The cost of operator is 33

Forward and Backward Method, Stopping Criterion

( has been fixed in advance)

The value of J has not increased in the last k steps

The value of J has not surpasses a prefixed value .

In practice backward method demands more computation than its forward counterpart.

34

Compound Successors Generation Apply f consecutive forward steps and b

consecutive backward ones. If the net result is a forward operator, otherwise it

is a backward one. This method, allows to discover new interactions

among features. Other stopping conditions should be established if

f = b. (Such as for ) In sequential FSA, the condition assures a

maximum of steps, with a total cost . 35

Weighting Successors Generation In weighting operators (continuous features).

All of the features are present in the solution to a certain degree.

A successor state is a state with a different weighting.

This is typically done by iteratively sampling the available set of instances.

36

Random Successors Generation Includes those operators that can potentially

generate any other state in a single step.

Restricted to some criterion of advance:• In the number of features• In improving the measure J at each step.

37

Evaluation Measures Probability of Error

Divergence Dependence Interclass Distance Information or Uncertainty Consistency Relative values assigned to different subsets reflect

their greater or lesser relevance to the objective function.

Let an evaluation measure to be maximized, where is a (weighed) feature subset.

38

Evaluation Measures, Probability of Error

Ultimate goal is to build a classifier with minimizing the (Bayesian) probability of error.

of the classifier seems to be the most natural choice.

39

Evaluation Measures, Probability of Error

Since the class-conditional densities are usually unknown, they can either be explicitly modeled (using parametric or non-parametric methods)

40

Evaluation Measures, Probability of Error

Provided the classifier has been built using only a subset of the features, we have:

is a test data sample is the subset of where the classifier

performed correctly. Finally we have:

41

Evaluation Measures, DivergenceThese measures compute a probabilistic distance

or divergence among the class-conditional probability densities , using the general formula:

42

Evaluation Measures, Divergence For valid measure, the function must be

such that the value of satisfies the following conditions:

1) only when the are equal

2) is maximum when they are non-overlapping

If the features used in a solution are good ones, the divergence will be significant.

43

Divergence, Some classical choices:

44

Divergence, Some classical choices:

45

Evaluation Measures,

Dependence These measures quantify how strongly two features are associated with one another.

Knowing the value of one feature it is possible to predict the value of the other feature.

The correlation coefficient is a classical measure that still use for these methods.

46

Evaluation Measures, Interclass Distance

These measures are based on the assumption that instances of a different class are distant in the instance space.

being the instance of class , and the �number of instances of the class .

The most usual distances belong to the Euclidean family.

47

Evaluation Measures,

Consistency An inconsistency in and is defined as: two instances in that are equal when considering only

the features in and that belong to different classes.

The aim is thus to find the minimum subset of features leading to zero inconsistencies.

48

Evaluation Measures,

Consistency The inconsistency count of an instance is defined as:

is the number of instances in equal to using only the features in .

is the number of instances in of class equal to using only the features in .

49

Evaluation Measures,

ConsistencyThe inconsistency rate of a feature subset in a sample is:

Finally we have:

This measure is in [0, 1] that must min.

50

51

General Algorithm for Feature Selection

All FSA can be represented in a space of characteristics according to the criteria of: search organization (Org) Generation of successor states (GS) Evaluation measures (J)

This space <Org, GS, J> encompasses the �whole spectrum of possibilities for a FSA.

hybrid FSA when FSA requires more than a point in the same coordinate to be characterized.

52

53

FCBFFeature Correlation

Based Filter(Filter Mode)

<Sequential, Compound, Information> 54

Previous Works and Their Defects

1) Huge Time Complexity

Binary Mode: Subset search algorithms search through

candidate feature subsets guided by a certain search strategy and a evaluation measure.

Different search strategies, namely, exhaustive, heuristic, and random search, are combined with this evaluation measure to form different algorithms. 55

Previous Works and Their Defects The time complexity is exponential in terms

of data dimensionality for exhaustive search and quadratic for heuristic search. The complexity can be linear to the number

of iterations in a random search, but experiments show that in order to find best feature subset, the number of iterations required is mostly at least quadratic to the number of features.

56

Previous Works and Their Defects

2) Inability to recognize redundant features.

Relief: The key idea of Relief is to estimate the relevance of

features according to how well their values distinguish between the instances of the same and different classes that are near each other.

Relief randomly samples a number (m) of instances from the training set and updates the relevance estimation of each feature based on the difference between the selected instance and the two nearest instances of the same and opposite classes. 57

Previous Works and Their Defects Time complexity of Relief for a data set with

M instances and N features is O(mMN). With m being a constant, the time complexity

becomes O(MN), which makes it very scalable to data sets with both a huge number of instances and a very high dimensionality.

However, Relief does not help with removing redundant features.

58

Good Feature A feature is good if it is relevant to the class

concept but is not redundant to any of the other relevant features.

Correlation as Goodness Measure

A feature is good if it is highly correlated to the class but not highly correlated to any of the other features. 59

Approaches to Measure The Correlation

Classical Linear Correlation (Linear Correlation Coefficient)

Information theory (Entropy or Uncertainty)

60

Linear Correlation Coefficient For a pair of variables (X,Y ) the linear

correlation coefficient r is given by the formula:

If X and Y are completely correlated, r takes the value of 1 or -1.

If X and Y are totally independent, r is zero.

61

Advantages It helps to remove features with near zero

linear correlation to the class. It helps to reduce redundancy among

selected features.

Disadvantages It may not be able to capture correlations

that are not linear in nature. Calculation requires all features contain

numerical values.62

Entropy The Entropy of a variable (feature) X is

defined as:

The Entropy of X after observing values of another variable Y is defined as:

63

Entropy, Information Gain The amount by which the entropy of X

decreases reflects additional information about X provided by Y:IG(X|Y) = H(X) - H(X|Y)

Feature Y is regarded more correlated to feature X than to feature Z, if IG(X|Y) > IG(Z|Y)

Information gain is symmetrical for two random variables X and Y: IG(X|Y) = IG(Y|X)

64

Entropy, Symmetrical Uncertainty Information gain is biased in favor of features

with more values. Thus must normalize it:

SU(X,Y) values are normalized to the range [0,1]. value 1 indicating that knowledge of the value of either

one completely predicts the value of the other. The value 0 indicating that X and Y are independent.

65

Entropy, Symmetrical Uncertainty Symmetrical Uncertainty still treats a pair of

features symmetrically. Entropy-based measures require nominal

features. Entropy-based measures can be applied to

measure correlations between continuous features as well, if the values are discretized properly in advance.

66

Algorithm Steps Aspects of developing a procedure to select

good features for classification:

1) How to decide whether a feature is relevant to the class or not (C-correlation).

2) How to decide whether such a relevant feature is redundant or not when considering it with other relevant features (F-correlation).

Select features with SU greater than a threshold.67

Predominant Correlation

The correlation between a feature and the class C is predominant iff:

There exists no such that

68

Redundant Feature If is redundant to feature , we use to denote

the set of all redundant peers for . We divide into two parts:

69

Predominant Feature A feature is predominant to the class, iff:

Its correlation to the class is predominant Or can become predominant after removing its

redundant peers.

Feature selection for classification is a process that identifies all predominant features to the class concept and removes the rest.

70

Heuristic We must use heuristics in order to avoid

pairwise analysis of F-Correlations between all relevant features.

Heuristic: (if ). Treat as a predominant feature, remove all features in , and skip identifying redundant peers for them.

71

72

FCBF Algorithm

73

74

GA-SVMGeneric Algorithm

Support Vector Machine

(Wrapper Mode)<Sequential, Compound,

Classifier>75

Support Vector Machine (SVM) SVM, one of the best techniques for pattern

classification. Widely use in many application areas. SVM classifies data by determining a set of

support vectors and their distance to hyperplane.

SVM provides a generic mechanism that fits the hyperplane surface to the training data.

76

SVM Main Idea With this hypothesis that classes are linearly

separable, make hyperplane with maximum margin to separate classes.

When classes are not linearly separable, map them to high dimensional space to linearly separate them.

77Separating Surface:

A+A-

Support Vector Nearest training set instances to hyperplane:

Use SV instead of training set.

Line Equation: (w and b are unknown)

78

Class +1Class -1

X2

X1

SV

SV

SV

Kernel

79

1 2 4 5 6

class 2 class 1class 1

1 Dimension

1 2 4 5 6

class 2 class 1class 1

2 Dimension

Kernel Data in higher dimensional! The user may select a kernel function for the

SVM during the training process. The kernel parameters setting for SVM in a

training process impacts on the classification accuracy.

The parameters that should be optimized include penalty parameter C and the kernel function parameters.

80

Linear SVM SVM concepts for typical two-class

classification problems:

Training set of instance-label pairs

For the linearly separable case, the data points will be correctly classified by

81

Linear SVM Fnd an optimal separating hyperplane with

the maximum margin by solving the following optimization problem:

To solve this quadratic optimization problem one must find the saddle point of the Lagrange function:

denotes Lagrange multipliers, hence .

82

Linear SVM After differentiating and Karush Kuhn–

Tucker (KTT) conditions:

values determine the parameters and of the optimal hyperplane. Thus, we obtain an optimal decision hyperplane

83

Linear Generalized SVM When can't linearly separate, the goal is to

construct a hyperplane that makes the smallest number of errors. (non-negative slack variables)

Solve

84C : tradeoff parameter between error and margin, Number od misclassified instances.

Linear Generalized SVM This optimization model can be solved

using the Lagrangian method

The penalty parameter C, which is now the upper bound on

85

NonLinear SVM The nonlinear SVM maps the training

samples from the input space into a higher-dimensional feature space via a mapping function F, which are also called kernel function. Inner products are replaced by the kernel function:

86

NonLinear SVM, Kernels

final hyperplane equation

87

NonLinear SVM, Kernels In order to improve classification accuracy,

these kernel parameters in the kernel functions should be properly set.

Polynomial kernel:

Radial basis function kernel:

Sigmoid kernel:88

Genetic Algorithm (GA) Genetic algorithms (GA), as a optimization search

methodology is a promising alternative to conventional heuristic methods.

GA work with a set of candidate solutions called a population.

Based on the Darwinian principle of ‘survival of the fittest’, the GA obtains the optimal solution after a series of iterative computations.

GA generates successive populations of alternate solutions that are represented by a chromosome.

A fitness function assesses the quality of a solution in the evaluation step. 89

90

GA Feature Selection Structure

The chromosome comprises three parts, C, , and the features mask. (Different parameters when other types of kernel functions)

The binary coding system was used to represent the chromosome. is the number of bits representing parameter is the number of bits representing parameter is the number of bits representing the features

Choose and according to the calculation precision.91

Evaluation Measure Three criteria used to design a fitness

function: Classification accuracy The number of selected features The feature cost

Thus, for the individual (chromosome) with: High classification accuracy Small number of features Low total feature cost

Produce a high fitness value. 92

Evaluation Measure

𝑓𝑖𝑡𝑛𝑒𝑠𝑠=𝑊 𝐴∗𝑆𝑉 𝑀 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦+¿93

94

95

Thanks For Your Regard

75

Thanks For Your Regard

top related