exploratory analysis feature selection and...

Feature SelectionClustering

Conclusion

Exploratory AnalysisFeature Selection and Clustering

Davide Bacciu

Computational Intelligence & Machine Learning GroupDipartimento di Informatica

Università di [email protected]

Introduzione all’Intelligenza Artificiale - A.A. 2012/2013


Conclusion

Lecture Outline

1 Feature SelectionIntroductionObjective FunctionsSearch Strategies

2 ClusteringIntroductionk-means ClusteringAdvanced Clustering

3 Conclusion


Conclusion

Exploratory Data Analysis

Discover structure in dataFind unknown patterns in the data that cannot be predictedusing current expert knowledgeFormulate new hypotheses about the causes of theobserved phenomena

Finding informative attributesFeature ExtractionFeature Selection

Finding natural groupsClustering


Conclusion

IntroductionObjective FunctionsSearch Strategies

Feature Selection Vs Feature Extraction

Two approaches to dimensionality reductionFeature Extraction - Create a new, lower dimensional,representation of some input data by transforming theexisting features with a given functionFeature Selection - Select a subset of the existing featureswithout transforming the input data

Feature extraction generates new features by optimizingSignal representation (PCA)Signal classification (LDA)

Feature selection looks at dimensionality reduction from adifferent perspective

Different methodologies (e.g. computational costs, . . . )Different applications


Conclusion


Why Feature Selection?

Straightforward answer is: for several good reasonsdiscussed previously for feature extraction

Reduce problem complexityReduce noiseFind good data representations

So, why do not always use feature extraction?It can be computationally expensive to generate newfeaturesDon’t want to transform data canceling its semanticsData is not always numericData may fail to meet the theoretical assumptionsunderlying feature extractionIt allows to explicitly select good predictors, i.e. featuresthat behave well on a specific supervised task


Conclusion


What About Projection?

Feature selection is rarely used for visualization

Nevertheless, it actually implements a projection

It is equivalent to projecting data onto lower-dimensional linearsubspace perpendicular to the removed feature


Conclusion


Some Use Cases

Case I - Lung cancer

Features are biomedical information or aspects of apatient’s medical historyWhich features best predict whether lung cancer willdevelop?

Case II -Image Understanding

Samples are images and features are pixelsWhich regions of an image is more likely to provideuseful/discriminative information?

We seek a compact data representation that is interpretableand that tells us which features are relevant


Conclusion


Feature Selection

Definition - A process that chooses a D′-dimensionalsubset of all the features according to an objective function

x =

x1x2. . .

xD

select i1,...,iD′−−−−−−−−→ y =

xi1xi2. . .xiD′

Three characterizing aspects

Subset⇒ no creation of new featuresObjective function⇒ measure of subset optimalityProcess⇒ need a subset search strategy


Conclusion


Search Strategy and Objective Function

Feature selection requiresA search strategy to selectcandidate subsetsAn objective function to evaluatethese candidates

Objective functionEvaluates candidate subsets andreturns a measure of theiroptimality that is used to selectnew candidates

Search StrategyExhaustive evaluation of all featuresubsets from N samples is O(2D)Need a smart strategy to direct thesubset selection process


Conclusion


Objective Functions - Two Approaches

Filters - Evaluate feature subsets by their information content,e.g. interclass distance, statistical dependence orinformation-theoretic measuresWrappers - Use a pattern classifier which evaluates featuresubsets by their predictive accuracy (e.g. recognition rate onvalidation data) using re-sampling or cross-validation


Conclusion


Filter Approach

Basic IdeaAssign an heuristic score to each feature to filter out theuseless ones

Separate feature selection phase from patternrecognition/classifier learningEvaluate features by measuring their informative content

Does the individual feature seems to help prediction?Is it redundant?Is it reliable?

The scoring metric can be unsupervised or exploitsupervised information


Conclusion


Measuring Feature Relevance

Distance metricsMeasure separability with respect to some target attribute(e.g. a class)Select those features which have the largest separability

Correlation metricsGood features are highly correlated with the class but areuncorrelated with each other

Information-theoretic measuresMeasure the reduction in uncertainty (entropy) between atarget variable (e.g. the class) and the featureE.g. Mutual information

Consistency measuresFind a minimum number of features that separate classesas consistently as the full setAn inconsistency are two samples from different classeshaving the same feature values


Conclusion


Wrapper Approach

Basic IdeaSelect those features that make my supervised learning modelperform best

Feature selection relies on the preselected supervisedlearning modelThe learning model is re-trained each time a new subset isselected

Quality of the feature subset is measured based on theempirical error of the learned modelNeed to use robust validation strategies to ensuregeneralization

Selected features are, tipically, effective class predictors forthe model


Conclusion


Filters vs. Wrappers (I)

Filter ApproachesPro

Fast execution - Non-iterative dataset processing which canexecute much faster than a classifier trainingGenerality - Evaluate intrinsic properties of the data, ratherthan interactions with a particular classifier, hence solutionswill be good for a larger family of classifiers

ConsSelect large subsets - Since the filter objective functions aregenerally monotonic, the filter tends to select the full featureset as the optimal solution. This forces the user to select anarbitrary cutoff on the number of features to be selected


Conclusion


Filters vs. Wrappers (II)

Wrapper ApproachesPro

Accuracy - Achieve better recognition rates since selectedfeature are tuned to the classifierTest generalization - Use cross-validation measures ofpredictive accuracy to avoid overfitting

ConsSlow - Must train a classifier for each feature subsetLack of generality - Solutions are tied to the bias of theclassifier used in the evaluation function


Conclusion


Characterization of a Search Routine

Search starting pointEmpty setFull setRandom subset

Search directionsSequential selection/eliminationRandom generation

Perform randomized exploration of the search space wherenext direction is sample from a given probabilityE.g. genetic algorithms, simulated annealing, . . .

Search StrategyExhaustive/complete searchHeuristic searchNondeterministic search


Conclusion


Sequential Feature Ranking

The simplest strategyWeight and rank eachfeatureSelect top-K ranked featuresNo need of subset search

AdvantagesO(D) complexityEasy to implement

DisadvantagesHow to determine top-K thethreshold?Does not consider featurecorrelation


Conclusion


Sequential Search (I)

Greedy search algorithms adding or removing single featuressequentially

Sequential Forward SelectionStarting from the empty set, sequentiallyadd the feature xd that results in thehighest objective function J(XD′ ∪ {xd})when combined with the features XD′ thathave already been selectedPerforms best when the optimal subsethas a small number of features


Conclusion


Sequential Search (II)

Sequential Backward EliminationStarting from the full set, sequentiallyremove the feature xd ∈ XD′ that results inthe smallest decrease of J(XD′ \ {xd})Performs best when the optimal subsethas a large number of features

Bi-directional SearchForward and backward searches areperformed in parallelThey are forced to converge to the samesolution by ensuring that

Features selected in forward pass are notremoved by backward searchFeatures removed in backward searchare not selected in the forward pass


Conclusion


Search Strategies Cookbook


Conclusion

Introductionk-means ClusteringAdvanced Clustering

Unsupervised Feature Selection

So far feature selection approaches seem to exploit onlyrelevance measures relying on supervised classinformationIs there any unsupervised feature selection approach? Theanswer is... yes (surprised?)Feature redundancy and relevance is measured withrespect to clusters instead of classes

Definition (Clustering)

The process of organizing objects into natural groups whosemembers are similar in way determined by a given metrics


Conclusion


Clustering = Seeking a natural grouping

What is a natural grouping?

Family School Females Males

Credit goes to Eamonn Keogh @ UCR


Conclusion


The Clustering Problem (I)

Clustering objectiveMaximizing intra-cluster similarityMinimizing inter-cluster similarity

Prototype-based clusteringFind a representative ci(prototype) for each cluster iMinimizing the distance w.r.tthe cluster members

Results depend on the choiceof the distance metric

Euclidean ‖ci − xn‖2Mahalanobis(ci − xn)

T S−1(ci − xn). . .


Conclusion


The Clustering Problem (II)

It is not always easy to define what is a cluster and what is not

Hierarchical ClusteringPrototype-basedFinds a hierarchy of nested clusters

Useful to convey a multi-resolution view of dataBio-medical dataClusters⇒ tumorsSub-clusters⇒ tumor subtypes


Conclusion


The Clustering Problem (II)

As we have seen with the Swiss roll in feature extraction, somedata might lay on particularly nasty surfaces

Prototype-based clustering may not be effective withnon-linearly separable clusters


Conclusion


K-means Clustering

The simplest clustering algorithmLots of limitationsStill widely used (easy to understand and implement)

The algorithm in briefReceives as input k , i.e. the number of clusters to seek inthe dataStarts by picking k points at random as cluster prototypes(centroids)Assigns each sample to the nearest prototype andrecomputes the cluster centroids


Conclusion


The Algorithm

Algorithm k-means(X ,K )N ← |X |;for all i = 1 to K do

c[i]← rand();end forrepeat {Update prototypes}

for all n = 1 to N dowinner← arg min

i=1,...,K‖c[i]− xn‖2

tot [winner]← tot [winner] + 1cnew [winner]← cnew [winner] + xn

end forfor all i = 1 to K do

c[i]← cnew [i]tot [i]

;

end foruntil MaxIterations or ‖c − cnew‖ < εreturn Cluster prototypes c


Conclusion


k-Means Example (I)

Input data X


Conclusion


k-Means Example (II)

Initialize K = 3 prototypes


Conclusion


k-Means Example (III)

Assign samples to the nearest prototype/cluster


Conclusion


k-Means Example (IV)

Compute the updated prototypes c′i


Conclusion


k-Means Example (V)

Update cluster assignment


Conclusion


Cluster number estimate

In practice, we seldom know the cluster number KTrial-and-error to find the most suitable KUse algorithm that estimate the cluster number

Several approaches but no killer algorithmAgglomerative Clustering - Start with all samples being aseparate cluster and iteratively aggregate existing clusteruntil an error criterion is satisfiedPartitional Clustering - Start with a single cluster and keepsplitting it until an error criterion is satisfied

Error is often composed ofA similarity term favoring aggregation of samples intoclustersA penalization term discouraging the creation of too manyclusters


Conclusion


Clustering and Feature Selection

Cluster labels can be used for performing feature selectionwhen class information is not available

Filter approachStep 1 Partition the data using a clustering algorithmStep 2 Run a filter model evaluating feature

relevance w.r.t. how much an attribute helpsin separating clusters

Wrapper approachStep 1 Select a set of featuresStep 2 Evaluate if they optimize the error criterion set

by the clustering algorithm

Feature selection can be used to help clustering algorithms!


Conclusion

Take-home Messages

Feature selectionExtracts a subset of informative input featuresPreserves the data semanticsFaster than feature extraction

Two main strategiesFilters - Separate feature selection from pattern recognitionWrappers - Select features that work best with a givenlearning model

ClusteringFinding natural groups in the dataCluster number estimate⇒ no definitive answerCan be used to support feature selection when classinformation is not available

exploratory analysis feature selection and...

Documents