exploratory analysis feature selection and...
TRANSCRIPT
Feature SelectionClustering
Conclusion
Exploratory AnalysisFeature Selection and Clustering
Davide Bacciu
Computational Intelligence & Machine Learning GroupDipartimento di Informatica
Università di [email protected]
Introduzione all’Intelligenza Artificiale - A.A. 2012/2013
Feature SelectionClustering
Conclusion
Lecture Outline
1 Feature SelectionIntroductionObjective FunctionsSearch Strategies
2 ClusteringIntroductionk-means ClusteringAdvanced Clustering
3 Conclusion
Feature SelectionClustering
Conclusion
Exploratory Data Analysis
Discover structure in dataFind unknown patterns in the data that cannot be predictedusing current expert knowledgeFormulate new hypotheses about the causes of theobserved phenomena
Finding informative attributesFeature ExtractionFeature Selection
Finding natural groupsClustering
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Feature Selection Vs Feature Extraction
Two approaches to dimensionality reductionFeature Extraction - Create a new, lower dimensional,representation of some input data by transforming theexisting features with a given functionFeature Selection - Select a subset of the existing featureswithout transforming the input data
Feature extraction generates new features by optimizingSignal representation (PCA)Signal classification (LDA)
Feature selection looks at dimensionality reduction from adifferent perspective
Different methodologies (e.g. computational costs, . . . )Different applications
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Why Feature Selection?
Straightforward answer is: for several good reasonsdiscussed previously for feature extraction
Reduce problem complexityReduce noiseFind good data representations
So, why do not always use feature extraction?It can be computationally expensive to generate newfeaturesDon’t want to transform data canceling its semanticsData is not always numericData may fail to meet the theoretical assumptionsunderlying feature extractionIt allows to explicitly select good predictors, i.e. featuresthat behave well on a specific supervised task
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
What About Projection?
Feature selection is rarely used for visualization
Nevertheless, it actually implements a projection
It is equivalent to projecting data onto lower-dimensional linearsubspace perpendicular to the removed feature
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Some Use Cases
Case I - Lung cancer
Features are biomedical information or aspects of apatient’s medical historyWhich features best predict whether lung cancer willdevelop?
Case II -Image Understanding
Samples are images and features are pixelsWhich regions of an image is more likely to provideuseful/discriminative information?
We seek a compact data representation that is interpretableand that tells us which features are relevant
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Feature Selection
Definition - A process that chooses a D′-dimensionalsubset of all the features according to an objective function
x =
x1x2. . .
xD
select i1,...,iD′−−−−−−−−→ y =
xi1xi2. . .xiD′
Three characterizing aspects
Subset⇒ no creation of new featuresObjective function⇒ measure of subset optimalityProcess⇒ need a subset search strategy
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Search Strategy and Objective Function
Feature selection requiresA search strategy to selectcandidate subsetsAn objective function to evaluatethese candidates
Objective functionEvaluates candidate subsets andreturns a measure of theiroptimality that is used to selectnew candidates
Search StrategyExhaustive evaluation of all featuresubsets from N samples is O(2D)Need a smart strategy to direct thesubset selection process
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Objective Functions - Two Approaches
Filters - Evaluate feature subsets by their information content,e.g. interclass distance, statistical dependence orinformation-theoretic measuresWrappers - Use a pattern classifier which evaluates featuresubsets by their predictive accuracy (e.g. recognition rate onvalidation data) using re-sampling or cross-validation
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Filter Approach
Basic IdeaAssign an heuristic score to each feature to filter out theuseless ones
Separate feature selection phase from patternrecognition/classifier learningEvaluate features by measuring their informative content
Does the individual feature seems to help prediction?Is it redundant?Is it reliable?
The scoring metric can be unsupervised or exploitsupervised information
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Measuring Feature Relevance
Distance metricsMeasure separability with respect to some target attribute(e.g. a class)Select those features which have the largest separability
Correlation metricsGood features are highly correlated with the class but areuncorrelated with each other
Information-theoretic measuresMeasure the reduction in uncertainty (entropy) between atarget variable (e.g. the class) and the featureE.g. Mutual information
Consistency measuresFind a minimum number of features that separate classesas consistently as the full setAn inconsistency are two samples from different classeshaving the same feature values
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Wrapper Approach
Basic IdeaSelect those features that make my supervised learning modelperform best
Feature selection relies on the preselected supervisedlearning modelThe learning model is re-trained each time a new subset isselected
Quality of the feature subset is measured based on theempirical error of the learned modelNeed to use robust validation strategies to ensuregeneralization
Selected features are, tipically, effective class predictors forthe model
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Filters vs. Wrappers (I)
Filter ApproachesPro
Fast execution - Non-iterative dataset processing which canexecute much faster than a classifier trainingGenerality - Evaluate intrinsic properties of the data, ratherthan interactions with a particular classifier, hence solutionswill be good for a larger family of classifiers
ConsSelect large subsets - Since the filter objective functions aregenerally monotonic, the filter tends to select the full featureset as the optimal solution. This forces the user to select anarbitrary cutoff on the number of features to be selected
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Filters vs. Wrappers (II)
Wrapper ApproachesPro
Accuracy - Achieve better recognition rates since selectedfeature are tuned to the classifierTest generalization - Use cross-validation measures ofpredictive accuracy to avoid overfitting
ConsSlow - Must train a classifier for each feature subsetLack of generality - Solutions are tied to the bias of theclassifier used in the evaluation function
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Characterization of a Search Routine
Search starting pointEmpty setFull setRandom subset
Search directionsSequential selection/eliminationRandom generation
Perform randomized exploration of the search space wherenext direction is sample from a given probabilityE.g. genetic algorithms, simulated annealing, . . .
Search StrategyExhaustive/complete searchHeuristic searchNondeterministic search
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Sequential Feature Ranking
The simplest strategyWeight and rank eachfeatureSelect top-K ranked featuresNo need of subset search
AdvantagesO(D) complexityEasy to implement
DisadvantagesHow to determine top-K thethreshold?Does not consider featurecorrelation
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Sequential Search (I)
Greedy search algorithms adding or removing single featuressequentially
Sequential Forward SelectionStarting from the empty set, sequentiallyadd the feature xd that results in thehighest objective function J(XD′ ∪ {xd})when combined with the features XD′ thathave already been selectedPerforms best when the optimal subsethas a small number of features
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Sequential Search (II)
Sequential Backward EliminationStarting from the full set, sequentiallyremove the feature xd ∈ XD′ that results inthe smallest decrease of J(XD′ \ {xd})Performs best when the optimal subsethas a large number of features
Bi-directional SearchForward and backward searches areperformed in parallelThey are forced to converge to the samesolution by ensuring that
Features selected in forward pass are notremoved by backward searchFeatures removed in backward searchare not selected in the forward pass
Feature SelectionClustering
Conclusion
IntroductionObjective FunctionsSearch Strategies
Search Strategies Cookbook
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
Unsupervised Feature Selection
So far feature selection approaches seem to exploit onlyrelevance measures relying on supervised classinformationIs there any unsupervised feature selection approach? Theanswer is... yes (surprised?)Feature redundancy and relevance is measured withrespect to clusters instead of classes
Definition (Clustering)
The process of organizing objects into natural groups whosemembers are similar in way determined by a given metrics
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
Clustering = Seeking a natural grouping
What is a natural grouping?
Family School Females Males
Credit goes to Eamonn Keogh @ UCR
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
The Clustering Problem (I)
Clustering objectiveMaximizing intra-cluster similarityMinimizing inter-cluster similarity
Prototype-based clusteringFind a representative ci(prototype) for each cluster iMinimizing the distance w.r.tthe cluster members
Results depend on the choiceof the distance metric
Euclidean ‖ci − xn‖2Mahalanobis(ci − xn)
T S−1(ci − xn). . .
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
The Clustering Problem (II)
It is not always easy to define what is a cluster and what is not
Hierarchical ClusteringPrototype-basedFinds a hierarchy of nested clusters
Useful to convey a multi-resolution view of dataBio-medical dataClusters⇒ tumorsSub-clusters⇒ tumor subtypes
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
The Clustering Problem (II)
As we have seen with the Swiss roll in feature extraction, somedata might lay on particularly nasty surfaces
Prototype-based clustering may not be effective withnon-linearly separable clusters
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
K-means Clustering
The simplest clustering algorithmLots of limitationsStill widely used (easy to understand and implement)
The algorithm in briefReceives as input k , i.e. the number of clusters to seek inthe dataStarts by picking k points at random as cluster prototypes(centroids)Assigns each sample to the nearest prototype andrecomputes the cluster centroids
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
The Algorithm
Algorithm k-means(X ,K )N ← |X |;for all i = 1 to K do
c[i]← rand();end forrepeat {Update prototypes}
for all n = 1 to N dowinner← arg min
i=1,...,K‖c[i]− xn‖2
tot [winner]← tot [winner] + 1cnew [winner]← cnew [winner] + xn
end forfor all i = 1 to K do
c[i]← cnew [i]tot [i]
;
end foruntil MaxIterations or ‖c − cnew‖ < εreturn Cluster prototypes c
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
k-Means Example (I)
Input data X
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
k-Means Example (II)
Initialize K = 3 prototypes
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
k-Means Example (III)
Assign samples to the nearest prototype/cluster
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
k-Means Example (IV)
Compute the updated prototypes c′i
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
k-Means Example (V)
Update cluster assignment
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
Cluster number estimate
In practice, we seldom know the cluster number KTrial-and-error to find the most suitable KUse algorithm that estimate the cluster number
Several approaches but no killer algorithmAgglomerative Clustering - Start with all samples being aseparate cluster and iteratively aggregate existing clusteruntil an error criterion is satisfiedPartitional Clustering - Start with a single cluster and keepsplitting it until an error criterion is satisfied
Error is often composed ofA similarity term favoring aggregation of samples intoclustersA penalization term discouraging the creation of too manyclusters
Feature SelectionClustering
Conclusion
Introductionk-means ClusteringAdvanced Clustering
Clustering and Feature Selection
Cluster labels can be used for performing feature selectionwhen class information is not available
Filter approachStep 1 Partition the data using a clustering algorithmStep 2 Run a filter model evaluating feature
relevance w.r.t. how much an attribute helpsin separating clusters
Wrapper approachStep 1 Select a set of featuresStep 2 Evaluate if they optimize the error criterion set
by the clustering algorithm
Feature selection can be used to help clustering algorithms!
Feature SelectionClustering
Conclusion
Take-home Messages
Feature selectionExtracts a subset of informative input featuresPreserves the data semanticsFaster than feature extraction
Two main strategiesFilters - Separate feature selection from pattern recognitionWrappers - Select features that work best with a givenlearning model
ClusteringFinding natural groups in the dataCluster number estimate⇒ no definitive answerCan be used to support feature selection when classinformation is not available