some thoughts at the interface of ensemble methods and...
TRANSCRIPT
![Page 1: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/1.jpg)
Some Thoughts at the Interface ofEnsemble Methods and Feature Selection
Gavin Brown, University of Manchester, UK
Cairo Microsoft Innovation Centre...adapted (slightly) from MCS 2010
![Page 2: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/2.jpg)
What type of document is this?
I potentially thousands of “features”
I highly spatially correlated, often noisy
![Page 3: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/3.jpg)
Statistics of High-dimensional data
In data mining, lots of features = big problem.
I overfitting (model only works on in-sample data)
I computationally expensive
I low interpretability of final model
I in certain domains, features cost money!
![Page 4: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/4.jpg)
Handling Hi-D data?
Feature Extraction
S = f(Ω), usually S = wT Ω.
With extraction (e.g. PCA), we lose meaning of original features.
Feature SelectionS ⊆ Ω.
With selection we identify meaningful subsets of originals.Combinatorial optimization over space of possible feature subsets.
![Page 5: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/5.jpg)
Feature Selection - “Filters”
PROCEDURE : FILTERInput: large feature set ΩReturns: useful feature subset S ⊆ Ω10 Identify candidate subset S ⊆ Ω20 While !stop criterion()
Evaluate relevancy index J using S.Adapt subset S.
30 Return S.
Pro: generic feature set, and fast!Con: task-specific design is open problem
![Page 6: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/6.jpg)
How ’relevant’ is any given feature?
Y is a human-labeled document set.X is presence/absence of a particular image feature.
I(X;Y ) - measure of dependence between feature X and target Y .
Zero when X is independent of Y .Increases as X and Y become dependent.
Defined as KL-divergence
I(X;Y ) =∑x∈X
∑y∈Y
p(xy) logp(xy)p(x)p(y)
![Page 7: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/7.jpg)
Filters using Mutual Information
Rank features Xi,∀i by their values of J = I(Xi;Y ).Retain the highest ranked features, discard the lowest ranked.
i J(Xi)35 0.84642 0.81110 0.810
654 0.61122 0.44359 0.388... ...
212 0.0939 0.05
Cut-off point decided byuser, e.g. |S| = 5, soS = 35, 42, 10, 654, 22.
Problem: F42 and F10 are almost identical!
![Page 8: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/8.jpg)
A Problem: Design of Filter Criteria
Q. What is the relevance of feature Xi?
Jmi(Xi) = I(Xi; Y )“its own mutual information with the target”
Jmifs(Xi) = I(Xi; Y )−∑
Xk∈S I(Xi; Xk)“as above, but penalised by correlations with features already chosen”
Jmrmr(Xi) = I(Xi; Y )− 1|S|
∑Xk∈S I(Xi; Xk)
“as above, but averaged, smoothing out noise”
Jjmi(Xi) =∑
Xk∈S I(XiXk; Y )“how well it pairs up with other features chosen”
![Page 9: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/9.jpg)
The Confusing Literature of Feature Selection Land
Criterion Full name AuthorMI Mutual Information Maximisation Various (1970s - )MIFS Mutual Information Feature Selection Battiti (1994)JMI Joint Mutual Information Yang & Moody (1999)MIFS-U MIFS-‘Uniform’ Kwak & Choi (2002)IF Informative Fragments Vidal-Naquet (2003)FCBF Fast Correlation Based Filter Yu et al (2004)CMIM Conditional Mutual Info Maximisation Fleuret (2004)JMI-AVG Averaged Joint Mutual Information Scanlon et al (2004)MRMR Max-Relevance Min-Redundancy Peng et al (2005)ICAP Interaction Capping Jakulin (2005)CIFE Conditional Infomax Feature Extraction Lin & Tang (2006)DISR Double Input Symmetrical Relevance Meyer (2006)MINRED Minimum Redundancy Duch (2006)IGFS Interaction Gain Feature Selection El-Akadi (2008)MIGS Mutual Information Based Gene Selection Cai et al (2009)
Why should we trust any of these? How do they relate?
![Page 10: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/10.jpg)
The Land of Feature Selection: A Summary
Problem: construct a useful set of features
I Need features to be relevant and not redundant.
Accepted research practice: invent heuristic measures
I Encouraging “relevant” features
I Discouraging correlated features
Sound familiar? For ‘feature’ above, read ‘classifier’...
![Page 11: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/11.jpg)
The “Duality” of MCS and FS
classifier
feature values(object description)
classifier classifier
class label
combiner
© L.I. Kuncheva, ICPR Plenary 2008
classifier
feature values(object description)
classifier classifier
class label
combinerclassifier
© L.I. Kuncheva, ICPR Plenary 2008
classifier
feature values(object description)
classifier classifier
class label
combiner
a fancy feature
extractorclassifier
© L.I. Kuncheva, ICPR Plenary 2008
classifier
feature values(object description)
classifier classifier
class label
combinerclassifier?
a fancy feature
extractorclassifier
© L.I. Kuncheva, ICPR Plenary 2008
![Page 12: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/12.jpg)
Regression Ensembles
Loss function : (f(x)− y)2
Combiner function : f(x) = 1M
∑Mi=1 fi(x)
Method:Take objective function, decompose into constituent parts.
(f − y)2 =1M
M∑i=1
(fi − y)2 − 1M
M∑i=1
(fi − f)2
![Page 13: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/13.jpg)
An MCS native visits the Land of Feature Selection
Loss function : I(F ;Y )
‘Combiner’ function : F = X1:M (joint random variable)
Method:Take objective function, decompose into constituent parts.
I(X1:M ; Y ) =∑∀i
I(Xi; Y )
+∑∀i,j
I(Xi, Xj , Y )
+∑∀i,j,k
I(Xi, Xj , Xk, Y )
+∑
∀i,j,k,l
I(Xi, Xj , Xk, Xl, Y )
... ... ...
Multiple “levels” of correlation!Each term is a multi-variate mutual information! (McGill, 1954)
![Page 14: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/14.jpg)
Linking theory to heuristics....
Take only terms involving Xi we want to evaluate - exact expression:
I(Xi; Y |S) = I(Xi; Y )−∑k∈S
I(Xi; Xk) +∑k∈S
I(Xi; Xk|Y ) +∑
j,k∈S
I(Xi, Xj , Xk, Y ) + ... +
Jmi = I(Xi; Y )
Jmifs = I(Xi; Y )−∑k∈S
I(Xi; Xk)
Jmrmr = I(Xi; Y )−1
|S|∑k∈S
I(Xi; Xk)
and others can be re-written to this form...
Jjmi =∑k∈S
I(XiXk; Y )
= I(Xi; Y )−1
|S|∑k∈S
I(Xi; Xk) +1
|S|∑k∈S
I(Xi; Xk|Y )
![Page 15: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/15.jpg)
A “Template” Criterion
Jmifs = I(Xi; Y )−∑k∈S
I(Xi; Xk)
Jmrmr = I(Xi; Y )−1
|S|∑k∈S
I(Xi; Xk)
Jjmi = I(Xi; Y )−1
|S|∑k∈S
I(Xi; Xk) +1
|S|∑k∈S
I(Xi; Xk|Y )
Jcife = I(Xi; Y )−∑k∈S
I(Xi; Xk) +∑k∈S
I(Xi; Xk|Y )
Jcmim = I(Xi; Y )− maxk
I(Xi; Xk)− I(Xi; Xk|Y )
J = I(Xn;Y )− β∑∀k∈S
I(Xn;Xk) + γ∑∀k∈S
I(Xn;Xk|Y )
![Page 16: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/16.jpg)
The β/γ Space of Possible Criteria
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
γ
β JMI n=3
JMI n=4
MIFS / MRMR n=2
MRMR n=3
MRMR n=4
FOU / CIFEJMI n=2
MIM
I(X1:M ;Y ) ≈ I(Xn;Y )︸ ︷︷ ︸relevancy
− β∑
k
I(Xn;Xk)︸ ︷︷ ︸redundancy
+ γ∑
k
I(Xn;Xk|Y )︸ ︷︷ ︸conditional redundancy
.
![Page 17: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/17.jpg)
The β/γ Space of Possible Criteria
I(X1:M ;Y ) ≈ I(Xn;Y )︸ ︷︷ ︸relevancy
− β∑
k
I(Xn;Xk)︸ ︷︷ ︸redundancy
+ γ∑
k
I(Xn;Xk|Y )︸ ︷︷ ︸conditional redundancy
.
![Page 18: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/18.jpg)
Exploring β/γ spaceBrown
Figure 3: ARCENE data (cancer diagnosis).
Figure 4: GISETTE data (handwritten digit recognition).
12
![Page 19: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/19.jpg)
Exploring β/γ space
Brown
Figure 3: ARCENE data (cancer diagnosis).
Figure 4: GISETTE data (handwritten digit recognition).
12
![Page 20: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/20.jpg)
Exploring β/γ space
Seems straightforward? Just use the diagonal? Top right corner?But... remember these are only low order components.
Easy to construct problems that have ZERO in low orders, andpositive terms in high orders.
I e.g. data drawn from a Bayesian net with some nodesexhibiting deterministic behavior. (e.g. parity problem).
![Page 21: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/21.jpg)
ARCENE data
3-nn classifier (left), and SVM (right).
![Page 22: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/22.jpg)
Image Segment data
3-nn classifier (left), and SVM (right).Pink line (‘UnitSquare’) is top right corner of β/γ space.
![Page 23: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/23.jpg)
GISETTE data
Low order components insufficient .......heuristics can triumph over theory!
![Page 24: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/24.jpg)
Exports & Imports
Exported a perspective from the Land of MCS...
...solved an open problem in the Land of FS.
But could the MCS natives also learn from this?
![Page 25: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/25.jpg)
Committee members should be ‘diverse’
classifier
feature values(object description)
classifier classifier
class label
combiner
© L.I. Kuncheva, ICPR Plenary 2008
![Page 26: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/26.jpg)
Exports & Imports : Understanding Ensemble Diversity
(Step 1) Take an objective function...- log-likelihood: ensemble combiner g, with M members...
L = Exy
log g(y|φ1:M )
(Step 2) ...decompose into constituent parts.
L = const + I(φ1:M ;Y )︸ ︷︷ ︸ensemble members
− KL( p(y|x) || g(y|φ1:M ) )︸ ︷︷ ︸combiner
“Information Theoretic Views of Ensemble Learning”.G.Brown, Manchester MLO Tech Report, Feb 2010
![Page 27: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/27.jpg)
Exports & Imports : Understanding Ensemble Diversity
∑ ∑∑ ∑∑= +== +==
+−≈M
j
M
jkji
M
j
M
jkji
M
iiM YXXIXXIYXIYXI
1 11 11:1 )|;();();();(
“diversity”“relevancy”
“An Information Theoretic Perspective on Multiple Classifier Systems”, MCS 2009.
![Page 28: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/28.jpg)
Exports & Imports : Understanding Ensemble Diversity
Bagging
Adaboost∑=
M
ii YXI
1);(
∑=
M
ii YXI
1);(
“An Information Theoretic Perspective on Multiple ClassifierSystems”, MCS 2009.Zhou & Li, “Multi-Information Ensemble Diversity” (tomorrow9.45am!)
![Page 29: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/29.jpg)
Exports & Imports : Understanding Ensemble Diversity
0 10 20 300
0.05
0.1
0.15
0.2Adaboost : breast
Number of Classifiers
Tes
t err
or
0 10 20 300
0.05
0.1
0.15
0.2RandomForests : breast
Number of Classifiers
Tes
t err
or
(ongoing work with Zhi-Hua Zhou)
![Page 30: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/30.jpg)
Exports & Imports : Understanding Ensemble Diversity
0 10 20 30−0.2
0
0.2
0.4
0.6
0.8
Adaboost : breast (r = −0.94079)
Number of Classifiers
Individual MIEnsemble Gain
0 10 20 30−0.2
0
0.2
0.4
0.6
0.8
RandomForests : breast (r = −0.98425)
Number of Classifiers
Individual MIEnsemble Gain
(ongoing work with Zhi-Hua Zhou)
![Page 31: Some Thoughts at the Interface of Ensemble Methods and ...gbrown/publications/mcs2010plenary-microsoft.pdf · Some Thoughts at the Interface of Ensemble Methods and Feature ... Criterion](https://reader034.vdocuments.net/reader034/viewer/2022051801/5ae1e6737f8b9a097a8c4a05/html5/thumbnails/31.jpg)
Conclusions
It’s getting really hard to contribute meaningful research to MCS.... and to ML/PR in general!
I I’m starting to look at importing ideas from other fields
I Information Theory seems natural
I Unified framework for feature selection
I Deeper understanding of committee learning methods