![Page 1: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/1.jpg)
Feature SelectionFeature Selectionas Relevant Information Encoding as Relevant Information Encoding
Naftali TishbySchool of Computer Science and EngineeringSchool of Computer Science and Engineering
The Hebrew University, Jerusalem, IsraelThe Hebrew University, Jerusalem, Israel
NIPS 2001NIPS 2001
![Page 2: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/2.jpg)
Many thanks to:
Noam SlonimAmir Globerson
Bill BialekFernando Pereira
Nir Friedman
![Page 3: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/3.jpg)
Feature Selection?Feature Selection?• NOT generative modeling!
– no assumptions about the source of the data
• Extracting relevant structure from data– functions of the data (statistics) that preserve information
• Information about what?
• Approximate Sufficient Statistics
• Need a principle that is both general and precise.– Good Principles survive longer!
![Page 4: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/4.jpg)
Israel Health www Drug Jewish Dos Doctor ...Doc1 12 0 0 0 8 0 0 ...Doc2 0 9 2 11 1 0 6 ...Doc3 0 10 1 6 0 0 20 ...Doc4 9 1 0 0 7 0 1 ...Doc5 0 3 9 0 1 10 0 ...Doc6 1 11 0 6 0 1 7 ...Doc7 0 0 8 0 2 12 2 ...Doc8 15 0 1 1 10 0 0 ...Doc9 0 12 1 16 0 1 12 ...Doc10 1 0 9 0 1 11 2 ...
... ... ... ... ... ... ... ... ...
A Simple Example...
![Page 5: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/5.jpg)
Simple ExampleIsrael Jewish Health Drug Doctor www Dos ....
Doc1 12 8 0 0 0 0 0 ...Doc4 9 7 1 0 1 0 0 ...Doc8 15 10 0 1 0 1 0 ...
Doc2 0 1 9 11 6 2 0 ...Doc3 0 0 10 6 20 1 0 ...Doc6 1 0 11 6 7 0 1 ...Doc9 0 0 12 16 12 1 1 ...
Doc5 0 1 3 0 0 9 10 ...Doc7 0 2 0 0 2 8 12 ...Doc10 1 1 0 0 2 9 11 ...
... ... ... ... ... ... ... ... ...
![Page 6: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/6.jpg)
Israel Jewish Health Drug Doctor www Dos ...
Cluster1 36 25 1 1 1 1 0 ...
Cluster2 1 1 42 39 45 4 2 ...
Cluster3 1 4 3 0 4 26 33 ...
... ... ... ... ... ... ... ... ...
A new compact representation
The document clusters preserve the relevant
information between the documents and words
![Page 7: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/7.jpg)
),( YCI x
xC
kc
1c2c
X
nx
1x2x
Y
my
1y2y
Documents Words
![Page 8: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/8.jpg)
Mutual information How much X is telling about Y?I(X;Y): function of the joint probability distribution p(x,y) -minimal number of yes/no questions (bits) needed to ask
about x, in order to learn all we can about Y.Uncertainty removed about X when we know Y:I(X;Y) = H(X) - H( X|Y) = H(Y) - H(Y|X)
H(X|Y) H(Y|X)
I(X;Y)
![Page 9: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/9.jpg)
Relevant Coding• What are the questions that we need to ask about X in order to learn about Y?
•Need to partition X into relevant domains, or clusters, between which we really need to distinguish...
X Y
X|y1
y2
y1
P(x|y1)
P(x|y2)X|y2
![Page 10: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/10.jpg)
Bottlenecks and Neural Nets• Auto association: forcing compact representations• is a relevant code of w.r.t.
X Y
X̂
Input Output
Sample 1 Sample 2
Past Future
X̂ X Y
![Page 11: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/11.jpg)
• Q: How many bits are needed to determine the relevant representation?– need to index the max number of non-
overlapping green blobs inside the blue blob: (mutual information!)
X X̂)x|x̂(p
)X̂|X(H2
)X(H2
)X̂,X(I)X̂|X(H)X(H / 222
![Page 12: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/12.jpg)
• The idea: find a compressed signal that needs short encoding ( small )while preserving as much as possible the
information on the relevant signal ( )
X X̂)x|x̂(p Y)x̂|y(p
)x̂(p)X̂,X(I )Y,X̂(I
)Y,X(I
X̂)X̂,X(I
)Y,X̂(I
![Page 13: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/13.jpg)
A A Variational PrincipleVariational PrincipleWe want a short representation of X that
keeps the information about another variable, Y, if possible.
X
YX YXI
ˆ
),(
ˆ( , )inI X X
ˆ( , )outI X Y
1ˆ ˆˆ( | ) ( , ) ( , )in outp x x I X X I X Y L
![Page 14: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/14.jpg)
The Self Consistent EquationsSelf Consistent Equations• Marginal:
• Markov condition:
• Bayes’ rule:
x
xpxxpxp )()|ˆ()ˆ(
x
xxpxypxyp )ˆ|()|()ˆ|(
)|ˆ()ˆ()()ˆ|( xxpxpxpxxp
ˆ[ ( | )]0ˆ( | )
p x xp x x
L ˆ( )ˆ ˆ( | ) exp [ , ]
( , )p xp x x D x xKLZ x
![Page 15: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/15.jpg)
The emerged effective distortioneffective distortion measure:
y
KLKL
xypxypxyp
xypxypDxxD
)ˆ|()|(log)|(
)ˆ|(|)|(ˆ,
• Regular if is absolutely continuous w.r.t.• Small if predicts y as well as x:
)ˆ|( xyp )|( xypx̂
yx
yx
xyp
xxp
xyp
)ˆ|(
)|ˆ(
)|(
ˆ
![Page 16: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/16.jpg)
The iterative algorithm: (Generalized Blahut-Arimoto)
)ˆ(xp )|ˆ( xxp
)ˆ|( xypGeneralizedBA-algorithm
![Page 17: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/17.jpg)
The Information BottleneckInformation Bottleneck Algorithmmin min min log ( , )
ˆ ˆ ˆ( | ) ( ) ( | )
ˆ ˆmin ( , ) ( , )ˆ ˆ ˆ( | ), ( ), ( | )
Z xp y x p x p x x
I X X D x xKLp y x p x p x x
xtt
tx
t
KLt
t
tt
xxpxypxyp
xxpxpxp
xxDxZxpxxp
)ˆ|()|()ˆ|(
)|ˆ()()ˆ(
)ˆ,(exp),(
)ˆ()|ˆ(1
“free energy”
![Page 18: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/18.jpg)
• The Information - plane, the optimal for a given is a concave function:
Possible phase
impossible
1
)ˆ,()ˆ,(
XXIXYI
),ˆ( XXI),ˆ( YXI
)(/),ˆ( XHXXI
),(),ˆ(
YXIYXI
![Page 19: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/19.jpg)
Manifold of relevanceManifold of relevance The self consistent equations:
Assuming aAssuming a continuous manifoldcontinuous manifold for for
Coupled (local in ) eigenfunction equations, with as an eigenvalue.
x
xxpxypxypxxDxZxpxxp
)ˆ|()|(log)ˆ|(log)ˆ|(),(log)(log)ˆ|(log
xxxpxM
xxyp
xxypxM
xxxp
y
x
ˆ)ˆ|(log]ˆ[
ˆ)ˆ|(log
ˆ)ˆ|(log]ˆ[
ˆ)ˆ|(log
X̂
X̂
![Page 20: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/20.jpg)
Document classification - information curves
![Page 21: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/21.jpg)
Multivariate Information Bottleneck
• Complex relationship between many variables• Multiple unrelated dimensionality reduction
schemes• Trade between known and desired dependencies• Express IB in the language of Graphical Models• Multivariate extension of Rate-Distortion Theory
![Page 22: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/22.jpg)
Multivariate Information Bottleneck: Extending the dependency graphs
1 2
1( | ), ( | ), ( ) ( , ) ( , )in outG Gp T x p x T p T I X T I T Y L
ii
n
xnn xp
xxpxxpXXXI)(),...,(log),...,(),...,,(~ 1
121
(Multi-information)
![Page 23: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/23.jpg)
![Page 24: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/24.jpg)
Sufficient Dimensionality Reduction(with Amir Globerson)
( , )P x y
1
1( , ) exp ( ) ( )[ , ]
d
r rr
P x y x yZ
• Exponential families have sufficient statistics• Given a joint distribution , find an approximation of the exponential form:
This can be done by alternating maximization of Entropy under the constraints:
; , 1r r r rp p p pr d
The resulting functions are our relevant features at rank d.
![Page 25: Feature Selection as Relevant Information Encoding](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc98cd/html5/thumbnails/25.jpg)
Summary Summary • We present a general information theoretic approach for extracting relevant information.
• It is a natural generalization of Rate-Distortion theory with similar convergence and optimality proofs.
• Unifies learning, feature extraction, filtering, and prediction...
• Applications (so far) include:– Word sense disambiguation– Document classification and categorization– Spectral analysis– Neural codes– Bioinformatics,…– Data clustering based on multi-distance distributions– …