feature selection as relevant information encoding

Feature SelectionFeature Selectionas Relevant Information Encoding as Relevant Information Encoding

Naftali TishbySchool of Computer Science and EngineeringSchool of Computer Science and Engineering

The Hebrew University, Jerusalem, IsraelThe Hebrew University, Jerusalem, Israel

NIPS 2001NIPS 2001

Many thanks to:

Noam SlonimAmir Globerson

Bill BialekFernando Pereira

Nir Friedman

Feature Selection?Feature Selection?• NOT generative modeling!

– no assumptions about the source of the data

• Extracting relevant structure from data– functions of the data (statistics) that preserve information

• Information about what?

• Approximate Sufficient Statistics

• Need a principle that is both general and precise.– Good Principles survive longer!

Israel Health www Drug Jewish Dos Doctor ...Doc1 12 0 0 0 8 0 0 ...Doc2 0 9 2 11 1 0 6 ...Doc3 0 10 1 6 0 0 20 ...Doc4 9 1 0 0 7 0 1 ...Doc5 0 3 9 0 1 10 0 ...Doc6 1 11 0 6 0 1 7 ...Doc7 0 0 8 0 2 12 2 ...Doc8 15 0 1 1 10 0 0 ...Doc9 0 12 1 16 0 1 12 ...Doc10 1 0 9 0 1 11 2 ...

... ... ... ... ... ... ... ... ...

A Simple Example...

Simple ExampleIsrael Jewish Health Drug Doctor www Dos ....

Doc1 12 8 0 0 0 0 0 ...Doc4 9 7 1 0 1 0 0 ...Doc8 15 10 0 1 0 1 0 ...

Doc2 0 1 9 11 6 2 0 ...Doc3 0 0 10 6 20 1 0 ...Doc6 1 0 11 6 7 0 1 ...Doc9 0 0 12 16 12 1 1 ...

Doc5 0 1 3 0 0 9 10 ...Doc7 0 2 0 0 2 8 12 ...Doc10 1 1 0 0 2 9 11 ...

... ... ... ... ... ... ... ... ...

Israel Jewish Health Drug Doctor www Dos ...

Cluster1 36 25 1 1 1 1 0 ...

Cluster2 1 1 42 39 45 4 2 ...

Cluster3 1 4 3 0 4 26 33 ...

... ... ... ... ... ... ... ... ...

A new compact representation

The document clusters preserve the relevant

information between the documents and words

),( YCI x

xC

kc

1c2c

X

nx

1x2x

Y

my

1y2y

Documents Words

Mutual information How much X is telling about Y?I(X;Y): function of the joint probability distribution p(x,y) -minimal number of yes/no questions (bits) needed to ask

about x, in order to learn all we can about Y.Uncertainty removed about X when we know Y:I(X;Y) = H(X) - H( X|Y) = H(Y) - H(Y|X)

H(X|Y) H(Y|X)

I(X;Y)

Relevant Coding• What are the questions that we need to ask about X in order to learn about Y?

•Need to partition X into relevant domains, or clusters, between which we really need to distinguish...

X Y

X|y1

y2

y1

P(x|y1)

P(x|y2)X|y2

Bottlenecks and Neural Nets• Auto association: forcing compact representations• is a relevant code of w.r.t.

X Y

X̂

Input Output

Sample 1 Sample 2

Past Future

X̂ X Y

• Q: How many bits are needed to determine the relevant representation?– need to index the max number of non-

overlapping green blobs inside the blue blob: (mutual information!)

X X̂)x|x̂(p

)X̂|X(H2

)X(H2

)X̂,X(I)X̂|X(H)X(H / 222

• The idea: find a compressed signal that needs short encoding ( small )while preserving as much as possible the

information on the relevant signal ( )

X X̂)x|x̂(p Y)x̂|y(p

)x̂(p)X̂,X(I )Y,X̂(I

)Y,X(I

X̂)X̂,X(I

)Y,X̂(I

A A Variational PrincipleVariational PrincipleWe want a short representation of X that

keeps the information about another variable, Y, if possible.

X

YX YXI

ˆ

),(

ˆ( , )inI X X

ˆ( , )outI X Y

1ˆ ˆˆ( | ) ( , ) ( , )in outp x x I X X I X Y L

The Self Consistent EquationsSelf Consistent Equations• Marginal:

• Markov condition:

• Bayes’ rule:

x

xpxxpxp )()|ˆ()ˆ(

x

xxpxypxyp )ˆ|()|()ˆ|(

)|ˆ()ˆ()()ˆ|( xxpxpxpxxp

ˆ[ ( | )]0ˆ( | )

p x xp x x

L ˆ( )ˆ ˆ( | ) exp [ , ]

( , )p xp x x D x xKLZ x

The emerged effective distortioneffective distortion measure:

y

KLKL

xypxypxyp

xypxypDxxD

)ˆ|()|(log)|(

)ˆ|(|)|(ˆ,

• Regular if is absolutely continuous w.r.t.• Small if predicts y as well as x:

)ˆ|( xyp )|( xypx̂

yx

yx

xyp

xxp

xyp

)ˆ|(

)|ˆ(

)|(

ˆ

The iterative algorithm: (Generalized Blahut-Arimoto)

)ˆ(xp )|ˆ( xxp

)ˆ|( xypGeneralizedBA-algorithm

The Information BottleneckInformation Bottleneck Algorithmmin min min log ( , )

ˆ ˆ ˆ( | ) ( ) ( | )

ˆ ˆmin ( , ) ( , )ˆ ˆ ˆ( | ), ( ), ( | )

Z xp y x p x p x x

I X X D x xKLp y x p x p x x

xtt

tx

t

KLt

t

tt

xxpxypxyp

xxpxpxp

xxDxZxpxxp

)ˆ|()|()ˆ|(

)|ˆ()()ˆ(

)ˆ,(exp),(

)ˆ()|ˆ(1

“free energy”

• The Information - plane, the optimal for a given is a concave function:

Possible phase

impossible

1

)ˆ,()ˆ,(

XXIXYI

),ˆ( XXI),ˆ( YXI

)(/),ˆ( XHXXI

),(),ˆ(

YXIYXI

Document classification - information curves

Multivariate Information Bottleneck

• Complex relationship between many variables• Multiple unrelated dimensionality reduction

schemes• Trade between known and desired dependencies• Express IB in the language of Graphical Models• Multivariate extension of Rate-Distortion Theory

Multivariate Information Bottleneck: Extending the dependency graphs

1 2

1( | ), ( | ), ( ) ( , ) ( , )in outG Gp T x p x T p T I X T I T Y L

ii

n

xnn xp

xxpxxpXXXI)(),...,(log),...,(),...,,(~ 1

121

(Multi-information)

Sufficient Dimensionality Reduction(with Amir Globerson)

( , )P x y

1

1( , ) exp ( ) ( )[ , ]

d

r rr

P x y x yZ

• Exponential families have sufficient statistics• Given a joint distribution , find an approximation of the exponential form:

This can be done by alternating maximization of Entropy under the constraints:

; , 1r r r rp p p pr d

The resulting functions are our relevant features at rank d.

Summary Summary • We present a general information theoretic approach for extracting relevant information.

• It is a natural generalization of Rate-Distortion theory with similar convergence and optimality proofs.

• Unifies learning, feature extraction, filtering, and prediction...

• Applications (so far) include:– Word sense disambiguation– Document classification and categorization– Spectral analysis– Neural codes– Bioinformatics,…– Data clustering based on multi-distance distributions– …

feature selection as relevant information encoding

Documents

relevant encoding

relevant signal

relevant representation

information plane

mutual information

relevant code of

relevant domains

speech vs