mvfl_part3 cvpr2012 spatio-temporal and higher-order feature learning

7/31/2019 Mvfl_part3 Cvpr2012 Spatio-temporal and Higher-Order Feature Learning

http://slidepdf.com/reader/full/mvflpart3-cvpr2012-spatio-temporal-and-higher-order-feature-learning 1/87

Outline

1 IntroductionFeature LearningCorrespondence in Computer VisionRelational feature learning

2 Learning relational featuresSparse Coding Review

Encoding relationsInferenceLearning

3 Factorization, eigen-spaces and complex cellsFactorizationEigen-spaces, energy models, complex cells

4 ApplicationsApplicationsConclusions

Roland Memisevic (Uni Frankfurt) Multiview Feature Learning Tutorial at CVPR 2012 71 / 174

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Outline

1 IntroductionFeature LearningCorrespondence in Computer VisionRelational feature learning






http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Complexity

The number of parameters is about n × n × n (!)

More, if we want sparse, overcomplete hiddens.There is a simple, yet far-reaching, way to reduce that number.


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Factorization

wy jf

wxif

wzkf

wijk

wijk =ijk f

wxif wy

jf wzkf


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Factorization is lter matching

x y

xi

y j

z

zk

W z

W x W y

zk =ij

wijk x i y j =ij f

wxif wy

jf wzkf x i y j

=f

wy jf

·

i

wxif x i ·

j

wykf y j


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Factorization is lter matching

x y

x i y j

z

zk

W z

W x W y

E =ijk

(f

wxif wy

jf wzkf )x i y j zk =

f

(i

wxif x i )(

j

wy jf y j )(

k

wzkf zk )


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Factorized models

x y

x i y j

z

zk

W z

W x W y

Factored Gated Boltzmann machinesExponentiate and normalize energy (just like RBM).Learning and inference exactly like before.(Taylor, 2009), (Memisevic, Hinton; 2009)


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Factorized models

x

y

z

x

y

Factored Relational AutoencodersAgain, everything like before. Back-propagate through the lters.Conditional learning trivial.Joint learning by adding two asymmetric objectives.


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Square pooling models

Square pooling:Another way to learn lter matchingmodels are square pooling models, forexample:

ASSOM (Kohonen, 1996)

ISA (Hyvarinen, 2000)Product of T-distributions (Osindero etal., 2006)(Karklin, Lewicki; 2008)cRBM (Ranzato et al., 2009)

Often, W z

is constrained so each hiddensees only a few squared inputs. That wayhiddens can be thought of as encodingsubspace norms.

zk

z

W z

x

x i y j

y

(·

)2

W x W y


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/




Square pooling:Why is square pooling the same?

The activity that a hidden unit gets is:

f wzkf W x· f

Tx + W y· f

Ty

2

= f wzkf 2(W x· f

Tx )(W yT

· f y )

+( W x

· f T

x )2

+ ( W y

· f

Ty )

2

zk

z

W z

x

x i y j

y

(·

)2

W x W y


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/




Square pooling:Why is square pooling the same?

The activity that a hidden unit gets is:f wz

kf W x· f T

x + W y· f T

y2

= f wzkf 2(W x· f

Tx )(W yT

· f y )

+( W x

· f T

x )2

+ ( W y

· f

Ty )

2

zk

z

W z

x

x i y j

y

(·

)2

W x W y


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/




Square pooling:Learning is somewhat more difcult thanwith factored gated feature learning.Example ISA: Gradient-based, whileenforcing W xy T W xy = I after everygradient step (eigen-decomposition).

zk

z

W z

x

x i y j

y

(·

)2

W x W y


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Examples

Toy examples:There is no structure in these images.Only in how they change .


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Learned lters wxif


y

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Learned lters wy jf


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Frequency/orientation histograms


h

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Frequency/orientation histograms


V l i i f i i

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Velocity tuning of mapping units


Fil l d f li hif

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Filters learned from split-screen shifts


Af lt

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Afne lters


“Filt i g” lt

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



“Filtering”-lters


Rotation lters

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Rotation lters


Rotation lters

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Rotation lters


Filters learned by watching TV

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/






http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/





“Bag-Of-Warps”

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Bag Of Warps


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Outline



Outline

1 IntroductionFeature Learning

Correspondence in Computer VisionRelational feature learning






Linear image warps

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Linear image warps

Consider a linear transformation in pixel space (“ warp ”):

y = L x

Now consider the following task:

Given two images x , y , what is the warp that relates them?

This is exactly the problem that mapping units should be able to

solve.


Orthogonal image warps

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



g g p

y = L x

We restrict our attention to orthogonal warps in the following,that is:

L T L = I

These include all permutations (“shufing pixels”).Orthogonal warps are the only transformations we can seeanyway, if all our images are white :

I = C y = LC x L T = LL T

(Bethge, 2007)To get a better understanding of what mapping units really do, wemake use of two properties of orthogonal ima ge wa rps:


Properties of orthogonal image warps

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



p g g p

(I) Orthogonal transformations decompose into 2-D rotations

An orthogonal matrix is similar to a matrix that performsaxis-aligned two-dimensional rotations:

V T LV =R 1

. . .R k

R i = cos(θi ) − sin(θi )

sin(θi ) cos(θi )

This follows, for example, from the fact that theeigen-decomposition

L = V DV T

has complex eigenvalues of length 1.The eigenspaces are also known as invariant subspaces .



http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



p g g p

Example: Translation and the Fourier spectrumTranslation is an example of an orthogonal warp.1-D translation matrices are circulants , which have ones along anoff-diagonal, like so:

L =

0 1 0 0 00 0 1 0 00 0 0 1 00 0 0 0 11 0 0 0 0

The two-dimensional eigen-features of this matrix turn out to besine-/cosine-pairs (Fourier features).



http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



p g g p

Quadrature pairsSince the invariant subspaces of orthogonal warps aretwo-dimensional, eigenvectors come in pairs :

v R , v I

They form an orthogonal basis for the invariant subspace.In the case of translation, v I is a sine and v R is a cosine feature.Waves with 90 degrees phase difference are known as“quadrature pair ”.But the concept is more general and applies to all orthogonalmatrices.The eigenvector pairs of orthogonal transformations have beenreferred to as “ generalized quadrature pairs ” (Bethge et al.,2007).


Properties of commuting image warps

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



(II) Commuting transformations share an eigen-basisAny two transformations that commute share a single eigen-basis.They differ only in their eigen values .

“Proof”: Consider A and B with AB = BA and the eigenvector vof B with λ an eigenvalue with multiplicity one. We have

BAv = ABv = λAv.

So Av is also an eigenvector of B with the same eigenvalue.



http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Translation Example continuedAll circulants have the Fourier basis as eigen-basis.

Properties (I) and (II) taken together now allow us to state thefollowing:



http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Any two orthogonal, commuting transformations differ only withrespect to the rotation angles in the eigenpaces .

So to apply a transformation you can equivalently perform a set ofindependent two-D rotations.

x

y = L x



http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/





x

y = L x

To infer the transformation, given two images x and y : Project x

and y onto the eigenvectors, then comput e the rota tion a ngles!



http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/





x

y = L x

To infer the transformation, given two images x and y : Project x

and y onto the eigenvectors, then comput e the rota tion a ngles!


Extracting sub-space rotations, naive approach

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



φy

φx

Im

Re

In each subspace:Normalize the 2-D projections to unit norm, then read off the anglebetween them.



http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



φy

φx

Im

Re




http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



φyφx

Im

Re

Extracting rotations by computing anglesTo read off the angle, compute the inner product (afternormalizing projections to unit-norm).Formally,

cos(φy − φx ) = cos φy cos φx + sin φy sin φx

= ( v RT

y )( v RT

x ) + ( v I T

y )( v I T

x )

Compute the sum over products of lter responses.



http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



φyφx

Im

Re

Extracting rotations by computing anglesTo read off the angle, compute the inner product (afternormalizing projections to unit-norm).Formally,

cos(φy − φx ) = cos φy cos φx + sin φy sin φx

= (v RT

y )( v RT

x ) + ( v I T

y )( v I T

x )

Compute the sum over products of lter responses.


Sub-space rotation detectors

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Normalizing to unit norm can be a bad idea, if projections are

small:The aperture problem

Consider the left shift of a horizontal bar.

It is impossible to see the transformation in this case.

This is known as the aperture problem .

Normalizing subspace projections would amount to pretending we couldsee the transformation!

A second way to get the rotations:Absorb the rotation into one of the eigenvectors, then try to detect rotation angles.



http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/





http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Extracting rotations by detecting angles

Formally, let the output lter pair vθR , v

θI be the input lter rotatedby θ degrees (in complex notation: v θ = exp( iθ )v ).

Measure how well the image pair x , y conforms with this rotation:

r θ := cos( φy − φx − θ)

= cos( φy )cos( φx + θ) + sin( φy )sin( φx + θ)= (v

θR

Ty )( v R

Tx ) + ( v

θI

Ty )( v I

Tx )

Again we have to sum over products of lter responses .



http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



y jx i

yx

For each subspace, we will need several mapping units, each

tuned to a different angle, θi .The set of mapping unit responses will now constitute apopulation code that represents the observed transformation.A mapping unit is conservative : It res only if a transform ispresent and if it is visible in the image pai r.But there is still one r oblem...Roland Memisevic (Uni Frankfurt) Multiview Feature Learning Tutorial at CVPR 2012 108 / 174

Subspace rotation detector graphical model

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



y jx iyx

But the aperture problem causes another problem:Take a video showing translations and generate two copies:

Low-pass lter each frame in the rst; High-pass lter each framein the second.Now the transformation will be visible only in some components inthe rst and in other components in the second video.These subspace features are content-dep e nde nt!


Subspace rotation detector graphical model

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



y jx iyx

The solution:Let hiddens pool within and pool across subspaces.This is exactly the factored bilinear model.


Summary: Learning relation-detectors

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



The cross-correlation modelA hidden variable that computes the sum over products of lterresponses can detect rotations, θ, in an invariant subspace.To reconstruct the transformed output from the input image, it has

to pool over multiple 2-dimensional subspaces.The population code of such hiddens is a good code for imagetransformations.Learning requires contrast normalization + keeping the scales

of lters roughly the same !


Learning quadrature features

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



y jx i

z

zk

yx

We can see the quadrature features, if we outsource theacross-subspace pooling into a separate layer.


Learning quadrature Features

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/




Learning quadrature Features

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/




Rotation “quadrature” lters

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/




Mixed transformations

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/




Quadrature features from natural video

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/




Energy models

z

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



zk =f

wfk uTf x + v

Tf y

2

= 2f

wfk uTf x v

Tf y

+f

wfk uTf x

2+

f

wfk vTf y

2

zk

z

W z

x

x i y j

y

( · )2

W x W y

When we apply energy models to the concatenation of twoimages , we add square terms in inference.This may make the rotation detectors more conservative.Otherwise inference is the same!


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Energy models



The energy model

(Adelson and Bergen, 1985): Motion(Ozhawa, DeAngelis, Freeman; 1990): DisparityEquivalence to cross-correlation: See, for example, (Fleet et al.;1994).


Learning energy models on moviesIm

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



φx t

φx 1

Re

What happens when we train energy models on movies?Hiddens receive all pairs of products between lters applied toframes.So they detect the repeated application of the sameeigenvalue :

s

vs T

x s

2

=s

vs T

x s

2

+st

vs T

x s · vt T

x t


Learning energy models on moviesIm

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



φx t

φx 1

Re

What happens when we train energy models on movies?Hiddens receive all pairs of products between lters applied toframes.So they detect the repeated application of the sameeigenvalue :

s

vs T

x s

2

=s

vs T

x s

2

+st

vs T

x s · vt T

x t


Training energy models via gating

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



We can train a cross-correlation model via the energy mechanism.But we can do the opposite, too:

Plug in the same data left and right and tie left and right lters.So we don’t have to use ISA or PoT to train energy models.


http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



A covariance encoder trained on movies





http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/




Learning cross-correlation and energy models

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Take-home message, factored modelTo learn about transformation, let hidden units pool over products of

lter responses (gated feature learning) or pool over squares of sumsof lter responses (energy model).


A bag of tricks

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Tricks for learning:Normalize lters during learning, so they grow slowly , and theygrow together : Normalize with a running average of the averagelter norms.Connect top-level hiddens locally to the factors.Probably even better: make them locally overlapping (“Topographic ICA”).DC-centering and contrast-normalization for each patch.Plus: Whiten the data before learning, using PCA or ZCA.

Fast learning: large data-sets essential (use GPU’s...).


A bag of tricks

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



Tricks for learning:Normalize lters during learning, so they grow slowly , and theygrow together : Normalize with a running average of the averagelter norms.Connect top-level hiddens locally to the factors.

Probably even better: make them locally overlapping (“Topographic ICA”).DC-centering and contrast-normalization for each patch.Plus: Whiten the data before learning, using PCA or ZCA.



http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/



A bag of tricks







A bag of tricks

http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/







http://0.0.0.0/

http://0.0.0.0/

http://find/

http://goback/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

http://0.0.0.0/

mvfl_part3 cvpr2012 spatio-temporal and higher-order feature learning

Documents