training sparse natural image models with a fast gibbs...

1
Introduction We present an efficient inference and optimization algorithm for maxi- mum likelihood learning in the overcomplete linear model (OLM). Using a persistent variant of expectation maximization, we find that using overcomplete representations significantly improves the performance of the linear model when applied to natural images while most previous studies were unable to detect an advantage [1, 2, 3, 4]. Overcomplete linear model Here, and . Note that we don’t assume any additive noise on the visible units. We use Gaussian scale mixtures (GSMs) to represent the source distributions. This family contains the Laplace, Student-t, Cauchy and exponential power distribution as a special case. f i (s i )= Z 1 0 g i (λ i )N (s i ;0, λ -1 i ) dλ i Inference The posterior distribution over latent source variables is constrained to a linear subspace and can have multiple modes with heavy tails. A s 1 s 2 p(s|x) p(z |x) Blocked Gibbs sampling We introduce two sets of auxiliary variables. One set, z , represents the missing visible variables [5]. x z = A B s s = A + x + B + z The other set, λ , represents the precisions of the GSM source distribu- tions. Our blocked Gibbs sampler alternately samples z and λ. Persistent EM For maximum likelihood learning, we use a Monte Carlo variant of expec- tation maximization (EM) where in each E-step the data is completed by sampling from the posterior. To further speed up learning, we initialize the Markov chain with the samples of the previous iteration. This algorithm works and will converge because each iteration can only improve a lower bound on the log-likelihood. F [q, ] = log p(x | ) - D KL [q (z | x) || p(z | x, )] F [Tq, ] F [q, ] Likelihood estimation We use a form of importance sampling (annealed importance sampling) to estimate the likelihood of the model. This yields an unbiased estimator of the likelihood and a conservative estimator of the log-likelihood. p(x)= Z q (z | x) p(x, z ) q (z | x) dz 1 N X n p(x, z n ) q (z n | x) Performance of the blocked Gibbs sampler We compare the performance of the Gibbs sampler to the performance of Hamiltonian Monte Carlo (HMC) sampling. The upper plots show trace plots of posterior density, the lower plots show autocorrelation functions averaged over several data points. R()= E (z t - μ) > (z t+- μ) | x E [(z t - μ) > (z t - μ) | x] Overcompleteness and sparsity When applied to natural image patches, the model learns highly sparse source distributions and three to four times overcomplete representations. Model comparison We compared the performance of the OLM with a product of experts (PoE) model, which represents another generalization of the linear model to overcomplete representations. s = W x, p(x) / Y i f i (s i ) Here, we use the product of Student-t distributions. Conclusions and remarks Efficient maximum likelihood learning in overcomplete linear models is possible and enables us to jointly optimize filters and sources. A linear model for natural images can benefit from overcomplete re- presentations if the source distributions are highly leptokurtotic. The presented algorithm can easily be extended to more powerful models of natural images such as subspace or bilinear models. Resources Code for training and evaluating overcomplete linear models: http://bethgelab.org/code/theis2012d/ References [1] M. Lewicki and B. A. Olshausen, JOSA A, 1999 [2] P. Berkes, R. Turner, and M. Sahani, NIPS 21, 2007 [3] M. Seeger, JMLR, 2008 [4] D. Zoran and Y. Weiss, NIPS 25, 2012 [5] R.-B. Chen and Y. N. Wu, Computational Statistics & Data Analysis, 2007 Acknowledgements This study was financially supported through the Bernstein award (BMBF; FKZ: 01GQ0601), the German Research Foundation (DFG; priority program 1527, Sachbeihilfe BE 3848/2-1), and a DFG- NSF collaboration grant (TO 409/8-1). Training sparse natural image models with a fast Gibbs sampler of an extended state space 1 Werner Reichardt Centre for Integrative Neuroscience, Tübingen 2 Redwood Center for Theoretical Neuroscience, Berkeley 3 Max Planck Institute for Biological Cybernetics, Tübingen 4 Bernstein Center for Computational Neuroscience, Tübingen p(z | x, λ)= N (z ; μ z |x , z |x ) p(λ | x, z ) / Y i N (s i ;0, λ -1 i )g i (λ i ) Lucas Theis 1 , Jascha Sohl-Dickstein 2 , Matthias Bethge 1,3,4 A B s x λ z x = As p(s)= Y i f i (s i ) 1. initialize z 2. repeat 3. 4. maximize z T (·; z, x) log p(x, z | ) - - 1x 2x 3x 4x 2x 3x 4x 0.9 1.1 1.3 1.5 1.33 1.44 1.47 1.48 1.48 1.55 1.58 1.47 1.03 Overcompleteness 16 16 image patches - - 1x 2x 3x 4x 2x 3x 4x 0.9 1.1 1.3 1.5 1.25 1.36 1.38 1.39 1.41 1.46 1.49 1.41 0.96 Overcompleteness Log-likelihood ± SEM [bit/pixel] 8 8 image patches 0 5 10 15 0 0.5 1 Time [s] Autocorrelation MALA HMC Gibbs 0 20 40 60 80 0 0.5 1 Time [s] 0 5 10 15 6.9 7.1 7.3 7.5 Avg. posterior energy Toy model 0 20 40 60 80 25 0 -25 -50 Image model M N A 2 R M N 64 128 192 256 0 0.25 0.5 0.75 1 Basis coefficient, i Basis vector norm, ||a i || Laplace, 2x GSM, 2x GSM, 3x GSM, 4x

Upload: others

Post on 15-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Training sparse natural image models with a fast Gibbs ...storage.googleapis.com/theis.io/publications/nips2012.pdf · [3] M. Seeger, JMLR, 2008 [4] D. Zoran and Y. Weiss, NIPS 25,

Introduction We present an efficient inference and optimization algorithm for maxi-mum likelihood learning in the overcomplete linear model (OLM). Using a persistent variant of expectation maximization, we find that using overcomplete representations significantly improves the performance of the linear model when applied to natural images while most previous studies were unable to detect an advantage [1, 2, 3, 4].

Overcomplete linear model

Here, and . Note that we don’t assume any additive noise on the visible units.

We use Gaussian scale mixtures (GSMs) to represent the source distributions. This family contains the Laplace, Student-t, Cauchy and exponential power distribution as a special case.

fi(si) =

Z 1

0gi(�i)N (si; 0,�

�1i ) d�i

InferenceThe posterior distribution over latent source variables is constrained to a linear subspace and can have multiple modes with heavy tails.

A B

s

x

z

A

s1

s2A Bp(s|x) p(z|x)

Blocked Gibbs sampling We introduce two sets of auxiliary variables. One set, z , represents the missing visible variables [5].

x

z

�=

A

B

�s s = A

+x+B

+z

The other set, � , represents the precisions of the GSM source distribu-

tions. Our blocked Gibbs sampler alternately samples z and �.

Persistent EMFor maximum likelihood learning, we use a Monte Carlo variant of expec-tation maximization (EM) where in each E-step the data is completed by sampling from the posterior. To further speed up learning, we initialize the Markov chain with the samples of the previous iteration.

This algorithm works and will converge because each iteration can only improve a lower bound on the log-likelihood.

F [q, ✓] = log p(x | ✓)�DKL [q(z | x) || p(z | x, ✓)]F [Tq, ✓] � F [q, ✓]

Likelihood estimationWe use a form of importance sampling (annealed importance sampling) to estimate the likelihood of the model. This yields an unbiased estimator of the likelihood and a conservative estimator of the log-likelihood.

p(x) =

Zq(z | x) p(x, z)

q(z | x) dz ⇡ 1

N

X

n

p(x, zn)

q(zn | x)

Performance of the blocked Gibbs samplerWe compare the performance of the Gibbs sampler to the performance of Hamiltonian Monte Carlo (HMC) sampling.

The upper plots show trace plots of posterior density, the lower plots show autocorrelation functions averaged over several data points.

R(⌧) =E

⇥(zt � µ)>(zt+⌧ � µ) | x

E [(zt � µ)>(zt � µ) | x]

Overcompleteness and sparsity

When applied to natural image patches, the model learns highly sparse source distributions and three to four times overcomplete representations.

Model comparison

We compared the performance of the OLM with a product of experts (PoE) model, which represents another generalization of the linear model to overcomplete representations.

s = Wx, p(x) /Y

i

fi(si)

Here, we use the product of Student-t distributions.

Conclusions and remarks• Efficient maximum likelihood learning in overcomplete linear models is possible and enables us to jointly optimize filters and sources.

• A linear model for natural images can benefit from overcomplete re-presentations if the source distributions are highly leptokurtotic.

• The presented algorithm can easily be extended to more powerful models of natural images such as subspace or bilinear models.

ResourcesCode for training and evaluating overcomplete linear models:

http://bethgelab.org/code/theis2012d/

References[1] M. Lewicki and B. A. Olshausen, JOSA A, 1999[2] P. Berkes, R. Turner, and M. Sahani, NIPS 21, 2007[3] M. Seeger, JMLR, 2008[4] D. Zoran and Y. Weiss, NIPS 25, 2012[5] R.-B. Chen and Y. N. Wu, Computational Statistics & Data Analysis, 2007

AcknowledgementsThis study was financially supported through the Bernstein award (BMBF; FKZ: 01GQ0601), the

German Research Foundation (DFG; priority program 1527, Sachbeihilfe BE 3848/2-1), and a DFG-

NSF collaboration grant (TO 409/8-1).

Training sparse natural image models with a fastGibbs sampler of an extended state space

1 Werner Reichardt Centre for Integrative Neuroscience, Tübingen 2 Redwood Center for Theoretical Neuroscience, Berkeley3 Max Planck Institute for Biological Cybernetics, Tübingen 4 Bernstein Center for Computational Neuroscience, Tübingen

p(z | x,�) = N (z;µz|x,⌃z|x)

p(� | x, z) /Y

i

N (si

; 0,��1i

)gi

(�i

)

Lucas Theis1, Jascha Sohl-Dickstein2, Matthias Bethge1,3,4

A B

s

x

z

A

s1

s2A Bp(s|x) p(z|x)

x = As

p(s) =Y

i

fi(si)

1. initialize z2. repeat3. 4. maximize

z ⇠ T (·; z,x)log p(x, z | ✓)

- - 1x 2x 3x 4x 2x 3x 4x

0.9

1.1

1.3

1.5

1.33

1.44 1.47 1.48 1.481.55 1.58

1.47

1.03

Overcompleteness

16 ⇥ 16 image patches

GaussianGSMLMOLMPoT

- - 1x 2x 3x 4x 2x 3x 4x

0.9

1.1

1.3

1.5

1.25

1.36 1.38 1.39 1.411.46 1.49

1.41

0.96

Overcompleteness

Log-

likel

ihoo

SE

M[b

it/pi

xel]

8 ⇥ 8 image patches

0 5 10 15

6.9

7.1

7.3

7.5

Time [s]

Avg

.po

ster

iore

nerg

y

Toy model

0 20 40 60 80

25

0

�25

�50

Time [s]

Image model

0 5 10 15

0

0.5

1

Time [s]

Aut

ocor

rela

tion

Toy model

MALAHMCGibbs

0 20 40 60 80

0

0.5

1

Time [s]

Image modelA B 0 5 10 15

6.9

7.1

7.3

7.5

Time [s]

Avg

.po

ster

iore

nerg

y

Toy model

0 20 40 60 80

25

0

�25

�50

Time [s]

Image model

0 5 10 15

0

0.5

1

Time [s]

Aut

ocor

rela

tion

Toy model

MALAHMCGibbs

0 20 40 60 80

0

0.5

1

Time [s]

Image modelA B

M NA 2 RM⇥N

64 128 192 256

0

0.25

0.5

0.75

1

Basis coefficient, i

Basis

vector

norm

,||a

i||

Laplace, 2x

GSM, 2x

GSM, 3x

GSM, 4x