# 1 restricted boltzmann machines restricted boltzmann...

Post on 12-Mar-2018

217 views

Embed Size (px)

TRANSCRIPT

Tutorial:

restricted Boltzmann machines

Chris J. Maddison

March 27, 2014

1 restricted Boltzmann machines

A Boltzmann machine is a family of probability distributions over binary vectorss of length K

P(s) = exp

1i

2 gradients

Fit an RBM to a data set of bit vectors (v1, . . . ,vN ) by following the averagegradient (with respect to the parameters W, b, c)

1

N

Nn=1

log P(vn)

We need the partial derivatives of

log P(vn) = log

(h

P(vn,h)

)

= log

(h

exp(E(vn,h))Z

)

= log

(h

exp(E(vn,h))

) logZ

We show how to derive the derivative of the first term. Recall,

log f() =

f()

f()

exp f() = exp(f())

f()

So for parameter

log

(h

exp(E(vn,h))

)=

1h exp(E(vn,h))

h

exp(E(vn,h))

=1

h exp(E(vn,h))h

exp(E(vn,h))

E(vn,h)

=h

P(h|v = vn)

E(v,h)

= E[

E(v,h)

v = vn]A similar trick works for the second term and we get the partial derivative

log P(vn) =

positive statistic E[

E(v,h)

v = vn]E [ E(v,h)]

negative statistic

2

For the RBM

Wijlog P(vn) = E [hivj |v = vn] E [hivj ]

bilog P(vn) = E [hi|v = vn] E [hi]

cjlog P(vn) = E [vj |v = vn] E [vj ]

This is how it corresponds to the notation in the lectures

E[hivj |v = vn] = hivjdata

That is the expected value under the model of the product of hidden unit j andvisible unit j when v is clamped to vn and

E[hivj ] = hivjmodel

is the expected number of times that hi and vj are both on if we sample fromthe model. We can vectorize everything:

E(v,h) = hTWv + hT b+ vT c

with gradients

W log P(vn) = E[hvT

v = vn] E [hvT ]b log P(vn) = E [h|v = vn] E [h]c log P(vn) = E [v|v = vn] E [v]

Remember to get a gradient on a batch we have to average the individual gra-dients!

3 computing gradients & contrastive divergence

In this section we talk about how to compute E [hivj |v = vn] E [hivj ] or ap-proximations to it. For the positive statistic we are conditioning on vn, so wecan take it out of the expected value:

E [hi|v = vn] vnj

E [hi|v = vn] is just the probability that hi is on when v is clamped; this issometimes called the activation:

E [hi|v = vn] =1

1 + exp(jWijvnj bi)

Two quick notes about this

3

(x) = 1/(1 + exp(x)) is called the logistic function

I will use the convention that (x) of a vector x is taken component-wise

So we can see thatE [h|v = vn] = (Wvn + b)

The negative statistic is the real problem. With M true samples (vm,hm)from the distribution defined by the RBM, we could approximate

E [hivj ] 1

M

Mm=1

hmivmj

Can get these samples by initializing N independent Markov chain at each datapoint vn and running until convergence (v

n ,h

n ). Then,

E [hivj ] 1

N

Nn=1

hnivnj

The type of Markov transition operator used most often is alternating Gibbs.

v0n = vn

hkn P(h|v = vkn) for k 0vkn P(v|h = hk1n ) for k 1

and in pictures

Sampling from P(h|vkn) is easy, compute E [h|v = vn] and sample each bit in-dependently with probability E [hi|v = vn]. Similarly for P(v|h = hk1n ).

The idea behind contrastive divergence is to run the Markov chain for onlyone step, get samples (v1n,h

1n), and hope that

E [hivj ] 1

N

Nn=1

h1niv1nj

Because these estimates are often noisy, we use the following smoothed recon-structions in their place in gradient calculations

v1n = E[vh = h0n] = (WTh0n + c)

h1n = (WE[vh0n]+ b) = (W v1n + b)

4

In brief we compute the contrastive divergence gradients on data point vn asfollows:

h0n P(h|v = vn)v1n = (W

Th0n + c)

h1n = (W v1n + b)

CDW log P(vn) = (Wvn + b)vTn h1nv1TnCDb log P(vn) = (Wvn + b) h1nCDc log P(vn) = vn v1n

To get the gradient on a batch, just average these individual gradients.

5

Recommended