1 restricted boltzmann machines restricted boltzmann...

Download 1 restricted Boltzmann machines Restricted Boltzmann tijmen/csc321/documents/maddison_rbmtutoriaTutorial: restricted Boltzmann machines Chris J. Maddison March 27, 2014 1 restricted

Post on 12-Mar-2018

217 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

  • Tutorial:

    restricted Boltzmann machines

    Chris J. Maddison

    March 27, 2014

    1 restricted Boltzmann machines

    A Boltzmann machine is a family of probability distributions over binary vectorss of length K

    P(s) = exp

    1i

  • 2 gradients

    Fit an RBM to a data set of bit vectors (v1, . . . ,vN ) by following the averagegradient (with respect to the parameters W, b, c)

    1

    N

    Nn=1

    log P(vn)

    We need the partial derivatives of

    log P(vn) = log

    (h

    P(vn,h)

    )

    = log

    (h

    exp(E(vn,h))Z

    )

    = log

    (h

    exp(E(vn,h))

    ) logZ

    We show how to derive the derivative of the first term. Recall,

    log f() =

    f()

    f()

    exp f() = exp(f())

    f()

    So for parameter

    log

    (h

    exp(E(vn,h))

    )=

    1h exp(E(vn,h))

    h

    exp(E(vn,h))

    =1

    h exp(E(vn,h))h

    exp(E(vn,h))

    E(vn,h)

    =h

    P(h|v = vn)

    E(v,h)

    = E[

    E(v,h)

    v = vn]A similar trick works for the second term and we get the partial derivative

    log P(vn) =

    positive statistic E[

    E(v,h)

    v = vn]E [ E(v,h)]

    negative statistic

    2

  • For the RBM

    Wijlog P(vn) = E [hivj |v = vn] E [hivj ]

    bilog P(vn) = E [hi|v = vn] E [hi]

    cjlog P(vn) = E [vj |v = vn] E [vj ]

    This is how it corresponds to the notation in the lectures

    E[hivj |v = vn] = hivjdata

    That is the expected value under the model of the product of hidden unit j andvisible unit j when v is clamped to vn and

    E[hivj ] = hivjmodel

    is the expected number of times that hi and vj are both on if we sample fromthe model. We can vectorize everything:

    E(v,h) = hTWv + hT b+ vT c

    with gradients

    W log P(vn) = E[hvT

    v = vn] E [hvT ]b log P(vn) = E [h|v = vn] E [h]c log P(vn) = E [v|v = vn] E [v]

    Remember to get a gradient on a batch we have to average the individual gra-dients!

    3 computing gradients & contrastive divergence

    In this section we talk about how to compute E [hivj |v = vn] E [hivj ] or ap-proximations to it. For the positive statistic we are conditioning on vn, so wecan take it out of the expected value:

    E [hi|v = vn] vnj

    E [hi|v = vn] is just the probability that hi is on when v is clamped; this issometimes called the activation:

    E [hi|v = vn] =1

    1 + exp(jWijvnj bi)

    Two quick notes about this

    3

  • (x) = 1/(1 + exp(x)) is called the logistic function

    I will use the convention that (x) of a vector x is taken component-wise

    So we can see thatE [h|v = vn] = (Wvn + b)

    The negative statistic is the real problem. With M true samples (vm,hm)from the distribution defined by the RBM, we could approximate

    E [hivj ] 1

    M

    Mm=1

    hmivmj

    Can get these samples by initializing N independent Markov chain at each datapoint vn and running until convergence (v

    n ,h

    n ). Then,

    E [hivj ] 1

    N

    Nn=1

    hnivnj

    The type of Markov transition operator used most often is alternating Gibbs.

    v0n = vn

    hkn P(h|v = vkn) for k 0vkn P(v|h = hk1n ) for k 1

    and in pictures

    Sampling from P(h|vkn) is easy, compute E [h|v = vn] and sample each bit in-dependently with probability E [hi|v = vn]. Similarly for P(v|h = hk1n ).

    The idea behind contrastive divergence is to run the Markov chain for onlyone step, get samples (v1n,h

    1n), and hope that

    E [hivj ] 1

    N

    Nn=1

    h1niv1nj

    Because these estimates are often noisy, we use the following smoothed recon-structions in their place in gradient calculations

    v1n = E[vh = h0n] = (WTh0n + c)

    h1n = (WE[vh0n]+ b) = (W v1n + b)

    4

  • In brief we compute the contrastive divergence gradients on data point vn asfollows:

    h0n P(h|v = vn)v1n = (W

    Th0n + c)

    h1n = (W v1n + b)

    CDW log P(vn) = (Wvn + b)vTn h1nv1TnCDb log P(vn) = (Wvn + b) h1nCDc log P(vn) = vn v1n

    To get the gradient on a batch, just average these individual gradients.

    5

Recommended

View more >