herding dynamical weights max welling bren school of information and computer science uc irvine

Herding Dynamical Weights

Max WellingBren School of Information

and Computer ScienceUC Irvine

Motivation

• Xi=1 means that pin i will fall during a Bowling round. Xi=0 means that pin i will still stand.

• You are given pairwise probabilities P(Xi,Xj).

• Task: predict the distribution Q(n), n=0,.., 10 of the total number of pins that will fall.

Stock market: Xi=1 means that company i defaults.You are interested in the probability of n companies defaulting in your portfolio.

Sneak Preview

Newsgroups-small (collected by S. Roweis)100 binary features, 16,242 instances (300 shown)

(Note: herding is a deterministic algorithm, no noise was added)

Herding is a deterministic dynamical system that turns “moments” (average feature statistics)into “samples” which share the same moments.

Quiz: which is which [top/bottom]?

-data in random order.

-herding sequence in order received.

Traditional Approach:Hopfield Nets & Boltzman Machines

is

ijw weight

state value (say 0/1)

jiij

ij sswwsE ),(

ji

ijijw ssw

wZsP exp

)(

1)(

Energy:

jijiji SWIS 0

Probability of a joint state:

Coordinate descent on energy:

Traditional Learning Approach

ij i

iijiij XXXW

eXP

)(

Pidataiii

Pjidatajiijij

XX

XXXXWW

Sii nSInQ

PS

10

0

)(

~

Use CDinstead

!

What’s Wrong With This?

• E[Xi] and E[XiXj] are intractable to compute (and you need them at every iteration of gradient descent).

• Slow convergence & local minima (only w/ hidden vars)

• Sampling can get stuck in local modes (slow mixing).

Solution in a Nutshell

datajiXX

Sii nSInQ

S

10

0

)(

dataiX

Nonlinear Dynamical SystemNonlinear Dynamical System

dataiSi

datajiSji

XS

XXSS

(sidestep learning + sampling)

Herding Dynamics

idataiii

jidatajiijij

jijiji

SX

SSXXWW

SWIS

0

no stepsize

• no stepsize

• no random numbers

• no exponentiation

• no point estimates

iSjS

ijWi

Piston Analogyweights=pistons

Pistons move up at a constant rate (proportional to observed correlations)

When they gets too high, the “fuel” will combustand the piston will be pushed down (depression)

“Engine driven by observed correlations”

Herding Dynamics with General Features

)()(

)(maxarg

SfXfWW

SfWS

kdatakkk

kkk

Si

i

• no stepsize

• no random numbers

• no exponentiation

• no point estimates

Features as New Coordinates)( 1Sf

)( 4Sf

)( 3Sf

)( 2Sf

1w

2w

tw

1tw

If then period is infinite

dataXf )(

)( 5Sf

data

B

bbbB ffnNnn

11 ),..,(

thanks to Romain Thibaux

Example]:1:[

)sin()(10

1)(

2

21

X

XXf

XXf

weights initialized in a grid

red ball tracks 1 weight

converence on afractal attractor setwith Hausdorf dim.1.5

The Tipi Function

gradient descend on G(w)with stepsize 1.

)(max)( SfWfWwG kk

kSk

datakk

This function is:

• Concave• Piecewise linear• Non-positive• Scale free

)(SffWW kdatakkk

kkk

SSfWS )(maxarg

coordinate ascend replaced with full maximization.

Scale free property implies that stepsize will not affect state sequence S.

RecurrenceThm: If we can find the optimal state S, then the weights will stay within a compact region.

Empirical evidence: coordinate ascent is sufficient to guarantee recurrence.

Ergodicity

s=1

s=2

s=3s=4

s=5

s=6

datak

T

ttk

T

fsfT

1

)(1

lim

s=[1,1,2,5,2...

Thm: If the 2-norm of the weights grows slower than linear, then feature averages over trajectories converge to data averages.

Relation to Maximum Entropy

dataP

P

fftoSubject

PHMaximize

:

][

x

xfW

kdatakk

W

kkk

k

efWWLMaximize)(

}{log)(

Dual:

Tipi function:

T

WLTWG

T 0lim)(

Herding dynamics satisfies constraints but not maximal entropy

Advantages / Disadvantages

• Learning & Inference have merged into one dynamical system.• Fully tractable – although one should monitor whether local maximization is enough to keep weights finite.• Very fast: no exponentation, no random number generation.• No fudge factors (learning rates, momentum, weight decay..).• Very efficient mixing over all “modes” (attractor set).

• Moments preserved, but what is our “inductive bias”? (i.e. what happens to remaining degrees of freedom?).

Back to BowlingData collected by P. Cotton.10 pins, 298 bowling runs.X=1 means a pin has fallen in two subsequent bowls.H.XX uses all pairwise probabilitiesH.XXX uses all triplet probabilities

P(total nr. pins falling)

More ResultsDatasets: Bowling (n=298, d=10, k=2, Ntrain=150, Ntest = 148)Abelone (n=4177, d=8, k=2, Ntrain=2000, Ntest = 2177)Newsgroup-small (n=16,242, d=100, k=2, Ntrain=10,000, Ntest = 6242)8x8 Digits (n=2200 [3’s and 5’s], d=64, k=2, Ntrain=1600, Ntest =600)

Task: given only pairwise probabilities,compute the probability of the total nr.of 1’s in a data-vector Q(n).

Solution: apply herding and compute Q(n)through sample averages.

Error : KL[Pdata||Pest]

Task: given only pairwise probabilities,compute the classifier P(Y|X).

Solution: train logistic regression (LR) classifieron herding sequence.

Error : fraction of misclassified test cases.

LR is too simple, PL on herding sequence also gives 0.04.In higher dimensions herding looses advantage in accuracy

Conclusions

• Herding replaces point estimates with trajectories over attractor sets (which is not the Bayesian posterior) in a tractable manner.

• Model for “neural computation”– similar to dynamical synapses– Quasi-random sampling of state space (chaotic?)– Local updates– Efficient (no random numbers, exponentiation)

herding dynamical weights max welling bren school of information and computer science uc irvine

Documents

state sequence s

optimal state s

random numbers

random order

herding sequence

deterministic dynamical

local maximization

random number generation