herding dynamical weights max welling bren school of information and computer science uc irvine

20
Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Upload: clementine-ward

Post on 01-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Herding Dynamical Weights

Max WellingBren School of Information

and Computer ScienceUC Irvine

Page 2: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Motivation

• Xi=1 means that pin i will fall during a Bowling round. Xi=0 means that pin i will still stand.

• You are given pairwise probabilities P(Xi,Xj).

• Task: predict the distribution Q(n), n=0,.., 10 of the total number of pins that will fall.

Stock market: Xi=1 means that company i defaults.You are interested in the probability of n companies defaulting in your portfolio.

Page 3: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Sneak Preview

Newsgroups-small (collected by S. Roweis)100 binary features, 16,242 instances (300 shown)

(Note: herding is a deterministic algorithm, no noise was added)

Herding is a deterministic dynamical system that turns “moments” (average feature statistics)into “samples” which share the same moments.

Quiz: which is which [top/bottom]?

-data in random order.

-herding sequence in order received.

Page 4: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Traditional Approach:Hopfield Nets & Boltzman Machines

is

ijw weight

state value (say 0/1)

jiij

ij sswwsE ),(

ji

ijijw ssw

wZsP exp

)(

1)(

Energy:

jijiji SWIS 0

Probability of a joint state:

Coordinate descent on energy:

Page 5: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Traditional Learning Approach

ij i

iijiij XXXW

eXP

)(

Pidataiii

Pjidatajiijij

XX

XXXXWW

Sii nSInQ

PS

10

0

)(

~

Use CDinstead

!

Page 6: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

What’s Wrong With This?

• E[Xi] and E[XiXj] are intractable to compute (and you need them at every iteration of gradient descent).

• Slow convergence & local minima (only w/ hidden vars)

• Sampling can get stuck in local modes (slow mixing).

Page 7: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Solution in a Nutshell

datajiXX

Sii nSInQ

S

10

0

)(

dataiX

Nonlinear Dynamical SystemNonlinear Dynamical System

dataiSi

datajiSji

XS

XXSS

(sidestep learning + sampling)

Page 8: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Herding Dynamics

idataiii

jidatajiijij

jijiji

SX

SSXXWW

SWIS

0

no stepsize

• no stepsize

• no random numbers

• no exponentiation

• no point estimates

iSjS

ijWi

Page 9: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Piston Analogyweights=pistons

Pistons move up at a constant rate (proportional to observed correlations)

When they gets too high, the “fuel” will combustand the piston will be pushed down (depression)

“Engine driven by observed correlations”

Page 10: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Herding Dynamics with General Features

)()(

)(maxarg

SfXfWW

SfWS

kdatakkk

kkk

Si

i

• no stepsize

• no random numbers

• no exponentiation

• no point estimates

Page 11: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Features as New Coordinates)( 1Sf

)( 4Sf

)( 3Sf

)( 2Sf

1w

2w

tw

1tw

If then period is infinite

dataXf )(

)( 5Sf

data

B

bbbB ffnNnn

11 ),..,(

thanks to Romain Thibaux

Page 12: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Example]:1:[

)sin()(10

1)(

2

21

X

XXf

XXf

weights initialized in a grid

red ball tracks 1 weight

converence on afractal attractor setwith Hausdorf dim.1.5

Page 13: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

The Tipi Function

gradient descend on G(w)with stepsize 1.

)(max)( SfWfWwG kk

kSk

datakk

This function is:

• Concave• Piecewise linear• Non-positive• Scale free

)(SffWW kdatakkk

kkk

SSfWS )(maxarg

coordinate ascend replaced with full maximization.

Scale free property implies that stepsize will not affect state sequence S.

Page 14: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

RecurrenceThm: If we can find the optimal state S, then the weights will stay within a compact region.

Empirical evidence: coordinate ascent is sufficient to guarantee recurrence.

Page 15: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Ergodicity

s=1

s=2

s=3s=4

s=5

s=6

datak

T

ttk

T

fsfT

1

)(1

lim

s=[1,1,2,5,2...

Thm: If the 2-norm of the weights grows slower than linear, then feature averages over trajectories converge to data averages.

Page 16: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Relation to Maximum Entropy

dataP

P

fftoSubject

PHMaximize

:

][

x

xfW

kdatakk

W

kkk

k

efWWLMaximize)(

}{log)(

Dual:

Tipi function:

T

WLTWG

T 0lim)(

Herding dynamics satisfies constraints but not maximal entropy

Page 17: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Advantages / Disadvantages

• Learning & Inference have merged into one dynamical system.• Fully tractable – although one should monitor whether local maximization is enough to keep weights finite.• Very fast: no exponentation, no random number generation.• No fudge factors (learning rates, momentum, weight decay..).• Very efficient mixing over all “modes” (attractor set).

• Moments preserved, but what is our “inductive bias”? (i.e. what happens to remaining degrees of freedom?).

Page 18: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Back to BowlingData collected by P. Cotton.10 pins, 298 bowling runs.X=1 means a pin has fallen in two subsequent bowls.H.XX uses all pairwise probabilitiesH.XXX uses all triplet probabilities

P(total nr. pins falling)

Page 19: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

More ResultsDatasets: Bowling (n=298, d=10, k=2, Ntrain=150, Ntest = 148)Abelone (n=4177, d=8, k=2, Ntrain=2000, Ntest = 2177)Newsgroup-small (n=16,242, d=100, k=2, Ntrain=10,000, Ntest = 6242)8x8 Digits (n=2200 [3’s and 5’s], d=64, k=2, Ntrain=1600, Ntest =600)

Task: given only pairwise probabilities,compute the probability of the total nr.of 1’s in a data-vector Q(n).

Solution: apply herding and compute Q(n)through sample averages.

Error : KL[Pdata||Pest]

Task: given only pairwise probabilities,compute the classifier P(Y|X).

Solution: train logistic regression (LR) classifieron herding sequence.

Error : fraction of misclassified test cases.

LR is too simple, PL on herding sequence also gives 0.04.In higher dimensions herding looses advantage in accuracy

Page 20: Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Conclusions

• Herding replaces point estimates with trajectories over attractor sets (which is not the Bayesian posterior) in a tractable manner.

• Model for “neural computation”– similar to dynamical synapses– Quasi-random sampling of state space (chaotic?)– Local updates– Efficient (no random numbers, exponentiation)