chapter 1: basics of statistical mechanics. the curie...

(TENTATIVE) PLAN OF THE COURSE Introduction Chapter 1: Basics of statistical mechanics.

The Curie-Weiss modelChapter 2: Neural networks for associative memory and pattern recognition.Chapter 3: The Hopfield model

Hopfield model with low-load and solution via log-constrained entropy.Self-average, spurious states, phase diagramHopfield model with high-load and solution via stochastic stability.

Chapter 4: Beyond the Hebbian paradigmaChapter 5: A gentle introduction to machine learning

Maximum likelihoodRosenblatt and Minsky&Papert perceptrons.

Chapter 6: Neural networks for statistical learning and feature discovery.Supervised Boltzmann machines. Bayesian equivalence between Hopfield retrieval and Boltzmann learning.

Chapter 7: A few remarks on unsupervised learning, “complex” patterns, deep learningUnsupervised Boltzmann machines.Non-Gaussian priors.Multilayered Boltzmann machines and deep learning.

Seminars: Numerical tools for machine learning; Non-mean-field neural networks; (Bio-)Logic gates; Maximum entropy approach, Hamilton-Jacobi techniques for mean-field models, …

Hopfield model at low load

Self-consistency equations for the Mattis magnetizations (a straightforward way without SM)

Replace the computation of extensive thermodynamic variables, by an average over an ensemble of

systems which are distributed according to the probability distribution of the random variables.

Recalling

the internal contribution can be written as

that is, it scales with <mμ>.

h�ii = (+1)P (+1) + (�1)P (�1) =e�hi

e�hi + e��hi� e��hi

e�hi + e��hi= tanh(�hi)

) h�ii = tanh(�hi)

hmµi =1

N

NX

i=1

⇠µi h�ii

hinti =

1

N

NX

j=1

Jijh�ji =PX

µ=1

⇠µi hmµi

Combining the previous expressions hmµi =1

N

NX

i=1

⇠µi tanh

"�

PX

µ=1

⇠µi hmµi+ hext

!#

Field hi = hiext + hiint acting upon spin i

Recalling the “Glauber” probability

we get

Prob[�i] =1

2[1 + �i tanh(�hi)] =

e�i�hi

e�hi + e��hi

1

N

NX

i=1

G(⇠µi ) !1

2P

X

⇠µ

P (⇠µ)G(⇠µ) = hGi⇠

hmµi =*⇠µ tanh

"�

PX

µ=1

⇠µhmµi+ hext

!#+

⇠

Self-averaging of the free energy

The free energy in the TDL only depends on the distribution of patterns, but not on which particular set of patterns we have sampled from this distribution.This holds for both low and high storage regimes.

E(f(⇠)) = 1

2P

X

{⇠}

f(⇠)

Theorem. In the TDL the free energy evaluated for a given pattern configuration converges in probability to the average value with respect to pattern sampling.

P (||fN � EfN || > ✏) ! 0

Examples of non-self-averaging observables: overlaps q in SK model (see Parisi plateau) and in high-load Hopfield model

Replace the computation of extensive thermodynamic variables, by an average over an ensemble of systems which are distributed according to the probability distribution of the random variables.

If 2P ≪ N this is a consequence of law of large numbersIf P grows large deviations can be found → non self-averaging quantities

71

Signal-to-noise Does the Hebbian rule actually stabilize the inscribed pattern ξμ?Would a network in a state that is a stored pattern σi = ξi1 be dynamically stable?σi = sgn(hi) → condition σi hi >0 (i=1, 2, …, N)

Jij =1

N

PX

µ=1

⇠µi ⇠µj

hi =1

N

NX

j=1j 6=i

PX

µ=1

⇠µi ⇠µj �j

Stability condition when σ = ξ1

signal term(unitary in the TDL)

noise termsum of (N-1)(p-1) ≈ Np uncorrelated bits ± 1

⇠11h1 = 1 +R

|R| ⇡r

P

N

If P is kept constant as N is made very large, the noise becomes negligible in comparison with the signal→ every pattern is a fixed point

72

⇠11h1 =N � 1

N+

1

N

NX

j=1j 6=i

PX

µ=2

⇠11⇠µ1 ⇠

µj ⇠

1j > 0

S: reinforcement due to the fact that other spins are as well retrieving pattern 1R: slow noise due to other patterns

⇠11h1 = 1 +R

|R| ⇡r

P

N

If P is kept constant as N is made very large, the noise becomes negligible in comparison with the signal→ every pattern is a fixed point

In fact, the patterns are very stable fixed points.Suppose a finite fraction d of spins is flipped away from one of the patterns, randomly, ⇒ S = 1-2d, R ≈ N-1/2 still negligible. hi = (1-2d)ξi1 + R → sgn(hi)=sgn(ξi1)The network will immediately align itself with the pattern⇒ the patterns have very large basins of attraction

73

Ener

gy

Hamming distance < N/2

Spurious states via signal-to-noiseAlthough the Hebbian kernel has been constructed to guarantee that certain specified patterns are fixed points of the dynamics, the non-linearity of the dynamical process induces additional attractors

Symmetric 3-mixture (majority rule)�i = sgn(⇠1i + ⇠2i + ⇠3i )

Check stability

Stability is mostly threatened by the sites which have the lowest values of the signal (i.e., 0.5)R2 ≈ (P-3)/N ⇒ P/N as N → ∞, keeping P fixed, R → 0 ⇒ the symmetric 3-mixture is stable.It has a rather large basin, but still much reduced compared to the pure patterns (i. stability is guaranteed by signals of magnitude one half only, ii. initial state must have large overlap with three patterns rather than one)

74

N ≫ 1

�ihi = S +R

S = �i

3X

µ=1

mµ⇠µi = 0.5(⇠1i + ⇠2i + ⇠3i )sgn(⇠

1i + ⇠2i + ⇠3i )

R =1

N

NX

j=1

X

µ>3

�i⇠µ1 ⇠

µj �j

m⌫ =1

N

NX

i=1

�i⇠⌫i ! h�⇠⌫i⇠ = 0.5, for ⌫ = 1, 2, 3

Noiseless symmetric spurious states

m = mn (1, 1, …, 1, 0, 0, …, 0) Symmetry of the solution under permutations of the components as well as under the change of sign of any number of them

There are overall 2n✓P

n

◆

sign

selection of non-zero

components

m = h⇠ tanh(�⇠ ·m)i⇠ ) mn =1

nhzn tanh(�mnzn)i⇠

In the noiseless limit mn =1

nh|zn|i⇠

mn =1

22k

✓2k

k

◆

En = �1

2Nnm2

n

E1 < E3 < E5 < ... < E1 < ... < E6 < E4 < E2

These mixture states are solutions of the MF eqs., but not necessarily attractors → test their stability.Compute the second variations of the energy at each of the stationationy points: those that are positive definite correspond to true attractorsAll the odd mixtures are attractors (minima)All the even mixtures are unstable (saddle pts)

zin =nX

µ=1

⇠µi

Pr(zn = 2k � n) = 2�n

✓n

k

◆

n

n=2k, n evenn=2k+1, n odd

75

Noiseless Hopfield network, α = 0The stored patterns, as well as their reversed twins, are absolute minima of an energy that is the (Lyapunov) landscape function

Each memory has an enormous basin of attraction

There is a large number of spurious attractors, referred to as spurious states.≈2n, n ≤ P symmetric mixture fixed points (even mixtures are unstable, odd mixtures are stable)

Noisy Hopfield network, α = 0System endowed with stochastic dynamical process obeying db relaxes towards a Boltzmann distribution. Free energy is the potential to look at for describing the behavior of the system.

T>1 noise prevails, there is but a single minimum, at which all overlaps vanish (paramagnetic state)

T=Tc=1 second order phase transition → qualitative modification in the shape of the landscape (recall Landau)

T<1 paramagnetic state becomes unstable (max for F) and 2P “valleys” emerge, each corresponding to a state which has a single non-vanishing average overlap

T ≪ 1 valleys deepen, move away from the point m=0, towards pure states

T< 0.461 spurious states become successively stableThe pure pattern attractors remain the absolute minima in the landscape all the way down to T=0; they also always have the largest basins of attraction

High load and the way to complexity

Def. The Hamiltonian of the mean-field spin-glass model is defined as

The main ingredient: Random frustration/competition

Main differences with classic systems- an (at least) extensive (in N) number of free energy minima- emerging ultrametric organization of states- a dynamical spread over time scales (infinite for N → ∞)- aging

77

Sherrington-Kirkpatrick model is prototype model for complex systems

HN (�; ⇠) = � 1

N

NX

i,j

P=↵NX

µ=1

⇠µi ⇠

µj �i�j ��!

P!1�

p↵pN

NX

i,j

Jij�i�j , Jij ⇠ N (0, 1)

HN (�;J) = � 1pN

NX

i,j

Jij�i�j , Jij ⇠ N (0, 1)&Jii = 0

Wick’s theorem

78

Let g be a centered Gaussian random variable with variance ν2 and let us denote the density function of its distribution by

given a continuosly differentiable function F: ℝ→ ℝ, we can formally integrate by parts,

Since

if the limits and the expectations on both sides are finite.

Therefore

This computation can be generalized to Gaussian vectors.

A new (set of) order parameter

describes correlation between spins belonging to two “replicas”, namely to two different systems endowed with the same couplings J

under replica-symmetry ansatz(which, by the way, is proven to be false)

metastable states

deep valleys79

Overlap qab =1

N

NX

i=1

�ai �

bi

q↵� = q, ↵ 6= �

q↵� = 1, ↵ = �

hHSKN (�;J)i = ��N

4

�1� hq2abi

�

) limN!1

P (hqabi) = �(hqabi � q̄), 8(a, b), a 6= b

Let us prove that

A new (set of) order parameter

Overlap

under replica-symmetry ansatz(which, by the way, is proven to be false)

80

Each RSB step adds a new set of valleys hierarchically embedded in the existing ones.

qab =1

N

NX

i=1

�ai �

bi

describes correlation between spins belonging to two “replicas”, namely to two different systems endowed with the same couplings J

q↵� = q, ↵ 6= �

q↵� = 1, ↵ = �

) limN!1

P (hqabi) = �(hqabi � q̄), 8(a, b), a 6= b

Let us prove that hHSKN (�;J)i = ��N

4

�1� hq2abi

�

The replica trick

log(x) = limn!0

�xn � 1

�

n

��Ef(�) = limN!1

1

NE logZN (�, J).

��Ef(�) = limN!1

limn!0

EZnN (�, J)� 1

Nn

The trick consists in calculating ZNn for n ∈ ℕ and in making an analytic continuation for n → 0.Serious mathematical problems:- analytic continuation is not ensured in the TDL- uniqueness of the limit (Carleson theorem does not hold)- interchange of limits

81

EZn

N(�, J) = exp

(J�2

4N

�nN2 � n2N

�)Z Y

a<b

dqab

r�2JN

2⇡·

· exp(��JN

2

X

a<b

q2ab

+N lnX

�

eJ�2 P

a<b qab�(a)

�(b)

)=

=

Z Y

a<b

dqab

r2⇡

�2JNe�NHeff(qab).

limN!1

ZnN (�, J) = exp(�N min

q[He↵(q)])

He↵(q) = ��2J(N � n)

4N+

�2J

2

X

a<b

q2ab � log

X

{�}

exp

J�

2X

a<b

qab�(a)

�(b)

!

RS ansatz

q̄RS =

Zdzp2⇡

e�z2/2 tanh2(z�pJq̄RS)

SK model CW model

ZN (�, h) =X

{�}

exp

2

4 �

2N

NX

i=1

�i

!2

+ �h

NX

i=1

�i

!3

5

ZN (�, h) =

rN�

2⇡

Z +1

�1dx exp

⇢�N

�x2

2� log 2 cosh[�(h+ x)]

��

To evaluate Z we need to solve a single integralFrom its log-derivative get mean observables

limN!1

ZN (�, J) = exp(�N minx

[Fe↵(x)])

x̄ = tanh[�(x̄+ h)]

Fe↵(x) =�x2

2� log 2 cosh[�(h+ x)]

Rigorous contributions from (not an exhaustive list!)Guerra: interpolation techniquesAizenmann: stochastic stability Pastur, Scherbina, Tirozzi: martingales, cavity methodsTalagrand, Panchenko: concentration of measureBovier: gaussian processes…

“Spin glasses are heaven for mathematicians”M. Talagrand

Solution at high load via interpolation techniquesat the RS level assuming one binary pattern and P-1 Gaussian patterns

Exploit “universality of noise”

q̄ =

ZdM(z) tanh2

p�q̄�zµ

(1� �(1� q̄))

�.

q̄ =

ZdM(z) tanh2

�m̄+

p�q̄�zµ

(1� �(1� q̄))

�.

Hopfield with Gaussian patterns (α>0)

Hopfield with digital patterns (α>0)

Slow noise 83

↵(�,�) = limN!1

↵N (t = 1) =

limN!1

↵N (t = 0) +

Z 1

0@t↵N (t)dt

�

↵N (t = 0) ! 1� body

@t↵N (t) = I + II + III + IV + V + V I

@t↵N =�

2E!t

⇥(m� m̄)2

⇤� 1

2�m̄2� �N�

2h(q12� q̄)(p12� p̄)it�

�N�

2p̄(1� q̄)

@↵N

@t(t) =E!t

"�

2

✓m2 � 2

�m

◆#+

1

2N

�� B2 � C

� P�1X

µ=1

E!t(z2µ)+

� �N�

2hq12p12it �

A2

2(1� hq12it) +

�NB2

2hp12it.

limN!1

(hmit � m̄) = 0

limN!1

(hq12it � q̄) = 0

limN!1

(hp12it � p̄) = 0

see F. Guerra “Central limit theorem for fluctuations in the high temperature region” (2002) and references therein for the validity of such limits

↵N (t = 0) =� �N�

2+ log 2 + E log cosh

��m̄+

p�N�p̄✓

�� N

2log

⇥1� �(1� q̄)

⇤+

+�N�

2· q̄

1� �(1� q̄).

↵(�,�) = limN!1

↵N (t = 0) +

Z 1

0@t↵N (t)dt

�

85

ZN (�; t).=ZN (t) = e�

�2N

Pi

Pµ(⇠

µi )

2 X

{�}

ZdM(z) exp

(t�

2Nm2

)·

· exp {(1� t) Nm} · exp(pt

r�

N

NX

i=1

P�1X

µ=1

⇠µi �izµ

)·

· exp(Ap1� t

NX

i=1

⌘i�i

)· exp

(Bp1� t

P�1X

µ=1

✓µzµ

)·

· exp((1� t)

C

2

P�1X

µ=1

z2µ

)

ψ mimics the field provided by mAηi mimics the field provided by ξiμ zμ Bθμ mimics the field provided by ξiμ σiCzμ2 accounts for z gaussianity

effective field

86

Theorem. The replica symmetric thermodynamic limit of the free energy density of the Hopfield neural network with N spins σi ∈{-1,+1} ∀i=1, …,N, one binary pattern and a high load of P-1 real patters is determined by the minimum value of the following function:

f(�,�) = � 1

�↵(�,�)

where

↵(�,�) =� ��

2+ log 2 + E log cosh

✓�m̄+

p��p̄✓

◆� 1

2�m̄2+

� �

2log

�1� �(1� q̄)

�+

��q̄

2�1� �(1� q̄)

� � ��

2p̄(1� q̄).

where

Require stationarity of f(λ, β) with respect to order parameters

q̄ =

Z

RdM(✓) tanh2

��m̄+

p��p̄✓

�

p̄ =�q̄

�1� �(1� q̄)

�2

m̄ =

Z

RdM(✓) tanh

��m̄+

p��p̄✓

�.

Left picture: RS amplitudes m of the pure states of the Hopfield model as a function of temperature. From top to bottom: α=0.000…0.125 (Δα =0.025).Right picture, solid lines: free energies of the pure states. From top to bottom: α =0.000…0.125 (Δα =0.025). Dashed lines and their solid continuations: free energies of the spin glass state m=0 for the same values of α, shown for comparison.

Phase diagram of the Hopfield model. P: paramagnetic phase, m=q=0. SG: spin-glass phase, m=0, q>0. F: pattern recall phase, where the pure states with m≠0 and q>0 minimize f. In region M, the pure states are local but not global minima of f. Dashed: the AT instability for the retrieval solution (TR). Inset: close-up of the low temperature region

88

chapter 1: basics of statistical mechanics. the curie...

Documents