langevin-type sampling methods - github pages · 3. hamiltonian monte carlo (hmc) neal, radford m....
TRANSCRIPT
Based on summer literature review
School of Physics, Peking University
Ziming Liu
Advisor: Zheng Zhang, UCSB ECE1
Langevin-type sampling methods
The big party
2
ComputerScience Math
Physics
Robustness
QuantumField
Theory
StatisticalPhysics
QuantumInformation
Information Theory
Tensor Network
QuantumDynamics
TensorAnalysis
PartialDifferentialEquation
The goal of this talk
Complexity
MachineLearning
Hamiltonian Monte CarloRenormalization Group
Ising model
Matrix product state
Tensor trainTensorized neural network
Schrodinger equationKohn-sham equationentanglementClassical-quantum correspondence
Quantum-inspired Hamiltonian Monte Carlo
Overview
3
1. Introduction to Bayesian models
2. Introduction to Langevin dynamics (1st,2nd,3rd-order)
3. Hamiltonian Monte Carlo (HMC)
4
1. Introduction to Bayesian models
Maxwell-Boltzmann distribution
5
Description: for isothermal systemSingle-particle system (temperature 𝑇)
For state 𝑥, energy 𝐸(𝑥), then probability density 𝑝 𝑥 ∝ exp(−𝐸 𝑥
𝑘𝐵𝑇)
Theory
• Maximum entropy principle for isolated system (canonical ensemble)
• Minimum free energy principle for isothermal system
Ideal gas 𝐸 =𝑞2
2𝑚𝑝 𝑞 ∝ exp(−
𝑞2
2𝑚𝑘𝐵𝑇)
Static isothermal atmosphere
𝐸 = 𝑈(𝑥) 𝑝 𝑥 ∝ exp(−𝑈(𝑥)
𝑘𝐵𝑇)
Gas ina well
𝐸 = 𝑈 𝑥 +𝑞2
2𝑚 𝑝 𝑥, 𝑞 ∝ exp(−𝑈 𝑥 +
𝑞2
2𝑚𝑘𝐵𝑇
)
Link between pdf and energy func
Bayesian model
6
What ?
𝑝 𝜃 𝐷 ∝ 𝑝 𝐷 𝜃 𝑝(𝜃)
posterior likelihood prior
𝜃: model parameters
Link to regression models
𝑝 𝜃 𝐷 = exp −𝑈 𝜃
𝑈 𝜃 = − log 𝑝 𝐷 𝜃 − log(𝑝 𝜃 )
Regression error Regularization term
𝜃∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑈(𝜃)𝜃
𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝜃|𝐷)𝜃
Maximum Posterior EstimationGlobal Optima
Bayesian model
7
Why ?
𝜃∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑈(𝜃)𝜃
𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝜃|𝐷)𝜃
Maximum a Posterior EstimationGlobal Optima
Limitations of MAP1 No uncertainty quantification (point estimation)2 Risk of overfitting !
MAP Bayesian
Markov Chain Monte Carlo (MCMC)
9
𝑃𝑠 𝜃 = 𝑃(𝜃|𝐷)Steady distribution Posterior distribution
Bayesian modelMarkov Chain
Given a steady distribution, how to construct a Markov Chain?
Steady distribution Detailed balance
(Simplified to)
1
2 3
Detailed balance:
1
2 3
Steady distributionBut no db:
Detailed balance
10
𝑁1 𝑜𝑟 𝑝1 𝑁2 𝑜𝑟 𝑝2
𝑄(1 → 2) =1/2
𝑄(2 → 1) =1/2
Metropolis-Hastings (MH)
11
The MH algorithm for sampling from a target distribution 𝑝(𝑥), using transition kernel 𝑄, consists of the following steps:• For 𝑡 = 1, 2,⋯
• Sample 𝑦 from 𝑄(𝑦|𝑥𝑡). Think of 𝑦 as a proposed value of 𝑥𝑡+1.• Compute acceptance probability
𝐴(𝑥𝑡 → 𝑦) = min(1,𝑝 𝑦 𝑄(𝑥𝑡|𝑦)
𝑝 𝑥𝑡 𝑄(𝑦|𝑥𝑡))
• With probability A accept the proposed value, and set 𝑥𝑡+1 = 𝑦.Otherwise, set 𝑥𝑡+1 = 𝑥𝑡.
𝑥𝑡 𝑦
𝑝(𝑥): number of particles at state 𝑥
𝑄(𝑦|𝑥): transition rate from 𝑥 to 𝑦
𝑝(𝑦)
𝑝(𝑥𝑡)𝑄(𝑦|𝑥𝑡)
𝑄(𝑥𝑡|𝑦)
𝑝(𝑥)𝑄(𝑦|𝑥): number of particles transitfrom 𝑥 to 𝑦
𝑝 𝑥 𝑄 𝑦 𝑥 𝐴(𝑥 → 𝑦): number of accepted particles transit from 𝑥 to 𝑦
=
Detailedbalance
Metropolis algorithmThe Metropolis algorithm for sampling from a target distribution 𝑝(𝑥),transit through random walk, consists of the following steps:• For 𝑡 = 1, 2,⋯
• Random walk to 𝑦 from 𝑥𝑡. Think of 𝑦 as a proposed value of 𝑥𝑡+1.• Compute acceptance probability
𝐴(𝑥𝑡 → 𝑦) = min(1,𝑝 𝑦
𝑝 𝑥𝑡)
• With probability A accept the proposed value, and set 𝑥𝑡+1 = 𝑦.Otherwise, set 𝑥𝑡+1 = 𝑥𝑡.
12
Comment: random walk is symmetric, so 𝑄 𝑦 𝑥 = 𝑄(𝑥|𝑦)
In thermodynamics models, 𝑃 𝑥 ∼ exp −𝑈 𝑥 /𝑇 (Boltzmann distribution)
𝑈(𝑥)
Step sizelarge: low acceptance rate
small: correlated & not independent
13
2. Introduction toLangevin Dynamics
Zoo of Langevin dynamics
14
Stochastic Gradient Langevin Dynamics (cite=718)
Stochastic Gradient Hamiltonian Monte Carlo (cite=300)
Stochastic sampling using Nose-Hoover thermostat (cite=140)
Stochastic sampling using Fisher information (cite=207)
Welling, Max, and Yee W. Teh. "Bayesian learning via
stochastic gradient Langevin dynamics." Proceedings
of the 28th international conference on machine
learning (ICML-11). 2011.
Chen, Tianqi, Emily Fox, and Carlos Guestrin.
"Stochastic gradient hamiltonian monte
carlo." International conference on machine
learning. 2014.
Ding, Nan, et al. "Bayesian sampling using
stochastic gradient thermostats." Advances in
neural information processing systems. 2014.
Ahn, Sungjin, Anoop Korattikara, and Max Welling.
"Bayesian posterior sampling via stochastic gradient
Fisher scoring." arXiv preprint arXiv:1206.6380 (2012).
1storder, general
1storder, gaussian
2ndorder
3rdorder
1st order Langevin dynamics
15
(also known as Brownian motion or Wiener Process)
𝑑𝑥 = −∇ 𝑓 𝑥 𝑑𝑡 + 𝛽−12𝑑𝑊(𝑡) 𝜌𝑠 ∝ exp(−𝛽𝑓(𝑥))
Energy function (bayesian) / loss function (optimization)
m
The properties of the mediumA heat bath (temperature 𝑻)Hit the ball every 𝑡0 (憋大招)
transfer momentum 𝑝 ∼ 𝑒𝑥𝑝(−𝑝2
2𝑇𝑡0)
Overdamped (coefficient 𝜸 large)Small relaxation time
Physical Intuition𝑓 𝑥 = 𝑐𝑜𝑛𝑠𝑡
①The ball gains a momentum 𝑝 fromparticles (fluctuating) around it.
② It travels in the damping medium𝑚 ሷ𝑥 = −𝛾 ሶ𝑥
→ ሶ𝑥 =𝑝
𝑚exp −
𝛾
𝑚𝑡 , 𝑥 =
𝑝
𝛾(1 − exp(−
𝛾
𝑚𝑡))
③Overdamped condition, then exp −𝛾
𝑚𝑡0 → 0.
So at time 𝑡, the total displacement is 𝑝
𝛾∝ 𝑝.
i.e. 𝑑𝑥 ∝1
𝛾exp −
𝑝2
2𝑇𝑡0∝ 𝑇𝑑𝑊(𝑡0) (𝛽 =
1
𝑇)
2nd order Langevin dynamics
16
𝑑 Ԧ𝑥 = Ԧ𝑝𝑑𝑡
𝑑 Ԧ𝑝 = −∇𝑓 Ԧ𝑥 𝑑𝑡 − 𝐴 Ԧ𝑝𝑑𝑡 + 2𝐴𝑇𝑑𝑊
𝑓( Ԧ𝑥)
Ԧ𝑝
ConservativeForce
DampingForce
Thermal“Force”
𝑃𝑠 Ԧ𝑥, Ԧ𝑝 ∝ exp(−(𝑝2
2+ 𝑈( Ԧ𝑥))/𝑇)
Invariant measure:
17
𝜙𝑡
Fokker Planck Eq for 2nd order LD
One-dim random walk (不变原理)
𝜙𝑥 𝜙𝑥+1𝜙𝑥−1
𝑡
𝑡 + 1𝜕𝜙
𝜕𝑡=1
2𝜙𝑥−1 + 𝜙𝑥+1 − 2𝜙𝑥 ≈
1
2
𝜕2𝜙
𝜕𝑥2
Dynamical Equations
Fokker-Planck Equations
18
𝑑 Ԧ𝜃 = Ԧ𝑝𝑑𝑡
𝑑 Ԧ𝑝 = −∇𝑈 Ԧ𝜃 𝑑𝑡 − 𝜁 Ԧ𝑝𝑑𝑡 + 2𝐴𝑇𝑑𝑊
𝑑𝜁 =𝑝2
𝑛− 𝑇0 𝑑𝑡
When 𝒑↑ → 𝒑𝟐
𝒏∼ 𝑻 > 𝑻𝟎 → 𝜻 ↑ →more friction on 𝒑 → 𝒑 ↓
Negative feedback loop
3rd order Langevin dynamics (special)
Thermal term (thermostat)
3rd order Langevin dynamics (general)
19
𝑑𝑞 = 𝑀−1𝑑𝑝
𝑑𝑝 = −∇𝑈 𝑞 𝑑𝑡 + 𝜎𝐹 Δ𝑡𝑀12𝑑𝑊 − 𝜁𝑝𝑑𝑡 + 𝜎𝐴𝑀
12𝑑𝑊
𝑑𝜁 =1
𝜇𝑝𝑇𝑀−1𝑝 − 𝑁𝑑𝑘𝐵𝑇 𝑑𝑡 − 𝛾𝜁𝑑𝑡 + 2𝑘𝐵𝑇𝛾𝑑𝑊
Invariant measure: exp −𝛽𝑈(𝑞) exp(−𝛽𝑝𝑇𝑀−1𝑝/2) exp −𝜇 𝜁 − ො𝛾 2/2
ො𝛾 = 𝛽(𝜎𝐹2 + 𝜎𝐴
2)/2
20
Slides from: https://ergodic.org.uk/~bl/Data/Slides/MD4.pdf
3rd order Langevin dynamics
21
𝑑𝑞 = 𝑀−1𝑑𝑝
𝑑𝑝 = −∇𝑈 𝑞 𝑑𝑡 + 𝜎𝐹 Δ𝑡𝑀12𝑑𝑊 − 𝜁𝑝𝑑𝑡 + 𝜎𝐴𝑀
12𝑑𝑊
𝑑𝜁 =1
𝜇𝑝𝑇𝑀−1𝑝 − 𝑁𝑑𝑘𝐵𝑇 𝑑𝑡 − 𝛾𝜁𝑑𝑡 + 2𝑘𝐵𝑇𝛾𝑑𝑊
Thermostat
exp −𝛽𝑈(𝑞) exp(−𝛽𝑝𝑇𝑀−1𝑝/2)Hamiltonian dynamics
exp(−𝛽𝑝𝑇𝑀−1𝑝/2)exp(−𝜇𝜁2/2)
OU process for 𝜁 exp(−𝜇𝜁2/2)
Noise for 𝑝 exp −𝜇𝜁2
2→ exp −
𝜇(𝜁 − ො𝛾)2
2
Invariant measure: exp −𝛽𝑈(𝑞) exp(−𝛽𝑝𝑇𝑀−1𝑝/2) exp −𝜇 𝜁 − ො𝛾 2/2
ො𝛾 = 𝛽(𝜎𝐹2 + 𝜎𝐴
2)/2
Bu
ildin
g B
lock
s
22
3. Hamiltonian Monte Carlo(HMC)
Neal, Radford M. "MCMC using Hamiltonian dynamics." Handbook of
markov chain monte carlo 2.11 (2011): 2.
Betancourt, Michael. "A conceptual introduction to Hamiltonian Monte Carlo." arXiv preprint arXiv:1701.02434 (2017).
2nd order Langevin dynamics
23
𝑑 Ԧ𝑥 = Ԧ𝑝𝑑𝑡
𝑑 Ԧ𝑝 = −∇𝑓 Ԧ𝑥 𝑑𝑡 − 𝐴 Ԧ𝑝𝑑𝑡 + 2𝐴𝑇𝑑𝑊
𝑓( Ԧ𝜃)
Ԧ𝑝
ConservativeForce
DampingForce
Thermal“Force”
𝑃𝑠 Ԧ𝑥, Ԧ𝑝 ∝ exp(−(𝑝2
2+ 𝑈( Ԧ𝑥))/𝑇)
Invariant measure:
𝑑 Ԧ𝑥 = Ԧ𝑝𝑑𝑡
𝑑 Ԧ𝑝 = −∇𝑓 Ԧ𝑥 𝑑𝑡
𝐴 = 0
Hamiltonian dynamics
24
𝑑 Ԧ𝑥 = 𝑀−1 Ԧ𝑝𝑑𝑡
𝑑 Ԧ𝑝 = −∇𝑈( Ԧ𝑥)𝑑𝑡
Definition of momentum
Momentum theorem
𝑓( Ԧ𝑥): conservative force
𝐻 𝑥, 𝑝 =1
2𝑝𝑇𝑀−1𝑝 + 𝑈(𝑥)
Kinetic energy
Potential energy
𝑑𝐻 = 𝑝𝑇𝑀−1𝑑𝑝 + 𝑑𝑈 𝑥 = −𝑑𝑥𝑇∇𝑈 𝑥 + 𝑑𝑈 𝑥 = 0
Energy conservation
Hamiltonian
Hamiltonian equations
Steady distribution
25
𝑑 Ԧ𝑥 = 𝑀−1 Ԧ𝑝𝑑𝑡
𝑑 Ԧ𝑝 = −∇𝑈( Ԧ𝑥)𝑑𝑡
Hamiltonian Equations
Steady distribution:
𝑝𝑠 𝑥, 𝑞 ∝ exp −𝑈 Ԧ𝑥 −1
2𝑞𝑇𝑀−1𝑞
𝑝𝑠 𝑥 ∝ exp −𝑈 Ԧ𝑥 = 𝑝(𝑥|𝐷)
Fokker Planck Equation:
𝜕𝑡𝑝 +𝜕𝑝
𝜕𝑥
𝑇𝜕𝐻
𝜕𝑥+
𝜕𝑝
𝜕𝑞
𝑇𝜕𝐻
𝜕𝑞= 0
(Also known as Liouville’s Theorem in physics)
Example: 1d spring-mass system
26
𝐸 =1
2𝑘𝑥2 +
𝑞2
2𝑚
q
x
𝑥 = 𝐴𝑠𝑖𝑛(𝜔𝑡 + 𝜙0)
𝑞 = 𝑚𝜔𝐴𝑐𝑜𝑠(𝜔𝑡 + 𝜙0)
(𝜔 =𝑘
𝑚)
No ergodicity ?
Example: 1d spring-mass system interacting with a heat bath
27
①momentum resampling (𝑚 = 𝑘𝐵𝑇 = 1)𝑞 ∼ N(0,1)
②travel on an energy level for a certain time (L steps)
q
x① ①
①
①
①
②
②
②
②
Maxwell-Boltzmann distribution
𝑝 𝑞 ∝ exp(−𝑞2
2𝑚𝑘𝐵𝑇)
Ensemble Time
2nd LD & HMC
28
q
x① ①
①
①
①
②
②
②
②
Continuous scattering Discrete scattering (憋大招)
Algorithm
29
Momentum resampling
Hamiltonian dynamics(Leap frog scheme)
Metropolis-Hastings
Euler vs leap frog
30
Leapfrog method
Euler’s method
𝑞 𝑡 + 𝜖 = 𝑞 𝑡 − 𝜖𝜕𝑈
𝜕𝑥(𝑥(𝑡))
𝑥 𝑡 + 𝜖 = 𝑥 𝑡 + 𝜖𝑞(𝑡)
𝑚
𝑞 𝑡 +𝜖
2= 𝑞 𝑡 −
𝜖
2
𝜕𝑈
𝜕𝑥(𝑥(𝑡))
𝑞 𝑡 + 𝜖 = 𝑞 𝑡 +𝜖
2−𝜖
2
𝜕𝑈
𝜕𝑥(𝑥(𝑡 + 𝜖))
𝑥 𝑡 + 𝜖 = 𝑥 𝑡 + 𝜖𝑞(𝑡 +
𝜖2)
𝑚
𝑥(𝑡 + 𝜖)
𝑞(𝑡 + 𝜖)
1𝜖
𝑚
−𝑘𝜖
𝑚1
𝑥(𝑡)
𝑞(𝑡)=
𝑥(𝑡 + 𝜖)
𝑞(𝑡 + 𝜖)=
1 0
−𝑘𝜖
2𝑚1
1𝜖
𝑚
0 1
1 0
−𝑘𝜖
2𝑚1
𝑥(𝑡)
𝑞(𝑡)×
e.g. 𝑈 𝑥 =1
2𝑘𝑥2
det>1, not preserving volume !
det=1, preserving volume !
Euler vs leap frog
31
Euler’s method : diverge Leapfrog : stable
MCMC & HMC
32
𝑈(𝑥)
Random walk MCMC: position 𝑥
𝑈(𝑥)
HMC: position 𝑥 + momentum 𝑞
𝑃 𝑥 ∼ exp −𝑈 𝑥 /𝑇
𝑃 𝑥 ∼ exp −(𝑈 𝑥 + 𝐾 𝑞 )/𝑇
Hamiltonian dynamics → energy conservation
Always accept ! (if step size→0)
}
MCMC & HMC
33
Neal, Radford M. "MCMC using Hamiltonian dynamics." Handbook of
markov chain monte carlo 2.11 (2011): 2.
HMC limitations
34
①Ill-conditioned distributions ②Multimodal distributions
③Discontinuous ④spiky
⑤Large training dataset
Need different massesin different directions hard to escape from one mode
Large energy gapAcceptance rate low
Large gradientsAcceptance rate low
Expensive gradients computation
HMC variants
35
Riemannian HMC
Girolami, Mark, Ben Calderhead, and
Siu A. Chin. "Riemannian manifold
hamiltonian monte carlo." arXiv
preprint arXiv:0907.1100 (2009).
Magnetic HMC
Tripuraneni, Nilesh, et al. "Magnetic
Hamiltonian Monte Carlo." Proceedings
of the 34th International Conference on
Machine Learning-Volume 70. JMLR.
org, 2017.
Wormhole HMC
Lan, Shiwei, Jeffrey Streets, and Babak
Shahbaba. "Wormhole hamiltonian
monte carlo." Twenty-Eighth AAAI
Conference on Artificial Intelligence.
2014.
Continuous tempered HMC
Graham, Matthew M., and Amos J.
Storkey. "Continuously tempered
hamiltonian monte carlo." arXiv preprint
arXiv:1704.03338 (2017).
Stochastic Gradient HMC
Chen, Tianqi, Emily Fox, and Carlos
Guestrin. "Stochastic gradient hamiltonian
monte carlo." International conference on
machine learning. 2014.
Stochastic Gradient Thermostat
Ding, Nan, et al. "Bayesian sampling
using stochastic gradient
thermostats." Advances in neural
information processing systems. 2014.
Relativistic Monte Carlo
Lu, Xiaoyu, et al. "Relativistic
monte carlo." arXiv preprint
arXiv:1609.04388 (2016).
Optics HMC
Afshar, Hadi Mohasel, and Justin
Domke. "Reflection, refraction, and
hamiltonian monte carlo." Advances
in Neural Information Processing
Systems. 2015.