markov chain monte carlo - universität...

ConceptsMarkov Chain Monte Carlo

Usher’s Algorithm

Markov Chain Monte Carlofor Parameter Optimization

Holger Schultheis

12.12.2011

1 / 27


Usher’s Algorithm

Topics

1 Concepts

2 Markov Chain Monte CarloBasicsExampleMetropolis and Simulated Anealing

3 Usher’s Algorithm

2 / 27


Usher’s Algorithm

Probability Distributions

The quantity X is called a random variable if it takes ondifferent values with certain probabilities.

Example: Result of rolling a die

The probability distribution of a random variable X is anycomplete description of the probabilistic behavior of X .

Example: If the die is fair, X = 1, 2, 3, 4, 5, 6 with probabilityp = 1/6, 1/6, 1/6, 1/6, 1/6, 1/6.

3 / 27


Usher’s Algorithm

Probability Distributions

Two common ways to characterize probability distributions

Probability density function (pdf (·)). Gives probability foreach possible value of X . Sums / integrates to 1.

Cumulative distribution function (cdf (·)). Gives probabilitythat X is less than or equal to a certain value.

4 / 27


Usher’s Algorithm

Monte Carlo Methods

Playing Solitaire with 52 cards: How high is the chance of thesolitaire coming out successfully?

Deriving an analytical solution is very difficult.

Laying out several solitaire games and checking for solvabilityis easy (but time consuming).

Monte Carlo Methods: Selecting a statistical sample toapproximate a hard analytical problem by a much simplerproblem.

Trading problem difficulty for computational effort.

5 / 27


Usher’s Algorithm

Sampling

Monte Carlo requiressampling from aprobability distribution.

The distribution may bearbitrarily complex

Sampling can be a hard task in itself. Several methodsavailable:

Rejection sampling,Importance sampling,Markov Chains

6 / 27


Usher’s Algorithm

Markov Chains

Let

Ω be a (countable) state space

Xt be a sequence of random variables on Ω fort = 0, 1, 2, . . .

Xt is a Markov Chain if

Pr [Xt+1 = y |Xt = xt , . . . ,X0 = x0] = Pr [Xt+1 = y |Xt = xt ]

7 / 27


Usher’s Algorithm

Markov ChainsExample

Markov Chain

Drawing a number from 1, 2, 3 with replacement. Xt = lastnumber seen at time t.

Not a Markov Chain

Drawing a number from 1, 2, 3 without replacement. Xt =last number seen at time t.

8 / 27


Usher’s Algorithm

Markov ChainsDistributions

Markov Chain with 3states x1, x2, x3 andtransition matrix

T =

0 1 00 0.1 0.9

0.6 0.4 0

Assume that the probability of initial states isPinit = (0.5, 0.2, 0.3)On the next iteration probabilities for states arePinit ∗ T = (0.18, 0.64, 0.18), and so on . . .Visiting states in a Markov Chain is like sampling states froma certain probability distribution

9 / 27


Usher’s Algorithm

BasicsExampleMetropolis & SA

Notation

Ω the state space.

Pr [Xt+1 = j |Xt = i ] = pij , the transition probability.

and Pr [Xt+k = j |Xt = i ] = p(k)ij .

P = (pij), the transition probability matrix.

π = (πk), a probability distribution over Ω.

ν = (νk), a positive vector over Ω such that∑

k∈Ω νk <∞.

ν can be turned into a probability distribution by normalizingwith

∑k∈Ω νk .

10 / 27


Usher’s Algorithm


Markov Chain Properties

irreducibility: ∀i , j ∈ Ω, ∃k such that p(k)ij > 0

at each point in time the Chain can get from each state toeach other state

period: gcd of

k ≥ 1 : p(k)ii > 0

; aperiodic if period is 1

number of states to be visited before the same state is revisitedaperiodicity: the chain does not get trapped in cycles

symmetric: ∀i , j ∈ Ω, pij = pji .

reversibility: ∃π, πipij = πjpji .each sequence of states has the same probability as its reverse

time-homogeneity: whether the transition probabilities changeover time (time-inhomogeneous) or not (time-homogeneous)

11 / 27


Usher’s Algorithm


Theorem

For an aperiodic, irreducible Markov chain over Ω, the limit

limk→∞ p(k)ij exists and is independent of i ; call it πj . If Ω is

finite, then ∑j∈Ω

πj = 1 and∑i∈Ω

πipij = πj

and such π is a unique solution to xP = x

π is called a stationary distribution

No matter where we start, after some time, we will be in anystate j with probability ∼ πj

12 / 27


Usher’s Algorithm


Proposition

Assume an irreducible Markov chain with discrete state spaceΩ. Assume there exist positive numbers πi , i ∈ Ω, such that∑

i πi = 1 and ∀i , j ∈ Ω

πipij = πjpji (detailed balance)

Then π is the stationary distribution.

Detailed balance is a sufficient (but not necessary) conditionfor π being the stationary distribution

One can construct Markov Chains with desired stationarydistributions by realizing detailed balance

13 / 27


Usher’s Algorithm


Corollaries

Assume an irreducible Markov chain with discrete state spaceΩ. If ∃ν such that νipij = νjpji , then the stationarydistribution exists and is given by

πi =νi∑νj

For a symmetric, irreducible Markov Chain, the stationarydistribution is uniform on Ω, i.e., πi = 1/ |Ω| , ∀i ∈ Ω

14 / 27


Usher’s Algorithm


So What?

Parameter fitting is an optimization problem

Optimization can be hard, if not impossible, to achieveanalytically

Monte Carlo solution: Sample possible parameter values andhope to find a good set

Markov Chains allow to ”smartly” sample from a space ofpossible solutions

Example: Knapsack Problem

15 / 27


Usher’s Algorithm


The Knapsack Problem

Definition

Given: m items and their weight wi and value vi , knapsackwith weight limit b

Find: what is the most valuable subset of items that will fitinto the knapsack?

Representation

z = (z1, . . . , zm) ∈ 0, 1m, zi signals whether we take item i

Feasible solutions Ω = z ∈ 0, 1m;∑

i wizi ≤ b

Problem: Maximize∑

i vizi subject to z ∈ Ω

16 / 27


Usher’s Algorithm


The Knapsack ProblemMCMC solution

Uniform sampling using MCMC: given currentXt = (z1, . . . , zm), generate Xt+1 by

choose J ∈ 1, . . . ,m uniformly at randomflip zJ , i.e., let y = (z1, . . . , 1− zJ , . . . , zm)if y is feasible, then set Xt+1 = y , else set Xt+1 = Xt

Comments:

Chain is aperiodic, irreducible, and symmetric =⇒ uniformsampling of all feasible solutionsHow long should we run it?Search is still somewhat undirected: Can we improve on that?

17 / 27


Usher’s Algorithm


The Knapsack ProblemImproved solution

Idea: Generate ”good” solutions with higher probability =⇒sample from a distribution where ”good” solutions havehigher probabilities

Knapsack Problem: πz = C−1 exp (β∑

i vizi )

Can we define a Markov Chain that allows to sample from thisdistribution over Ω?

The Metropolis algorithm provides a means to constructMarkov Chains with arbitrary stationary distributions

18 / 27


Usher’s Algorithm


Metropolis Algorithm

Given a state space Ω and a target distribution π:choose Y ∈ Ω according to Markov chain Q, i.e.,Pr(Y = j | Xt = i) = qij

Q is called the proposal distribution

let α = min 1, πY /πi (acceptance probability)

accept Y with probability α, i.e., Xt+1 = Y with probabilityα, Xt+1 = Xt otherwise

Resulting pij :

pij = qij min 1, πj/πi , fori 6= j

pii = 1−∑j 6=i

pij

19 / 27


Usher’s Algorithm


Metropolis AlgorithmProperties

Proposition (Metropolis works):

The pij ’s from the Metropolis algorithm satisfy detailedbalance w.r.t. π, i.e., πipij = πjpji

=⇒ the Metropolis chain has stationary distribution π

Remarks:

we only need to know ratios of values of π

we need to know π only up to a constant of proportionality!

the Metropolis chain might converge to π exponentially slowly(burn-in period)

20 / 27


Usher’s Algorithm


Metropolis AlgorithmOptimization

Metropolis theoretically works, but

needs large β to make ”good” states more likely

its convergence time may be exponential in β

=⇒ try changing β over time

Simulated Annealing

for Knapsack problem: α = min 1, exp (βt∑

i vi (yi − zi ))βt increases slowly with time (e.g., = log(t),= (1.001)t)

21 / 27


Usher’s Algorithm


Simulated Annealing

General optimization problem: maximize function G (z) on allfeasible solutions Ω

Let Q be again an irreducible, symmetric Markov chain on Ω

Simulated Annealing is a Metropolis chain with

pij = qij min 1, exp (βt [G (j)− G (i)])

pii = 1−∑j 6=i

pij

Although above considerations concerned discrete statespaces, results apply straightforwardly to continuous statespaces (e.g., subspaces of Rn).

22 / 27


Usher’s Algorithm


Simulated AnnealingProposal Distribution

Bimodal targetdistribution π

Normal proposaldistribution Q = N (xt , σ)

How to choose σ?

Proposal too narrow −→π not sampled completely

Proposal too wide −→rejection rate high

If rejection rate is low and π is sampled sufficiently, the chainis said to mix well.

23 / 27


Usher’s Algorithm


Simulated AnnealingCooling Schedule

Simulated annealing chain is time-inhomogeneous. The theoryfor time-homogeneous chains does not apply!

However, simulated annealing consists of a sequence ofhomogeneous chains

Theoretical results on convergence:if for a given step Ti in the cooling schedule,the homogeneous chain mixes quickly enoughconvergence to global extrema is ensured for the schedule

Ti = (C ln(i + T0))−1

where C and T0 are problem dependent.

Holds for finite spaces and compact continuous spaces

24 / 27


Usher’s Algorithm


Simulated AnnealingCooling Schedule

Problems: We normally do not know

how well the chain mixes (what the burn-in period is)

the optimal T0

the optimal C

We normally cannot ensure optimal results with simulatedannealing.

But: Simulated annealing

has been found to work well for many hard problems

provides a general approach to solve difficult optimizationproblems

25 / 27


Usher’s Algorithm

Usher’s Algorithm

Let

v ∈ Rn be a vector of parameter values

σ a vector of standard deviations

One swap of the algorithm consists of

compute discrepancy of human behavior and model behavior(given current v)

if current discrepancy less than best found so far replaceoptimal v with current v

Generate new set of parameter values using normal proposaldistributions N (v, σ)

Each k swaps σ is (slightly) reduced. Algorithm stops after nswaps.

26 / 27


Usher’s Algorithm

Usher’s AlgorithmParameters

number of swaps: How thoroughly parameter space is explored

initial σ: rejection-rate / coverage trade-off

final σ: amount of fine-tuning

σ reduction rate: related to burn-in times −→ convergenceand optimality

Several interrelations between these

27 / 27

markov chain monte carlo - universität...

Documents