markov chain monte carlo - universität...
TRANSCRIPT
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
Markov Chain Monte Carlofor Parameter Optimization
Holger Schultheis
12.12.2011
1 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
Topics
1 Concepts
2 Markov Chain Monte CarloBasicsExampleMetropolis and Simulated Anealing
3 Usher’s Algorithm
2 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
Probability Distributions
The quantity X is called a random variable if it takes ondifferent values with certain probabilities.
Example: Result of rolling a die
The probability distribution of a random variable X is anycomplete description of the probabilistic behavior of X .
Example: If the die is fair, X = 1, 2, 3, 4, 5, 6 with probabilityp = 1/6, 1/6, 1/6, 1/6, 1/6, 1/6.
3 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
Probability Distributions
Two common ways to characterize probability distributions
Probability density function (pdf (·)). Gives probability foreach possible value of X . Sums / integrates to 1.
Cumulative distribution function (cdf (·)). Gives probabilitythat X is less than or equal to a certain value.
4 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
Monte Carlo Methods
Playing Solitaire with 52 cards: How high is the chance of thesolitaire coming out successfully?
Deriving an analytical solution is very difficult.
Laying out several solitaire games and checking for solvabilityis easy (but time consuming).
Monte Carlo Methods: Selecting a statistical sample toapproximate a hard analytical problem by a much simplerproblem.
Trading problem difficulty for computational effort.
5 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
Sampling
Monte Carlo requiressampling from aprobability distribution.
The distribution may bearbitrarily complex
Sampling can be a hard task in itself. Several methodsavailable:
Rejection sampling,Importance sampling,Markov Chains
6 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
Markov Chains
Let
Ω be a (countable) state space
Xt be a sequence of random variables on Ω fort = 0, 1, 2, . . .
Xt is a Markov Chain if
Pr [Xt+1 = y |Xt = xt , . . . ,X0 = x0] = Pr [Xt+1 = y |Xt = xt ]
7 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
Markov ChainsExample
Markov Chain
Drawing a number from 1, 2, 3 with replacement. Xt = lastnumber seen at time t.
Not a Markov Chain
Drawing a number from 1, 2, 3 without replacement. Xt =last number seen at time t.
8 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
Markov ChainsDistributions
Markov Chain with 3states x1, x2, x3 andtransition matrix
T =
0 1 00 0.1 0.9
0.6 0.4 0
Assume that the probability of initial states isPinit = (0.5, 0.2, 0.3)On the next iteration probabilities for states arePinit ∗ T = (0.18, 0.64, 0.18), and so on . . .Visiting states in a Markov Chain is like sampling states froma certain probability distribution
9 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
Notation
Ω the state space.
Pr [Xt+1 = j |Xt = i ] = pij , the transition probability.
and Pr [Xt+k = j |Xt = i ] = p(k)ij .
P = (pij), the transition probability matrix.
π = (πk), a probability distribution over Ω.
ν = (νk), a positive vector over Ω such that∑
k∈Ω νk <∞.
ν can be turned into a probability distribution by normalizingwith
∑k∈Ω νk .
10 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
Markov Chain Properties
irreducibility: ∀i , j ∈ Ω, ∃k such that p(k)ij > 0
at each point in time the Chain can get from each state toeach other state
period: gcd of
k ≥ 1 : p(k)ii > 0
; aperiodic if period is 1
number of states to be visited before the same state is revisitedaperiodicity: the chain does not get trapped in cycles
symmetric: ∀i , j ∈ Ω, pij = pji .
reversibility: ∃π, πipij = πjpji .each sequence of states has the same probability as its reverse
time-homogeneity: whether the transition probabilities changeover time (time-inhomogeneous) or not (time-homogeneous)
11 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
Theorem
For an aperiodic, irreducible Markov chain over Ω, the limit
limk→∞ p(k)ij exists and is independent of i ; call it πj . If Ω is
finite, then ∑j∈Ω
πj = 1 and∑i∈Ω
πipij = πj
and such π is a unique solution to xP = x
π is called a stationary distribution
No matter where we start, after some time, we will be in anystate j with probability ∼ πj
12 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
Proposition
Assume an irreducible Markov chain with discrete state spaceΩ. Assume there exist positive numbers πi , i ∈ Ω, such that∑
i πi = 1 and ∀i , j ∈ Ω
πipij = πjpji (detailed balance)
Then π is the stationary distribution.
Detailed balance is a sufficient (but not necessary) conditionfor π being the stationary distribution
One can construct Markov Chains with desired stationarydistributions by realizing detailed balance
13 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
Corollaries
Assume an irreducible Markov chain with discrete state spaceΩ. If ∃ν such that νipij = νjpji , then the stationarydistribution exists and is given by
πi =νi∑νj
For a symmetric, irreducible Markov Chain, the stationarydistribution is uniform on Ω, i.e., πi = 1/ |Ω| , ∀i ∈ Ω
14 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
So What?
Parameter fitting is an optimization problem
Optimization can be hard, if not impossible, to achieveanalytically
Monte Carlo solution: Sample possible parameter values andhope to find a good set
Markov Chains allow to ”smartly” sample from a space ofpossible solutions
Example: Knapsack Problem
15 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
The Knapsack Problem
Definition
Given: m items and their weight wi and value vi , knapsackwith weight limit b
Find: what is the most valuable subset of items that will fitinto the knapsack?
Representation
z = (z1, . . . , zm) ∈ 0, 1m, zi signals whether we take item i
Feasible solutions Ω = z ∈ 0, 1m;∑
i wizi ≤ b
Problem: Maximize∑
i vizi subject to z ∈ Ω
16 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
The Knapsack ProblemMCMC solution
Uniform sampling using MCMC: given currentXt = (z1, . . . , zm), generate Xt+1 by
choose J ∈ 1, . . . ,m uniformly at randomflip zJ , i.e., let y = (z1, . . . , 1− zJ , . . . , zm)if y is feasible, then set Xt+1 = y , else set Xt+1 = Xt
Comments:
Chain is aperiodic, irreducible, and symmetric =⇒ uniformsampling of all feasible solutionsHow long should we run it?Search is still somewhat undirected: Can we improve on that?
17 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
The Knapsack ProblemImproved solution
Idea: Generate ”good” solutions with higher probability =⇒sample from a distribution where ”good” solutions havehigher probabilities
Knapsack Problem: πz = C−1 exp (β∑
i vizi )
Can we define a Markov Chain that allows to sample from thisdistribution over Ω?
The Metropolis algorithm provides a means to constructMarkov Chains with arbitrary stationary distributions
18 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
Metropolis Algorithm
Given a state space Ω and a target distribution π:choose Y ∈ Ω according to Markov chain Q, i.e.,Pr(Y = j | Xt = i) = qij
Q is called the proposal distribution
let α = min 1, πY /πi (acceptance probability)
accept Y with probability α, i.e., Xt+1 = Y with probabilityα, Xt+1 = Xt otherwise
Resulting pij :
pij = qij min 1, πj/πi , fori 6= j
pii = 1−∑j 6=i
pij
19 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
Metropolis AlgorithmProperties
Proposition (Metropolis works):
The pij ’s from the Metropolis algorithm satisfy detailedbalance w.r.t. π, i.e., πipij = πjpji
=⇒ the Metropolis chain has stationary distribution π
Remarks:
we only need to know ratios of values of π
we need to know π only up to a constant of proportionality!
the Metropolis chain might converge to π exponentially slowly(burn-in period)
20 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
Metropolis AlgorithmOptimization
Metropolis theoretically works, but
needs large β to make ”good” states more likely
its convergence time may be exponential in β
=⇒ try changing β over time
Simulated Annealing
for Knapsack problem: α = min 1, exp (βt∑
i vi (yi − zi ))βt increases slowly with time (e.g., = log(t),= (1.001)t)
21 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
Simulated Annealing
General optimization problem: maximize function G (z) on allfeasible solutions Ω
Let Q be again an irreducible, symmetric Markov chain on Ω
Simulated Annealing is a Metropolis chain with
pij = qij min 1, exp (βt [G (j)− G (i)])
pii = 1−∑j 6=i
pij
Although above considerations concerned discrete statespaces, results apply straightforwardly to continuous statespaces (e.g., subspaces of Rn).
22 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
Simulated AnnealingProposal Distribution
Bimodal targetdistribution π
Normal proposaldistribution Q = N (xt , σ)
How to choose σ?
Proposal too narrow −→π not sampled completely
Proposal too wide −→rejection rate high
If rejection rate is low and π is sampled sufficiently, the chainis said to mix well.
23 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
Simulated AnnealingCooling Schedule
Simulated annealing chain is time-inhomogeneous. The theoryfor time-homogeneous chains does not apply!
However, simulated annealing consists of a sequence ofhomogeneous chains
Theoretical results on convergence:if for a given step Ti in the cooling schedule,the homogeneous chain mixes quickly enoughconvergence to global extrema is ensured for the schedule
Ti = (C ln(i + T0))−1
where C and T0 are problem dependent.
Holds for finite spaces and compact continuous spaces
24 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
BasicsExampleMetropolis & SA
Simulated AnnealingCooling Schedule
Problems: We normally do not know
how well the chain mixes (what the burn-in period is)
the optimal T0
the optimal C
We normally cannot ensure optimal results with simulatedannealing.
But: Simulated annealing
has been found to work well for many hard problems
provides a general approach to solve difficult optimizationproblems
25 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
Usher’s Algorithm
Let
v ∈ Rn be a vector of parameter values
σ a vector of standard deviations
One swap of the algorithm consists of
compute discrepancy of human behavior and model behavior(given current v)
if current discrepancy less than best found so far replaceoptimal v with current v
Generate new set of parameter values using normal proposaldistributions N (v, σ)
Each k swaps σ is (slightly) reduced. Algorithm stops after nswaps.
26 / 27
ConceptsMarkov Chain Monte Carlo
Usher’s Algorithm
Usher’s AlgorithmParameters
number of swaps: How thoroughly parameter space is explored
initial σ: rejection-rate / coverage trade-off
final σ: amount of fine-tuning
σ reduction rate: related to burn-in times −→ convergenceand optimality
Several interrelations between these
27 / 27