ensuring rapid mixing and low bias for asynchronous gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6...
Post on 29-Mar-2021
1 Views
Preview:
TRANSCRIPT
Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling
Christopher De Sa Kunle Olukotun Christopher Ré {cdesa,kunle,chrismre}@stanford.edu
Stanford
1
Overview
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Zhang et al, PVLDB 2014 Smola et al, PVLDB 2010
…etc.
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Question: when and why does it work?
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Question: when and why does it work?
“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does
…but there’s no theoretical guarantee.
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Question: when and why does it work?
“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does
…but there’s no theoretical guarantee.
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Question: when and why does it work?
“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does
…but there’s no theoretical guarantee.
Our contributions
1. The “folklore” is not necesarily true. 2. ...but it works under reasonable conditions.
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Question: when and why does it work?
“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does
…but there’s no theoretical guarantee.
Our contributions
1. The “folklore” is not necesarily true. 2. ...but it works under reasonable conditions.
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Question: when and why does it work?
“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does
…but there’s no theoretical guarantee.
Our contributions
1. The “folklore” is not necesarily true. 2. ...but it works under reasonable conditions.
10
Problem: given a probability distribution, produce samples from it.
• e.g. to do inference in a graphical model
11
Problem: given a probability distribution, produce samples from it.
• e.g. to do inference in a graphical model
Algorithm: Gibbs sampling
• de facto Markov chain Monte Carlo (MCMC) method for inference
• produces a series of approximate samples that approach the target distribution
What is Gibbs Sampling?
12
Algorithm 1 Gibbs sampling
Require: Variables xi for 1 i n, and distribution ⇡.
loop
Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x
end loop
What is Gibbs Sampling?
13
x1
x2
x3
x4
x5 x7
x6
Algorithm 1 Gibbs sampling
Require: Variables xi for 1 i n, and distribution ⇡.
loop
Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x
end loop
What is Gibbs Sampling?
14
x1
x2
x3
x4
x5 x7
x6
Algorithm 1 Gibbs sampling
Require: Variables xi for 1 i n, and distribution ⇡.
loop
Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x
end loop
What is Gibbs Sampling?
15
Choose a variable to update at random.
x5
x1
x2
x3
x4
x5 x7
x6
Algorithm 1 Gibbs sampling
Require: Variables xi for 1 i n, and distribution ⇡.
loop
Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x
end loop
What is Gibbs Sampling?
16
Compute its conditional distribution given the
other variables.
x4
x6
x7
P( ) = 0.7
P( ) = 0.3
x5
x5
x5
x1
x2
x3
x4
x5 x7
x6
Algorithm 1 Gibbs sampling
Require: Variables xi for 1 i n, and distribution ⇡.
loop
Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x
end loop
What is Gibbs Sampling?
17
Update the variable by sampling from its
conditional distribution.
Compute its conditional distribution given the
other variables.
x4
x6
x7
P( ) = 0.7
P( ) = 0.3
x5
x5
x5 x5
x1
x2
x3
x4
x5 x7
x6
Algorithm 1 Gibbs sampling
Require: Variables xi for 1 i n, and distribution ⇡.
loop
Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x
end loop
What is Gibbs Sampling?
18
Output the current state as a sample.
x5 x5
Gibbs Sampling: A Practical Perspective
19
Gibbs Sampling: A Practical Perspective
• Pros of Gibbs sampling – Easy to implement – Updates are sparse à fast on modern CPUs
• Cons of Gibbs sampling – sequential algorithm à can’t naively parallelize
20
Gibbs Sampling: A Practical Perspective
• Pros of Gibbs sampling – Easy to implement – Updates are sparse à fast on modern CPUs
• Cons of Gibbs sampling – sequential algorithm à can’t naively parallelize
21 64 core
No parallelism Leave up to 98% of performance
on the table! e.g.
Asynchronous Gibbs Sampling
22
Asynchronous Gibbs Sampling
• Run multiple threads in parallel without locks – also known as HOGWILD! – adapted from a popular technique for stochastic
gradient descent (SGD)
• When we read a variable, it could be stale – while we re-sample a variable, its adjacent variables
can be overwritten by other threads – semantics not equivalent to standard (sequential)
Gibbs sampling
23
Asynchronous Gibbs Sampling
• Run multiple threads in parallel without locks – also known as HOGWILD! – adapted from a popular technique for stochastic
gradient descent (SGD)
• When we read a variable, it could be stale – while we re-sample a variable, its adjacent variables
can be overwritten by other threads – semantics not equivalent to standard (sequential)
Gibbs sampling
24
25
26
Question
Does asynchronous Gibbs sampling work? …and what does it mean for it to work?
27
Question
Does asynchronous Gibbs sampling work? …and what does it mean for it to work?
Two desiderata
28
Question
Does asynchronous Gibbs sampling work? …and what does it mean for it to work?
want to get
accurate estimates ê
bound the bias
Two desiderata
29
Question
Does asynchronous Gibbs sampling work? …and what does it mean for it to work?
want to get
accurate estimates ê
bound the bias
Two desiderata
want to be independent of initial conditions
quickly ê
bound the mixing time
Previous Work
30
Previous Work
• “Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent” — Niu et al, NIPS 2011. follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015
• “Analyzing Hogwild Parallel Gaussian Gibbs Sampling” — Johnson et al, NIPS 2013.
31
Previous Work
• “Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent” — Niu et al, NIPS 2011. follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015
• “Analyzing Hogwild Parallel Gaussian Gibbs Sampling” — Johnson et al, NIPS 2013.
32
33
Question
Does asynchronous Gibbs sampling work? …and what does it mean for it to work?
Two desiderata
want to be independent of initial conditions
quickly ê
bound the mixing time
want to get
accurate estimates ê
bound the bias
Bias
34
Bias
• How close are samples to target distribution? – standard measurement: total variation distance
• For sequential Gibbs, no asymptotic bias:
35
kµ� ⌫kTV = max
A⇢⌦|µ(A)� ⌫(A)|
Bias
• How close are samples to target distribution? – standard measurement: total variation distance
• For sequential Gibbs, no asymptotic bias:
36
kµ� ⌫kTV = max
A⇢⌦|µ(A)� ⌫(A)|
8µ0, limt!1
kP (t)µ0 � ⇡kTV = 0
Bias
• How close are samples to target distribution? – standard measurement: total variation distance
• For sequential Gibbs, no asymptotic bias:
37
kµ� ⌫kTV = max
A⇢⌦|µ(A)� ⌫(A)|
“Folklore”: asynchronous Gibbs is also unbiased. …but this is not necessarily true!
8µ0, limt!1
kP (t)µ0 � ⇡kTV = 0
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Simple Bias Example
38
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Simple Bias Example
39
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Simple Bias Example
40
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Simple Bias Example
41
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Two threads update starting here.
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Simple Bias Example
42
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)1/4
1/4
1/4
1/4 Two threads update starting here.
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Simple Bias Example
43
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)1/4
1/4
1/4
1/4
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)1/4
1/4
1/4
1/4Should have zero probability!
Two threads update starting here.
Nonzero Asymptotic Bias Asynchronous Bias: Example Distribution
�
�.��
�.�
�.��
�.�
�.��
�.�
�.��
�.�
(�,�) (�,�) (�,�) (�,�)
prob
abili
ty
state
Distribution of Sequential vs. Hogwild! Gibbs
sequentialHogwild!
Bias introduced by Hogwild!-Gibbs (��� samples).
�� / �
44
Measured
Bias (total variation
distance)
sequential < 0.1%
unbiased
asynchronous 9.8% biased
Nonzero Asymptotic Bias Asynchronous Bias: Example Distribution
�
�.��
�.�
�.��
�.�
�.��
�.�
�.��
�.�
(�,�) (�,�) (�,�) (�,�)
prob
abili
ty
state
Distribution of Sequential vs. Hogwild! Gibbs
sequentialHogwild!
Bias introduced by Hogwild!-Gibbs (��� samples).
�� / �
45
Asynchronous Bias: Example Distribution
�
�.�
�.�
�.�
�.�
�.�
�.�
�.�
�.�
(�,X�) (�,X�) (X�,�) (X�,�)
prob
abili
ty
state
Marginal distribution of Sequential vs. Hogwild! Gibbs
sequentialHogwild!
Bias introduced by Hogwild!-Gibbs (��� samples).
�� / ��
Measured
Bias (total variation
distance)
sequential < 0.1%
unbiased
asynchronous 9.8% biased
Are we using the right metric?
46
Are we using the right metric?
• No, total variation distance is too conservative – depends on events that don’t matter for inference – usually only care about small number of variables
• New metric: sparse variation distance
where |A| is the number of variables on which event A depends
47
Are we using the right metric?
• No, total variation distance is too conservative – depends on events that don’t matter for inference – usually only care about small number of variables
• New metric: sparse variation distance
where |A| is the number of variables on which event A depends
48
kµ� ⌫kSV(!) = max
|A|!|µ(A)� ⌫(A)|
Are we using the right metric?
• No, total variation distance is too conservative – depends on events that don’t matter for inference – usually only care about small number of variables
• New metric: sparse variation distance
where |A| is the number of variables on which event A depends
49
kµ� ⌫kSV(!) = max
|A|!|µ(A)� ⌫(A)|
Simple Example: Bias of Asynchronous Gibbs
Total variation: 9.8% Sparse Variation ( ): 0.4% ! = 1
Total Influence Parameter
50
Total Influence Parameter
• Old condition that was used to study mixing times of spin statistics systems
– means X and Y equal except variable j.
– is conditional distribution of variable i given the values of all the other variables in state X.
– Dobrushin’s condition holds when
51
↵ = max
i2I
X
j2I
max
(X,Y )2Bj
��⇡i(·|XI\{i})� ⇡i(·|YI\{i})��TV
(X,Y ) 2 Bj
⇡i(·|XI\{i})
Total Influence Parameter
• Old condition that was used to study mixing times of spin statistics systems
– means X and Y equal except variable j.
– is conditional distribution of variable i given the values of all the other variables in state X.
– Dobrushin’s condition holds when
52
↵ = max
i2I
X
j2I
max
(X,Y )2Bj
��⇡i(·|XI\{i})� ⇡i(·|YI\{i})��TV
(X,Y ) 2 Bj
⇡i(·|XI\{i})
↵ < 1.
Asymptotic Result
• For any class of distributions with bounded total influence – big-O notation is over number of variables
• If timesteps of sequential Gibbs suffice to achieve arbitrarily small bias – measured by sparse variation distance, for fixed
• …then asynchronous Gibbs requires only additional timesteps to achieve the same bias!
53
↵ = O(1).n.
O(n)
!-
O(1)
!-
Asymptotic Result
• For any class of distributions with bounded total influence – big-O notation is over number of variables
• If timesteps of sequential Gibbs suffice to achieve arbitrarily small bias – measured by sparse variation distance, for fixed
• …then asynchronous Gibbs requires only additional timesteps to achieve the same bias!
54
↵ = O(1).n.
O(n)
!-
O(1)
more details, explicit bounds, et cetera in the paper
!-
want to get
accurate estimates ê
bound the bias
55
Question
Does asynchronous Gibbs sampling work? …and what does it mean for it to work?
Two desiderata
want to be independent of initial conditions
quickly ê
bound the mixing time
Mixing Time
56
Mixing Time
• How long do we need to run until the samples are independent of initial conditions?
• Mixing time of a Markov chain is the first time at which the distribution of the sample is close to the stationary distribution. – in terms of total variation distance – feasible to run MCMC if mixing time is small
57
Mixing Time
• How long do we need to run until the samples are independent of initial conditions?
• Mixing time of a Markov chain is the first time at which the distribution of the sample is close to the stationary distribution. – in terms of total variation distance – feasible to run MCMC if mixing time is small
58
“Folklore”: asynchronous Gibbs has the same mixing time as sequential Gibbs…also not necessarily true!
Mixing Time Example
Mixing Time: Example (cont’d)
Here is the empirical estimate for P
�
�
TY > ��
for di�ferentmaximum delays ⌧—should be exactly �.�.
�
�.�
�.�
�.�
�.�
�
� �� ��� ��� ��� ��� ���
estim
atio
nof
P(�
T Y>
�)
sample number (thousands)
Mixing of Sequential vs Hogwild! Gibbs
⌧ = �.�⌧ = �.�⌧ = �.�
sequentialtrue distribution
�� / �
59
is hardware-dependent read
staleness parameter
⌧
HOGWILD!
Mixing Time Example
Mixing Time: Example (cont’d)
Here is the empirical estimate for P
�
�
TY > ��
for di�ferentmaximum delays ⌧—should be exactly �.�.
�
�.�
�.�
�.�
�.�
�
� �� ��� ��� ��� ��� ���
estim
atio
nof
P(�
T Y>
�)
sample number (thousands)
Mixing of Sequential vs Hogwild! Gibbs
⌧ = �.�⌧ = �.�⌧ = �.�
sequentialtrue distribution
�� / �
60
Sequential Gibbs achieves correct marginal quickly.
tmix
= O(n log n)
Asynchronous Gibbs takes much longer.
Asynchronous Gibbs takes much longer.
Asynchronous Gibbs takes much longer.
tmix
= exp(⌦(n))
is hardware-dependent read
staleness parameter
⌧
HOGWILD!
Bounding the Mixing Time
61
↵ < 1.
Bounding the Mixing Time
Suppose that our target distribution satisfies Dobrushin’s condition (total influence ). • Mixing time of sequential Gibbs (known result)
• Mixing time of asynchronous Gibbs is
62
↵ < 1.
tmix�seq
(✏) n
1� ↵log
⇣n✏
⌘.
tmix�hog
(✏) n+ ↵⌧
1� ↵log
⇣n✏
⌘.
is hardware-dependent read
staleness parameter
⌧
Bounding the Mixing Time
Suppose that our target distribution satisfies Dobrushin’s condition (total influence ). • Mixing time of sequential Gibbs (known result)
• Mixing time of asynchronous Gibbs is
63
↵ < 1.
tmix�seq
(✏) n
1� ↵log
⇣n✏
⌘.
tmix�hog
(✏) n+ ↵⌧
1� ↵log
⇣n✏
⌘.
Takeaway message: can compare the two mixing time bounds with
…they differ by a negligible factor!
tmix�hog
(✏) ⇡�1 + ↵⌧n�1
�tmix�seq
(✏)
is hardware-dependent read
staleness parameter
⌧
Theory Matches Experiment Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling
16500
17000
17500
18000
18500
19000
0 50 100 150 200
mix
ing
time
expected delay parameter (⌧ ⇤)
Estimated tmix
of HOGWILD! Gibbs on Large Ising Model
estimatedtheory
Figure 4. Comparison of estimated mixing time and theory-predicted (by Equation 2) mixing time as ⌧ increases for a syn-thetic Ising model graph (n = 1000, � = 3).
the (relatively small) dependence of the mixing time on ⌧proved to be computationally intractable.
Instead, we use a technique called coupling to the future.We initialize two chains, X and Y , by setting all the vari-ables in X
0
to 1 and all the variables in Y0
to �1. Weproceed by simulating a coupling between the two chains,and return the coupling time T
c
. Our estimate of the mixingtime will then be ˆt(✏), where P(T
c
� ˆt(✏)) = ✏.
Statement 2. This experimental estimate is an upperbound for the mixing time. That is, ˆt(✏) � t
mix
(✏).
To estimate ˆt(✏), we ran 10000 instances of the cou-pling experiment, and returned the sample estimate ofˆt(1/4). To compare across a range of ⌧⇤, we selectedthe ⌧̃
i,t
to be independent and identically distributed ac-cording to the maximum-entropy distribution supported on{0, 1, . . . , 200} consistent with a particular assignment of⌧⇤. The resulting estimates are plotted as the blue seriesin Figure 4. The red line represents the mixing time thatwould be predicted by naively applying Equation 2 usingthe estimate of the sequential mixing time as a startingpoint — we can see that it is a very good match for the ex-perimental results. This experiment shows that, at least forone archetypal model, our theory accurately characterizesthe behavior of HOGWILD! Gibbs sampling as the delayparameter ⌧⇤ is changed, and that using HOGWILD!-Gibbsdoesn’t cause the model to catastrophically fail to mix.
Of course, in order for HOGWILD!-Gibbs to be useful, itmust also speed up the execution of Gibbs sampling onsome practical models. It is already known that this is thecase, as these types of algorithms been widely implementedin practice (Smola & Narayanamurthy, 2010; Smyth et al.,2009). To further test this, we ran HOGWILD!-Gibbs sam-pling on a real-world 11 GB Knowledge Base Populationdataset (derived from the TAC-KBP challenge) using a ma-
0
0.5
1
1.5
2
2.5
3
3.5
1 2 4 8 12 18 36
spee
dup
over
sing
le-th
read
ed
threads
Performance of HOGWILD! Gibbs on KBP Dataset
HOGWILD!multi-model
Figure 5. Speedup of HOGWILD! and multi-model Gibbs sam-pling on large KBP dataset (11 GB).
chine with a single-socket, 18-core Xeon E7-8890 CPUand 1 TB RAM. As a comparison, we also ran a “multi-model” Gibbs sampler: this consists of multiple threadswith a single execution of Gibbs sampling running inde-pendently in each thread. This sampler will produce thesame number of samples as HOGWILD!-Gibbs, but will re-quire more memory to store multiple copies of the model.
Figure 5 reports the speedup, in terms of wall-clock time,achieved by HOGWILD!-Gibbs on this dataset. On this ma-chine, we get speedups of up to 2.8⇥, although the programbecomes memory-bandwidth bound at around 8 threads,and we see no significant speedup beyond this. With anynumber of workers, the run time of HOGWILD!-Gibbs isclose to that of multi-model Gibbs, which illustrates thatthe additional cache contention caused by the HOGWILD!updates has little effect on the algorithm’s performance.
7. ConclusionWe analyzed HOGWILD!-Gibbs sampling, a heuristic forparallelized MCMC sampling, on discrete-valued graphi-cal models. First, we constructed a statistical model forHOGWILD!-Gibbs by adapting a model already used forthe analysis of asynchronous SGD. Next, we illustrated amajor issue with HOGWILD!-Gibbs sampling: that it pro-duces biased samples. To address this, we proved that if forsome class of models with bounded total influence, onlyO(n) sequential Gibbs samples are necessary to producegood marginal estimates, then HOGWILD!-Gibbs samplingproduces equally good estimates after only O(1) additionalsteps. Additionally, for models that satisfy Dobrushin’scondition (↵ < 1), we proved mixing time bounds for se-quential and asynchronous Gibbs sampling that differ byonly a factor of 1 + O(n�1
). Finally, we showed that ourtheory matches experimental results, and that HOGWILD!-Gibbs produces speedups up to 2.8⇥ on a real dataset.
64
expected staleness parameter ( ) ⌧
Conclusion
• Analyzed and modeled asynchronous Gibbs sampling, and identified two success metrics – sample bias à how close to target distribution? – mixing time à how long do we need to run?
• Showed that asynchronicity can cause problems
• Proved bounds on the effect of asynchronicity – using the new sparse variation distance, together with – the classical condition of total influence
65
Conclusion
• Analyzed and modeled asynchronous Gibbs sampling, and identified two success metrics – sample bias à how close to target distribution? – mixing time à how long do we need to run?
• Showed that asynchronicity can cause problems
• Proved bounds on the effect of asynchronicity – using the new sparse variation distance, together with – the classical condition of total influence
66
Thank you!
cdesa@stanford.edu stanford.edu/~cdesa
top related