ensuring rapid mixing and low bias for asynchronous gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6...

66
Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Christopher De Sa Kunle Olukotun Christopher Ré {cdesa,kunle,chrismre}@stanford.edu Stanford 1

Upload: others

Post on 29-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling

Christopher De Sa Kunle Olukotun Christopher Ré {cdesa,kunle,chrismre}@stanford.edu

Stanford

1

Page 2: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Overview

Page 3: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.

Zhang et al, PVLDB 2014 Smola et al, PVLDB 2010

…etc.

Page 4: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.

Question: when and why does it work?

Page 5: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.

Question: when and why does it work?

“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does

…but there’s no theoretical guarantee.

Page 6: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.

Question: when and why does it work?

“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does

…but there’s no theoretical guarantee.

Page 7: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.

Question: when and why does it work?

“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does

…but there’s no theoretical guarantee.

Our contributions

1.  The “folklore” is not necesarily true. 2.  ...but it works under reasonable conditions.

Page 8: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.

Question: when and why does it work?

“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does

…but there’s no theoretical guarantee.

Our contributions

1.  The “folklore” is not necesarily true. 2.  ...but it works under reasonable conditions.

Page 9: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.

Question: when and why does it work?

“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does

…but there’s no theoretical guarantee.

Our contributions

1.  The “folklore” is not necesarily true. 2.  ...but it works under reasonable conditions.

Page 10: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

10

Problem: given a probability distribution, produce samples from it.

•  e.g. to do inference in a graphical model

Page 11: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

11

Problem: given a probability distribution, produce samples from it.

•  e.g. to do inference in a graphical model

Algorithm: Gibbs sampling

•  de facto Markov chain Monte Carlo (MCMC) method for inference

•  produces a series of approximate samples that approach the target distribution

Page 12: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

What is Gibbs Sampling?

12

Page 13: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Algorithm 1 Gibbs sampling

Require: Variables xi for 1 i n, and distribution ⇡.

loop

Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x

end loop

What is Gibbs Sampling?

13

Page 14: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

x1

x2

x3

x4

x5 x7

x6

Algorithm 1 Gibbs sampling

Require: Variables xi for 1 i n, and distribution ⇡.

loop

Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x

end loop

What is Gibbs Sampling?

14

Page 15: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

x1

x2

x3

x4

x5 x7

x6

Algorithm 1 Gibbs sampling

Require: Variables xi for 1 i n, and distribution ⇡.

loop

Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x

end loop

What is Gibbs Sampling?

15

Choose a variable to update at random.

x5

Page 16: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

x1

x2

x3

x4

x5 x7

x6

Algorithm 1 Gibbs sampling

Require: Variables xi for 1 i n, and distribution ⇡.

loop

Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x

end loop

What is Gibbs Sampling?

16

Compute its conditional distribution given the

other variables.

x4

x6

x7

P( ) = 0.7

P( ) = 0.3

x5

x5

x5

Page 17: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

x1

x2

x3

x4

x5 x7

x6

Algorithm 1 Gibbs sampling

Require: Variables xi for 1 i n, and distribution ⇡.

loop

Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x

end loop

What is Gibbs Sampling?

17

Update the variable by sampling from its

conditional distribution.

Compute its conditional distribution given the

other variables.

x4

x6

x7

P( ) = 0.7

P( ) = 0.3

x5

x5

x5 x5

Page 18: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

x1

x2

x3

x4

x5 x7

x6

Algorithm 1 Gibbs sampling

Require: Variables xi for 1 i n, and distribution ⇡.

loop

Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x

end loop

What is Gibbs Sampling?

18

Output the current state as a sample.

x5 x5

Page 19: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Gibbs Sampling: A Practical Perspective

19

Page 20: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Gibbs Sampling: A Practical Perspective

•  Pros of Gibbs sampling – Easy to implement – Updates are sparse à fast on modern CPUs

•  Cons of Gibbs sampling – sequential algorithm à can’t naively parallelize

20

Page 21: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Gibbs Sampling: A Practical Perspective

•  Pros of Gibbs sampling – Easy to implement – Updates are sparse à fast on modern CPUs

•  Cons of Gibbs sampling – sequential algorithm à can’t naively parallelize

21 64 core

No parallelism Leave up to 98% of performance

on the table! e.g.

Page 22: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Asynchronous Gibbs Sampling

22

Page 23: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Asynchronous Gibbs Sampling

•  Run multiple threads in parallel without locks – also known as HOGWILD! – adapted from a popular technique for stochastic

gradient descent (SGD)

•  When we read a variable, it could be stale – while we re-sample a variable, its adjacent variables

can be overwritten by other threads – semantics not equivalent to standard (sequential)

Gibbs sampling

23

Page 24: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Asynchronous Gibbs Sampling

•  Run multiple threads in parallel without locks – also known as HOGWILD! – adapted from a popular technique for stochastic

gradient descent (SGD)

•  When we read a variable, it could be stale – while we re-sample a variable, its adjacent variables

can be overwritten by other threads – semantics not equivalent to standard (sequential)

Gibbs sampling

24

Page 25: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

25

Page 26: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

26

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

Page 27: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

27

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

Two desiderata

Page 28: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

28

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

want to get

accurate estimates ê

bound the bias

Two desiderata

Page 29: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

29

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

want to get

accurate estimates ê

bound the bias

Two desiderata

want to be independent of initial conditions

quickly ê

bound the mixing time

Page 30: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Previous Work

30

Page 31: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Previous Work

•  “Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent” — Niu et al, NIPS 2011. follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015

•  “Analyzing Hogwild Parallel Gaussian Gibbs Sampling” — Johnson et al, NIPS 2013.

31

Page 32: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Previous Work

•  “Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent” — Niu et al, NIPS 2011. follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015

•  “Analyzing Hogwild Parallel Gaussian Gibbs Sampling” — Johnson et al, NIPS 2013.

32

Page 33: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

33

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

Two desiderata

want to be independent of initial conditions

quickly ê

bound the mixing time

want to get

accurate estimates ê

bound the bias

Page 34: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Bias

34

Page 35: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Bias

•  How close are samples to target distribution? – standard measurement: total variation distance

•  For sequential Gibbs, no asymptotic bias:

35

kµ� ⌫kTV = max

A⇢⌦|µ(A)� ⌫(A)|

Page 36: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Bias

•  How close are samples to target distribution? – standard measurement: total variation distance

•  For sequential Gibbs, no asymptotic bias:

36

kµ� ⌫kTV = max

A⇢⌦|µ(A)� ⌫(A)|

8µ0, limt!1

kP (t)µ0 � ⇡kTV = 0

Page 37: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Bias

•  How close are samples to target distribution? – standard measurement: total variation distance

•  For sequential Gibbs, no asymptotic bias:

37

kµ� ⌫kTV = max

A⇢⌦|µ(A)� ⌫(A)|

“Folklore”: asynchronous Gibbs is also unbiased. …but this is not necessarily true!

8µ0, limt!1

kP (t)µ0 � ⇡kTV = 0

Page 38: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Simple Bias Example

38

Page 39: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Simple Bias Example

39

Page 40: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Simple Bias Example

40

Page 41: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Simple Bias Example

41

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Two threads update starting here.

Page 42: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Simple Bias Example

42

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)1/4

1/4

1/4

1/4 Two threads update starting here.

Page 43: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Simple Bias Example

43

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)1/4

1/4

1/4

1/4

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)1/4

1/4

1/4

1/4Should have zero probability!

Two threads update starting here.

Page 44: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Nonzero Asymptotic Bias Asynchronous Bias: Example Distribution

�.��

�.�

�.��

�.�

�.��

�.�

�.��

�.�

(�,�) (�,�) (�,�) (�,�)

prob

abili

ty

state

Distribution of Sequential vs. Hogwild! Gibbs

sequentialHogwild!

Bias introduced by Hogwild!-Gibbs (��� samples).

�� / �

44

Measured

Bias (total variation

distance)

sequential < 0.1%

unbiased

asynchronous 9.8% biased

Page 45: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Nonzero Asymptotic Bias Asynchronous Bias: Example Distribution

�.��

�.�

�.��

�.�

�.��

�.�

�.��

�.�

(�,�) (�,�) (�,�) (�,�)

prob

abili

ty

state

Distribution of Sequential vs. Hogwild! Gibbs

sequentialHogwild!

Bias introduced by Hogwild!-Gibbs (��� samples).

�� / �

45

Asynchronous Bias: Example Distribution

�.�

�.�

�.�

�.�

�.�

�.�

�.�

�.�

(�,X�) (�,X�) (X�,�) (X�,�)

prob

abili

ty

state

Marginal distribution of Sequential vs. Hogwild! Gibbs

sequentialHogwild!

Bias introduced by Hogwild!-Gibbs (��� samples).

�� / ��

Measured

Bias (total variation

distance)

sequential < 0.1%

unbiased

asynchronous 9.8% biased

Page 46: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Are we using the right metric?

46

Page 47: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Are we using the right metric?

•  No, total variation distance is too conservative – depends on events that don’t matter for inference – usually only care about small number of variables

•  New metric: sparse variation distance

where |A| is the number of variables on which event A depends

47

Page 48: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Are we using the right metric?

•  No, total variation distance is too conservative – depends on events that don’t matter for inference – usually only care about small number of variables

•  New metric: sparse variation distance

where |A| is the number of variables on which event A depends

48

kµ� ⌫kSV(!) = max

|A|!|µ(A)� ⌫(A)|

Page 49: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Are we using the right metric?

•  No, total variation distance is too conservative – depends on events that don’t matter for inference – usually only care about small number of variables

•  New metric: sparse variation distance

where |A| is the number of variables on which event A depends

49

kµ� ⌫kSV(!) = max

|A|!|µ(A)� ⌫(A)|

Simple Example: Bias of Asynchronous Gibbs

Total variation: 9.8% Sparse Variation ( ): 0.4% ! = 1

Page 50: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Total Influence Parameter

50

Page 51: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Total Influence Parameter

•  Old condition that was used to study mixing times of spin statistics systems

–  means X and Y equal except variable j.

–  is conditional distribution of variable i given the values of all the other variables in state X.

– Dobrushin’s condition holds when

51

↵ = max

i2I

X

j2I

max

(X,Y )2Bj

��⇡i(·|XI\{i})� ⇡i(·|YI\{i})��TV

(X,Y ) 2 Bj

⇡i(·|XI\{i})

Page 52: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Total Influence Parameter

•  Old condition that was used to study mixing times of spin statistics systems

–  means X and Y equal except variable j.

–  is conditional distribution of variable i given the values of all the other variables in state X.

– Dobrushin’s condition holds when

52

↵ = max

i2I

X

j2I

max

(X,Y )2Bj

��⇡i(·|XI\{i})� ⇡i(·|YI\{i})��TV

(X,Y ) 2 Bj

⇡i(·|XI\{i})

↵ < 1.

Page 53: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Asymptotic Result

•  For any class of distributions with bounded total influence – big-O notation is over number of variables

•  If timesteps of sequential Gibbs suffice to achieve arbitrarily small bias – measured by sparse variation distance, for fixed

•  …then asynchronous Gibbs requires only additional timesteps to achieve the same bias!

53

↵ = O(1).n.

O(n)

!-

O(1)

!-

Page 54: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Asymptotic Result

•  For any class of distributions with bounded total influence – big-O notation is over number of variables

•  If timesteps of sequential Gibbs suffice to achieve arbitrarily small bias – measured by sparse variation distance, for fixed

•  …then asynchronous Gibbs requires only additional timesteps to achieve the same bias!

54

↵ = O(1).n.

O(n)

!-

O(1)

more details, explicit bounds, et cetera in the paper

!-

Page 55: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

want to get

accurate estimates ê

bound the bias

55

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

Two desiderata

want to be independent of initial conditions

quickly ê

bound the mixing time

Page 56: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Mixing Time

56

Page 57: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Mixing Time

•  How long do we need to run until the samples are independent of initial conditions?

•  Mixing time of a Markov chain is the first time at which the distribution of the sample is close to the stationary distribution. –  in terms of total variation distance –  feasible to run MCMC if mixing time is small

57

Page 58: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Mixing Time

•  How long do we need to run until the samples are independent of initial conditions?

•  Mixing time of a Markov chain is the first time at which the distribution of the sample is close to the stationary distribution. –  in terms of total variation distance –  feasible to run MCMC if mixing time is small

58

“Folklore”: asynchronous Gibbs has the same mixing time as sequential Gibbs…also not necessarily true!

Page 59: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Mixing Time Example

Mixing Time: Example (cont’d)

Here is the empirical estimate for P

TY > ��

for di�ferentmaximum delays ⌧—should be exactly �.�.

�.�

�.�

�.�

�.�

� �� ��� ��� ��� ��� ���

estim

atio

nof

P(�

T Y>

�)

sample number (thousands)

Mixing of Sequential vs Hogwild! Gibbs

⌧ = �.�⌧ = �.�⌧ = �.�

sequentialtrue distribution

�� / �

59

is hardware-dependent read

staleness parameter

HOGWILD!

Page 60: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Mixing Time Example

Mixing Time: Example (cont’d)

Here is the empirical estimate for P

TY > ��

for di�ferentmaximum delays ⌧—should be exactly �.�.

�.�

�.�

�.�

�.�

� �� ��� ��� ��� ��� ���

estim

atio

nof

P(�

T Y>

�)

sample number (thousands)

Mixing of Sequential vs Hogwild! Gibbs

⌧ = �.�⌧ = �.�⌧ = �.�

sequentialtrue distribution

�� / �

60

Sequential Gibbs achieves correct marginal quickly.

tmix

= O(n log n)

Asynchronous Gibbs takes much longer.

Asynchronous Gibbs takes much longer.

Asynchronous Gibbs takes much longer.

tmix

= exp(⌦(n))

is hardware-dependent read

staleness parameter

HOGWILD!

Page 61: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Bounding the Mixing Time

61

↵ < 1.

Page 62: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Bounding the Mixing Time

Suppose that our target distribution satisfies Dobrushin’s condition (total influence ). •  Mixing time of sequential Gibbs (known result)

•  Mixing time of asynchronous Gibbs is

62

↵ < 1.

tmix�seq

(✏) n

1� ↵log

⇣n✏

⌘.

tmix�hog

(✏) n+ ↵⌧

1� ↵log

⇣n✏

⌘.

is hardware-dependent read

staleness parameter

Page 63: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Bounding the Mixing Time

Suppose that our target distribution satisfies Dobrushin’s condition (total influence ). •  Mixing time of sequential Gibbs (known result)

•  Mixing time of asynchronous Gibbs is

63

↵ < 1.

tmix�seq

(✏) n

1� ↵log

⇣n✏

⌘.

tmix�hog

(✏) n+ ↵⌧

1� ↵log

⇣n✏

⌘.

Takeaway message: can compare the two mixing time bounds with

…they differ by a negligible factor!

tmix�hog

(✏) ⇡�1 + ↵⌧n�1

�tmix�seq

(✏)

is hardware-dependent read

staleness parameter

Page 64: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Theory Matches Experiment Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling

16500

17000

17500

18000

18500

19000

0 50 100 150 200

mix

ing

time

expected delay parameter (⌧ ⇤)

Estimated tmix

of HOGWILD! Gibbs on Large Ising Model

estimatedtheory

Figure 4. Comparison of estimated mixing time and theory-predicted (by Equation 2) mixing time as ⌧ increases for a syn-thetic Ising model graph (n = 1000, � = 3).

the (relatively small) dependence of the mixing time on ⌧proved to be computationally intractable.

Instead, we use a technique called coupling to the future.We initialize two chains, X and Y , by setting all the vari-ables in X

0

to 1 and all the variables in Y0

to �1. Weproceed by simulating a coupling between the two chains,and return the coupling time T

c

. Our estimate of the mixingtime will then be ˆt(✏), where P(T

c

� ˆt(✏)) = ✏.

Statement 2. This experimental estimate is an upperbound for the mixing time. That is, ˆt(✏) � t

mix

(✏).

To estimate ˆt(✏), we ran 10000 instances of the cou-pling experiment, and returned the sample estimate ofˆt(1/4). To compare across a range of ⌧⇤, we selectedthe ⌧̃

i,t

to be independent and identically distributed ac-cording to the maximum-entropy distribution supported on{0, 1, . . . , 200} consistent with a particular assignment of⌧⇤. The resulting estimates are plotted as the blue seriesin Figure 4. The red line represents the mixing time thatwould be predicted by naively applying Equation 2 usingthe estimate of the sequential mixing time as a startingpoint — we can see that it is a very good match for the ex-perimental results. This experiment shows that, at least forone archetypal model, our theory accurately characterizesthe behavior of HOGWILD! Gibbs sampling as the delayparameter ⌧⇤ is changed, and that using HOGWILD!-Gibbsdoesn’t cause the model to catastrophically fail to mix.

Of course, in order for HOGWILD!-Gibbs to be useful, itmust also speed up the execution of Gibbs sampling onsome practical models. It is already known that this is thecase, as these types of algorithms been widely implementedin practice (Smola & Narayanamurthy, 2010; Smyth et al.,2009). To further test this, we ran HOGWILD!-Gibbs sam-pling on a real-world 11 GB Knowledge Base Populationdataset (derived from the TAC-KBP challenge) using a ma-

0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 8 12 18 36

spee

dup

over

sing

le-th

read

ed

threads

Performance of HOGWILD! Gibbs on KBP Dataset

HOGWILD!multi-model

Figure 5. Speedup of HOGWILD! and multi-model Gibbs sam-pling on large KBP dataset (11 GB).

chine with a single-socket, 18-core Xeon E7-8890 CPUand 1 TB RAM. As a comparison, we also ran a “multi-model” Gibbs sampler: this consists of multiple threadswith a single execution of Gibbs sampling running inde-pendently in each thread. This sampler will produce thesame number of samples as HOGWILD!-Gibbs, but will re-quire more memory to store multiple copies of the model.

Figure 5 reports the speedup, in terms of wall-clock time,achieved by HOGWILD!-Gibbs on this dataset. On this ma-chine, we get speedups of up to 2.8⇥, although the programbecomes memory-bandwidth bound at around 8 threads,and we see no significant speedup beyond this. With anynumber of workers, the run time of HOGWILD!-Gibbs isclose to that of multi-model Gibbs, which illustrates thatthe additional cache contention caused by the HOGWILD!updates has little effect on the algorithm’s performance.

7. ConclusionWe analyzed HOGWILD!-Gibbs sampling, a heuristic forparallelized MCMC sampling, on discrete-valued graphi-cal models. First, we constructed a statistical model forHOGWILD!-Gibbs by adapting a model already used forthe analysis of asynchronous SGD. Next, we illustrated amajor issue with HOGWILD!-Gibbs sampling: that it pro-duces biased samples. To address this, we proved that if forsome class of models with bounded total influence, onlyO(n) sequential Gibbs samples are necessary to producegood marginal estimates, then HOGWILD!-Gibbs samplingproduces equally good estimates after only O(1) additionalsteps. Additionally, for models that satisfy Dobrushin’scondition (↵ < 1), we proved mixing time bounds for se-quential and asynchronous Gibbs sampling that differ byonly a factor of 1 + O(n�1

). Finally, we showed that ourtheory matches experimental results, and that HOGWILD!-Gibbs produces speedups up to 2.8⇥ on a real dataset.

64

expected staleness parameter ( ) ⌧

Page 65: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Conclusion

•  Analyzed and modeled asynchronous Gibbs sampling, and identified two success metrics –  sample bias à how close to target distribution? – mixing time à how long do we need to run?

•  Showed that asynchronicity can cause problems

•  Proved bounds on the effect of asynchronicity –  using the new sparse variation distance, together with –  the classical condition of total influence

65

Page 66: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡

Conclusion

•  Analyzed and modeled asynchronous Gibbs sampling, and identified two success metrics –  sample bias à how close to target distribution? – mixing time à how long do we need to run?

•  Showed that asynchronicity can cause problems

•  Proved bounds on the effect of asynchronicity –  using the new sparse variation distance, together with –  the classical condition of total influence

66

Thank you!

[email protected] stanford.edu/~cdesa