bayesian evolutionary analysis by sampling trees

71
Engelberg-Titlis Tourismus Computational Evolution Taming the BEAST Bayesian evolutionary analysis by sampling trees A C A C A CC TA C A G A C TTA C A G A CCC T C A C A CC TA C A C A CCC A C A G A C TT A C A G A C TTT C A G A C TTT C A G A CCC T C A G A C TTT C A C A CC TT C A G A CC T T C A C A CC TA C A C A CCC A C A G A C TT T C A G A C TTT C A C A CC TT C A G A CC T T C A C A CC TA C A C A CCC A C A G A C TT T C A G A C TTT C A C A CC TT C A G A CC T Louis du Plessis and Carsten Magnus

Upload: vukhanh

Post on 31-Dec-2016

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Bayesian evolutionary analysis by sampling trees

Engelberg-Titlis Tourismus

Computational Evolution

Taming the BEASTBayesian evolut ionary analysis by sampling trees

ACACACCTACAGACTTACAGACCCTCACACCTACACACCCACAGACTT

ACAGACTTTCAGACTTTCAGACCCTCAGACTTTCACACCTTCAGACCT

TCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCTTCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCT

Louis du Plessis and Carsten Magnus

Page 2: Bayesian evolutionary analysis by sampling trees

Engelberg-Titlis Tourismus

Computational Evolution

Taming the BEASTBayesian evolut ionary analysis by sampling trees

ACACACCTACAGACTTACAGACCCTCACACCTACACACCCACAGACTT

ACAGACTTTCAGACTTTCAGACCCTCAGACTTTCACACCTTCAGACCT

TCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCTTCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCT

1. The problem

Louis du Plessis and Carsten Magnus

Page 3: Bayesian evolutionary analysis by sampling trees

Engelberg-Titlis Tourismus

Computational Evolution

Taming the BEASTBayesian evolut ionary analysis by sampling trees

ACACACCTACAGACTTACAGACCCTCACACCTACACACCCACAGACTT

ACAGACTTTCAGACTTTCAGACCCTCAGACTTTCACACCTTCAGACCT

TCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCTTCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCT

1. The problem

2. What goes into a BEAST model?

Louis du Plessis and Carsten Magnus

Page 4: Bayesian evolutionary analysis by sampling trees

Engelberg-Titlis Tourismus

Computational Evolution

Taming the BEASTBayesian evolut ionary analysis by sampling trees

ACACACCTACAGACTTACAGACCCTCACACCTACACACCCACAGACTT

ACAGACTTTCAGACTTTCAGACCCTCAGACTTTCACACCTTCAGACCT

TCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCTTCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCT

1. The problem

2. What goes into a BEAST model?

3. MCMC inference

Louis du Plessis and Carsten Magnus

Page 5: Bayesian evolutionary analysis by sampling trees

Engelberg-Titlis Tourismus

Computational Evolution

Taming the BEASTBayesian evolut ionary analysis by sampling trees

ACACACCTACAGACTTACAGACCCTCACACCTACACACCCACAGACTT

ACAGACTTTCAGACTTTCAGACCCTCAGACTTTCACACCTTCAGACCT

TCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCTTCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCT

1. The problem

2. What goes into a BEAST model?

3. MCMC inference

4. Bayesian evolutionary analysis by sampling trees

Louis du Plessis and Carsten Magnus

Page 6: Bayesian evolutionary analysis by sampling trees

2

Myself

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Re

01 Dec

2013

19 Ju

n 201

4

16 Aug

2014

12 O

ct 20

14

09 Dec

2014

04 Fe

b 201

5

(B) Liberia

41 se

quen

ces

137 s

eque

nces

156 s

eque

nces

159 s

eque

nces

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Re

01 Dec

2013

26 M

ar 20

14

12 Ju

n 201

4

28 Aug

2014

14 Nov

2014

31 Ja

n 201

5

(A) Guinea

47 se

quen

ces

83 se

quen

ces

201 s

eque

nces

236 s

eque

nces

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Re

01 Dec

2013

24 M

ay 20

14

10 Ju

l 201

4

25 Aug

2014

11 O

ct 20

14

26 Nov

2014

(C) Sierra Leone (Southeast)

117 s

eque

nces

185 s

eque

nces

232 s

eque

nces

246 s

eque

nces

Real-time phylodynamics!• Viral data sampled over the

course of an outbreak !

Model biases and limitations!• Simulated data

Page 7: Bayesian evolutionary analysis by sampling trees

3

Everyone else

• Evolution of HIV-1, tuberculosis, leprosy, Influenza etc. • Evolution of drug resistance • Host-parasite co-evolution • Disease spread by migration/travel • Species delimitation • Bird taxonomy • Plant diversity and dispersal • Adaptive radiations and diversification rates • Time calibration using fossils • Ancient DNA • Influence of environmental factors on phylogenetic diversity • Total evidence dating …etc

Page 8: Bayesian evolutionary analysis by sampling trees

ACACACCTACAGACTTACAGACCCTCACACCTACACACCCACAGACTT

ACAGACTTTCAGACTTTCAGACCCTCAGACTTTCACACCTTCAGACCT

TCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCT

4

We all have one thing in common…

All of us use genomic sequencing data to answer questions in BEAST

?BEAST2

Page 9: Bayesian evolutionary analysis by sampling trees

ACACACCTACAGACTTACAGACCCTCACACCTACACACCCACAGACTT

ACAGACTTTCAGACTTTCAGACCCTCAGACTTTCACACCTTCAGACCT

TCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCT

4

We all have one thing in common…

All of us use genomic sequencing data to answer questions in BEAST

?BEAST2

Page 10: Bayesian evolutionary analysis by sampling trees

5

Bayesian inference recap

P(model | data) = P(data | model)P(model) P(data)

Page 11: Bayesian evolutionary analysis by sampling trees

5

Bayesian inference recap

Prior

Likelihood

Posterior

P(model | data) = P(data | model)P(model) P(data)

Page 12: Bayesian evolutionary analysis by sampling trees

5

Bayesian inference recap

Prior

Likelihood

Posterior

Marginallikelihood of the data

P(model | data) = P(data | model)P(model) P(data)

Page 13: Bayesian evolutionary analysis by sampling trees

5

Bayesian inference recap

Prior

Likelihood

Posterior

Marginallikelihood of the data

P(model | data) = P(data | model)P(model) P(data)

Data • One or more alignments of sequencing data • Sampled at one or many time points • Timescale of days to millions of years • Realisation of a stochastic process

Page 14: Bayesian evolutionary analysis by sampling trees

5

Bayesian inference recap

Prior

Likelihood

Posterior

Marginallikelihood of the data

P(model | data) = P(data | model)P(model) P(data)

Data • One or more alignments of sequencing data • Sampled at one or many time points • Timescale of days to millions of years • Realisation of a stochastic process

!Model!• Description of the process that generated the data • Parameters are random variables • We may only be interested in some of the parameters • Still need a prior for all the model parameters!

Page 15: Bayesian evolutionary analysis by sampling trees

6

What goes into a BEAST model?P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

Page 16: Bayesian evolutionary analysis by sampling trees

6

What goes into a BEAST model?P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACACACCTACAGACTTACAGACCCTCACACCTACACACCCACAGACTT

ACAGACTTTCAGACTTTCAGACCCTCAGACTTTCACACCTTCAGACCT

TCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCT

BEAST2

Page 17: Bayesian evolutionary analysis by sampling trees

6

What goes into a BEAST model?P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

substitutionmodel

molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACACACCTACAGACTTACAGACCCTCACACCTACACACCCACAGACTT

ACAGACTTTCAGACTTTCAGACCCTCAGACTTTCACACCTTCAGACCT

TCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCT

BEAST2

Page 18: Bayesian evolutionary analysis by sampling trees

6

What goes into a BEAST model?P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

substitutionmodel

molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACACACCTACAGACTTACAGACCCTCACACCTACACACCCACAGACTT

ACAGACTTTCAGACTTTCAGACCCTCAGACTTTCACACCTTCAGACCT

TCACACCTACACACCCACAGACTTTCAGACTTTCACACCTTCAGACCT

BEAST2Estimate posterior distributions!

Page 19: Bayesian evolutionary analysis by sampling trees

7

Demographic model

• Description of the population dynamics • How does the population grow over time?

• Infectious diseases - transmission/recovery • Species - speciation/extinction

• Migration events • Different classes of individuals • Host/vector dynamics

…etc

Page 20: Bayesian evolutionary analysis by sampling trees

7

Demographic model

• Description of the population dynamics • How does the population grow over time?

• Infectious diseases - transmission/recovery • Species - speciation/extinction

• Migration events • Different classes of individuals • Host/vector dynamics

…etc

λ+ ➞➞μ

Page 21: Bayesian evolutionary analysis by sampling trees

7

Demographic model

• Description of the population dynamics • How does the population grow over time?

• Infectious diseases - transmission/recovery • Species - speciation/extinction

• Migration events • Different classes of individuals • Host/vector dynamics

…etc

λ+ ➞➞μ

genealogy

Page 22: Bayesian evolutionary analysis by sampling trees

8

Types of demographic models

t0 z1 z2y1 y2 y3

Birth-death ➞

←Coalescent

Page 23: Bayesian evolutionary analysis by sampling trees

8

Types of demographic models

t0 z1 z2y1 y2 y3

Birth-death ➞

←Coalescent

Coalescent • Effective population size • Conditions on sampling a

small random sample

!Birth-death models!• Speciation/infection rate • Extinction/recovery rate • Explicitly models sampling

Page 24: Bayesian evolutionary analysis by sampling trees

9

Different population dynamics generate different trees

Volz et al. PLoS Comput Biol 2013

Page 25: Bayesian evolutionary analysis by sampling trees

10

The genealogy• What are the ancestral relationships between the sequences

in our dataset? • Only the relationships between the sampled sequences!

genealogy

Page 26: Bayesian evolutionary analysis by sampling trees

10

The genealogy• What are the ancestral relationships between the sequences

in our dataset? • Only the relationships between the sampled sequences!

genealogy

0

5

10

15

20

25

30

num

ber i

nfec

ted

t x x x x xx x x x x0

5

10

15

20

25

30

num

ber r

emov

ed

t0 y1 y2 y3 y4 y5y6 y7 y8 y9 y10

Full tree!generated!by an SIR model

δλ IS R

Page 27: Bayesian evolutionary analysis by sampling trees

10

The genealogy• What are the ancestral relationships between the sequences

in our dataset? • Only the relationships between the sampled sequences!

genealogy

0

5

10

15

20

25

30

num

ber i

nfec

ted

t x x x x xx x x x x0

5

10

15

20

25

30

num

ber r

emov

ed

t0 y1 y2 y3 y4 y5y6 y7 y8 y9 y10

Full tree!generated!by an SIR model

Sampled tree

δλ IS R

Page 28: Bayesian evolutionary analysis by sampling trees

11

The genealogygenealogy

t0 z1 z2y1 y2 y3

TCAGACTTTTCACACCTA

TCAGACTTT

Page 29: Bayesian evolutionary analysis by sampling trees

11

The genealogygenealogy

ACAGACTTT

t0 z1 z2y1 y2 y3

TCAGACTTTTCACACCTA

TCAGACTTT

Page 30: Bayesian evolutionary analysis by sampling trees

11

The genealogygenealogy

ACAGACTTT

t0 z1 z2y1 y2 y3

TCAGACTTTTCACACCTA

TCAGACTTT

AA

C G T

CGT

Page 31: Bayesian evolutionary analysis by sampling trees

12

Site model

• How does one sequence change into another? • Basic substitution model

• Relative rate of substitution from one nucleotide to another

• Or amino acids • Or codons

• Site variation • Invariant sites • Separate site model for each data partition

• Multiple genes/loci • Model 1st, 2nd, 3rd codon positions differently

Page 32: Bayesian evolutionary analysis by sampling trees

13

Site model

t0 z1 z2y1 y2 y3

TCAGACTTTTCACACCTA

TCAGACTTT

Page 33: Bayesian evolutionary analysis by sampling trees

13

Site model

t0 z1 z2y1 y2 y3

TCAGACTTTTCACACCTA

TCAGACTTT}?

Page 34: Bayesian evolutionary analysis by sampling trees

13

Site model

t0 z1 z2y1 y2 y3

TCAGACTTTTCACACCTA

TCAGACTTT}?

molecular clockmodel

Page 35: Bayesian evolutionary analysis by sampling trees

14

Molecular clock

• How long does it take for substitutions to appear? • Scales branch lengths to calendar time • Different branches may have different clock rates

molecular clockmodel

Page 36: Bayesian evolutionary analysis by sampling trees

15

Putting it all togetherP( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

P(model | data) = P(data | model)P(model) P(data)

Page 37: Bayesian evolutionary analysis by sampling trees

15

Putting it all togetherP( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

P( | )=P( | )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

Page 38: Bayesian evolutionary analysis by sampling trees

15

Putting it all togetherP( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

P( | )=P( | )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

P( )=P( | )P( )P( )P( )Assume independence

Page 39: Bayesian evolutionary analysis by sampling trees

16

Posterior distribution in BEAST2

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

Page 40: Bayesian evolutionary analysis by sampling trees

16

Posterior distribution in BEAST2

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

Priors

Likelihood

Posterior

Marginallikelihood of the data

Tree-prior

Page 41: Bayesian evolutionary analysis by sampling trees

16

Posterior distribution in BEAST2

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

Priors

Likelihood

Posterior

Marginallikelihood of the data

Tree-prior

MCMC inference • We want to calculate the posterior • But we cannot easily calculate the

marginal likelihood → use MCMC!

!Markov-chain!• Stochastic process • Jumps between different states • Memoryless

!Monte Carlo algorithm!• Randomized algorithm • Deterministic runtime

(it will finish) • Output may not be correct

(with some small probability)

Page 42: Bayesian evolutionary analysis by sampling trees

17

MCMC inference

!• Markov-chain:

• Stochastic process

• MCMC performs a random walk on the posterior, preferentially sampling high-density areas

• Only need to compare which posterior density is higher

• So we only need the ratio of posteriors and the marginal likelihoods cancel out

Page 43: Bayesian evolutionary analysis by sampling trees

17

MCMC inference

!• Markov-chain:

• Stochastic process

• MCMC performs a random walk on the posterior, preferentially sampling high-density areas

• Only need to compare which posterior density is higher

• So we only need the ratio of posteriors and the marginal likelihoods cancel out

P(model1 | data) = P(model2 | data)

P(data | model1)P(model1) P(data)P(data | model2)P(model2) P(data)

Page 44: Bayesian evolutionary analysis by sampling trees

17

MCMC inference

!• Markov-chain:

• Stochastic process

• MCMC performs a random walk on the posterior, preferentially sampling high-density areas

• Only need to compare which posterior density is higher

• So we only need the ratio of posteriors and the marginal likelihoods cancel out

P(model1 | data) = P(model2 | data)

P(data | model1)P(model1) P(data)P(data | model2)P(model2) P(data)

Page 45: Bayesian evolutionary analysis by sampling trees

18

MCMC robot (courtesy of Paul Lewis)

Page 46: Bayesian evolutionary analysis by sampling trees

19

MCMC robot (courtesy of Paul Lewis)

(R is the ratio between the posterior densities)

Page 47: Bayesian evolutionary analysis by sampling trees

20

Pure random walk (courtesy of Paul Lewis)

Random walk!• Random direction • Gamma distributed step size • Reflection at edgesTarget distribution!• Equal mixture of 3 bivariate

normal hills • Inners contours: 50% • Outer contours: 95%

5000 steps by the robot

Page 48: Bayesian evolutionary analysis by sampling trees

21

Burn in (courtesy of Paul Lewis)

• Using MCMC rules • Quickly finds one of the 3 hills • First few steps are not

representative of the distribution

100 steps by the robot

Page 49: Bayesian evolutionary analysis by sampling trees

22

MCMC approximation (courtesy of Paul Lewis)

How good is the approximation?!• 51.2% of points inside 50%

contours • 93.6% of points inside 95%

contours !The more steps, the better the accuracy

5000 steps by the robot

Page 50: Bayesian evolutionary analysis by sampling trees

23

Summary: MCMC inference

Target distribution!• This is the posterior in BEAST2: !Proposal distribution!• Used to decide where to step to next • The choice only affects the efficiency of the algorithm • Metropolis algorithm: symmetric proposals • Metropolis-Hastings: asymmetric proposals

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

substitutionmodel

molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

Page 51: Bayesian evolutionary analysis by sampling trees

23

Summary: MCMC inference

Target distribution!• This is the posterior in BEAST2: !Proposal distribution!• Used to decide where to step to next • The choice only affects the efficiency of the algorithm • Metropolis algorithm: symmetric proposals • Metropolis-Hastings: asymmetric proposals

In BEAST and BEAST2 operators are used to propose the next step

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

substitutionmodel

molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

Page 52: Bayesian evolutionary analysis by sampling trees

24

Marginal distributions

• We only have the joint posterior:

• But we want distributions for each of the parameters we are interested in → marginalize

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

substitutionmodel

molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

P(φ) = ∫θP(φ|θ)P(θ)dθ

Page 53: Bayesian evolutionary analysis by sampling trees

24

Marginal distributions

• We only have the joint posterior:

• But we want distributions for each of the parameters we are interested in → marginalize

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

substitutionmodel

molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

P(φ) = ∫θP(φ|θ)P(θ)dθ

!In practice!!

!!!!!!!!!!

Page 54: Bayesian evolutionary analysis by sampling trees

25

One tree to rule them all?

• As with other parameters the data may support multiple trees equally well

• Thus we should look at all trees in order to take phylogenetic uncertainty into account

• If we are only interested in the epidemiological parameters we can integrate out the tree (but this is very difficult in an ML framework)

• Naturally integrate out the tree in an MCMC framework by inferring the distribution of trees

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

Page 55: Bayesian evolutionary analysis by sampling trees

25

One tree to rule them all?

• As with other parameters the data may support multiple trees equally well

• Thus we should look at all trees in order to take phylogenetic uncertainty into account

• If we are only interested in the epidemiological parameters we can integrate out the tree (but this is very difficult in an ML framework)

• Naturally integrate out the tree in an MCMC framework by inferring the distribution of trees

P( | )=P( | )P( | )P( )P( )P( )P( )

geneticsequences

genealogy demographicmodel

site model molecular clockmodel

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

ACAC...TCAC...ACAG...

Can easily integrate out any nuisance parameters!!

!• What is a nuisance parameter depends on the question • In phylogenetics we specifically want the tree, but the substitution model is a

nuisance parameter

Page 56: Bayesian evolutionary analysis by sampling trees

26

What we hope will happen

Page 57: Bayesian evolutionary analysis by sampling trees

26

What we hope will happen

• The MCMC algorithm samples efficiently from high density areas of the posterior distribution

Page 58: Bayesian evolutionary analysis by sampling trees

26

What we hope will happen

• The MCMC algorithm samples efficiently from high density areas of the posterior distribution

• We end up with a good approximation of the posterior distribution in finite time

Page 59: Bayesian evolutionary analysis by sampling trees

26

What we hope will happen

• The MCMC algorithm samples efficiently from high density areas of the posterior distribution

• We end up with a good approximation of the posterior distribution in finite time

• Everything is awesome!

Page 60: Bayesian evolutionary analysis by sampling trees

26

What we hope will happen

• The MCMC algorithm samples efficiently from high density areas of the posterior distribution

• We end up with a good approximation of the posterior distribution in finite time

• Everything is awesome!

Mixing well!😄

Page 61: Bayesian evolutionary analysis by sampling trees

27

What often happens in practice…

Page 62: Bayesian evolutionary analysis by sampling trees

27

What often happens in practice…

• Not so much…

Page 63: Bayesian evolutionary analysis by sampling trees

27

What often happens in practice…

• Not so much…

Not mixing!😭

Page 64: Bayesian evolutionary analysis by sampling trees

27

What often happens in practice…

• Not so much…

Not mixing!😭

What went wrong?!• Recall that MCMC is a Monte Carlo

algorithm — it is not guaranteed to find the correct solution in finite time!

!How can we make it work better?!• Tweak the operators to make good

proposals (increase operator efficiency)

• Fine tune specifics about the MCMC run(sampling frequency, length etc.)

• Poor model choice? • Poor choice of parameterization?

Page 65: Bayesian evolutionary analysis by sampling trees

28

Model Parameterization: Birth-death process

μλ I

ψ

• λ — Birth-rate (speciation / transmission) • μ — Death-rate (extinction / death) • ψ — Sampling rate

Page 66: Bayesian evolutionary analysis by sampling trees

28

Model Parameterization: Birth-death process

Priors?

μλ I

ψ

• λ — Birth-rate (speciation / transmission) • μ — Death-rate (extinction / death) • ψ — Sampling rate

λ ∊[0,∞) μ ∊ [0,∞) ψ ∊ [0,∞)

Page 67: Bayesian evolutionary analysis by sampling trees

28

Model Parameterization: Birth-death process

Priors?

μλ I

ψ

• λ — Birth-rate (speciation / transmission) • μ — Death-rate (extinction / death) • ψ — Sampling rate

λ ∊[0,∞) μ ∊ [0,∞) ψ ∊ [0,∞)

Infectious diseases!• R = λ/(μ+ψ) — Reproduction number

(is it spreading or not?) • δ = μ + ψ — Becoming noninfectious rate

(1/δ = infectious period) • p = ψ/(μ + ψ) — Sampling proportion !

Page 68: Bayesian evolutionary analysis by sampling trees

28

Model Parameterization: Birth-death process

Priors?

μλ I

ψ

• λ — Birth-rate (speciation / transmission) • μ — Death-rate (extinction / death) • ψ — Sampling rate

λ ∊[0,∞) μ ∊ [0,∞) ψ ∊ [0,∞)

Infectious diseases!• R = λ/(μ+ψ) — Reproduction number

(is it spreading or not?) • δ = μ + ψ — Becoming noninfectious rate

(1/δ = infectious period) • p = ψ/(μ + ψ) — Sampling proportion !

R ∊[0,∞) δ ∊ [0,∞) p ∊ [0,1]

More natural priors!

Page 69: Bayesian evolutionary analysis by sampling trees

28

Model Parameterization: Birth-death process

Priors?

μλ I

ψ

• λ — Birth-rate (speciation / transmission) • μ — Death-rate (extinction / death) • ψ — Sampling rate

λ ∊[0,∞) μ ∊ [0,∞) ψ ∊ [0,∞)

Infectious diseases!• R = λ/(μ+ψ) — Reproduction number

(is it spreading or not?) • δ = μ + ψ — Becoming noninfectious rate

(1/δ = infectious period) • p = ψ/(μ + ψ) — Sampling proportion !

R ∊[0,∞) δ ∊ [0,∞) p ∊ [0,1]

More natural priors!

Speciation!• λ - μ — Growth rate • μ/λ — Relative death rate • p = ψ/(μ + ψ) — Sampling proportion !

Page 70: Bayesian evolutionary analysis by sampling trees

28

Model Parameterization: Birth-death process

Priors?

μλ I

ψ

• λ — Birth-rate (speciation / transmission) • μ — Death-rate (extinction / death) • ψ — Sampling rate

λ ∊[0,∞) μ ∊ [0,∞) ψ ∊ [0,∞)

Infectious diseases!• R = λ/(μ+ψ) — Reproduction number

(is it spreading or not?) • δ = μ + ψ — Becoming noninfectious rate

(1/δ = infectious period) • p = ψ/(μ + ψ) — Sampling proportion !

R ∊[0,∞) δ ∊ [0,∞) p ∊ [0,1]

More natural priors!

Speciation!• λ - μ — Growth rate • μ/λ — Relative death rate • p = ψ/(μ + ψ) — Sampling proportion !

(λ - μ) ∊[0,∞) μ/λ ∊ [0,1) p ∊ [0,1]

Smaller parameter space!

Page 71: Bayesian evolutionary analysis by sampling trees

29

Summary• Model consists of a demographic model, a site model and a molecular clock

model

• Combine priors for model parameters with the likelihood to get the posterior

• Using MCMC we can efficiently approximate the joint posterior of the model

• By marginalising we get posterior distributions of parameters of interest while integrating out uncertainty in nuisance parameters

• Need a good set of operators to propose new steps

• Efficient inference may need some fine-tuning! (or it may not converge to the posterior distribution in a reasonable time)