bayesian phylogenetic analysishpc.ilri.cgiar.org/beca/training/advancedbfx2015/fredrick_nindo... ·...

27
Bayesian Phylogenetic Analysis Fredrick Nindo Computational Biology Group University of Cape Town

Upload: phungkhanh

Post on 25-Mar-2018

235 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Bayesian Phylogenetic Analysis Fredrick  Nindo  

Computational  Biology  Group  University  of  Cape  Town  

Page 2: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Outline �  Introduction to Bayesian phylogenetic analysis

�  Markov chain Monte Carlo sampling

�  Comparison of MrBayes and BEAST

�  Strict molecular clocks and relaxed molecular clocks

�  Calibrating estimates of rates and divergence times

Page 3: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

History of Bayesian theory

�  The  Reverend  Thomas  Bayes  was  born  in  London  in  1702  

�  He  was  the  son  of  one  of  the  first  Noncomformist  ministers  to  be  ordained  in  England  

�  He  became  a  Presbyterian  minister  in  the  late  1720s,  but  was  well  known  for  his  studies  of  mathematics  

�  He  was  elected  a  Fellow  of  the  Royal  Society  of  London  in  1742  

�  He  died  in  1761  before  his  works  were  published  

Page 4: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Bayes Theorem �  Bayes’  Theorem  explains  how  to  calculate  inverse  probabilities  

�  For  example,  suppose  that  boxes  contains  colored  balls  as  shown  below  

�  Given  a  box,  a  ball  is  chosen  uniformly  at  random  

�  For  example,  if  a  ball  is  chosen  from  Box  B1,  there  is  a  3/4  chance  that  it  is  red  

�  The  inverse  problem  states  if  a  red  ball  is  drawn,  how  likely  is  it  that  it  came  from  Box  B1?  

B1: OOOO B2: OOOO B3: OOOO

Page 5: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Bayesian Priors

�  If  a  red  ball  is  drawn,  how  likely  is  it  that  it  came  from  Box  B1?  

�  To  answer  this  question,  we  need  a  prior  distribution  for  the  selection  of  the  box  

�  The  answer  will  be  different  if  we  believe  a  priori  that  Box  B1  is  10%  likely  to  be  the  chosen  box  than  if  we  believe  that  all  three  boxes  are  equally  likely  

B1:OOOO B2:OOOO B3:OOOO

Page 6: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Bayes formulation �  Bayes’  Theorem  states  that  if  a  complete  list  of  mutually  exclusive  

events  B1,B2,...  have  prior  probabilities  Pr(B1),Pr(B2),...,  and  if  the  likelihood  of  the  event  A  given  event  Bi    is  Pr(A|Bi)  for  each  i,  

       then                  

�  The  posterior  probability  of  Bi    given  A,  written  Pr(Bi  |A),  is  proportional  to  the  product  of  the  likelihood  Pr(A|Bi)  and  the  prior  probability  Pr(Bi)  where  the  normalizing  constant  

Pr(Bi  |A)  =  Pr(A|Bi)Pr(Bi)    j  Pr(A|Bj)Pr(Bj)  

Pr(A)  =   j  Pr(A|Bj)Pr(Bj)  is  the  prior  probability  of  A  

Page 7: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Bayesian theory in Phylogenetics �  In  a  Bayesian  approach  to  phylogenetics,  the  boxes  are  like  different  

tree  topologies.  

�  The  colored  balls  are  like  site  patterns,  except:  �  there  are  many  more  than  two  colors;  and  �   we  observe  multiple  draws  from  each  box  

�  Additional  parameters  such  as  branch  lengths  and  substitution  model  parameters  affect  the  likelihood,  are  unknown,  and  add  to  the  complexity  

�  A prior distribution is a probability distribution on parameters before any data is observed.

�  A posterior distribution is a probability distribution on parameters after data is observed  

Page 8: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Bayes Rule

�  H  represents  a  specific  hypothesis  

�  Pr(H)  is  called  the  prior  probability  of  H  that  was  assumed  before  new  data,  D,  became  available.  

�  Pr(D|H)  is  called  the  conditional  probability  of  seeing  D  if  H  is  true.  It  is  also  called  a  likelihood  function  when  it  is  considered  as  a  function  of  H  for  fixed  D  

�   Pr(D)  is  called  the  marginal  probability  of  D:  the  a  priori  probability  of  witnessing  the  D  under  all  possible  hypotheses.  It  can  be  calculated  as  the  sum  of  the  product  of  all  probabilities  of  any  complete  set  of  mutually  exclusive  hypotheses  and  corresponding  conditional  probabilities:    Pr(D)  =  ΣPr(D|Hi)P(Hi)  

�  Pr(H|D)  is  called  the  posterior  probability  of  H  given  D  

Page 9: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Markov Chain Monte Carlo(MCMC) �  Markov chain Monte Carlo (MCMC) is a method to take

(dependent) samples from a distribution.

�  The distribution need only be known up to a constant of proportionality.

�  MCMC is especially useful for computation of Bayesian posterior probabilities.

�  Simple summary statistics from the sample converge to posterior probabilities.

�  Metropolis-Hastings is a form of MCMC that works using any Markov chain to propose the next item to sample, but rejecting proposals with specified probability.

Page 10: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

MCMC ALGORITHM �  We  want  to  make  inferences  on  the  basis  of  a  posterior  distribution  

p(θ|x)  

�  We  cannot  calculate  desired  quantities  analytically,  so  instead  we  wish  to  sample  from  p(θ|x)  and  use  sample  statistics  as  estimates  for  the  true  posterior  values—  for  example,  a  sample  mean  is  an  estimate  of  an  expected  value  

�  But,  we  also  may  not  be  able  to  take  a  simple  random  sample  of  values  from  the  posterior  distribution  

�  A  computational  method  called  Markov  chain  Monte  Carlo  has  proven  to  be  remarkably  successful  for  obtaining  dependent  samples  from  probability  distributions  

�  If  this  is  done  carefully,  sample  statistics  will  converge  to  the  desired  posterior  values  

Page 11: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

MCMC Tree search strategy

�  Markov  chain  Monte  Carlo  (MCMC)  takes  (dependent)  samples  from  a  distribution  

�  The  distribution  need  only  be  known  up  to  a  constant  of  proportionality  as  the  algorithm  depends  only  on  ratios  

�  A  proposal  method  is  needed  that  describes  a  probability  distribution  for  proposing  new  parameter  values  given  current  ones  

�  In  theory,  just  about  any  proposal  distribution  is  correct  (given  an  infinite  sample  size)—  the  art  is  in  designing  (and  correctly  implementing)  a  method  so  that  feasible  sample  sizes  are  adequate  

�  If a proposal is not accepted, the current value is sampled again  

 

Page 12: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

MCMC proposal steps �  Start  at  θ0;  Set  i  =0    

�  Propose  θ    from  the  current  θi  

�  Calculate  the  acceptance  probability  

�  Generate  a  random  number.  �  If  accepted,  set  i+1  =    �  If  rejected,  set  i+1  =  i  .  

�  Increment  i  to  i  +  1  

�  Repeat  steps  2  through  6  many  times  

Page 13: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

MCMC sampling �  The  parameter  space  includes:  

�  The  tree  topology  

�  The  branch  lengths  

�  Substitution  model  parameters  

�  In  practice,  we  use  several  MCMC  proposals  that  leave  some  parameters  fixed  while  changing  others  

�  The  result  of  an  MCMC  analysis  is  a  sample  from  the  posterior  distribution  

�  Sample  statistics  are  estimates  of  corresponding  posterior  estimates  �  The  sample  proportion  of  a  given  tree  topology  converges  to  the  posterior  probability  of  that  

tree  topology  

�  The  proportion  of  trees  with  a  given  clade  converge  to  the  posterior  probability  of  that  clade  

�  The  ends  of  the  middle  95%  of  the  sample  for  the  transition/transversion  biasκis  an  interval  estimate  forκ  

Page 14: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

summarizing MCMC results �  Logging  trees(.trees)  and  continuous  parameters(.log)  for  each  

sample  

�  The  proportion  of  a  given  tree  topology  (after  burn-­‐in)  in  these  logs  is  an  approximation  of  the  posterior    probability  of  that  tree  topology  

�  Trees in the file can be analyzed for tree partitions, from which a consensus tree can be made. The proportion of a given tree partition in the trees is the posterior probability of that partition  

Page 15: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

MCMC output �  A consensus tree from an MCMC sample is simply

a summary of the posterior distribution of the topology.

�  Other summaries are possible.

�  This consensus tree is not an optimal tree according to some criterionsuch as maximum likelihood or parsimony

Page 16: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

MCMC Diagnostics �  measure  the  behavior  of  a  single  chain  

�  visually  inspect  output  traces  in  Tracer  

�  Measure  autocorrelation  within  a  chain:  the  effective  sampling  size  (ESS)  

ESS=1689  

Page 17: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Limitations of MCMC sampling �  MCMC  sometimes  fails  to  converge  

�  Should  always  run  several  chains  with  different  random  numbers  and  compare  answers  

�  If  the  true  tree  has  some  very  short  internal  edges  Bayesian  inference  can  mislead  

�  Different  likelihood  models  can  lead  to  different  results  

Page 18: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Bayesian Phylogenetic Tree Reconstruction

�  A search for a set of plausible trees (weighed by their

probability) instead of a single best tree

�  the “space” that you search in is limited by prior information and the data

�  the posterior distribution of trees can be translated to a probability of any branching event �  allows estimate of uncertainty!

�  BUT incorporates prior beliefs

Page 19: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Bayesian phylogenetics  the  one  ‘true’  tree?  •  Maximum  likelihood,  Maximum  parsimony,  Distance  methods    try  to  get  a  single  tree  that  best  describes  the  data    •  however,  they  admit  that  they  don’t  search  everywhere  (heuristic),  and  that  it  is  difficult  to  find  the  “best”  tree    •  are  we  doing  a  good  job  reporting  a  single  tree  ??  

Page 20: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Maximum likelihood vs Bayesian approaches

Maximum Likelihood

�  Probability: Only defined in the context of long-run relative frequencies

�  Parameters: Fixed and Unknown

�  Nuisance Parameters: Optimize them

�  Testing: p-values

�  Nature of the method: Objective

Bayesian

�  Probability: Describes everything that is uncertain

�  Parameters: Random

�  Nuisance Parameters: Average over them

�  Testing: Bayes’ factors

�  Nature of the method: subjective

Page 21: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Advantages of Bayesian Inference of Phylogeny

�  We  have  almost  no  prior  knowledge  for  the  parameters  of  interest....  So  why  bother  doing  Bayesian  inference?  �  A  Bayesian  analysis  expresses  its  results  as  the  probability  of  the  

hypothesis  given  the  data  �   MCMC  is  a  stochastic  algorithm  and  thus  is  able  to  avoid  getting  

stuck  in  a  local  suboptimal  solution.  �  By  sampling  a  set  of  plausible  trees,  MCMC  allows  estimating  of  

the  uncertainty  of  any  branching  event  

Page 22: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Bayesian Inference of Phylogeny �  Development  of  Bayesian  methods  has  led  to  continual  improvement  

in  our  ability  to  model  and  learn  about  molecular  evolution.  

�  Bayesian  Inference  uses  likelihood,  but  requires  a  prior  distribution.  

�  Bayesian  inference  is  computationally  intensive,  but  can  be  less  so  than  ML  plus  bootstrapping.  

�  Bayesian  inference  directly  measures  items  of  interest  on  an  easily  interpretable  probability  scale.  

�  Some  folks  dislike  the  requirement  of  specifying  a  prior  distribution  

Page 23: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Bayesian analysis software �  A  number  of  computer  programs  have  been  developed  that  

implement  Bayesian  Phylogenetic  analysis:  �  MrBayes  (http://mrbayes.sourceforge.net/)  includes  BEST  (http://

www.stat.osu.edu/~dkp/BEST/help/manual_BEST2.3.pdf)  �  BEAST  (Bayesian  Evolutionary  Analysis    Sampling  Trees)  

 (  http://beast.bio.ed.ac.uk/)  �  Coevolv  (Lartillot  and  Poujol,  2011)  �  PhyloBayes  (Lartillot  et  al,  2009)  �  RevBayes  (http://revbayes.com)  

Page 24: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

MrBayes �  A Bayesian inference program for phylogenetic

inference and model selection

�  Developed by

�  Assisted by Maxim Teslenko, Paul van der Mark, Daniel Ayres, Aaron Darling, Sebastian Höhna, Bret Larget, Liang Liu, Marc Suchard

John  Huelsenbeck   Fredrik  Ronquist  

Page 25: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

MRBAYES �  Phylogenetic  inference  under  a  wide  range  of  models  two  output  files  

�  Parameter  (.p)  files  �  Tree  (.t  files)  

�  Unrooted  trees  �   joint  estimation  of  topology,  branch  length,  and  model  parameters  

�  Rooted  –  time-­‐calibrated  trees  �  joint  estimation  of  topology,  branch  rates,  branch  times,  and  model  

parameters,  and  gene-­‐tree/  species-­‐tree  inference  in  BEST  

�   Data  types  �   discrete  characters  –  binary  (0,1)  or  multi-­‐state  (0,1,  .  .  .  ,9)    �  DNA  –  4-­‐state  nucleotide,  doublet,  or  codons    �  amino  acid  

Page 26: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Comparison of MrBayes and BEAST

Method/Model/Feature MrBayes BEAST

Unrooted trees ✓ ✗

Joint est. topology & times ✓ ✓

Gene-tree/species-tree ✓ ✓

Dataset partioning ✓ ✓

Bayes factors ✓ ✓

Morphological data ✓ ✓

Demography/phylogeography ✗ ✓

Continuous traits ✗ ✓

Graphical-user-interface (GUI) ✗ ✓

Final Tree summarisation .con .mcc

Page 27: Bayesian Phylogenetic Analysishpc.ilri.cgiar.org/beca/training/AdvancedBFX2015/Fredrick_Nindo... · Outline ! Introduction to Bayesian phylogenetic analysis ! Markov chain Monte Carlo

Acknowledgements  

Assoc  Prof  Darren  Martin  

Dr.  Aderito  Louis  Monjane  

Brejnev  Muhire  

Rebone  Meraba  

UCT  CBIO  Crew