tree building 2

Lecture 4Tree Building 2

Phylogenetics and Sequence Analysis

Fall 2015

Kurt Wollenberg, PhDPhylogenetics SpecialistBioinformatics and Computational Biosciences BranchOffice of Cyber Infrastructure and Computational Biology

Course Organization

•Building a clean sequence•Collecting homologs•Aligning your sequences•Building trees•Further Analysis

Tree building, so far

• Generate a distance tree

• Generate maximum parsimony tree(s)

• Calculate bootstrap support

What’s next?

Statistical Methods for Calculating Trees

•Maximum Likelihood

•Bayesian Phylogenetics

Optimality Criterion: Likelihood

Calculating likelihood

The DataThe TreeS1: AAG

S2: AAAS3: GGA

S4: AGA

L(Tree) = Prob(Data|Tree) = ∏iProb(Data(i)|Tree)

Calculating likelihood: Setting parameters

What values do you use for the substitution model?

Run jModelTest (or ProtTest for protein MSAs)

L(Tree) = Prob(Data|Tree) = ∏iProb(Data(i)|Tree)


Calculating likelihood: jModelTest


Calculating likelihood: jModelTest Results


Substitution models http://www.molecularevolution.org/resources/models/nucleotide

Calculating likelihood: jModelTest Results


The Gamma Distribution

Mean = kθ Shape parameter = θ Coefficient of Variation = 1/√θ

Calculating likelihood: Programs

PAUP* – Commercial, NIH Biowulf, or NIAID HPC

DNA only

PHYLIP – Download, NIH Biowulf, or NIAID HPC

dnaml and proml programs

PAML – Download, NIH Biowulf, or NIAID HPC

RaxML – Download or NIH BioWulf or webserver

PhyML – Download or NIAID HPC or webserver

GARLi – Download or NIAID HPC or webserver

MEGA6 - Download

Generally the user has more flexibility with a local program.But local programs can hog your computer.

Calculating likelihood: Programs

Input data format:PHYLIP or nexus SEQUENTIAL

Output files:filename.best.treefilename.best.all.treefilename.*.log – intermediate run output

GARLi – Genetic Algorithm for Rapid Likelihood Inference

Input data format: PHYLIP or Nexus SEQUENTIALnotINTERLEAVED

and now fasta

Calculating likelihood: GARLi

Specifying input parameters:garli.conf

www.nescent.org/wg_garli/GARLI_Configuration_Settings


Running the program


The Tree


Bootstrap results?GARLi does not calculate a bootstrap consensus!You must use another piece of software for this.

Dendropy:sumtrees – a very efficient python package for summarizing collections of phylogenetic treesDendropy – Freely distributed from:

https://pythonhosted.org/DendroPysumtrees has also been implemented on the NIAID HPC


Calculating the posterior probability of the evolutionary parameters

Pr(D|τ, v, θ) × Pr(τ, v, θ) Pr(D)Pr (τ, v, θ|Data) =

where:τ = tree topologyv = branch lengthsθ = substitution parameters

Bayesian Analysis

What is Bayesian Analysis?

Calculation of the probability of parameters (tree, substitution model) given the data (sequence alignment)

p(q|D) = (Likelihood x Prior) / probability of the data p(q|D) = p(D|q)p(q) / p(D)

Exploring the posterior probability distribution

Posterior probabilities of trees and parameters are approximated using Markov

Chain Monte Carlo (MCMC) sampling

Markov Chain: A statement of the probability of moving from one state to another

Bayesian Analysis

What is MCMC?

Markov Chain Monte Carlo

Markov chain Monte Carlo

One link in the chain Choosing a link

Markov Chain example: Jukes-Cantorμt = 0.25

A C G T

A 0.5259 0.158 0.158 0.158

C 0.158 0.5259 0.158 0.158

G 0.158 0.158 0.5259 0.158

T 0.158 0.158 0.158 0.5259

Bayesian Analysis

The posterior probability of a specific tree is the number of times the Markov Chain visits

that tree

Posterior probability distribution is summarized by the clade probabilities.

Exploring the posterior probability distribution

Bayesian Analysis

Using MrBayes• Input format = Nexus• Choose a substitution model (jModelTest)• Check for convergence

Using BEAST• Input format = XML (made using BEAUTi program)• Choose a substitution model (jModelTest)• Check for convergence (using Tracer program)

Bayesian Analysis

Running MrBayes: Model parameters

MrBayes> lset nst=6 rates=invgammaMrBayes> showmodelMrBayes> mcmc ngen=20000 samplefreq=100 printfreq=100 diagnfreq=100 burninfrac=0.25

Bayesian Analysis

Running MrBayes: Setting the Priors

• Generally, the default priors work well• These are known as “uninformative “ priors• For implementing the Jukes-Cantor model,

change statefreqpr to “fixed”

Bayesian Analysis

Amino acid substitution models

• Poisson - equal rates, equal state frequencies• Blosum62• Dayhoff• Mtrev, Mtmamm - mitochondrial models• mixed - Let MrBayes choose among the many fixed-

rate models

Running MrBayes: Setting the Priors

Bayesian Analysis

Running MrBayes: General

• burnin - initial portion of the run to discard• Generally, 25% of the samples

• samplefreq - how often to sample the Markov chain• More frequently for small analyses• Less frequently for low-complexity data

• printfreq - how often output is sent to the log file(s)

Bayesian Analysis

Running MrBayes: General

#NEXUSbegin mrbayes; set autoclose=yes nowarn=yes; execute /pathtodata/InputData.nex; lset nst=6 rates=invgamma; mcmc stoprule=yes stopval=0.009;end;

Bayesian Analysis

Running MrBayes: GeneralBurn in

Bayesian Analysis

Running MrBayes: Summarizing results

MrBayes> sump (burninfrac=0.25)

MrBayes> sumt (burninfrac=0.25)

Bayesian Analysis

999000 -- (-4886.847) [-4876.966] [...6 remote chains...] -- 0:00:06

Average standard deviation of split frequencies: 0.0023711000000 -- (-4885.621) [-4889.536] [...6 remote chains...] -- 0:00:00

Average standard deviation of split frequencies: 0.002413

Chain results:

1 -- [-5762.003] (-5753.828) [...6 remote chains...] 1000 -- (-4832.654) (-4844.806) [...6 remote chains...] -- 0:16:39

Average standard deviation of split frequencies: 0.143471

2000 -- (-4748.109) (-4762.679) [...6 remote chains...] -- 0:24:57

*************** [SNIP] ***************

Using MrBayes: Convergence

Bayesian Analysis

Log-probability plot appears stochasticOverlay plot for both runs: (1 = Run number 1; 2 = Run number 2; * = Both runs)

+------------------------------------------------------------+ -4879.06 | 2 1 | | 1 1 11 2 1 | | 2 1 2 1 1 1 | | 1 2 1 2 11 1 | | 2 1 2 * 1 1 2 1 1 * 2 1 22 | | 1 2 12 * 1222 2 11 22 1 21 1 2| |2 *21 1 2 221 2 2 211 21 1 1 | | 1 2 12 2 2 1 1 12 2 2 2 2 1| | 122 2 2 1 21 1 22 1 1 2 | |1 1 2 1 2 2 | | 1 11 2 | | 1 22 | | 2 | | | | 2 | +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ -4882.76 ^ ^ 100000 1000000

Using MrBayes: Convergence

Bayesian Analysis

Using MrBayes: ConvergencePotential Scale Reduction Factor (PSRF) ~1.000

Estimated Sample Size (ESS) > 100

95% HPD Interval --------------------Parameter Mean Variance Lower Upper Median min ESS* avg ESS PSRF+ -------------------------------------------------------------------------------------------------- TL 1.569934 0.002486 1.469170 1.658322 1.571118 557.38 565.69 1.000 r(A<->C) 0.169762 0.000056 0.155400 0.183553 0.169529 390.57 402.28 1.005 r(A<->G) 0.339080 0.000150 0.314176 0.361233 0.339147 223.59 264.91 1.007 r(A<->T) 0.099197 0.000030 0.089226 0.110209 0.098949 473.14 478.79 0.999 r(C<->G) 0.048546 0.000018 0.040465 0.056966 0.048398 377.39 383.60 1.003 r(C<->T) 0.264055 0.000110 0.244343 0.283464 0.263957 260.59 265.08 1.005 r(G<->T) 0.079359 0.000028 0.070214 0.090797 0.079234 423.74 466.87 0.999 pi(A) 0.241458 0.000036 0.230500 0.252766 0.241097 344.85 373.56 1.000 pi(C) 0.264338 0.000040 0.252421 0.276949 0.264253 322.23 359.75 1.004 pi(G) 0.215574 0.000040 0.203251 0.227854 0.215273 359.47 406.56 1.005 pi(T) 0.278630 0.000044 0.264885 0.290797 0.278650 254.72 339.96 1.000 alpha 0.666901 0.005264 0.521789 0.801975 0.663511 387.86 408.38 0.999 pinvar 0.374432 0.000849 0.314571 0.429393 0.375512 356.44 374.84 0.999 --------------------------------------------------------------------------------------------------

Bayesian Analysis

The Consensus Tree

Bayesian Analysis

The Consensus Tree

Likelihood Analysis

Distance Analysis

The Neighbor-Joining Tree

100100

100

10068.298.6

68.8

93.4

59.6

99.4

89.8

Parsimony Analysis

The Bootstrap Consensus Tree

Tree Building, in conclusion

How to generate trees using distance, parsimony, and likelihood• How to calculate bootstrap support

Bayesian exploration of phylogeny posterior distribution

Always use more than one tree generation algorithm Look for consensus and investigate disagreement

Where have we been? What have we done?

Visualizing Trees

• FigTree• Dendroscope

REFERENCES

Inferring Phylogenies. J. Felsenstein. 2004A good general reference written in Professor Felsenstein’s unique style.

The Phylogenetic Handbook. Edited by P. Lemey, et al. 2009A very thorough exploration of theory and practice.

Biological Sequence Analysis. R. Durbin, et al. 1998.A good introduction to maximum likelihood and hidden Markov models (HMM).

Phylogenetic Trees Made Easy. B. Hall1st, 2nd Edition - A PAUP and PHYLIP manual, cookbook style.

3rd Edition - A MEGA4 manual, among other things, cookbook style.

4th Edition - MEGA5 and MrBayes 3.2 manuals, cookbook style.

46

Seminar Follow-Up Site

For access to past recordings, handouts, slides visit this site from the NIH network: http://collab.niaid.nih.gov/sites/research/SIG/Bioinformatics/

1. Select a Subject Matter

View:• Seminar Details• Handout and

Reference Docs• Relevant Links• Seminar

Recording Links

2. Select a Topic

Recommended Browsers:• IE for Windows,• Safari for Mac (Firefox on a

Mac is incompatible with NIH Authentication technology)

Login• If prompted to log in use

“NIH\” in front of your username

47

Retrieving Slides/Handouts

This lecture series

48

Retrieving Slides/Handouts

This lecture

This file

49

Questions?

50

Next

Making Publication-Quality Phylogenetic Tree Figures

Tuesday, 10 November at 1300

tree building 2

Science