Download - Phylogenetics Analysis in R
Phylogenetic Analyses in R
Klaus Schliep
Universidad de Vigo
Porto, 15–16 July 2013
Outline
Getting started
Data Structures
Distance based methods
Maximum Parsimony
Maximum likelihood
Section 1
Getting started
About
This slides should give a short introduction into phylogeneticreconstruction in R. It focuses mostly on the packages ape andphangorn. I have to thank Emmanuel Paradis for his work on ape.The slides are produced with literate programming using Latex,Beamer, Sweave and R. So all the code and graphics are ”real”!
Help
To install an R package it is good to have administrator rights.Download R from www.cran.r-project.org. You can easily installpackges from within R:
> install.packages("phangorn")
> install.packages("phytools")
> install.packages("pegas")
> install.packages("seqLogo")
> q()
Then you can load the packages you need:
> library("phangorn")
> library("seqLogo")
Help
The R homepage provides lots of general documentation, faqs, etc.There are help pages for all the functions and most of themcontain examples.
> library(help="phangorn")
> help.start()
> ?pml
> help(pml)
> example(pml)
> vignette("Ancestral")
Copy and paste the parts of the code in the examples is a goodstart. If you prefer reading a book (even they are fast outdated):Paradis, E. (2012) Analysis of Phylogenetics and Evolution with R(Second Edition) New York: SpringerThere is a mailing list stat.ethz.ch/mailman/listinfo/r-sig-phylowhere you can ask questions, after browsing through the archive.
Section 2
Data Structures
Data Structures
Reminder:
1. Data in R are made of vector + attribute(s) (andcombinations of these). Vector: a series of elements all of thesame kind (a list is a vector of pointers).
2. The class is the attribute determining the action of genericfunctions (plot, summary, etc.)
We will make heavily use of the following 3 data structures:1. phyDat: sequences (DNA, AA, codons, user defined) inphangorn2. DNAbin: DNA sequences (ape format)3. phylo: phylogenetic trees
Class phylo
This class represents phylogenetic trees. The tip labels may bereplicated, the node labels (which may be absent). Input:1. read.tree: Newick files2. read.nexus: NEXUS filesIf the file contains several trees, these two functions return anobject of class multiPhylo which is a list of trees of class phylo.And you can write objects of class phylo using write.tree orwrite.nexus.
Plotting trees
ape has great plotting capabilities.
> help(plot.phylo)
Some simple example
> tree <- rtree(10)
> par(mfrow=c(2,2), mar=rep(0,4))
> plot(tree)
> plot(tree, type="fan")
> plot(tree, type="unrooted")
> plot(tree, type="cladogram")
Plotting trees
t9
t10
t4
t8
t5
t3
t6
t1
t2
t7
t9
t10
t4
t8
t5
t3
t6
t1
t2
t7
t9t10
t4
t8
t5
t3
t6
t1
t2
t7
t9
t10
t4
t8
t5
t3
t6
t1
t2
t7
Transforming trees
There are many functions in ape and phangorn to transform trees(i.e. objects of class phylo)
> root(tree, outgroup)
> drop.tip(tree, "t1")
> extract.clade(phy, 1)
> bind.tree(tree1, tree2)
> unroot(tree)
> multi2di(tree)
> di2multi(tree)
> nni(tree)
> rSPR(tree)
Class phyDat
The starting point for phylogenetic reconstruction are sequencealignments. ape can call clustal,tcoffee and muscle andphyloch can call mafft, prank and gblocks.More frequently you will just read in an alignment
> align1 <- read.phyDat("myfile")
phangorn (phyDat) and ape (DNAbin) use different formats torepresent alignments, but it is easy to convert formats.
> align2 <- read.dna("myfile") # ape format
> align3 <- as.phyDat(align1) # phangorn format
Section 3
Distance based methods
Distance based methodsDistance methods take a distance or dissimilarity matrix as input.
Ultrametric Additive
upgmaa fastme.olswpgmaa fastme.bal
njUNJa
bionj
a in phangorn the rest in ape.
I Fast methods O(n2) or O(n3) → big data sets can beanalysed.
I Distances can be calculated for different kinds of data.
I In phylogenetics often used to compute starting trees for ML,MP or inside species tree methods.
Distance based methods
> set.seed(1)
> bs <- bootstrap.phyDat(Laurasiatherian, FUN = function(x)nj(dist.ml(x)), bs=100)
> class(bs) <- 'multiPhylo'
> cnet = consensusNet(bs, .3)
> plot(cnet, show.tip.label=FALSE, show.nodes=TRUE)
Consensusnetwork
Section 4
Maximum Parsimony
Maximum parsimony
In contrast to the distance methods (maximum) parsimony usessequence alignments as input. The target is to minimize anoptimality criterion, i.e. a score to a tree, given the data. For theparsimony method the score is the minimal number of substitutionsneeded to account for the data on a phylogeny.
> data(Laurasiatherian)
> tree = nj(dist.ml(Laurasiatherian))
> parsimony(tree, Laurasiatherian)
[1] 9776
> tree2 = optim.parsimony(tree, Laurasiatherian,
trace=FALSE, rearrangement="SPR")
> parsimony(tree2, Laurasiatherian)
[1] 9713
> tree3 = pratchet(Laurasiatherian, rearrangement="SPR", trace=0)
Branch and boundNormally it is not possible to evaluate an optimality criterion for alltrees, as there are just too many trees.
> sapply(3:10, howmanytrees, FALSE)
[1] 1 3 15 105 945 10395
[7] 135135 2027025
> howmanytrees(20, FALSE)
[1] 2.216431e+20
For small datasets it is possible to find all most parsimonious treesusing a branch and bound algorithm. For datasets with more than10 taxa this can take a long time and depends strongly on howtree like the data are.
> besttree <- bab(subset(Laurasiatherian,1:10), trace=0)
> parsimony(besttree, Laurasiatherian)
[1] 2695
Ancestral reconstructionTo reconstruct ancestral sequences we first load some data andreconstruct a tree:
> primates = read.phyDat("primates.dna")
> tree = pratchet(primates, trace=0)
> tree = acctran(tree, primates)
> parsimony(tree, primates)
[1] 746
In parsimony analysis the edge length represent the observednumber of changes. Reconstructiong ancestral states thereforedefines also the edge lengths of a tree. However there can existseveral equally parsimonious reconstructions or states can beambiguous and therefore edge length can differ (e.g. ”MPR” or”ACCTRAN” ).
> anc.acctran = ancestral.pars(tree, primates, "ACCTRAN")
> anc.mpr = ancestral.pars(tree, primates, "MPR")
Ancestral reconstruction
> seqLogo( t(subset(anc.mpr, getRoot(tree), 1:20)[[1]]), ic.scale=FALSE)
1 2 3 4 5 6 7 8 910 12 14 16 18 20
Position
0
0.2
0.4
0.6
0.8
1P
roba
bilit
y
Ancestral reconstruction MPR
> plotAnc(tree, anc.mpr, 17)
> title("MPR")
Mouse
Bovine
Lemur
Tarsier
Squir Monk
Jpn Macaq
Rhesus Mac
Crab−E.Mac
BarbMacaq
Gibbon
Orang
Gorilla
Chimp
Human
acgt
MPR
Ancestral reconstruction ACCTRAN
> plotAnc(tree, anc.acctran, 17)
> title("ACCTRAN")
Mouse
Bovine
Lemur
Tarsier
Squir Monk
Jpn Macaq
Rhesus Mac
Crab−E.Mac
BarbMacaq
Gibbon
Orang
Gorilla
Chimp
Human
acgt
ACCTRAN
Section 5
Maximum likelihood
Maximum Likelihood
”[In 1961] I had visions of evolutionary tree estimation being muchthe same [than linkage estimation] but with the addition of theneed to estimate the form of the tree itself, surely a fatalcomplexity: my intuition was that there would be insufficient datafor the task.”
—A.W.F. Edwards (2009)
Phylogenetic likelihood is the probability f (x |θ, τ) of observing analignment X given a model of (nucleotide) substitution withparameters θ and phylogenetic tree τ .
L(θ, τ, x) =N∏i=1
f (xi |θ, τ)
where N is the number of sites in the alignment. It is common tomaximise the log-likelihood function`(θ, τ, x) =
∑Ni=1 log (f (xi |θ, τ)) which also maximises L(θ, τ, x).
Applications in phylogenetics
Felsenstein (1981) introduced the pruning algorithm which madethe computation of the likelihood feasible. Let nodes j and k havea direct ancestor h. We can estimate the conditional likelihood
Lh(xh) =
∑xj
Lj(xj)pxj ,xh(tj)
×(∑xk
Lk(xk)pxk ,xh(tk)
)
The likelihood of the tree is evaluated by traversing the tree inpostorder fashion from the tips towards the root. For unrootedtrees, a root can be chosen arbitrarily as our models aretime-reversible. We get the likelihood of the tree if we multiply theconditional likelihood of the root node r with the base compositionπ, as
fh(x |θ, τ) =∑xr
πxrLr (xr ),
These formulas can be adapted to estimate ancestral sequences.
ML in phylogenetics
5
6
7
human chimp gorilla orangutan
ML in phylogenetics
a a g t
ML in phylogenetics
1|0|0|0 1|0|0|0 0|0|1|0 0|0|0|1
ML in phylogenetics
1|0|0|0 1|0|0|0 0|0|1|0 0|0|0|1
0.000988|0.000031|0.000595|0.000744
0.027161|0.000559|0.016240|0.000559
0.923613|0.000168|0.000168|0.000169
Finding the best topology
A binary unrooted tree has 5 edges and 3 distinct topologies. Hereare the general formulas for binary unrooted trees:
I 2n − 3 edges
I (2n − 5)!! = 1× 3× 5× · · · × (2n − 3) topologies
Rooted binary trees have 2n − 2 edges and (2n − 5)!! topologies.A function exists for this:
> howmanytrees(4, rooted=FALSE)
[1] 3
> howmanytrees(10, rooted=FALSE)
[1] 2027025
> howmanytrees(20, rooted=FALSE)
[1] 2.216431e+20
Finding the best trees
The strategy of evaluating the likelihood criterion for all trees inorder to find the best tree topoology is in most cases highlyimpracticable. Instead, local tree rearrangements are used tosearch locally within the tree space. The idea behind such aheuristic is to use a starting tree and search locally for improvedscores (parsimony, maximum likelihood, Least-Squares), until nofurther rearrangements can lead to a tree with a better score.
Nearest neighbor interchangeFor any internal edge of a binary tree there exist three differentways to connect its four subtrees, one of which is the current tree.
A
B
C
D
A
C
B
D
A
D
B
C
Modelling rate variation
We assume that the substitution rate varies between different sites(intron vs. exon, codon positions, etc). Two approaches arecommonly used:
I define different partitions
I model rate variation with different rate categories, with a(discrete) Γ distribution and/or proportion of variables sites
Comparing trees and modelsThe phylogenetic likelihood allows us to compare many differentmodels or trees. There is often a bias vs. variance trade-off.Simple models are easy to interpret but can often be biased.
MSEVariance
Bias2
number of parameters
Comparing trees and models
The phylogenetic likelihood allows us to compare many differentmodels or trees.
I If two models are nested - that is, one model can be describedas a special case of the other – then we can directly comparetheir likelihoods under their ML parameter estimates for afixed tree using a likelihood ratio test (LRT)
I For non nested models we can use the Akaike InformationCriteria (AIC) or the Bayesian Information Criteria (BIC):AIC = −`(θ, τ, x) + 2 ∗ dfBIC = −`(θ, τ, x) + ln(n) ∗ dfwhere df is the number of parameters of the model and n thenumber of sites.
I Or use the Shimodaira-Hasegawa test or similar bootstrapapproaches.
Detection of molecular adaptation
We look at each triplet of nucletides and assume that only onenucleotide can be replaced at a time. Then we can distinguishbetween nucleotide substitutions that result in the same aminoacid (synonymous substitutions) or a different amino acid(non-synonymous substitutions). The ratio dN/dS ofnon-synonymous to synonymous substitutions can be an indicationof the kind of selective pressure acting on the codon site. Undernegative selection, we expect that non-synonymous substitutionswill accumulate more slowly than synonymous ones. And underpositive or diversifying selection, we expect more amino acidchanging replacements.
Applications with phangorn
The two main functions are pml to set up the model andoptim.pml for optimising parameters and the tree with ML.Example session for Jukes Cantor, GTR and GTR+Γ+I model:
> data(Laurasiatherian)
> tr <- nj(dist.ml(Laurasiatherian))
> m0 <- pml(tr, Laurasiatherian)
> m.jc69 <- optim.pml(m0, optNni=TRUE)
> m.gtr <- optim.pml(m0, optNni=TRUE, model="GTR")
> m.gtr.G.I <- optim.pml(update(m.gtr, k=4), model=
"GTR", optNni=TRUE, optGamma=TRUE, optInv=TRUE)
By default, only the edge lengths are optimized. Currentlyphangorn only supports NNI tree rearrangements (equivalent toPhyML vers. 2)
There exist several useful generic functions like update, anova orAIC for objects of class pml.
> methods(class="pml")
[1] anova.pml logLik.pml plot.pml print.pml
[5] update.pml vcov.pml
For example we can compare the different models as they arenested with likelihood ratio test:
> anova(m.jc69, m.gtr, m.gtr.G.I)
Likelihood Ratio Test Table
Log lik. Df Df change Diff log lik. Pr(>|Chi|)
1 -54113 91
2 -50603 99 8 7020 < 2.2e-16 ***
3 -44527 101 2 12151 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Partition models
pmlPart(global ∼ local, object, model)
global local
bf bfQ Q
inv invshape shapeedge edge
ratenni
Each component can be only used once in the formula.
Partition models
There are two different ways to set up partition models.1. Setting up partition models for different genes.> fit1 <- pml(tree, g1)
> fit2 <- pml(tree, g2)
> fit3 <- pml(tree, g3)
> fit4 <- pml(tree, g4)
> genePart <- pmlPart(Q + bf ∼ edge,
list(fit1, fit2, fit3, fit4), optRooted=TRUE)
> trees <- lapply(genePart$fits, function(x)x$tree)
> class(trees) <- "multiPhylo"
> densiTree(trees, type="phylogram", col="red")
where g1, g2, g3 and g4 are objects of class phyDat.
ML in phylogenetics
Scer
Spar
Smik
Skud
Sbay
Scas
Sklu
Calb
Partition models
2. Partitioning via a weight matrix.> woody <- phyDat(woodmouse)
> tree <- nj(dist.ml(woody))
> fit <- pml(tree, woody)
> w <- attr(woody, "index")
> weight <- table(w, rep(c(1,2,3), length=length(w)))
> codonPart <- pmlPart(edge ∼ rate, fit,
model=c("JC", "JC", "GTR"), weight=weight)
Model / tree comparison
Alternatively we can use the Shimodaira-Hasegawa test to checkfor differences between models:
> SH.test(m.jc69, m.gtr, m.gtr.G.I)
Trees ln L Diff ln L p-value
[1,] 1 -54112.74 9585.685 0.0000
[2,] 2 -50602.74 6075.683 0.0000
[3,] 3 -44527.06 0.000 0.5911
Model selection
Two possibilities
I ape: phymltest
> write.phyDat(woody, "woody.phy")
> out <- phymltest("woody.phy", execname =
"~/phyml")
I phangorn: modelTest
> mt <- modelTest(Laurasiatherian, model=c("JC",
"F81", "HKY", "GTR"))
modelTest works also for amino acid models similar to ProtTest.
> mt <- modelTest(myAAData, model=c("WAG", "JTT",
"LG","Dayhoff"))
Model Selection
Model df logLik AIC BIC
1 JC 91.00 -54303.67 108789.35 109341.202 JC+I 92.00 -50673.32 101530.63 102088.553 JC+G 92.00 -48684.10 97552.19 98110.114 JC+G+I 93.00 -48605.03 97396.06 97960.055 F81 94.00 -54212.64 108613.27 109183.326 F81+I 95.00 -50549.53 101289.06 101865.177 F81+G 95.00 -48500.49 97190.99 97767.108 F81+G+I 96.00 -48416.26 97024.51 97606.699 HKY 95.00 -51275.86 102741.72 103317.83
10 HKY+I 96.00 -47451.73 95095.45 95677.6311 HKY+G 96.00 -44893.11 89978.23 90560.4012 HKY+G+I 97.00 -44770.18 89734.36 90322.6013 GTR 99.00 -50759.89 101717.79 102318.1614 GTR+I 100.00 -47081.77 94363.55 94969.9815 GTR+G 100.00 -44759.49 89718.99 90325.4216 GTR+G+I 101.00 -44624.02 89450.04 90062.54
Bootstrap
> bs <- bootstrap.pml(m.gtr, bs=100, optNni=TRUE)
> plotBS(m.gtr$tree, bs, type="phylo", bs.adj=c(.5,0))
PlatypusWallarooPossum
Bandicoot
Opposum
ArmadilloElephant
AardvarkTenrec
HedghogGymnure
MoleShrew
RbatFlyingFoxRyFlyFox
FruitBatLongTBat
HorseDonkeyWhiteRhino
IndianRhin
Pig
AlpacaCowSheep
HippoFinWhaleBlueWhaleSpermWhale
RabbitPika
SquirrelDormouseGuineaPig
MouseVoleCaneRat
BaboonHuman
LorisCebus
Cat
DogHarbSeal
FurSealGraySeal
10058100
100100
58
93
100100
100100
6458
10086
100100
98
96
10010087
100
44
79
10088
97
64
86
73
75
100
5489100
70
47
91
55
68
67
100
100
Codon Models
qij =
0 if i and j differ in more than one positionπj for synonymous transversionπjκ for synonymous transitionπjω for non-synonymous transversionπjωκ for non-synonymous transition
or if we make abstraction of pij (frequency of base j):
qij =
0 if i and j differ in more than one position1 for synonymous transversionκ for synonymous transitionω for non-synonymous transversionωκ for non-synonymous transition
where ω is the dN/dS ratio, κ the transition transversion ratio andπj is the the equilibrium frequencies of codon j .
Codon Models
> (dat <- phyDat(as.character(yeast), "CODON"))
> tree <- nj(dist.ml(yeast))
> fit <- pml(tree, dat)
> ctr <- pml.control(trace=0)
> fit0 <- optim.pml(fit, control = ctr)
> fit1 <- optim.pml(fit0, model="codon1", control=ctr)
> fit2 <- optim.pml(fit0, model="codon2", control=ctr)
> fit3 <- optim.pml(fit0, model="codon3", control=ctr)
Model κ ω
codon0 1 1codon1 free freecodon2 1 freecodon3 free 1
Additionally, the equilibrium frequencies of the codons πj can beestimated setting the parameter optBf=TRUE.
Codon Models
> anova(fit0, fit2, fit1)
Likelihood Ratio Test Table
Log lik. Df Df change Diff log lik. Pr(>|Chi|)
1 -1054762 13
2 -648282 14 1 812961 < 2.2e-16 ***
3 -642807 15 1 10949 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(fit0, fit3, fit1)
Likelihood Ratio Test Table
Log lik. Df Df change Diff log lik. Pr(>|Chi|)
1 -1054762 13
2 -708674 14 1 692176 < 2.2e-16 ***
3 -642807 15 1 131735 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1