phylogenetic inference
DESCRIPTION
Phylogenetic Inference. Involves an attempt to estimate the evolutionary history of a collection of organisms (taxa) or a family of genes Two major components Estimation of the evolutionary tree (branching order) - PowerPoint PPT PresentationTRANSCRIPT
Phylogenetic Inference
• Involves an attempt to estimate the evolutionary history of a collection of organisms (taxa) or a family of genes
• Two major components– Estimation of the evolutionary tree (branching
order)– Using estimated trees (phylogenies) as analytical
framework for further evolutionary study
• Traditional role: systematics and classification
Example 1: Closest living relatives of humans
Humans
Bonobos
Gorillas
Orangutans
Chimpanzees
MYA015-30
MYA
Chimpanzees
Orangutans
Humans
Bonobos
Gorillas
014
Pre-molecular view(morphology)
Emerging picture from mtDNA, most nuclear genes, DNA/DNA hybridization
Example 2: Who are whales related to?
Morphological data suggest that whales are a “sister clade” to extant artiodactylans, but molecular data suggest strongly that whales and hippos are more closely related to each other than hippos are to other artiodactylans
Morphology
Mt and nuclear DNA sequences, SINEs, LINEs
Other interesting applicationsForensics—Transmission of HIV by Florida dentist
DENTIST
DENTIST
Patient D
Patient F
Patient C
Patient A
Patient G
Patient BPatient E
Patient A
Local control 2
Local control 3
Local control 9
Local control 35
Local control 3
Yes:The HIV sequences fromthese patients fall withinthe clade of HIV sequences found in the dentist.
No
No
From Ou et al. (1992) and Page & Holmes (1998), redrawn by Caro-Beth Stewart
Phylogenetic treeof HIV sequencesfrom the DENTIST,his Patients, & LocalHIV-infected People:
Other interesting applicationsStudying dynamics of microbial communities:
Sequence 16s rDNA to identify and quantify microbes in soil before and after pesticide exposure (many microbes are previously unknown, so study gene sequences phylogenetically to follow changes in community composition)
Known sequences from database
Novel microbial sequences
Other interesting applicationsPredicting evolution of influenza viruses
Lineages with many mutations in one set of positively selected codons were usually the ones which led to successful strains in subsequent seasons
Other interesting applicationsPredicting functions of uncharacterized genes
Use “character-mapping” to infer functions based on parsimonious reconstructions
Many situations where similarity-based methods are inadequate, e.g.:
Other interesting applications• Drug Discovery—predicting natural ligands for cell
surface receptors that are potential drug targets (e.g., G-protein coupled receptors)
G-protein-coupled receptors are a pharmacologically important protein family with approximately 450 genes identified to date. Pathways involving these receptors are the targets of hundreds of drugs, including antihistamines, neuroleptics, antidepressants, and antihypertensives. The functions of many of these proteins are unknown, and determining ligands and signaling pathways is time-consuming and expensive. This difficulty motivates the search for a computational method which can predict ligand and second messenger with high reliability. Classifying this family of proteins helps us classify drugs, a technique which might be called "evolutionary pharmacology”… A computational method based on evolutionary tree reconstruction and employing an accepted-mutation stepmatrix can predict the ligand selectivities and intracellular signaling pathways of uncharacterized receptors, given only the amino acid sequence of the receptor. This dramatically increases the efficiency of functional characterization of new receptors. (http://www.cis.upenn.edu/~krice/receptor.html)
• Vaccine development—engineer vaccines to confer immunity against multiple virus populations by targeting their inferred common ancestors
Ancestral Node or ROOT of
the TreeInternal Nodes orDivergence Points
(represent hypothetical ancestors of the taxa)
Branches (edges) and lineages
Terminal Nodes
A
B
C
D
E
Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny
Common Phylogenetic Tree Terminology
Completely unresolvedor "star" phylogeny
Partially resolvedphylogeny
Fully resolved, bifurcatingphylogeny (binary tree)
A A A
B
B B
C
C
C
E
E
E
D
D D
Polytomy or multifurcation A bifurcation
The goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees:
C-B Stewart, NHGRI lecture, 12/5/00
Three possible unrooted trees for four taxa (A, B, C, D)
A C
B D
Tree 1
A B
C D
Tree 2
A B
D C
Tree 3
Phylogenetic tree building (or inference) methods are aimed at discovering which of the possible unrooted trees is "correct".We would like this to be the “true” biological tree — that is, one that accurately represents the evolutionary history of the taxa.However, we must settle for discovering the optimal tree for the phylogenetic method of choice (no guarantee that optimality = truth).
The number of unrooted trees increases in a greater than exponential manner with number of taxa
(2N - 5)!! = # unrooted trees for N taxa
CA
B D
A B
C
A D
B E
C
A D
B E
C
F
Inferring evolutionary relationships between the taxa requires rooting the tree:
To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A
BC
Root D
A B C D
RootNote that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.
Rooted tree
Unrooted tree
Now, try it again with the root at another position:
A
BC
Root
D
Unrooted tree
Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.
C D
Root
Rooted tree
A
B
An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees
The unrooted tree 1:
A C
B D
Rooted tree 1d
C
D
A
B
4
Rooted tree 1c
A
B
C
D
3
Rooted tree 1e
D
C
A
B
5
Rooted tree 1b
A
B
C
D
2
Rooted tree 1a
B
A
C
D
1
These trees show five different evolutionary relationships among the taxa
All of these rearrangements show the same evolutionary relationships between the taxa
B
A
C
D
A
B
D
C
B
C
A
D
B
D
A
C
B
AC
DRooted tree 1a
B
A
C
D
A
B
C
D
By outgroup: Uses taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., -globins to root -globins).
There are two major ways to root trees:
A
B
C
D
10
2
3
5
2
By midpoint or distance:Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods.
outgroup
d (A,D) = 10 + 3 + 5 = 18Midpoint = 18 / 2 = 9
C-B Stewart, NHGRI lecture,12/5/00
# T axa
3
4
5
6
7
8
9
.
.
.
.
30
# Un r oot e d
T rees
1
3
15
105
945
1 0 ,935
13 5 ,135
.
.
.
.
~3 . 58 x 10
3 6
# Root s
3
5
7
9
11
13
15
.
.
.
.
57
# Root e d
T rees
3
1 5
1 0 5
9 4 5
10,3 9 5
1 35,1 3 5
2, 0 27,0 2 5
.
.
.
.
~2 . 04 x 10
3 8
x =
CA
B D
A D
B E
C
A D
B E
C
F (2N - 3)!! = # unrooted trees for N taxa
Each unrooted tree theoretically can be rootedanywhere along any of its branches
Types of data used in phylogenetic inference:Character-based methods: Use the aligned characters, such as DNA
or protein sequences, directly during tree inference. Taxa Characters
Species A ATGGCTATTCTTATAGTACGSpecies B ATCGCTAGTCTTATATTACASpecies C TTCACTAGACCTGTGGTCCASpecies D TTGACCAGACCTGTGGTCCGSpecies E TTGACCAGTTCTCTAGTTCG
Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building.
A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ----
Example 1: Uncorrected“p” distance(=observed percentsequence difference)
Example 2: Kimura 2-parameter distance(estimate of the true number of substitutions between taxa)
Similarity vs. Evolutionary Relationship:
Similarity and relationship are not the same thing, even thoughevolutionary relationship is inferred from certain types of similarity.
Similar: having likeness or resemblance (an observation)
Related: genetically connected (an historical fact)
Two taxa can be most similar without being most closely-related:
Taxon A
Taxon B
Taxon C
Taxon D
1
1
1
6
3
5
C is more similar in sequence to A (d = 3) than to B (d = 7),but C and B are most closelyrelated (that is, C and B shareda common ancestor more recentlythan either did with A).
Character-based methods can tease apart types of similarity and theoreticallyfind the true evolutionary tree. Similarity = relationship only if certain conditionsare met (if the distances are ‘ultrametric’).
Types of Similarity
Observed similarity between two entities can be due to:
Evolutionary relationship:Shared ancestral characters (‘symplesiomorphies’)Shared derived characters (‘’synapomorphy’)
Homoplasy (independent evolution of the same character):Convergent events (in either related on unrelated entities),Parallel events (in related entities), Reversals (in related entities)
CC
G
G
C
C
G
G
CG
G C
C
G
GT
METRIC DISTANCES between any two or three taxa(a, b, and c) have the following properties:
Property 1: d (a, b) ≥ 0 Non-negativity
Property 2: d (a, b) = d (b, a) Symmetry
Property 3: d (a, b) = 0 if and only if a = b Distinctness
and...
Property 4: d (a, c) ≤ d (a, b) + d (b, c) Triangle inequality:
a
b
c6
9
5
ULTRAMETRIC DISTANCESmust satisfy the previous four conditions, plus:
Property 5 d (a, b) ≤ maximum [d (a, c), d (b, c)]
If distances are ultrametric, then the sequences are evolving in a perfectly clock-like manner, thus can be used in UPGMA trees and for the most precise calculations of divergence dates.
a b4
66
c
Similarity = Relationship if the distances are ultrametric!
a
b
c
2
22
4
This implies that the two largest distances are equal, so that they define an isosceles triangle:
General strategy for estimating a phylogeny
1. Get data
2. Select an optimality criterion (e.g., parsimony, least-squares distance, maximum likelihood)
3. Choose a search strategy (e.g., stepwise addition with branch swapping, branch-and-bound)
4. Evaluate optimality criterion for each tree visited during search, always keeping track of best tree(s) found
Parsimony (optimality criterion)
• In general: choose the tree requiring the fewest number of (possibly weighted) character-state changes (= steps)
• Assume character independence; can calculate length required by each character and sum over characters to get total tree length
Parsimony variants used for molecular data
• Fitch parsimony (unordered/nonadditive): Each change counts 1 step, regardless of the nature of this change
• Transversion parsimony: changes between a purine (A or G) and a pyrimidine (C or T) (“transversions”) count 1, changes between two purines or between two pyrimidines (“transitions”) count 0
• Generalized parsimony: User specifies cost of each type of change
A C
G T
= 1 step
= 3 steps
Calculating tree lengths under parsimony using “brute force”
• For each character:– Consider every possible ancestral state
reconstruction– Count total cost required for each of these
reconstructions– Sum over all characters
G
A
A C
C
C
G
A
A T
C
C
G
A
A G
C
C
G
A
C A
C
C
G
A
C C
C
C
G
A
C T
C
C
G
A
C G
C
C
G
A
G A
C
C
G
A
G C
C
C
G
A
G T
C
C
G
A
G G
C
C
G
A
T A
C
C
G
A
T C
C
C
G
A
T T
C
C
G
A
T G
C
C
G
A
A A
C
C
equal: 1+0+0+1+1=3tv4: 1+0+0+4+4=9
equal: 1+1+1+1+1=5tv4: 4+4+4+4+4=20
equal: 0+1+1+1+1=4tv4: 0+1+1+4+4=10
equal: 1+1+1+1+1=5tv4: 4+4+4+4+4=20
equal: 1+0+1+0+0=2tv4: 1+0+4+0+0=5
equal: 1+1+0+0+0=2tv4: 4+4+0+0+0=8
equal: 0+1+1+0+0=2tv4: 0+1+4+0+0=5
equal: 1+1+1+0+0=3tv4: 4+4+1+0+0=9
equal: 1+0+1+1+1=4tv4: 1+0+1+4+4=10
equal: 1+1+1+1+1=5tv4: 4+4+4+4+4=20
equal: 0+1+0+1+1=3tv4: 0+1+0+4+4=9
equal: 1+1+1+1+1=5tv4: 4+4+4+4+4=20
equal: 1+0+1+1+1=3tv4: 1+0+4+1+1=7
equal: 1+1+1+1+1=5tv4: 4+4+1+1+1=11
equal: 0+1+1+1+1=4tv4: 0+1+4+1+1=7
equal: 1+1+0+1+1=4tv4: 4+4+0+1+1=10
0 1 1 11 0 1 11 1 0 11 1 1 0
equal =
0 4 1 44 0 4 11 4 0 44 1 4 0
tv4 =
Calculating tree lengths using dynamic programming
• Analogous to pairwise alignment: determine implications of each possible state assignment at one level (node) for length at next level (parent node)
G A C CA C G T A C G T A C G T A C G T
∞ ∞ ∞0 ∞ ∞∞ ∞ ∞∞ ∞ ∞∞000
W XY Z
1 2
3
A C G T A C G T
∞ ∞∞ ∞ ∞∞00
(min∞,4,∞,∞)+
(min∞,4,∞,∞)= 4 + 4 = 8
(min∞,0,∞,∞)+
(min∞,0,∞,∞)= 0 + 0 = 0
(min∞,4,∞,∞)+
(min∞,4,∞,∞)= 4 + 4 = 8
(min∞,1,∞,∞)+
(min∞,1,∞,∞)=1+1= 2
A C G T
2
X Z
min(1,12,2,12)+
min(8,4,9,6)= 1 + 4 = 5
min(5,8,5,9)+
min(12,0,12,3)= 5 + 0 = 5
min(2,12,1,12)+
min(9,4,8,6)= 1 + 4 = 5
min(5,9,5,8)+
min(12,1,12,2)= 5 + 1 = 6
A C G T
A C G T
1 8 81
A C G T
8 0 28
A C G T A C G T
∞ ∞ ∞0 ∞ ∞∞0
(min∞,∞,1,∞)+
(min0,∞,∞,∞)= 1 + 0 = 1
(min∞,∞,4,∞)+
(min4,∞,∞,∞)= 4 + 4 = 8
(min∞,∞,0,∞)+
(min1,∞,∞,∞)=0 + 1 = 1
(min∞,∞,4,∞)+
(min4,∞,∞,∞)= 4 + 4 = 8
A C G T
W Y
1
Faster algorithms for special cases
• Farris (1970) algorithm for ordered characters• Fitch (1971) algorithm for unordered characters
• Assign “state sets” to terminal taxa based on observed data, and initialize tree length to 0
• Traverse tree from tips to root; for each node consider state sets of two immediate descendants (children)
– If child state sets have a nonempty intersection, new state set equals this intersection
– Otherwise, make new state set equal to the union of the two child state sets, and add 1 to the tree length
{G}:0 {A}:0 {C}:0 {C}:0
1 2
3
W XY Z
{G}:0 {A}:0 {C}:0 {C}:0
{A,G}:1 2
3
{G}:0 {A}:0 {C}:0 {C}:0
{A,G}:1
3
{C}:0
{G}:0 {A}:0 {C}:0 {C}:0
{A,G}:1 {C}:0
{A,C,G}:2
Example of tree length calculation using Fitch optimization
Searching for trees
• Generation of all possible trees
B
C
A
D
D
D
B
CD
A
B
CD
B C
DB
A
1.Generate all 3 trees for first 4 taxa:
Searching for trees
B
C
D
AE
EE
C
DE
AB
C
DE
BA
C
DB
AE
D
EB
AC
C
EB
AD
2. Generate all 15 trees for first 5 taxa:
(likewise for each of the other two 4-taxon trees)
Searching for trees
3. Full search tree:
EA
CB
D
DA
CB
E
DA
EB
C
DA
EC
B
CB
ED
A
CA
DB
E
CA
EB
D
CA
ED
B
DB
EC
A
EA
DC
BE
B
DC
A
BA
DC
E
BA
EC
D
BA
ED
C
D
A
B
C
B
A
C
D
A
B
C
C
A
B
D
DB
EA
C
Searching for trees
Branch and bound algorithm:
The branch-and-bound algorithm for exact solution of the problem of finding an optimal parsimony tree. The search tree is the same as for exhaustive search, with tree lengths for a hypothetical data set shown in boldface type. If a tree lying at a node of this search tree has a length that exceeds the current lower bound on the optimal tree length, this path of the search tree is terminated (indicated by a cross-bar), and the algorithm backtracks and takes the next available path. When a tip of the search tree is reached (i.e., when we arrive at a tree containing the full set of taxa), the tree is either optimal (and hence retained) or suboptimal (and rejected). When all paths leading from the initial 3-taxon tree have been explored, the algorithm terminates, and all most-parsimonious trees will have been identified. Asterisks indicate points at which the current lower bound is reduced. See text for additional explanation, and circled numbers represent the order in which phylogenetic trees are visited in the search tree.
1
*229
EA
CB
D
DA
CB
E
DA
EB
C
DA
EC
B
CB
ED
A
CA
DB
E
CA
EB
D
DB
EC
A
D
A
B
C
A
B
C
233
235
237 237245
251258
C
A
B
D
280
221 213
B
A
C
D
234
*241
*242
242245
246247
249
268C
A
ED
B
245
241
241
244248
251
232
226
233
235
251
262
243
227
2
3
11
12
13-19
4-10
DB
EA
C
20
21
22
26
23
24
25
27
28-34
Searching for trees
Heuristic search methods
A greedy stepwise-addition search applied to the example used for branch-and-bound. The best 4-taxon tree is determined by evaluating the lengths of the three trees obtained by joining taxon D to tree 1 containing only the first three taxa. Taxa E and F are then connected to the five and seven possible locations, respectively, on trees 4 and 9, with only the shortest trees found during each step being used for the next step. In this example, the 233-step tree obtained is not a global optimum. Circled numbers indicate the order in which phylogenetic trees are evaluated in the stepwise-addition search.
EA
CB
D
DA
CB
E
DA
EB
C
DA
EC
B
CB
ED
A
D
A
B
C
A
B
C
233*
235
237 237245
251258
C
A
B
D
280
221 213
B
A
C
D
235
251
262
243
227
2
1
2
3
5
6
7
8
4
9
10-16
Searching for trees
Heuristic search methods continued
1
2 3 45
6
Nearest neighbor interchange:
1
2 3 45
6
1
2 4 35
6 1
2 3
4
5 6
1
32 4 5
6
3
21 4
5
6
1
2 3 5 4
6
1
2 3 64
5
All possible NNIs on 6-taxon tree:
Searching for trees
Heuristic search methods continued
Subtree pruning regrafting:
1
2 3 45
6
x zy
x
1
2 3 4 5
6
a
bc
z
1
2 3 45
6
a
bc
d
y
1
2 3 4 5
6
a
b
Searching for trees
Heuristic search methods continued
Trees resulting from SPR:
z.a.
1
2 4 3 5
6
z.b.
1
2
4 3
5
6z.c.
4
3 2 1 5
6
z.d.
3
4 1 2 5
6
y.a.
1
2 3 5 4
6
y.b.
1
2 3 6 4
5
x.a.
1
2 4 3 5
6
x.b.
1
2
4 3
5
6x.c.
1
2 5 63
4
x.d.
1
2 6 5 3
4
Searching for trees
Heuristic search methods continued
Tree bisection-reconnection:
1
2 3 45
6
x zy
r
s
t u v
w
1
2 3 45
6
x zx'
u v
w1
2 4 3 5
6
1
2 3 45
6
0 01
1
2
2
Reconnection distances:
Searching for trees
Heuristic search methods continued
Tree bisection-reconnection:
(D)
1
2 3 45
6
y
r
s
v
wy'
3
1 2 54
6
01
1
2 3 45
6
1
1
1
0Reconnection distances:
Star-decomposition search
1
2
3
4
5
1
3
2
4
5
3
5
1
2
4
•••
4
5
1
2
3
1
2
3
4
5
14
3
2
5
12
3
4
5
15
3
2
4
Step 1
Step 2 Step 3
Other search strategies
• These “hill-climbing” methods work well for up to 20-30 taxa. For larger numbers of taxa, highly prone to entrapment in local optima. Therefore, additional strategies may be necessary:– Random restart (random trees, stepwise addition
with random addition sequences)– Other optimization (meta)heuristics: iterated local
search (restart after random perturbations); simulated annealing and other stochastic optimization methods
– Genetic algorithms and other population-based approaches
Overview of maximum likelihood as used Overview of maximum likelihood as used in phylogeneticsin phylogenetics
• Overall goal: Find a tree topology (and associated Overall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of parameter estimates) that maximizes the probability of obtaining the observed data, given a model of evolutionobtaining the observed data, given a model of evolution
Overview of maximum likelihood as used Overview of maximum likelihood as used in phylogeneticsin phylogenetics
• Overall goal: Find a tree topology (and associated Overall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of parameter estimates) that maximizes the probability of obtaining the observed data, given a model of evolutionobtaining the observed data, given a model of evolution
Likelihood(hypothesis) Likelihood(hypothesis) Prob(dataProb(data||hypothesis)hypothesis)
Likelihood(tree,model) = k Prob(observed sequences|Likelihood(tree,model) = k Prob(observed sequences|tree,model)tree,model)
Overview of maximum likelihood as used Overview of maximum likelihood as used in phylogeneticsin phylogenetics
• Overall goal: Find a tree topology (and associated Overall goal: Find a tree topology (and associated parameter estimates) that maximizes the probability of parameter estimates) that maximizes the probability of obtaining the observed data, given a model of evolutionobtaining the observed data, given a model of evolution
Likelihood(hypothesis) Likelihood(hypothesis) Prob(dataProb(data||hypothesis)hypothesis)
Likelihood(tree,model) = k Prob(observed sequences|Likelihood(tree,model) = k Prob(observed sequences|tree,model)tree,model)
[[notnot Prob(tree Prob(tree||data,model)]data,model)]
Computing the likelihood of a single treeComputing the likelihood of a single tree
1 1 jj NN (1) C…GGACA…(1) C…GGACA…CC……
GTTTA…CGTTTA…C(2) C…AGACA…(2) C…AGACA…CC……
CTCTA…CCTCTA…C(3) C…GGATA…(3) C…GGATA…AA……
GTTAA…C GTTAA…C (4) C…GGATA…(4) C…GGATA…GG……
CCTAG…C CCTAG…C
Computing the likelihood of a single treeComputing the likelihood of a single tree
1 1 jj NN (1) C…GGACA…(1) C…GGACA…CC……
GTTTA…CGTTTA…C(2) C…AGACA…(2) C…AGACA…CC……
CTCTA…CCTCTA…C(3) C…GGATA…(3) C…GGATA…AA……
GTTAA…C GTTAA…C (4) C…GGATA…(4) C…GGATA…GG……
CCTAG…C CCTAG…C (1)(1)
(2)(2)
(3)(3)
(4)(4)
Computing the likelihood of a single treeComputing the likelihood of a single tree
1 1 jj NN (1) C…GGACA…(1) C…GGACA…CC……
GTTTA…CGTTTA…C(2) C…AGACA…(2) C…AGACA…CC……
CTCTA…CCTCTA…C(3) C…GGATA…(3) C…GGATA…AA……
GTTAA…C GTTAA…C (4) C…GGATA…(4) C…GGATA…GG……
CCTAG…C CCTAG…C (1)(1)
(2)(2)
(3)(3)
(4)(4)
CCCC AA GG
(6)(6)
(5)(5)
Computing the likelihood of a single treeComputing the likelihood of a single tree
ProbProb
CCCC AA GG
AA
AA
Likelihood at site Likelihood at site jj = =
Computing the likelihood of a single treeComputing the likelihood of a single tree
ProbProb
CCCC AA GG
AA
AA
Likelihood at site Likelihood at site jj = =
+ Prob+ Prob
CCCC AA GG
AA
CC
Computing the likelihood of a single treeComputing the likelihood of a single tree
ProbProb
CCCC AA GG
AA
AA
Likelihood at site Likelihood at site jj = =
+ Prob+ Prob
CCCC AA GG
AA
CC
ProbProb
CCCC AA GG
TT
TT+ … ++ … +
Computing the likelihood of a single treeComputing the likelihood of a single tree
ProbProb
CCCC AA GG
AA
AA
Likelihood at site Likelihood at site jj = =
+ Prob+ Prob
CCCC AA GG
AA
CC
ProbProb
CCCC AA GG
TT
TT+ … ++ … +
But use Felsenstein (1981) pruning algorithmBut use Felsenstein (1981) pruning algorithm
Computing the likelihood of a single treeComputing the likelihood of a single tree
L=L1L2L LN = Ljj=1
N
∏
lnL=lnL1 +lnL2 +L +lnLN = lnLjj=1
N
∑
Finding the maximum-likelihood treeFinding the maximum-likelihood tree(in principle)(in principle)
• Evaluate the likelihood of each possible Evaluate the likelihood of each possible tree for a given collection of taxa.tree for a given collection of taxa.
Finding the maximum-likelihood treeFinding the maximum-likelihood tree(in principle)(in principle)
• Evaluate the likelihood of each possible Evaluate the likelihood of each possible tree for a given collection of taxa.tree for a given collection of taxa.
• Choose the tree topology which Choose the tree topology which maximizes the likelihood over all maximizes the likelihood over all possible trees.possible trees.
Probability calculations Probability calculations require…require…
• An explicit model of substitution that specifies change probabilities for a given branch length:An explicit model of substitution that specifies change probabilities for a given branch length:
Probability calculations Probability calculations require…require…
• An explicit model of substitution that specifies change probabilities for a given branch length:An explicit model of substitution that specifies change probabilities for a given branch length:
Q =
πArAA πCrAC πGrAG πTrAT
πArCA πCrCC πGrCG πTrCT
πArGA πCrGC πGrGG πTrGT
πArTA πCrTC πGrTG πTrTT
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
Jukes-CantorJukes-CantorKimura 2-parameterKimura 2-parameterHasegawa-Kishino-Yano (HKY)Hasegawa-Kishino-Yano (HKY)Felsenstein 1981, 1984 Felsenstein 1981, 1984 General time-reversibleGeneral time-reversible
Probability calculations Probability calculations require…require…
• An explicit model of substitution that specifies change probabilities for a given branch length:An explicit model of substitution that specifies change probabilities for a given branch length:
• An estimate of optimal branch lengths in units of expected amount of change (An estimate of optimal branch lengths in units of expected amount of change ( = rate x time) = rate x time)
Q =
πArAA πCrAC πGrAG πTrAT
πArCA πCrCC πGrCG πTrCT
πArGA πCrGC πGrGG πTrGT
πArTA πCrTC πGrTG πTrTT
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
P(v)=eQν
Jukes-CantorJukes-CantorKimura 2-parameterKimura 2-parameterHasegawa-Kishino-Yano (HKY)Hasegawa-Kishino-Yano (HKY)Felsenstein 1981, 1984Felsenstein 1981, 1984General time-reversibleGeneral time-reversible
A Family of Reversible Substitution ModelsA Family of Reversible Substitution Models
GTR
SYMTrN
F81
JC
K3ST
K2P
HKY85F84
Equal base frequencies
3 substitution types(transitions,2 transversion classes)
2 substitution types(transitions vs. transversions)
3 substitution types(transversions, 2 transition classes)
2 substitution types(transitions vs.transversions)
Single substitution type
Equal basefrequencies
Single substitution typeEqual base frequencies
(general time-reversible)
(Tamura-Nei)
(Hasegawa-Kishino-Yano)
(Felsenstein)
Jukes-Cantor
(Kimura 2-parameter)
(Kimura 3-subst. type)
(Felsenstein)
E.g., transition probabilities forE.g., transition probabilities forHKY and F84:HKY and F84:
Pij t( ) =
π j +π j1
Π j
−1⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟ e
−μν +Π j −π j
Π j
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟ e
−μνA (i= j)
π j +π j
1Π j
−1⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟ e
−μν −π j
Π j
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟ e
−μνA (i≠ j, transition)
π j 1−e−μν( ) (i≠ j, transversion)
⎧
⎨
⎪ ⎪ ⎪ ⎪ ⎪
⎩
⎪ ⎪ ⎪ ⎪ ⎪
The Relevance of Branch LengthsThe Relevance of Branch LengthsC C A A A A A A A A
A
C
The Relevance of Branch LengthsThe Relevance of Branch LengthsC C A A A A A A A A
A
C
C C A A A A A A A A
CA
Concerns about statistical properties Concerns about statistical properties and suitability of models and suitability of models
(assumptions)(assumptions)
Concerns about statistical properties Concerns about statistical properties and suitability of models and suitability of models
(assumptions)(assumptions)
ConsistencyConsistency
If an estimator converges to the true value of a If an estimator converges to the true value of a parameter as the amount of data increases toward parameter as the amount of data increases toward infinity, the estimator is infinity, the estimator is consistentconsistent..
Two levels of maximizationTwo levels of maximization
• Nei (1987)Nei (1987)– “…“…the likelihood computed in this method is conditional for the likelihood computed in this method is conditional for
each topology, so it is not clear whether or not the topology each topology, so it is not clear whether or not the topology showing the highest likelihood has the highest probability showing the highest likelihood has the highest probability of being the true topology…”of being the true topology…”
Two levels of maximizationTwo levels of maximization
• Nei (1987)Nei (1987)– “…“…the likelihood computed in this method is conditional for the likelihood computed in this method is conditional for
each topology, so it is not clear whether or not the topology each topology, so it is not clear whether or not the topology showing the highest likelihood has the highest probability showing the highest likelihood has the highest probability of being the true topology…”of being the true topology…”
• Yang (1996)Yang (1996)– ““Literally it is a Literally it is a maximum maximum likelihoodmaximum maximum likelihood method method… …
The failure to recognize the complexity of the problem has The failure to recognize the complexity of the problem has caused much controversy … Felsenstein (1973, 1978) caused much controversy … Felsenstein (1973, 1978) referred to the regularity conditions of Wald (1949) for a referred to the regularity conditions of Wald (1949) for a proof of …consistency. These conditions would include proof of …consistency. These conditions would include the continuity and differentiability of the likelihood function the continuity and differentiability of the likelihood function with respect to the topology parameter. These concepts with respect to the topology parameter. These concepts are not defined.are not defined.
““Likelihood” Likelihood” isis consistent. consistent.
• Two proofs:Two proofs:– Chang (1996) in Chang (1996) in Mathematical BiosciencesMathematical Biosciences– Rogers (1997) in Rogers (1997) in Systematic BiologySystematic Biology
These proofs establish that the probability that the true tree has These proofs establish that the probability that the true tree has a higher likelihood than any other possible tree approaches one a higher likelihood than any other possible tree approaches one
as the number of sites (characters) increases toward infinityas the number of sites (characters) increases toward infinity. . Chang called his proof a “customized variant of the fundamental Chang called his proof a “customized variant of the fundamental consistency result of Wald.”consistency result of Wald.”
When does maximum likelihood work When does maximum likelihood work better than parsimony?better than parsimony?
When does maximum likelihood work When does maximum likelihood work better than parsimony?better than parsimony?
• When you’re in the “Felsenstein Zone”When you’re in the “Felsenstein Zone”
AA CC
BB DD
(Felsenstein, 1978)(Felsenstein, 1978)
In the Felsenstein ZoneIn the Felsenstein Zone
AA CC GG TTAA -- 55 66 22CC 55 -- 33 88GG 66 33 -- 11TT 22 88 11 --
Substitution rates:Substitution rates:
Base frequencies:Base frequencies: A=0.1A=0.1 C=0.2C=0.2 G=0.3G=0.3 T=0.4T=0.4
AA BB
CC DD
0.10.1
0.10.1 0.10.1
0.80.8 0.80.8
In the Felsenstein ZoneIn the Felsenstein Zone
0
0.2
0.4
0.6
0.8
1
0 5000 10000
Sequence Length
parsimony
Pro
port
ion
corr
ect
In the Felsenstein ZoneIn the Felsenstein Zone
0
0.2
0.4
0.6
0.8
1
0 5000 10000
Sequence Length
parsimonyML-GTR
Pro
port
ion
corr
ect
The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem
Pattern typePattern type
11 44AA I = Uninformative (constant)I = Uninformative (constant) AA
A AA A 22 33
The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4
(zero changes required on any (zero changes required on any tree)tree)
The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem
Pattern typePattern type
11 44AA I = Uninformative (constant)I = Uninformative (constant) AAAA II = UninformativeII = Uninformative GG
A AA A 22 33
The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4
(one change required on any tree)(one change required on any tree)
The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem
Pattern typePattern type
11 44AA I = Uninformative (constant)I = Uninformative (constant) AAAA II = UninformativeII = Uninformative GGCC III = UninformativeIII = Uninformative GG
A AA A 22 33
The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4
(two changes required on any tree)(two changes required on any tree)
The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem
Pattern typePattern type
11 44AA I = Uninformative (constant)I = Uninformative (constant) AAAA II = UninformativeII = Uninformative GGCC III = UninformativeIII = Uninformative GGG G IV = IV = MisinformativeMisinformative GG
A AA A 22 33
The true phylogeny ofThe true phylogeny of1, 2, 3 and 41, 2, 3 and 4
(two changes required on true tree)(two changes required on true tree)
The long-branch attraction (LBA) problemThe long-branch attraction (LBA) problem
GG 44
AA 22
AA 33
GG 11
… … but this tree needs only one stepbut this tree needs only one step
When do both methods fail?When do both methods fail?
When do both methods fail?When do both methods fail?
• When there is insufficient phylogenetic signal...When there is insufficient phylogenetic signal...
22
11 33
44
When does parsimony work “better” When does parsimony work “better” than maximum likelihood?than maximum likelihood?
When does parsimony work “better” When does parsimony work “better” than maximum likelihood?than maximum likelihood?
• When you’re in the Inverse-Felsenstein (“Farris”) zoneWhen you’re in the Inverse-Felsenstein (“Farris”) zone
AA
BB
CC
DD
(Siddall, 1998)(Siddall, 1998)
Siddall (1998) parameter space Siddall (1998) parameter space
a
a
b
b
b
Both methods do poorly
Parsimony has higheraccuracy than likelihood
Both methods do well
pa
pb0 0.75
0.75
Parsimony vs. likelihood in the Inverse-Felsenstein ZoneParsimony vs. likelihood in the Inverse-Felsenstein Zone
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
20 100 1,000 10,000 100,000
Sequence length
ParsimonyML/JC
15%67.5%
67.5%
(expected differences/site)
Acc
ura
cy
Why does parsimony do so well in theWhy does parsimony do so well in theInverse-Felsenstein Inverse-Felsenstein zone?zone?
A
A
C
C
AC
A
A
C
C
AG
A
C G
C
A
A
C
CAC
AC
True synapomorphyTrue synapomorphy
Apparent synapomorphiesApparent synapomorphiesactually due toactually due tomisinterpreted homoplasymisinterpreted homoplasy
Proportion of parsimony- Proportion of parsimony- informative sites for which informative sites for which
ancestral states are correctly ancestral states are correctly reconstructed and reconstructed and
interpreted as interpreted as synapomorphiessynapomorphies
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
x
x yy
y
x
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
q
x
y yy
y
x
p
q
p
Proportion of parsimony- Proportion of parsimony- informative sites that are informative sites that are
interpreted as interpreted as synapomorphies but are synapomorphies but are actually misinterpreted actually misinterpreted
homoplasieshomoplasies
Parsimony vs. likelihood in the Felsenstein ZoneParsimony vs. likelihood in the Felsenstein Zone
15%
67.5% 67.5%
Acc
ura
cy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
20 100 1,000 10,000 100,000
ParsimonyML/JC
(expected differences/site)
Sequence length
From the Farris Zone to the Felsenstein ZoneFrom the Farris Zone to the Felsenstein Zone
CC
DD
AA
BB
CC
DD
AA
BB
CC
DD
AA
BB
BB
CC
DD
AA
BB
DD
CC
AA
External branches = 0.5 or 0.05 substitutions/site, Jukes-Cantor model of nucleotide substitutionExternal branches = 0.5 or 0.05 substitutions/site, Jukes-Cantor model of nucleotide substitution
0
0.2
0.4
0.6
0.8
1.0
0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05
100 sites
1,000 sites
10,000 sites ML/JC
Length of internal branch ( d)Farris zone Felsenstein zone
0
0.2
0.4
0.6
0.8
0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05
Length of internal branch ( d)Farris zone Felsenstein zone
100 sites
1,000 sites
10,000 sites
1.0
Acc
ura
cyA
ccu
racy
ParsimonyParsimony
LikelihoodLikelihood
SimulationSimulationresults:results:
Maximum likelihood models are Maximum likelihood models are oversimplifications of reality. If I assume the oversimplifications of reality. If I assume the
wrong model, won’t my results be meaningless?wrong model, won’t my results be meaningless?
Maximum likelihood models are Maximum likelihood models are oversimplifications of reality. If I assume the oversimplifications of reality. If I assume the
wrong model, won’t my results be meaningless?wrong model, won’t my results be meaningless?
• Not necessarily (maximum likelihood is pretty robust)Not necessarily (maximum likelihood is pretty robust)
Returning to earlier example...Returning to earlier example...
AA CC GG TTAA -- 55 66 22CC 55 -- 33 88GG 66 33 -- 11TT 22 88 11 --
Substitution rates:Substitution rates:
Base frequencies:Base frequencies: A=0.1A=0.1 C=0.2C=0.2 G=0.3G=0.3 T=0.4T=0.4
AA BB
CC DD
0.10.1
0.10.1 0.10.1
0.80.8 0.80.8
Performance of ML when its model is Performance of ML when its model is violated (one example)violated (one example)
0
0.2
0.4
0.6
0.8
1
100 1000 10000
Sequence Length
parsimonyML-JCML-K2PML-HKYML-GTR
Performance of ML when its model is Performance of ML when its model is violated (another example)violated (another example)
...
0
0.02
0.04
0.06
0.08
0 1 2
Rate
=50
=200
Modeling among-site rate variation with a gamma distribution...Modeling among-site rate variation with a gamma distribution...
=2
=0.5
Fre
quen
cy
Performance of ML when its model is Performance of ML when its model is violated (another example)violated (another example)
...
0
0.02
0.04
0.06
0.08
0 1 2
Rate
=50
=200
Modeling among-site rate variation with a gamma distribution...Modeling among-site rate variation with a gamma distribution...
……can also estimate a proportion of “invariable” sites (pcan also estimate a proportion of “invariable” sites (p invinv))
=2
=0.5
Fre
quen
cy
Performance of ML when its model is Performance of ML when its model is violated (another example)violated (another example)
Sequence Length
Proportion Correct
Tree a = 0.5, =0.5pinv a = 1.0, =0.5pinv a = 1.0, =0.2pinv
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony
HKYigGTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony
HKYig
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony
HKYig
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigHKYigGTRgHKYgGTRiHKYiGTRerHKYerParsimony
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigHKYigGTRgHKTgGTRiHKYiGTRerHKYerparsimony
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigHYYigGTRgHKYgGTRiHKYiGRTerHKYerparsimony
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony
HKYig
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony
HKYig
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
“MODERATE”–Felsenstein zone
= 1.0, pinv=0.5
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony
“MODERATE”–Inverse-Felsenstein zone
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony
“MODERATE”–Equal branch lengths
00.10.20.30.40.50.60.70.80.91
100 1000 10000
JCerJC+GJC+IJC+I+GGTRerGTR+GGTR+IGTR+I+Gparsimony
100000
Performance of ML when its model is Performance of ML when its model is violated (another example)violated (another example)
Sequence Length
Proportion Correct
Tree a = 0.5, =0.5pinv a = 1.0, =0.5pinv a = 1.0, =0.2pinv
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony
HKYigGTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony
HKYig
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony
HKYig
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigHKYigGTRgHKYgGTRiHKYiGTRerHKYerParsimony
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigHKYigGTRgHKTgGTRiHKYiGTRerHKYerparsimony
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigHYYigGTRgHKYgGTRiHKYiGRTerHKYerparsimony
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony
HKYig
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
GTRigGTRgHKYgGTRiHKYiGTRerHKYerparsimony
HKYig
00.10.20.30.40.50.60.70.80.91
100 1000 10000 100000
Extension to more taxa...Extension to more taxa...
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
200 1000 10000
HKY+I+ΓHKY+ΓHKY+IHKYerparsimony
Sequece Legth
Proportio Correct
Distance methods
DB
A
C
v3
v2
v1
v4
v5
"Input" distance matrix:
A B C DA - dAB dAC dADB dBA - dBC dBDC dCA dCB - dCDD dDA dDB dDC -
Distances are "additive" if, e.g.:
pAB = v1 + v2 = dAB
pAC = v1 + v3 + v4 = dAC
pAD = v1 + v3 + v5 = dAD
pBC = v2 + v3 + v4 = dBC
pBD = v2 + v3 + v5 = dBD
pCD = v4 + v5 = dCD
Distances in general will not be additive, sochoose optimal tree according to one of the
following criteria (objective functions):
"Goodness - of - fit" : minimize wij pij −diji < j∑
r
Typicall , y r = 2 (least-squares) and wij = 1/dij2 ("Fitch-
Margoliash" method)
"Minimum- "evolution : minimize vkk=1
#branches
∑ or vkk=1
#branches
∑
Neighbor joining:Neighbor joining:
A fast approximation to full searching under the minimum-evolution criterion A fast approximation to full searching under the minimum-evolution criterion using star-decomposition with iteratively updated branch lengthsusing star-decomposition with iteratively updated branch lengths
Uses the relationship:Uses the relationship:
ddAXAX = (d = (dABAB + d + dACAC - d - dBCBC)/2)/2
(etc.)(etc.)
AACC
BB
XX
Bayesian Inference in Phylogenetics
• Uses Bayes formula:
Pr(q|D) = Pr(D|q) Pr(q) Pr(D)
Pr(D|q) Pr(q)
L(q) Pr(q)
• Calculation involves integrating over all tree topologies and model-parameter values, subject to assumed prior distribution on parameters
Bayesian Inference in Phylogenetics
• To approximate this posterior density (complicated multidimensional integral) we use Markov chain Monte Carlo (MCMC)– Simulated Markov chain in which transition probabilities are
assigned such that the stationary distribution of the chain is the posterior density of interest
– E.g., Metropolis-Hastings algorithm: Accept a proposed move from one state q to another state q* with probability min(r,1) where
r = Pr(q*|D) Pr(q| q*)
Pr(q|D) Pr(q*| q)– Sample chain at regular intervals to approximate posterior
distribution
Bayesian Inference in Phylogenetics
• To approximate this posterior density (complicated multidimensional integral) we use Markov chain Monte Carlo (MCMC)– Simulated Markov chain in which transition
probabilities are assigned such that the stationary distribution of the chain is the posterior density of interest
– E.g., Metropolis-Hastings algorithm: Accept a proposed move from one state to another with probability min(r,1) where