algorithms in computational biology
DESCRIPTION
Algorithms in Computational Biology. Building Phylogenetic Trees. Phylogeny. All organisms on Earth had a common ancestor Evidence from morphological, biochemical, and gene sequence data Phylogeny This history of organismal lineages as they change through time Phylogenetic tree - PowerPoint PPT PresentationTRANSCRIPT
Department of Mathematics & Computer Science Algorithms in Computational Biology 11
Algorithms in Computational Biology
Building Phylogenetic Trees
Department of Mathematics & Computer Science Algorithms in Computational Biology 22
Phylogeny
• All organisms on Earth had a common ancestor• Evidence from morphological, biochemical, and gene sequence
data
• Phylogeny• This history of organismal lineages as they change through time
• Phylogenetic tree• A tree showing the evolutionary relationships among various
biological species• All living organisms today, from smallest microbe to the largest
plants and animals, are connected by the passage of genes along the branches of the phylogenetic tree
Department of Mathematics & Computer Science Algorithms in Computational Biology 33
Phylogenetic Tree of Life
Department of Mathematics & Computer Science Algorithms in Computational Biology 44
Inferring Phylogenies
• Traditionally• Use morphological characters (both from living and
fossilized organisms)
• 1962• Zuckerkandl & Pauling showed that molecular
sequences can be used to infer phylogenies• Assumes current sequences descended from some
common ancestral gene in a common ancestral species
Department of Mathematics & Computer Science Algorithms in Computational Biology 55
Major Tree Building Algorithms
• Distance based• Parsimony• Maximum likelihood
Department of Mathematics & Computer Science Algorithms in Computational Biology 66
Orthologue vs Paralogue
• Both of them are homologous genes (homologues)
• Orthologues are a set of genes diverged from a common ancestor through gene speciation• Homologous genes from different species
• Paralogues are a set of genes diverged from a common ancestor through gene duplication • Homologous genes from the same species
Department of Mathematics & Computer Science Algorithms in Computational Biology 77
A Tree of OrthologuesA tree of orthologues
based on a set of alpha hemoglobins
Department of Mathematics & Computer Science Algorithms in Computational Biology 88
A Tree of Paralogues
Department of Mathematics & Computer Science Algorithms in Computational Biology 99
Background on Trees
• Nodes and Edges• Nodes: unobserved ancestor• Edge length
• On average, corresponds to evolutionary time period
• Variations• Different proteins can change at different rates
• Same sequence evolve much faster in some organism than others
• Root of a phylogenetic tree• Ultimate ancestor of all species• Some algorithms provides the location of the root, while other
don’t
Department of Mathematics & Computer Science Algorithms in Computational Biology 1010
Counting and Labeling Trees
• Counting:• For a rooted tree with n leaves
• As we move up the tree, the edges coalesce as each new node is reached• In addition to n leaves, there are n-1 nodes (internal nodes plus root node).
• A total of 2n-1 nodes
• There will be 2n-2 edges (discounting the edge above the root node)
• For an unrooted tree with n leaves• Total number of nodes = 2n – 2• Total number of edges = 2n – 3
• Labeling (for rooted tree)• Label the leaves using 1 to n• Label the branch nodes using n+1 to 2n-2• Label the root using 2n-1
Department of Mathematics & Computer Science Algorithms in Computational Biology 1111
Rooting an Unrooted Tree1
2
3
1
2
3
1
2
3
1
2 3
2
13
3
1 2
Department of Mathematics & Computer Science Algorithms in Computational Biology 1212
How Many Possible Topologies?
# of leaves Ways to add nth leaf
# of edges in the sub-tree
# of un-rooted trees
4 3 5 3
5 5 7 3x5
6 7 9 3x5x7
7 9 11 3x5x7x9
… … … …
n 2n-5 2n-3 3x5x7x9x…x(2n-5)
(2n-5)!!# of rooted trees: (2n-3)!!
Department of Mathematics & Computer Science Algorithms in Computational Biology 1313
Making a Tree from Pairwise Distances
• Distance Measure• First find f which is the fraction of differences between two sequences
presupposing an alignment of the two sequences• Fraction of difference expected by chance (by random substitution) is
about 3/4• Jukes-Cantor distance (odds ratio)
• Clustering methods• UPGMA• Neighbor-joining
3
41log
4
3 fdij
Department of Mathematics & Computer Science Algorithms in Computational Biology 1414
Unweighted Pair Group Method Using Arithmetic Average (UPGMA)
[Sokal & Michener, 1958]
Overview
1. Cluster the sequences
2. Amalgamate two clusters at each stage, create a new node on a tree
3. Assemble the tree upwards, each node being added above the others
4. The edge length determined by the difference in the heights of the nodes at the top and bottom of an edge
Department of Mathematics & Computer Science Algorithms in Computational Biology 1515
Distance Measure Used in UPGMA
ji
jjliil
kl
pq
ji
ij
CC
CdCdd
dCC
d
ji Cin q ,Cin p
1
Distance b/w two clusters Ci and Cj is the average
distance between pairs of sequences from each other
Distance b/w two clusters Ck and Cl, if Ck is the union
of two clusters Ci and Cj
Department of Mathematics & Computer Science Algorithms in Computational Biology 1616
Algorithm UPGAM
InitializationAssign each sequence i to its own cluster Ci
Define one leaf of T for each sequence, and place at height zero
IterationDetermine the two clusters i, j for which dij is minimal (if there are ties, pick one randomly)
Define a new cluster k by Ck = Ci Cj, and define dkl for all l using arithmetic average
Define a node k with daughter nodes i and j, and place it at height d ij/2.
Add k to the current clusters and remove i and j
TerminationWhen only two clusters i, j remain, place the root at height dij/2
Department of Mathematics & Computer Science Algorithms in Computational Biology 1717
An Example
Department of Mathematics & Computer Science Algorithms in Computational Biology 1818
Cont’
Department of Mathematics & Computer Science Algorithms in Computational Biology 1919
Molecular Clock Assumption in UPGMA
• UPGMA produces a rooted tree• Edge lengths in the resulting tree can be viewed as times measured by a
molecular clock with a constant rate• The sum of times down a path to the leaves from any node is the same,
whatever the path
• The distances dij are said to be ultrametric, if for any triplet of sequences, xi, xj, xk, the distances dij, djk, dik are either all equal, or two are equal and the remaining one is smaller• True for a tree with a molecular clock
• Implied additivity• The edge lengths are said to be additive if the distance b/w any pair of
the leaves is the sum of the lengths of the edges on the path connecting them
Department of Mathematics & Computer Science Algorithms in Computational Biology 2020
Molecular Clocks
• Mutations may build up in any given stretch of DNA at a reliable rate
• If the rate of mutation of a gene is reliable, this gene can be used as a molecular clock
• This gene can be a powerful tool for estimating the dates of lineage-splitting events.
Department of Mathematics & Computer Science Algorithms in Computational Biology 2121
Example The entire length of DNA of a genes changes at a rate of approximately
one base per 25 million years
Department of Mathematics & Computer Science Algorithms in Computational Biology 2222
What If Molecular Clock Property Fails?
1
23
41 4 2 3
A tree that is reconstructed incorrectly by
UPGMA (right)
Department of Mathematics & Computer Science Algorithms in Computational Biology 2323
Additivity
• Given a tree, its edge length is additive• If the distance between any pair of leaves is the sum
of lengths of the edges on the path connecting them• Build-in assumption in UPGMA
Department of Mathematics & Computer Science Algorithms in Computational Biology 2424
Test for Additivity
• For every set of four leaves, 1, 2, 3 and 4, two of the three distances d12 + d34 , d13 + d24 and d14 + d23 must be equal and larger than the 3rd.
1
2
3
4
Department of Mathematics & Computer Science Algorithms in Computational Biology 2525
Joining a Pair of Neighboring Leaves
i
j
k
m
Dim = dik + dkm
Djm = djk + dkm
Dij = dik + djk
Dkm = 0.5(dim + djm – dij)
Node k joins leaf nodes i and j
Department of Mathematics & Computer Science Algorithms in Computational Biology 2626
Closest Pairs of Leaves Are not Necessarily Neighboring Leaves
0.1 0.1 0.1
0.4 0.4
1 2
3 4
1 2 3 4
1
2 0.3
3 0.5 0.6
4 0.6 0.5 0.9
d Table
Department of Mathematics & Computer Science Algorithms in Computational Biology 2727
Compensation for Long Edges
leaves of set of size theis
2
1 Where,
)(
LL
dL
r
rrdD
Lkiki
jiijij
1 2 3 4
1
2 -1.1
3 -1.2 -1.1
4 -1.1 -1.2 -1.1
r1 = 0.7r2 = 0.7r3 = 1r4 = 1
D Table
Department of Mathematics & Computer Science Algorithms in Computational Biology 2828
Algorithm: Neighbor-Joining
Initialization:Define T to be the set of leaf nodes, one for each given sequence, and put L = T.
Iteration:Pick a pair i, j in L for which Dij is minimalDefine a new node k and set dkm = 0.5(dim + djm – dij), for all m in L.Add k to T with edges of lengths dik = 0.5(dij+ri-rj), djk = dij – dik, joining k to i and j, respectively.Remove i and j from L and add k.
TerminationWhen L consists of two leaves i and j add the remaining edge between i and j, with length dij Produces an unrooted tree
Department of Mathematics & Computer Science Algorithms in Computational Biology 2929
Rooting Trees
• Outgroup• Species known to be more distantly related to each of the
remaining species than they are to each other
• Find the root by adding an outgroup• The point in the tree where the edge to the outgroup joins is
expected to be the best root candidate
• In the absence of a convenient outgroup, methods are quite ad hoc• E.g. picking the midpoint of the longest chain of consecutive
edges if deviation from a molecular clock were not too great.
Department of Mathematics & Computer Science Algorithms in Computational Biology 3030
Assumptions Used by UPGMA and Neighbor-Join
• UPGMA (molecular clock with implied additivity)• The edge lengths in the resulting tree can be viewed as times
measured by a molecular clock with a constant rate• The divergence of sequences is assumed to occur at the same
constant rate at all points in the tree• The distance from an internal node to a leaf node will always be
the same no matter what path is taken
• Neighbor-Join• It is possible for the molecular clock property to fail but for
additivity to hold• Assume additivity only
Department of Mathematics & Computer Science Algorithms in Computational Biology 3131
Parsimony
• Most widely used tree building algorithm• It works by finding the tree which can
explain the observed sequences with a minimum # of substitutions
• Two components to the algorithm1. The computation of a cost for a given tree T
2. A search through all trees, to find the overall minimum of this cost
Department of Mathematics & Computer Science Algorithms in Computational Biology 3232
Notations Used in Weighted Parsimony
• Sk(a) denotes the minimal cost for the assignment of a to node k
• S(a, b): cost for each substitution of a by b
Department of Mathematics & Computer Science Algorithms in Computational Biology 3333
Algorithm: Weighted ParsimonyCompute the minimum cost at site u
[Sankoff & Cedergren 1983]
Initialization:
Set k = 2n – 1, the number of the root node
Recursion: Compute Sk(a) for all a as follows:
If k is a leaf node:
Set Sk(a) = 0 for a = xuk, Sk(a) = , otherwise
If k is not leaf node:
Compute Si(b), Sj(b) for all b at the daughter nodes i, j and define Sk(a) = minb(Si(b) + S(a, b)) + minb(Sj(b) + S(a, b)).
Termination:
Minimal cost of tree = minaS2n-1(a) Weighted parsimony reduces to traditional parsimony if S(a, a) = 0 for all a, S(a, b) = 1 for all a
b
Department of Mathematics & Computer Science Algorithms in Computational Biology 3434
Algorithm: Traditional Parsimony [Fitch 1971]
InitializationSet C = 0 and k = 2n -1
Recursion: to obtain the set Rk
If k is leaf node:
Set Rk = xuk
If k is not a leaf node:
Compute Ri, Rj for the daughter nodes i, j of k, and set
Rk = Ri Rj if this intersection is not empty, or else
Rk = Ri Rj and increment CTermination:
Minimal cost of the tree = C
Department of Mathematics & Computer Science Algorithms in Computational Biology 3535
Parsimony Example
{A, B}
A
{A, B}
A
B
A
B
Minimum cost = 2Obtained by traditional parsimony
A
A
A
A
B
A
B
B
A
A
A
B
A
BX
X
X
X
Department of Mathematics & Computer Science Algorithms in Computational Biology 3636
Cont’
B
B
B
A
B
A
B
Minimum cost tree: not obtained by traditional parsimony
Department of Mathematics & Computer Science Algorithms in Computational Biology 3737
Enumeration of Unrooted Trees
• Enumerate all unrooted trees by an array [i3] [i5] [i7] [i9]… [i2n-5]• Take the unrooted tree with 3 sequences x1, x2 and
x3 and add an edge for x4 on the edge labeled by i3, since the new edge divides the preexisting edge in two, the total number of edges is now 3 + 2 = 5. The value of i5 determines which of these x5 is added to.
• Think of [i3] [i5] [i7] [i9]… [i2n-5] as an odometer …
Department of Mathematics & Computer Science Algorithms in Computational Biology 3838
Counting TreesCont’
• Counting complete trees• The rightmost numbers advance till they reach 2n-5• The next-to-rightmost array index clicks forward by 1
when the rightmost array index go back to 1• The second-to-rightmost index clicks forward by 1
when the next-to-rightmost index reaches 2n-7• And so on and so forth …
• Counting both complete and incomplete trees• Add 0 to each array index, meaning that there is no
edge of the order specified by the counter
Department of Mathematics & Computer Science Algorithms in Computational Biology 3939
Selecting Labeled Branching Patterns by Branch and Bound
• Starts from the odometer setting [1][0][0]…[0]• Let the smallest cost so far for a complete tree be C• Brand and bound
• Adding more leaves can only increase cost• No point branching out if current cost is larger than the minimum
cost so far• Implementation trick
• Whenever the cost of our current subtree T is more than C, we know that T is not part of the optimal tree
• If all the counters to the right of a given non-zero counter are 0, instead of advancing them all to ‘1’ we can click the rightmost non-zero counter one forward
Department of Mathematics & Computer Science Algorithms in Computational Biology 4040
An Example of Branch-and-Bound
7 0 0 0 03
7 1 1 1 13
8 0 0 0 03
Skip 3…70001 to 3…7(2n-11)(2n-9)(2n-7)(2n-5) and go directly to 3…80000 if the cost of 3…70000
is higher the the minimum cost found so far
……
……
Department of Mathematics & Computer Science Algorithms in Computational Biology 4141
Assessing the Trees: the Bootstrap
• Bootstrapping (sample with replacement)• Given a dataset consisting an alignment of sequences, generates
an artificial dataset by picking columns from the alignment at random with replacement
• Generate large number (order of thousands) of artificial alignment datasets
• For each artificially generated data set, build a tree• Assessing phylogenetic features
• Find the frequency of each phylogenetic feature that appears in the thousands trees generated above
• The higher the frequency, the more confident we have with a phylogenetic feature
Department of Mathematics & Computer Science Algorithms in Computational Biology 4242
Describe a New Hampshire Standard Tree
Tree file representation of the above rooted tree, starting at the beginning of the file:(B,(A,C,E),D);(B:6.0,(A:5.0,C:3.0,E:4.0):5.0,D:11.0);
Department of Mathematics & Computer Science Algorithms in Computational Biology 4343
Visualize TreesPhylip DrawTree
Department of Mathematics & Computer Science Algorithms in Computational Biology 4444
Visualize TreesCladogram
Department of Mathematics & Computer Science Algorithms in Computational Biology 4545
Visualize TreesPhenogram
Department of Mathematics & Computer Science Algorithms in Computational Biology 4646
Visualize TreesCurve-O-Gram
Department of Mathematics & Computer Science Algorithms in Computational Biology 4747
Visualize TreesEurogram
Department of Mathematics & Computer Science Algorithms in Computational Biology 4848
Programs to Build Phylogenetic Trees
• PAUP• Include parsimony, maximum likelihood, and distance methods
• Phylip• Include parsimony, distance matrix, and likelihood methods,
including bootstrapping and consensus trees.• MrBayes
• Bayesian estimation of phylogeny• Uses a simulation technique called Markov chain Monte Carlo (or
MCMC) to approximate the posterior probabilities of trees• NoTung
• Incorporating duplication/loss parsimony into phylogenetic tasks• ……