algorithms in computational biology

48
Department of Mathematics & Computer Science Algorithms in Computational Biology 1 Algorithms in Computational Biology Building Phylogenetic Trees

Upload: gary

Post on 04-Jan-2016

38 views

Category:

Documents


1 download

DESCRIPTION

Algorithms in Computational Biology. Building Phylogenetic Trees. Phylogeny. All organisms on Earth had a common ancestor Evidence from morphological, biochemical, and gene sequence data Phylogeny This history of organismal lineages as they change through time Phylogenetic tree - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 11

Algorithms in Computational Biology

Building Phylogenetic Trees

Page 2: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 22

Phylogeny

• All organisms on Earth had a common ancestor• Evidence from morphological, biochemical, and gene sequence

data

• Phylogeny• This history of organismal lineages as they change through time

• Phylogenetic tree• A tree showing the evolutionary relationships among various

biological species• All living organisms today, from smallest microbe to the largest

plants and animals, are connected by the passage of genes along the branches of the phylogenetic tree

Page 3: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 33

Phylogenetic Tree of Life

Page 4: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 44

Inferring Phylogenies

• Traditionally• Use morphological characters (both from living and

fossilized organisms)

• 1962• Zuckerkandl & Pauling showed that molecular

sequences can be used to infer phylogenies• Assumes current sequences descended from some

common ancestral gene in a common ancestral species

Page 5: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 55

Major Tree Building Algorithms

• Distance based• Parsimony• Maximum likelihood

Page 6: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 66

Orthologue vs Paralogue

• Both of them are homologous genes (homologues)

• Orthologues are a set of genes diverged from a common ancestor through gene speciation• Homologous genes from different species

• Paralogues are a set of genes diverged from a common ancestor through gene duplication • Homologous genes from the same species

Page 7: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 77

A Tree of OrthologuesA tree of orthologues

based on a set of alpha hemoglobins

Page 8: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 88

A Tree of Paralogues

Page 9: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 99

Background on Trees

• Nodes and Edges• Nodes: unobserved ancestor• Edge length

• On average, corresponds to evolutionary time period

• Variations• Different proteins can change at different rates

• Same sequence evolve much faster in some organism than others

• Root of a phylogenetic tree• Ultimate ancestor of all species• Some algorithms provides the location of the root, while other

don’t

Page 10: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1010

Counting and Labeling Trees

• Counting:• For a rooted tree with n leaves

• As we move up the tree, the edges coalesce as each new node is reached• In addition to n leaves, there are n-1 nodes (internal nodes plus root node).

• A total of 2n-1 nodes

• There will be 2n-2 edges (discounting the edge above the root node)

• For an unrooted tree with n leaves• Total number of nodes = 2n – 2• Total number of edges = 2n – 3

• Labeling (for rooted tree)• Label the leaves using 1 to n• Label the branch nodes using n+1 to 2n-2• Label the root using 2n-1

Page 11: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1111

Rooting an Unrooted Tree1

2

3

1

2

3

1

2

3

1

2 3

2

13

3

1 2

Page 12: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1212

How Many Possible Topologies?

# of leaves Ways to add nth leaf

# of edges in the sub-tree

# of un-rooted trees

4 3 5 3

5 5 7 3x5

6 7 9 3x5x7

7 9 11 3x5x7x9

… … … …

n 2n-5 2n-3 3x5x7x9x…x(2n-5)

(2n-5)!!# of rooted trees: (2n-3)!!

Page 13: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1313

Making a Tree from Pairwise Distances

• Distance Measure• First find f which is the fraction of differences between two sequences

presupposing an alignment of the two sequences• Fraction of difference expected by chance (by random substitution) is

about 3/4• Jukes-Cantor distance (odds ratio)

• Clustering methods• UPGMA• Neighbor-joining

3

41log

4

3 fdij

Page 14: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1414

Unweighted Pair Group Method Using Arithmetic Average (UPGMA)

[Sokal & Michener, 1958]

Overview

1. Cluster the sequences

2. Amalgamate two clusters at each stage, create a new node on a tree

3. Assemble the tree upwards, each node being added above the others

4. The edge length determined by the difference in the heights of the nodes at the top and bottom of an edge

Page 15: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1515

Distance Measure Used in UPGMA

ji

jjliil

kl

pq

ji

ij

CC

CdCdd

dCC

d

ji Cin q ,Cin p

1

Distance b/w two clusters Ci and Cj is the average

distance between pairs of sequences from each other

Distance b/w two clusters Ck and Cl, if Ck is the union

of two clusters Ci and Cj

Page 16: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1616

Algorithm UPGAM

InitializationAssign each sequence i to its own cluster Ci

Define one leaf of T for each sequence, and place at height zero

IterationDetermine the two clusters i, j for which dij is minimal (if there are ties, pick one randomly)

Define a new cluster k by Ck = Ci Cj, and define dkl for all l using arithmetic average

Define a node k with daughter nodes i and j, and place it at height d ij/2.

Add k to the current clusters and remove i and j

TerminationWhen only two clusters i, j remain, place the root at height dij/2

Page 17: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1717

An Example

Page 18: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1818

Cont’

Page 19: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1919

Molecular Clock Assumption in UPGMA

• UPGMA produces a rooted tree• Edge lengths in the resulting tree can be viewed as times measured by a

molecular clock with a constant rate• The sum of times down a path to the leaves from any node is the same,

whatever the path

• The distances dij are said to be ultrametric, if for any triplet of sequences, xi, xj, xk, the distances dij, djk, dik are either all equal, or two are equal and the remaining one is smaller• True for a tree with a molecular clock

• Implied additivity• The edge lengths are said to be additive if the distance b/w any pair of

the leaves is the sum of the lengths of the edges on the path connecting them

Page 20: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2020

Molecular Clocks

• Mutations may build up in any given stretch of DNA at a reliable rate

• If the rate of mutation of a gene is reliable, this gene can be used as a molecular clock

• This gene can be a powerful tool for estimating the dates of lineage-splitting events.

Page 21: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2121

Example The entire length of DNA of a genes changes at a rate of approximately

one base per 25 million years

Page 22: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2222

What If Molecular Clock Property Fails?

1

23

41 4 2 3

A tree that is reconstructed incorrectly by

UPGMA (right)

Page 23: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2323

Additivity

• Given a tree, its edge length is additive• If the distance between any pair of leaves is the sum

of lengths of the edges on the path connecting them• Build-in assumption in UPGMA

Page 24: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2424

Test for Additivity

• For every set of four leaves, 1, 2, 3 and 4, two of the three distances d12 + d34 , d13 + d24 and d14 + d23 must be equal and larger than the 3rd.

1

2

3

4

Page 25: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2525

Joining a Pair of Neighboring Leaves

i

j

k

m

Dim = dik + dkm

Djm = djk + dkm

Dij = dik + djk

Dkm = 0.5(dim + djm – dij)

Node k joins leaf nodes i and j

Page 26: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2626

Closest Pairs of Leaves Are not Necessarily Neighboring Leaves

0.1 0.1 0.1

0.4 0.4

1 2

3 4

1 2 3 4

1

2 0.3

3 0.5 0.6

4 0.6 0.5 0.9

d Table

Page 27: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2727

Compensation for Long Edges

leaves of set of size theis

2

1 Where,

)(

LL

dL

r

rrdD

Lkiki

jiijij

1 2 3 4

1

2 -1.1

3 -1.2 -1.1

4 -1.1 -1.2 -1.1

r1 = 0.7r2 = 0.7r3 = 1r4 = 1

D Table

Page 28: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2828

Algorithm: Neighbor-Joining

Initialization:Define T to be the set of leaf nodes, one for each given sequence, and put L = T.

Iteration:Pick a pair i, j in L for which Dij is minimalDefine a new node k and set dkm = 0.5(dim + djm – dij), for all m in L.Add k to T with edges of lengths dik = 0.5(dij+ri-rj), djk = dij – dik, joining k to i and j, respectively.Remove i and j from L and add k.

TerminationWhen L consists of two leaves i and j add the remaining edge between i and j, with length dij Produces an unrooted tree

Page 29: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2929

Rooting Trees

• Outgroup• Species known to be more distantly related to each of the

remaining species than they are to each other

• Find the root by adding an outgroup• The point in the tree where the edge to the outgroup joins is

expected to be the best root candidate

• In the absence of a convenient outgroup, methods are quite ad hoc• E.g. picking the midpoint of the longest chain of consecutive

edges if deviation from a molecular clock were not too great.

Page 30: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3030

Assumptions Used by UPGMA and Neighbor-Join

• UPGMA (molecular clock with implied additivity)• The edge lengths in the resulting tree can be viewed as times

measured by a molecular clock with a constant rate• The divergence of sequences is assumed to occur at the same

constant rate at all points in the tree• The distance from an internal node to a leaf node will always be

the same no matter what path is taken

• Neighbor-Join• It is possible for the molecular clock property to fail but for

additivity to hold• Assume additivity only

Page 31: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3131

Parsimony

• Most widely used tree building algorithm• It works by finding the tree which can

explain the observed sequences with a minimum # of substitutions

• Two components to the algorithm1. The computation of a cost for a given tree T

2. A search through all trees, to find the overall minimum of this cost

Page 32: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3232

Notations Used in Weighted Parsimony

• Sk(a) denotes the minimal cost for the assignment of a to node k

• S(a, b): cost for each substitution of a by b

Page 33: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3333

Algorithm: Weighted ParsimonyCompute the minimum cost at site u

[Sankoff & Cedergren 1983]

Initialization:

Set k = 2n – 1, the number of the root node

Recursion: Compute Sk(a) for all a as follows:

If k is a leaf node:

Set Sk(a) = 0 for a = xuk, Sk(a) = , otherwise

If k is not leaf node:

Compute Si(b), Sj(b) for all b at the daughter nodes i, j and define Sk(a) = minb(Si(b) + S(a, b)) + minb(Sj(b) + S(a, b)).

Termination:

Minimal cost of tree = minaS2n-1(a) Weighted parsimony reduces to traditional parsimony if S(a, a) = 0 for all a, S(a, b) = 1 for all a

b

Page 34: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3434

Algorithm: Traditional Parsimony [Fitch 1971]

InitializationSet C = 0 and k = 2n -1

Recursion: to obtain the set Rk

If k is leaf node:

Set Rk = xuk

If k is not a leaf node:

Compute Ri, Rj for the daughter nodes i, j of k, and set

Rk = Ri Rj if this intersection is not empty, or else

Rk = Ri Rj and increment CTermination:

Minimal cost of the tree = C

Page 35: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3535

Parsimony Example

{A, B}

A

{A, B}

A

B

A

B

Minimum cost = 2Obtained by traditional parsimony

A

A

A

A

B

A

B

B

A

A

A

B

A

BX

X

X

X

Page 36: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3636

Cont’

B

B

B

A

B

A

B

Minimum cost tree: not obtained by traditional parsimony

Page 37: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3737

Enumeration of Unrooted Trees

• Enumerate all unrooted trees by an array [i3] [i5] [i7] [i9]… [i2n-5]• Take the unrooted tree with 3 sequences x1, x2 and

x3 and add an edge for x4 on the edge labeled by i3, since the new edge divides the preexisting edge in two, the total number of edges is now 3 + 2 = 5. The value of i5 determines which of these x5 is added to.

• Think of [i3] [i5] [i7] [i9]… [i2n-5] as an odometer …

Page 38: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3838

Counting TreesCont’

• Counting complete trees• The rightmost numbers advance till they reach 2n-5• The next-to-rightmost array index clicks forward by 1

when the rightmost array index go back to 1• The second-to-rightmost index clicks forward by 1

when the next-to-rightmost index reaches 2n-7• And so on and so forth …

• Counting both complete and incomplete trees• Add 0 to each array index, meaning that there is no

edge of the order specified by the counter

Page 39: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3939

Selecting Labeled Branching Patterns by Branch and Bound

• Starts from the odometer setting [1][0][0]…[0]• Let the smallest cost so far for a complete tree be C• Brand and bound

• Adding more leaves can only increase cost• No point branching out if current cost is larger than the minimum

cost so far• Implementation trick

• Whenever the cost of our current subtree T is more than C, we know that T is not part of the optimal tree

• If all the counters to the right of a given non-zero counter are 0, instead of advancing them all to ‘1’ we can click the rightmost non-zero counter one forward

Page 40: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 4040

An Example of Branch-and-Bound

7 0 0 0 03

7 1 1 1 13

8 0 0 0 03

Skip 3…70001 to 3…7(2n-11)(2n-9)(2n-7)(2n-5) and go directly to 3…80000 if the cost of 3…70000

is higher the the minimum cost found so far

……

……

Page 41: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 4141

Assessing the Trees: the Bootstrap

• Bootstrapping (sample with replacement)• Given a dataset consisting an alignment of sequences, generates

an artificial dataset by picking columns from the alignment at random with replacement

• Generate large number (order of thousands) of artificial alignment datasets

• For each artificially generated data set, build a tree• Assessing phylogenetic features

• Find the frequency of each phylogenetic feature that appears in the thousands trees generated above

• The higher the frequency, the more confident we have with a phylogenetic feature

Page 42: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 4242

Describe a New Hampshire Standard Tree

Tree file representation of the above rooted tree, starting at the beginning of the file:(B,(A,C,E),D);(B:6.0,(A:5.0,C:3.0,E:4.0):5.0,D:11.0);

Page 43: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 4343

Visualize TreesPhylip DrawTree

Page 44: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 4444

Visualize TreesCladogram

Page 45: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 4545

Visualize TreesPhenogram

Page 46: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 4646

Visualize TreesCurve-O-Gram

Page 47: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 4747

Visualize TreesEurogram

Page 48: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 4848

Programs to Build Phylogenetic Trees

• PAUP• Include parsimony, maximum likelihood, and distance methods

• Phylip• Include parsimony, distance matrix, and likelihood methods,

including bootstrapping and consensus trees.• MrBayes

• Bayesian estimation of phylogeny• Uses a simulation technique called Markov chain Monte Carlo (or

MCMC) to approximate the posterior probabilities of trees• NoTung

• Incorporating duplication/loss parsimony into phylogenetic tasks• ……