phylogentic tree construction

Phylogentic Tree Construction

(Lecture for CS397-CXZ Algorithms in Bioinformatics)

April. 2, 2004

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Introduction

• Phylogenetic tree: A tree with sequences as leaves that reflect evolutionary relationship

• Formal properties– Binary

– Rooted or unrooted

– Edge length reflects the amount of evolutionary divergence.

• Contruction methods (all related to clustering)– Similarity/distance based (bottom up construction)

– Maximum parsimony (search for the right tree)

– Probabilistic models (modeling a tree)

Similarity-based Methods

• Unweighted Pair Group Method using Arithmetic Averages (UPGMA)– Essentially average-link clustering

– Node height (Ck) = ½ dij, dij is the distance of the two children of Ck

• Desirable properties of tree– Molecular clocks (edge lengths): Equal edge length to the

leaves from the same node (tree shows the time)

– Additivity: Edge lengths are additive if the distance between any pair of leaves is the sum of the lengths of the edges on the path connecting them. (tree shows “changes”)

• UPGMA can guarantee “molecular” but not necessary “additivity”.

Neighbor-Joining

• Adjust the distances

– Dij = dij –(ri +rj), ri is the average distance of i to all other nodes

– Guarantees minimum Dij=> neighbors

• Alternative cluster distance function

– Suppose i and j are a pair of neighbors, replacing them with a new node k

– Define dkm = ½ (dim + djm –dij) for any other node m

– This guarantees additivity

• Finally, the edge length is dik = ½ (dij +rj -rj), djk =dij –dik, for joining k to i and j.

• Used in ClustalW

1

| | 2i ikk L

r dL

Neighbor-Joining: Example

23

5

3

1

6

Sequence A B C D

A 0 8 7 12

B 8 0 9 14

C 7 9 0 11

D 12 14 11 0

Original (true) tree

A

BC

D

r

13.5

15.5

13.5

18.5

Sequence A B C D

A 0 -21 -20 -20

B -21 0 -20 -20

C -20 -20 0 -21

D -20 -20 -21 0

Original distance matrix

Adjusted distance matrix 8-(13.5+15.5)

Neighbor-Joining: Example (cont.)

5

3Node F C D

F 0 4 9

C 4 0 11

D 9 11 0

A

B

C

D

r

13

15

20

Node F C D

F 0 -24 -24

C -24 0 -24

D -24 -24 0

Intermediate distance matrix

Adjusted distance matrix

4-(13+15)(8+(15.5-13.5))/2=5

(8-(15.5-13.5))/2=3

F

dFC=(dAC+dBC-dAB)/2=4

Node A B C D

A 0 8 7 12

B 8 0 9 14

C 7 9 0 11

D 12 14 11 0

Original distance matrix

4

9

11

5

3A

B

C

D

F

3

8

1

root 6

maximum parsimony principle:

the principle that the most accurate phylogenetic tree is one that is based on the fewest changes in the genetic code.

Maximum Parsimony

1 2 3 4 5 6 7 8 9 10

1 - A G G G T A A C T G

2 - A C G A T T A T T A

3 - A T A A T T G T C T

4 - A A T G T T G T C G

0

0

0

1

3

2

4

1

2

3

4

1

4

3

2

1 2 3 4 5 6 7 8 9 10





0 3

0 3

0 3

1

3

2

4

1

2

3

4

1

4

3

2

1

2

3

4A

G

C

T

4

1 - G

2 - C

3 - T

4 - A

C

A

G

T

C1

3

2

4C

C

G

A

T1

4

3

2C

3

3

3

1 2 3 4 5 6 7 8 9 10





0 3 2

0 3 2

0 3 2

1

3

2

4

1

2

3

4

1

4

3

2

Informative Site=discriminative site

1 2 3 4 5 6 7 8 9 10





0 3 2 2

0 3 2 1

0 3 2 2

1

3

2

4

1

2

3

4

1

4

3

2

4

1 - G

2 - A

3 - A

4 - G

1

2

3

4G

G

A

A

A

G

G

A

A1

3

2

4A

G

A

A

G1

4

3

2A

2

2

1

1 2 3 4 5 6 7 8 9 10





0 3 2 2

0 3 2 1

0 3 2 2

1

3

2

4

1

2

3

4

1

4

3

2

1 2 3 4 5 6 7 8 9 10





0 3 2 2 0 1 1 1 1 3 14

0 3 2 1 0 1 2 1 2 3 15

0 3 2 2 0 1 2 1 2 3 16

1

3

2

4

1

2

3

4

1

4

3

2

Probabilistic Approaches

• Basic idea:

– Tree= Generative probabilistic model, e.g., an n-leaf tree defines a model p(X1, …,Xn)

– Data: sequences {s1, …, sn}

– Choose the tree according to • Maximum Likelihood: p(Data|Tree)

• Maximum A Posterior (Bayesian): p(Tree|Data)

• Model evolution more directly

• Computationally expensive

Detailed View of Probabilistic Models

x5

t3

x4

x2

x1

x3

t2

t1

t41 5 1 4 2 4 3 5 4 5 5

1 2 3 4( ,..., | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( )p x x T t p x x t p x x t p x x t p x x t p x

The tree on the left defines the following probabilistic model:

Basic evolution model: p(x|y,t)=prob of x arising from an ancestral sequence y

over an edge of length t

Decompose the sequence: “Independence Assumption”:

( | , ) ( | , )u uu

p x y t p x y tDecompose the time: “Markovian Assumption”

( | , ) ( | , ) ( | , )b

p a c s t p b c s p a b t “Primitive Evolution Model”: p(a|b,t)

- Nucleotides: Jukes-Cantor model - Amino acids: PAM

The Jukes-Cantor model

-3

-3

-3

-3

rt st st st

st rt st st

st st rt st

st st st rt

Solutions: rt = (1+3e4t)/4, st = (1 e4t)/4.

R= S(t)=

: ( )

( ) ( ) ( ) ( )( )

'( ) ( )

' 3 3 '

Short time S I R

S t S t S S t I R

S t S t R

r r s s s r

A

C

G

T

Computing the Likelihood

x5

t3

x4

x2

x1

x3

t2

t1

t41 5 1 4 2 4 3 5 4 5 5

1 2 3 4( ,..., | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( )p x x T t p x x t p x x t p x x t p x x t p x

With Parents Known:

But We don’t know the parents…

Handling the Hidden Nodes

• We must sum up over all the hidden ancestral nodes

• Felsenstein’s algorithm for likelihood: Compute the sum in a bottom up fashion

– Start from leaves

– Compute the parent node based on children nodes

1 2 1 21 2 1 2( , | , , ) ( | , ) ( | , )u u a u u

a

p x x T t t q p x a t p x a t

Maximizing the Likelihood

• Easy for small number of sequences

• Generally complex for large number of sequences

• Many solutions:– EM

– Gradient descent

– Sampling

• Metropolis sampling– Accept a new tree if P(new-tree)>= P(old-tree)

– Accept a new tree with prob. P(new-tree)/P(old-tree) if p(new-tree)<p(old-tree)

More realistic evolutionary models

• Allowing different rates at different sites

– Using a prior (e.g., gamma) to regular the different rates

– Hidden Markov models

• Evolutionary models with gaps

– Tree HMMs

phylogentic tree construction

Documents

t t a3

t g t t g t c g0

g g g t

t t g t c t4

c t g2

a4 g

g2 a3

node tree