building phylogenies distance-based methods. methods distance-based parsimony maximum likelihood

29
Building Phylogenies Distance-Based Methods

Upload: douglas-jones

Post on 22-Dec-2015

243 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Building Phylogenies

Distance-Based Methods

Page 2: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Methods

• Distance-based• Parsimony• Maximum likelihood

Page 3: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Distance Matrices

a 0

b 6 0

c 7 3 0

d 14 10 9 0

a b c d

a

b

c

d

1 2 3 4 50 6 7 8

Distance matrix is additive if there is a tree that fits it exactly

Page 4: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Ultrametric Matrices

a 0

b 2 0

c 6 6 0

d 10 10 10 0

a b c d

a

b

c

d

1 2 3 4 50

Additive + molecular clock assumption

Page 5: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Methods

• Fitch - Margoliash• UPGMA• Neighbor-joining• Many others

Page 6: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Least squares trees

• Minimize

over all trees

• Choice of weights wij :

– Uniform: wij 1

– Fitch-Margoliash: wij 1/Dij2

– Others . . .

ji

ijijij dDwQ 2

Page 7: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Sarich's (1969) immunological distances

Page 8: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Least squares tree for Sarich’s data

Page 9: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Clustering Methods

• E.g., UPGMA and Neighbor-Joining• A cluster is a set of taxa• Interspecies distances translate into

intercluster distances• Clusters are repeatedly merged

– “Closest” clusters merged first– Distances are recomputed after

merging

Page 10: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

UPGMA• Unweighted pair group method using arithmetic

averages

• The distance between clusters Ci and Cj is

• After merging Ci and Cj to create cluster Ck define distance from k to every other cluster r as

ji CqCppq

ji

ij DCC

D,

1

ji

jjriir

krCC

CDCDD

Page 11: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

UPGMA: Initialization

1.Assign each sequence i to its own cluster Ci

2.Define one leaf (tip) of tree for each sequence and place it at height 0

Page 12: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

UPGMA: Iteration

1.Choose the two clusters i and j with smallest Dij

2.Create a new cluster k, where Ck = Ci Cj

3.Compute Dkr for all r.4.Define a new node k with children i and j,

and place it at height Dij /2.5.Add k to the current clusters and delete i

and j Let i and j be the remaining clusters.

Place root at height Dij /2

Repeat until only two clusters remain:

Page 13: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

UPGMA Example

Page 14: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood
Page 15: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood
Page 16: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood
Page 17: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

UPGMA tree for Sarich’s data

Page 18: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

A pitfall of UPGMA

• The algorithm produces an ultrametric tree: the distance from the root to any leaf is the same

• UPGMA assumes a constant molecular clock: all species accumulate mutations (evolve) at the same rate.

Page 19: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

UPGMA fails when molecular clock assumption doesn’t

hold

Page 20: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Neighbor Joining

• Saitou and Nei, Molecular Biology and Evolution 4 (1987)

• Idea: Find a pair of leaves that are close to each other but far from other leaves– Implicitly finds a pair of neighboring leaves

• Advantages: – Works well for additive and other nonadditive

matrices– Does not have the molecular clock assumption

Page 21: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Long branches must be handled carefully!

0.1

0.1

0.1

0.4 0.4

and are closer to each other than to or . Obvious approach produces incorrect clusters!

Page 22: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Compensating for long edges

Introduce “correction terms”

ji

iji Dn

u2

1

jiijij uuDD

“Corrected” distances:

Distances are reduced for pairs that are far away from all other species: They may be close to each other.

Average dist. to other taxa

Page 23: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Neighbor-joining

1. Choose i, j such that Dij ui uj is minimum2. Define a new leaf k whose distances to i and j are

3. Compute the distance from k to every other leaf r

4. Delete i and j

ijijjk

jiijik

uuDd

uuDd

21

21

21

21

ijjrirkr DDDD 21

Repeat the following until only two leaves remain:

Connect the 2 remaining leaves by a branch of length Dij

Page 24: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

NJ tree for Sarich’s data

Page 25: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Computing distance matrices

• Based on sequence alignment• Various possibilities:

– Distance = average number of differences– Try different PAM matrices; distance =

index of matrix that gives highest score– Feng and Doolitle: Based on alignment

scores – roughly ratio to max possible score (see text)

• Read, e.g., PHYLIP documentation:http://evolution.genetics.washington.edu/phylip/general.html

Page 26: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Distance correction

• The amount of evolutionary change is not linearly related to time

• Over a long period of time, a series of substitutions may bring us back to where we started

• Percentage difference may underestimate evolutionary time

Page 27: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Jukes-Cantor Model

Page 28: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Correcting for multiple substitutions in the JC model

dt

34

1ln43

Page 29: Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood

Many other models!