phylogenetic trees sushmita roy bmi/cs 576 [email protected] sep 23 rd, 2014

39
Phylogenetic trees Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ [email protected] Sep 23 rd , 2014

Upload: kristian-rice

Post on 18-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Phylogenetic trees

Sushmita RoyBMI/CS 576

www.biostat.wisc.edu/bmi576/[email protected]

Sep 23rd, 2014

Page 2: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Key concepts in this section

• What are phylogenies or phylogenetic trees?– Terminology such as extant, ancestral, branch point, branch length,

orthologs, paralogs

• Why build phylogenetic trees?• Algorithms to build phylogenetic trees

– Distance-based methods– Parsimony methods

• Minimize the number of changes– Probabilistic methods

• Find the tree that best explains the data using probabilistic models

Page 3: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Readings

• Chapter 7– 7.1-7.5

Page 4: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

What are phylogenetic trees?

• A tree that describes evolutionary relationships among entities– Species, genes, strains

• This relationship is called “phylogeny”• Leaves represent extant (current day) species• Internal nodes represent ancestral species• Phylogenetics:

– The task for inferring the phylogenetic tree from observations in existing organisms

Page 5: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Why phylogenetic trees?

• Inform multiple sequence alignments• Identify signatures of conservation of sequence• Understand how organisms are related

– Do humans and chimpanzees share a common ancestor or do humans and gorillas?

• Ask how closely organisms are related– Humans and chimpanzees shard a common ancestor 5mya

• How specific functions/traits have evolved– What made us human?

• Conjecture the fate of specific regions of the genome– Will the human Y disappear?

Page 6: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

From http://tellapallet.com/tree_of_life.htm

Tree of life aims to represents the phylogeny of all species on earth

Page 7: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Tracing the evolution of the Ebola virus

• Ebola virus: a lethal human pathogen, fatality rate 78%• Ebola is spreading now in Africa

– Until recently the largest known case happened in 1976 (318 cases)– This year’s outbreak reported in Feb 2014– As of 19 Aug 2014, 1229 deaths have been reported

• Largest known in history

• Key questions– Where did the pathogen come from?– How is it evolving?

• In a 2014 Science paper, researchers reported whole genome sequence alignment of 78 Ebola virus samples

Page 8: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Phylogenetic tree of the Ebola virus

Gire et al, Science 2014

Three recent outbreaks from the same ancestor

Page 9: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Insights gained from sequence comparison

• “Genetic similarity across the sequenced 2014 samples suggests a single transmission from the natural reservoir, followed by human-to-human transmission during the outbreak”

• “..data suggest that the Sierra Leone outbreak stemmed from the introduction of two genetically distinct viruses from Guinea around the same time…”

• “..the catalog of 395 mutations, including 50 fixed nonsynonymous changes with 8 at positions with high levels of conservation across ebola viruses, provides a starting point for such studies”

Gire et al., Science 2014

Page 10: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Phylogenetic tree basics

• Leaves represent entities(genes, species, individuals/strains) being compared– the term taxon (taxa plural) is used to refer to these when they

represent species and broader classifications of organisms– For example if taxa are species, the tree is a species tree

• Internal nodes are ancestral units• Phylogenetic trees can be rooted or unrooted

– the root represents the common ancestor

• In a rooted tree, path from root to a node represents an evolutionary path– Gives directionality to evolutionary time

• An unrooted tree specifies relationships among taxa, but not from an ancestor

Page 11: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Tree basics

6

2

1

8

5

7

3 4Branch

Leaf node: Extant

Internal node: Ancestral

For a species tree, internal nodes represent speciation events

1 2

6

Unrooted tree Rooted tree

Each tree topology represents a different evolutionary history

3 4 5

7

8

9

Branch length

Branch length describes the evolutionary divergence between two nodes

Page 12: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Orthologs and paralogs

• Orthologs:– Two sequences in two species that have a a common ancestor– Diverged due to a speciation event– Used to create a “species tree”

• Paralogs:– Two sequences in the same species that arose from a gene duplication

event– Captured in a “gene tree”.

Page 13: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Tree counting

• A rooted tree with n leaf nodes has – n-1 internal nodes– 2n-2 edges/branches

• An unrooted tree with n leaf nodes has– n-2 internal nodes– 2n-3 edges/branches– A root can be added to any of these branches to give 2n-3 rooted

trees for any unrooted tree

• E.g. for n=3 there is one unrooted tree and three rooted trees

Page 14: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Tree counting

1

2

3

1

2

3

1 2 3

1

2

3

3 1 2

1

2

3

2 1 3

An unrooted tree

Possible positions for root Rooted trees

Page 15: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Tree counting

• Instead of adding a root we could add a branch for the n+1th taxon

1

2

3

1

2

3

1

2

3

1

2

3

3

2

1

4

1

23

4

1

23

4

Page 16: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Tree counting

• A tree with 3 nodes can be grown in (2*3)-3=3 ways to make a tree of 4 nodes

• Each tree with 4 nodes can be grown in (2*4)-3=5 ways to make a tree of 5 nodes– So we have 3*5 trees

• Each tree of 5 nodes can be grown in (2*5)-3=7– So we have 3*5*7

• In general for n nodes we can have – (1)*(3)*(5)*..(2n-5) unrooted trees

Page 17: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Tree counting

• This grows very fast– For n=10, we have 2 million unrooted trees– For n=20, we have 2.2*1020

Page 18: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Constructing phylogenetic trees

• Phylogenetic tree construction– Given observations of n taxonomical units infer the tree that best

describes the evolutionary relationships among the units

• Three types of methods– Distance based methods– Parsimony methods– Probabilistic approaches

Page 19: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Distance-based methods for phylogenetic tree reconstruction

• Given nXn distance matrix for n units, construct the tree for these n units

• Algorithms– UPGMA– Neighbor joining

• Assume additivity and sometimes a “molecular clock”• Additivity means we can add up the branch lengths of the tree

connecting two nodes and get their distances.

Page 20: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Defining distance between sequences

• Fractional alignment mismatch for two sequences i and j– pij = mij/Lij

• Gives an estimate of changes per site

– mij: Number of mismatches– Assumes that changes have happened only once

• Underestimates the distance between sequences– Assumes all sequences change at the same rate

• Jukes Cantor distance– The simplest evolutionary distance dij between sequences i and j, pij

fractional mismatch

Page 21: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

UPGMA algorithm for phylogenetic tree reconstruction

• UPGMA: Unweighted pair group method using arithmetic averages• Represent all sequences as the leaf nodes of a tree• Merge two closest nodes at a time to create a new node in the tree

– Set new node at height determined by nodes being merged– Recompute distance between new node and all other nodes

• Leaf nodes have one sequence• Intermediate nodes have multiple sequences• We will call sequences associated with an intermediate node i

cluster Ci • Need to compute

– Distance between two clusters of sequences– Height

Page 22: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Computing distance between clusters

• Let i and j be two nodes• Let Ci be the cluster of sequences for node i

• Let Cj be the cluster of sequences for node j

• |Cj|: Number of sequences in Cj

• Distance between nodes i and j

Page 23: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Computing distance from a new node

• Let k be a new node to be created from merging i and j• Let Ci be the cluster of sequences for node i

• Let Cj be the cluster of sequences for node j

• Distance dkl between nodes k and l, l!=i and l!=j

• This is equal to

Page 24: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

UPGMA algorithm

• Input – n sequences– Distance matrix for all pairs of n sequences, dij

• Output– Tree T

• Initialization– Assign each sequence i to its own cluster Ci

– Define one leaf of T for each sequence

• Iterate until only two clusters remain– Find two nodes Ci and Cj that have the smallest dij

– Define new cluster Ck = Ci U Cj

– Define daughters of k as i and j, place at height dij/2

– Add k to cluster set. Remove i and j from the set of clusters

• Terminate– When only two clusters Ci and Cj remain, place root at dij/2

Page 25: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

UPGMA example

A B C D E

A 0 8 8 5 3

B 0 3 8 8

C 0 8 8

D 0 5

E 0

A E D B C

1234

AE B C D

AE 0 8 8 5

B 0 3 8

C 0 8

D 0

A E D B C

1234initial

state

after onemerge

Example calculation

Page 26: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

UPGMA example (cont.)

A E D B C

1234

AE BC D

AE 0 8 5

BC 0 8

D 0

AED BC

AED 0 8

BC 0

A E D B C

1234

A E D B C

1234

after twomerges

after threemerges

final state

Page 27: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

UPGMA relies on the molecular clock assumption

• Sequences diverge at the same rate at different points in the phylogeny

• Distance from any leaf to root is the same.• If this is true the distances are said to have an “ultrametric”

property• This assumption is rarely true in practice

Page 28: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

The molecular clock assumption & ultrametric data

• Ultrametric data: for any triplet of sequences, i, j, k, the distances are either all equal, or two are equal and the remaining one is smaller

A B C D E

A 0 8 8 5 3

B 0 3 8 8

C 0 8 8

D 0 5

E 0 A E D B C

1

2

3

4

Page 29: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Problems with the molecular clock assumption

1

23

4

Actual tree

2 3 4 1

Constructed by UPGMA

Page 30: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Neighbor joining

• The assumption about the ultra-metric property is too strong– Most sequences diverge at different rates

• A more relaxed requirement is that of additivity– Distance between a pair of species/nodes is equal to the sum of the

branch lengths

• Uses a similar idea to construct trees as UPGMA– That is consider pairs of nodes and joins them

• Produces unrooted trees

Page 31: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

How to select nodes for joining?

• Given all pairwise distances for n sequences• dij denote the distance between node i and j

• Should we select node pairs with the smallest dij?

A B

C D

0.40.4

0.1 0.1 0.1

This will give us an incorrect tree

Page 32: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Selecting nodes to join

ri : Average distance from all other leavesL: number of leaves

• Neighbor joining requires us to correct the distance to account for distances from all other nodes.

• The corrected distance is denoted as Dij

Page 33: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Defining the distance to a new node

i

j

m

k

dkm?

New node

Given dij, dim, djm, how to calculate distance of existing node m to new node k?

Page 34: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Updating distances in neighbor joining

• Calculate the distance from a leaf to its parent node so that we take into account the distance to all other leaves

where

and L is the set of leaves

Page 35: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Algorithm for NJ

• Initialization– T be set the of leaf nodes– L = T

– Estimate ri for all i in L

– Estimate Dij

• Iteration– Pick a pair i, j from L such that Dij is smallest– Define new node k– Estimate dik, djk, add edge between k and i, and between k to j

– Add k to T, remove i and j from L

– Estimate Dmn for all nodes m, n in L

• Terminate– If L has two nodes, add the edge between these two.

Page 36: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

An example with neighbor joining

• Consider 5 sequences: A, B, C, D, E• Distance matrix

• Let us infer the tree using the Neighbor joining algorithm

5 4 9 8

5 10 9

7 6

7

A

B

C

D

E

B C D E

Page 37: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Can we check for additivity?

Check for additivity: For four leaves, i, j, k, l and the distances dij, dik, dil, djk, djl, dkl

i j

kl

The three sums of two distances

i j

kl

i j

kl

i j

kl

Should be such that two of these are equal, and larger than the third.

Page 38: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Comparing NJ and UPGMA

• UPGMA– Rooted tree– Assumptions: Molecular clock assumption/ultrametric distance and

additivity

• NJ– Unrooted tree– Assumption: Additivity

Page 39: Phylogenetic trees Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Sep 23 rd, 2014

Rooting a tree

• An unrooted tree can be converted to a rooted tree using an outgroup species

• Outgroup: a species known to be more distantly related all the species than each of the species themselves

• Find the branch where the outgroup is selected to be added• That gives the root

1

5

4

87

6

32

candidate root

outgroup

2 3

654

87

1outgroup