new methods for estimating species trees from gene trees

88
New methods for estimating species trees from gene trees Tandy Warnow March 12, 2012

Upload: maalik

Post on 22-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

New methods for estimating species trees from gene trees. Tandy Warnow March 12, 2012. Phylogeny (evolutionary tree). Orangutan. Human. Gorilla. Chimpanzee. From the Tree of the Life Website, University of Arizona. -3 mil yrs. AAGACTT. AAGACTT. -2 mil yrs. AAG G C C T. AAGGCCT. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: New methods for estimating species trees from gene trees

New methods for estimating species trees

from gene trees

Tandy WarnowMarch 12, 2012

Page 2: New methods for estimating species trees from gene trees

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website,University of Arizona

Phylogeny(evolutionary tree)

Page 3: New methods for estimating species trees from gene trees

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Page 4: New methods for estimating species trees from gene trees

Input: unaligned sequences

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

Page 5: New methods for estimating species trees from gene trees

Phase 1: Multiple Sequence Alignment

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

Page 6: New methods for estimating species trees from gene trees

Phase 2: Construct tree

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

S1

S4

S2

S3

Page 7: New methods for estimating species trees from gene trees

Progress on Gene Tree and Alignment Estimation

• Statistical performance of phylogeny estimation methods

• Co-estimation of alignments and trees (SATé)

• “Alignment-free” phylogeny estimation (DACTAL)

• Phylogenetic analysis and alignment of NGS data (SEPP)

• Taxon identification of short reads from same gene (metagenomic analysis) (TIPP)

Tomorrow’s talk will cover SATé, SEPP, and TIPP

Page 8: New methods for estimating species trees from gene trees

Single gene vs. multi-gene analyses

• Most methods analyze single genes (or other genomic region). These produce estimated “gene trees”.

• But species trees are estimated using multiple genes.

Page 9: New methods for estimating species trees from gene trees

Multi-gene analysesAfter alignment of each gene dataset:

• Combined analysis: Concatenate (“combine”) alignments for different genes, and run phylogeny estimation methods

• Supertree: Compute trees on alignment and combine gene trees

Page 10: New methods for estimating species trees from gene trees

Not all genes present in all species

gene 1S1

S2

S3

S4

S7

S8

TCTAATGGAA

GCTAAGGGAA

TCTAAGGGAA

TCTAACGGAA

TCTAATGGAC

TATAACGGAA

gene 3TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA

S1

S3

S4

S7

S8

gene 2GGTAACCCTC

GCTAAACCTC

GGTGACCATC

GCTAAACCTC

S4

S5

S6

S7

Page 11: New methods for estimating species trees from gene trees

. . .

Analyzeseparately

SupertreeMethod

Two competing approaches

gene 1 gene 2 . . . gene k

. . . Combined Analysis

Sp

ecie

s

Page 12: New methods for estimating species trees from gene trees

Constructing trees from subtrees

Let T|A denote the induced subtree of T on the leafset A

a

b

c

f

d e

T

c d

fa

T|{a,c,d,f}

Question: given induced subtrees of T for many subsets of taxa -- can you produce the tree T?

Page 13: New methods for estimating species trees from gene trees

Supertree estimationChallenges:• Tree compatibility is NP-complete (therefore,

even if subtrees are correct, supertree estimation is hard)

• Estimated subtrees have error

Advantages:• Estimating individual gene trees can be

computationally feasible (compared to the combined analysis of many genes)

• Can use different types of data for each gene

Page 14: New methods for estimating species trees from gene trees

Many Supertree Methods

• MRP• weighted MRP• MRF• MRD• Robinson-Foulds

Supertrees• Min-Cut• Modified Min-Cut• Semi-strict Supertree

• QMC• Q-imputation• SDM• PhySIC• Majority-Rule

Supertrees• Maximum Likelihood

Supertrees• and many more ...

Matrix Representation with Parsimony(Most commonly used and most accurate)

Page 15: New methods for estimating species trees from gene trees

a

b

c

f

d e a

b

d

f

c

e

Quantifying topological error

True Tree Estimated Tree

• False positive (FP): b B(Test.)-B(Ttrue)

• False negative (FN): b B(Ttrue)-B(Test.)

Page 16: New methods for estimating species trees from gene trees

FN rate of MRP vs. combined analysis

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Scaffold Density (%)

Page 17: New methods for estimating species trees from gene trees

SuperFine-boosting: improves accuracy of MRP

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Scaffold Density (%)

(Swenson et al., Syst. Biol. 2012)

Page 18: New methods for estimating species trees from gene trees

SuperFine

• First, construct a supertree with low false positives

The Strict Consensus

• Then, refine the tree to reduce false negatives by resolving each polytomy using a “base” supertree method (e.g., MRP)

Quartet Max Cut

Page 19: New methods for estimating species trees from gene trees

Obtaining a supertree with low FP

The Strict Consensus Merger (SCM)

SCM of two treesComputes the strict consensus on the

common leaf setThen superimposes the two trees,

contracting more edges in the presence of “collisions”

Page 20: New methods for estimating species trees from gene trees

Strict Consensus Merger (SCM)

a b

c d

e

fg

a b

cdh

i j

e

fg

hi j

a b

c

d

a b

c

d

e

fg

a b

c

dh

i j

Page 21: New methods for estimating species trees from gene trees

Performance of SCM

• Low false positive (FP) rate(Estimated supertree has few false

edges)

• High false negative (FN) rate(Estimated supertree is missing many

true edges)

Page 22: New methods for estimating species trees from gene trees

Theoretical results for SCM

• SCM can be computed in polynomial time

• For certain types of inputs, the SCM method solves the NP-hard “Tree Compatibility” problem

• All splits in the SCM “appear” in at least one source tree (and project onto each source tree)

Page 23: New methods for estimating species trees from gene trees

Resolving a single polytomy, v, using MRP

• Step 1: Reduce each source tree to a tree on leafset, {1,2,...,d} where d=degree(v)

• Step 2: Apply MRP to the collection of reduced source trees, to produce a tree t on {1,2,...,d}

• Step 3: Replace the star tree at v by tree t

Page 24: New methods for estimating species trees from gene trees

Part 1 of SuperFinea b

c d

e

fg

a b

cdh

i j

e

fg

hi j

a b

c

d

a b

c

d

e

fg

a b

c

dh

i j

Page 25: New methods for estimating species trees from gene trees

Part 2 of SuperFine

e

fg

a b

c

dh

i j

a bc e

hi j

d fg

1 2 3

4 5 6

a b

c d

e

fg

a b

cdh

i j

1 1

1 4

1

65

1 1

142

3 3

4

1

65

1

42 3

Page 26: New methods for estimating species trees from gene trees

Theorem

Given – a set of source trees, – SCM tree T, – and a polytomy in T,

after relabelling and reducing, each source tree has at most one leaf with each label.

Page 27: New methods for estimating species trees from gene trees

Step 2: Apply MRP to the collection of reduced source trees

1

2 3

4

1 4

56MRP

1

2 3

4

6

5

Page 28: New methods for estimating species trees from gene trees

Replace polytomy using tree from MRP

1

2 3

4

6

5

a bc e

hi j

d fg

e

fg

a b

c

dh

i jh

dg

f

ij

a

b

ce

Page 29: New methods for estimating species trees from gene trees

SuperFine-boosting: improves accuracy of MRP

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Scaffold Density (%)

(Swenson et al., Syst. Biol. 2012)

Page 30: New methods for estimating species trees from gene trees

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

SuperFine is also much faster

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

MRP 8-12 sec.SuperFine 2-3 sec.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Scaffold Density (%) Scaffold Density (%)Scaffold Density (%)

Page 31: New methods for estimating species trees from gene trees

Limitations of Supertree Methods

• Traditional supertree methods assume that the true gene trees match the true species tree.

• This is known to be unrealistic in some situations, due to processes such as• Deep coalescence (“incomplete lineage

sorting”)• Gene duplication and loss• Horizontal gene transfer

Page 32: New methods for estimating species trees from gene trees

Multiple populations/species

Present

Past

Courtesy James Degnan

Page 33: New methods for estimating species trees from gene trees

Gene tree in a species treeCourtesy James Degnan

Page 34: New methods for estimating species trees from gene trees

Deep Coalescence

• Population-level process, also called “Incomplete Lineage Sorting”

• Gene trees can differ from species trees due to short times between speciation events (population size also impacts this probability)

• Causes difficulty in estimating some species trees (such as human-chimp-gorilla)

Page 35: New methods for estimating species trees from gene trees

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website,University of Arizona

Phylogeny(evolutionary tree)

Page 36: New methods for estimating species trees from gene trees

MDC Problem

• MDC (minimize deep coalescence) problem:

– given set of true gene trees, find the species tree that implies the fewest deep coalescence events

• Posed by Wayne Maddison, Syst Biol 1997

Page 37: New methods for estimating species trees from gene trees

Counting deep coalescences

QuickTime™ and a decompressor

are needed to see this picture.

Page 38: New methods for estimating species trees from gene trees

Extra Lineages XL(T,t)

• T is the species tree

• t is the gene tree

• XL(T,t): the number of extra lineages, under the best embedding of t into T

Page 39: New methods for estimating species trees from gene trees

Two MDC problems

Score pair of trees:• Input: rooted binary gene tree t and species

tree T• Output: XL(T,t)

Find best species tree:• Input: set X of rooted, binary gene trees on set

S• Output: species tree T on S that minimizes

XL(T,X) = t XL(T,t).

Page 40: New methods for estimating species trees from gene trees

Limitations of methods for MDC

Current methods typically assume

• input gene trees are correct, binary, rooted trees containing all the taxa

But

• Estimated gene trees are usually partially incorrect, are often unrooted, and may not be complete.

• Assuming all gene tree incompatibility is due to deep coalescence is likely problematic.

Page 41: New methods for estimating species trees from gene trees

Minimizing Deep Coalescence (MDC)

• Than and Nakhleh (PLoS Comp Biol 2009): algorithms for MDC which assume all gene trees are correct, rooted, binary trees.

• Yu, Warnow, and Nakhleh (RECOMB 2011 and J Comp Biol 2011) extends T&N 2009 to handle estimated gene trees that are unrooted and have errors.

• Bayzid and Warnow (J Comp Biol, in press) extends T&N 2009 to handle incomplete gene trees.

Page 42: New methods for estimating species trees from gene trees

Search: main results in T&N 2009

• Theorem: Let X be a set of k rooted binary gene trees on taxon set S, and let C be a set of subsets of the taxon set. Then a species tree T that optimizes MDC with Clusters(T) C can be found in time that is polynomial in |C|, n, and k.

• Exact MDC: Let C be all possible subsets of S• “Heuristic” MDC: Let C be the set of “clusters” of

the input gene trees (where a cluster is the set of leaves below a node in a tree)

Page 43: New methods for estimating species trees from gene trees

T&N 2009: B-maximal clusters and

kB(t) T is a species tree, and t is a gene tree,

both rooted and binary

Definitions• B is a cluster of T• Y is a B-maximal cluster in t if (i) Y is a

cluster of t, (ii) Y B, and (iii) Y Z for any other cluster Z of t such that Z B.

• kB(t) is the number of B-maximal clusters in t

Page 44: New methods for estimating species trees from gene trees

Calculating XL(T,t)

Lemma (T&N 2009): Let T be a binary species tree and t be a binary rooted gene tree. Then for an optimal embedding of t into T:– kB(t) is the number of lineages on the

edge “above” subtree for B in T

– XL(T,t) = B[kB(t)-1], where B ranges over the clusters of T.

Page 45: New methods for estimating species trees from gene trees

Calculating XL(T,X)

Define CostB(t)= kB(t)-1, and therefore

XL(T,t) = B CostB(t)

Given set X of gene trees, define XL(T,X) = t XL(T,t)

= t B CostB(t)

= B t CostB(t)

= B w(B)

where w(B) = t CostB(t)

Page 46: New methods for estimating species trees from gene trees

Graph Algorithm for MDC

Graph G(X):• Vertex set: v corresponds to non-trivial S(v) S,

where S(v) is the cluster of T below node v• Edges: (v,w) present iff clusters S(v) and S(w) can

co-exist as clusters in a tree

• Vertex weight: Weight(v) =∑t CostS(v)(t)

Theorem: T, binary rooted tree on S s.t. XL(T,X)=W, iff (n-2)-clique in G(X) of weight W, where |S|=n.

Hence, MDC can be solved by finding a (n-2)-clique of minimum total weight in G(X).

Page 47: New methods for estimating species trees from gene trees

T&N algorithm for MDC

• Because of the structure of the graph, we can find a min cost max clique (of size n-2) in polynomial time (in the size of the graph), using dynamic programming. But the graph has 2n vertices!

• However, if we constrain the set C of permitted clusters for the species tree, we can find an optimal constrained solution in O(|C|2 nk) time (the “heuristic” algorithm in T&N 2009).

Page 48: New methods for estimating species trees from gene trees

Yu, Warnow and Nakhleh (2011)

• Allows for error in estimated gene trees.

• RECOMB 2011 and J Comp Biol 2011

Page 49: New methods for estimating species trees from gene trees

Yu, Warnow and Nakhleh (2011)

Modify gene trees to reduce false positive error:

• Unroot trees• Use bootstrap (or other statistical

techniques) to identify the edges that are potentially incorrect

• Contract the low support edges

Result: estimated gene trees that are likely to be unrooted contractions of the true gene tree.

Page 50: New methods for estimating species trees from gene trees

New MDC problem

• Input: set X ={t1, t2, …, tk} of incompletely resolved, unrooted gene trees.

• Output: set X’={t’1, t’2, …, t’k} (such that each t’i is a resolved, rooted version of ti, i=1,2…k) and species tree T that minimizes XL(T,X’).

In other words, we treat ti as a constraint on the true gene tree for gene i.

Page 51: New methods for estimating species trees from gene trees

Search: main theoretical result in T&N 2009

• Theorem: Let X be a set of k rooted binary gene trees on taxon set S, and let C be a set of clusters on the taxon set. Then a species tree T that optimizes MDC with Clusters(T) C can be found in O(|C|2nk) time, where |S|=n.

Page 52: New methods for estimating species trees from gene trees

Search: main theoretical result in YWN 2011

• Theorem: Let X be a set of k unrooted and not necessarily binary gene trees on taxon set S, and let C be a set of clusters on the taxon set. Then a species tree T that optimizes MDC with Clusters(T) C can be found in O(|C|2nk) time, where |S|=n.

Page 53: New methods for estimating species trees from gene trees

Scoring: main theoretical result

• Theorem: Let t be an unrooted and not necessarily binary gene tree, and let T be a rooted binary species tree, both on S. Then a rooted refinement t* of t that minimizes XL(T,t*) can be found in O(n2) time, where |S|=n.

Note: brute-force is exponential, even if t is rooted and the maximum degree in t is low

Page 54: New methods for estimating species trees from gene trees

Simplest case: t is rooted

• Input: rooted tree t, not necessarily binary, and binary rooted species tree T

• Output: refinement t* of t, minimizing XL(T,t*)

Recall that XL(T,t*) = ∑B[kB(t*)-1]

Page 55: New methods for estimating species trees from gene trees

Refining rooted tree t

Def.: FB(t) denotes the number of nodes in t that have at least one B-maximal child.

Lemma: If t’ is a binary refinement of t, then FB(t) kB(t’).

Theorem: For all rooted trees t, there exists t*, a binary refinement of t, such that for all clusters B of T, kB(t*) = FB(t).

Page 56: New methods for estimating species trees from gene trees

Computing t*

• Algorithm: Refine around each high degree node v in t using the subtree of T defined by the LCAs in T of the children of v.

• Order in which you visit each high degree node does not impact the output

• Can be computed in O(n2) time

Page 57: New methods for estimating species trees from gene trees

Proof of optimality

Recall: FB(t) denotes the number of nodes in t that have at least one B-maximal child.

Theorem: The tree t* produced by the algorithm satisfies kB(t*) = FB(t) for every cluster B of T. Hence, t* is optimal.

Proof: Algorithm is locally optimal.

Page 58: New methods for estimating species trees from gene trees

Finding the best species tree, given rooted non-

binary trees• Same basic graph-theoretic

approach and DP algorithms work• Same graph G(X), but redefine

CostB(t)= FB(t)-1

and keep weight(v) = t CostS(v)(t)

Page 59: New methods for estimating species trees from gene trees

General case: t unrooted, non-binary

Input: unrooted, non-binary gene tree t and rooted binary species tree T

Output: rooted, binary tree t* refining t such that XL(T,t*) is minimized

Clearly this is solvable in O(n3) time.Better O(n2) algorithm: find root, then

refine optimally.

Page 60: New methods for estimating species trees from gene trees

Summary of YWN 2011

• Extends all results from Than and Nakhleh 2009 to partially resolved, unrooted gene trees

• Suggests contraction of low support edges and suppression of root before species tree estimation

• Gives polynomial time DP algorithm for constrained search for species tree (using only clusters from input gene trees)

Page 61: New methods for estimating species trees from gene trees

Related results

• Yang and Warnow (RECOMB-CG 2011 and BMC Bioinformatics 2011) shows that the constrained version of the polynomial time DP algorithm in YWN 2011 produces trees of comparable accuracy to BUCKy, a statistically-based method for species tree estimation under ILS.

• Bayzid and Warnow (in press, J Comp Biol) extends T&N 2009 to incomplete gene trees

Page 62: New methods for estimating species trees from gene trees
Page 63: New methods for estimating species trees from gene trees

Discussion• SuperFine is a fast method to “boost” the

accuracy of supertree methods, and produces highly accurate species trees quickly when no ILS occurs. (Data not shown: SuperFine also gives good results in the presence of ILS!)

• In the presence of ILS, statistically-based methods give the best results, but can only be run on small datasets.

• Acknowledging error in gene trees improves species tree estimation.

Page 64: New methods for estimating species trees from gene trees

Acknowledgments

• Funding: Microsoft Research New England, National Science Foundation, and the Guggenheim Foundation

• Collaborators: Luay Nakhleh and Yun Yu (MDC), Shel Swenson, Randy Linder, and Rahul Suri (SuperFine)

Page 65: New methods for estimating species trees from gene trees

Part I: SuperFine

• Nelesen, Suri, Linder, and Warnow• Accepted for publication, subject to

revision, Systematic Biology

Note: SuperFine is the supertree method used in the DACTAL software (Nelesen et al., submitted)

Page 66: New methods for estimating species trees from gene trees

Step 1: Encode each source tree as a collection of reduced source trees on

{1,2,...,d}a b

c d

e

fg

a b

cdh

i j

4

1

65

1

423

Page 67: New methods for estimating species trees from gene trees

Part 2 of SuperFine

e

fg

a b

c

dh

i j

a bc e

hi j

d fg

1 2 3

4 5 6

a b

c d

e

fg

a b

cdh

i j

1 1

1 4

1

65

1 1

142

3 3

4

1

65

1

42 3

Page 68: New methods for estimating species trees from gene trees

Recall Lemma a b

c d

e

fg

a b

cdh

i j

e

fg

hi j

a b

c

d

a b

c

d

e

fg

a b

c

dh

i j

Page 69: New methods for estimating species trees from gene trees

Replace polytomy using tree from MRP

1

2 3

4

6

5

a bc e

hi j

d fg

e

fg

a b

c

dh

i jh

dg

f

ij

a

b

ce

Page 70: New methods for estimating species trees from gene trees

Statistical consistency, exponential convergence, and absolute fast

convergence (afc)

Page 71: New methods for estimating species trees from gene trees

Neighbor Joining’s sequence length

requirement is exponential!

• Atteson: Let T be a General Markov model tree on n leaves. Then Neighbor Joining will reconstruct the true tree with high probability from sequences that are of length at least O(ln n eO(n)).

Page 72: New methods for estimating species trees from gene trees

Chordal graph algorithms yield phylogeny estimation from polynomial

length sequences

•Theorem (Warnow et al., SODA 2001): DCM1-NJ correct with high probability given sequences of length O(ln n eO(ln n))

•Simulation study from Nakhleh et al. ISMB 2001

NJ

DCM1-NJ

0 400 800 16001200No. Taxa

0

0.2

0.4

0.6

0.8

Err

or R

ate

Page 73: New methods for estimating species trees from gene trees

SATé-1 and SATé-2 (“Next” SATé), on 1000 leaf models

Page 74: New methods for estimating species trees from gene trees

DACTAL more accurate than all standard methods, and much faster than SATé

Average results on 3 large RNA datasets (6K to 28K)

CRW: Comparative RNA database, structural alignments

3 datasets with 6,323 to 27,643 sequences

Reference trees: 75% RAxML bootstrap trees

DACTAL (shown in red) run for 5 iterations starting from FT(Part)

SATé-1 fails on the largest dataset

SATé-2 runs but is not more accurate than DACTAL, and takes longer

Page 75: New methods for estimating species trees from gene trees

Markov Model of Site Evolution

Simplest (Jukes-Cantor):• The model tree T is binary and has substitution

probabilities p(e) on each edge e.• The state at the root is randomly drawn from {A,C,T,G}

(nucleotides)• If a site (position) changes on an edge, it changes with

equal probability to each of the remaining states.• The evolutionary process is Markovian.

More complex models (such as the General Markov model) are also considered, often with little change to the theory.

Page 76: New methods for estimating species trees from gene trees

Recall Lemma a b

c d

e

fg

a b

cdh

i j

e

fg

hi j

a b

c

d

a b

c

d

e

fg

a b

c

dh

i j

Page 77: New methods for estimating species trees from gene trees

Step 1: Encode each source tree as a collection of reduced source trees on

{1,2,...,d}a b

c d

e

fg

a b

cdh

i j

4

1

65

1

423

Page 78: New methods for estimating species trees from gene trees

Bipartitions and refinementB(T) denotes the set of non-trivial bipartitions (splits) of TT refines T’ (T’≤T) if B(T’) B(T)

a

b

c

f

d e a

b

c

f

d

e

TB(T) = {ab|cdef, abc|def, abcf|de}

T’B(T’) = {ab|cdef, abc|def}

Page 79: New methods for estimating species trees from gene trees

Displays and compatibility

• T displays T’ if T’ ≤ T|L(T’)

• T displays a set of trees if it displays every tree in that set.

• A set S of trees is compatible if there exists a tree T such that T displays S

In general, determining whether a set of trees is compatible is NP-hard

Page 80: New methods for estimating species trees from gene trees

Matrix representation with parsimony (MRP)

First, encode each edge of each source tree as a partial binary character

Then, analyze this matrix of partial binary characters (the matrix representation) using maximum parsimony (MP)

If used with exact solutions to MP, MRP is an exact algorithm for Tree Compatibility

Page 81: New methods for estimating species trees from gene trees

Maximum Parsimony (Hamming distance Steiner Tree)

• Input: Set S of n aligned sequences of length k

• Output: A phylogenetic tree T– leaf-labeled by sequences in S– additional sequences of length k labeling the

internal nodes of T

such that is minimized. ∑∈ )(),(

),(TEji

jiH

Page 82: New methods for estimating species trees from gene trees

Lemma: SCM splits project onto source trees

a b

c d

e

fg

a b

cdh

i j

e

fg

hi j

a b

c

d

a b

c

d

e

fg

a b

c

dh

i j

Page 83: New methods for estimating species trees from gene trees

Finding optimal root

Color all edges of the gene tree in a B-maximal subtree, for some cluster B in T.

Theorem: the optimal rooted refinement of t can be obtained by rooting t at any node that is incident to at least one uncolored edge (and there will be at least one). Furthermore, such a node can be found in O(n2) time.

Page 84: New methods for estimating species trees from gene trees

Graph algorithm

• For each non-trivial subset B of S, find the best rooted version t’ of each gene tree t, and define CostB(t) = FB(t’)-1.

• Find (n-2)-clique of minimum total weight in the new G(X), with weight(v) = t CostS(v)(t).

Page 85: New methods for estimating species trees from gene trees

Main results in Than and Nakhleh, 2009

• Gives polynomial time algorithm to compute XL(T,X), where T is a binary rooted species tree and X is a set of binary rooted gene trees

• Gives exact DP algorithm for finding optimal MDC species tree for input set of binary rooted gene trees

• Gives exact DP polynomial time solution for constructing optimal MDC species tree when all its bipartitions constrained to come from a user-specified set.

All results require input gene trees be binary, rooted trees.

Analysis assumes input trees are 100% correct.

Page 86: New methods for estimating species trees from gene trees

SuperFine: new supertree method

• Step 1: construct a supertree with low false positives (unresolved)

• Step 2: Refine the tree to reduce false negatives by resolving each high degree node (“polytomy”) using a “base” supertree method (e.g., MRP) applied to recoded source trees.Quartet Max Cut

Page 87: New methods for estimating species trees from gene trees

Main results of Than and Nakhleh, 2009

• Gives polynomial time algorithm to compute XL(T,X), where T is a binary rooted species tree and X is a set of binary rooted gene trees

• Gives exact (DP) algorithm for finding optimal MDC species tree for input set of binary rooted gene trees, by finding (n-2)-clique of minimum weight in a exponentially large graph.

• Gives exact (DP) polynomial time algorithm for constrained version of MDC problem, in which the species tree bipartitions must come from a user-provided input set.

All results require input gene trees be binary, rooted trees.

Analysis assumes input trees are 100% correct.

Page 88: New methods for estimating species trees from gene trees

Scoring a pair of trees

Recall: FB(t) denotes the number of nodes in t that have at least one B-maximal child.

Corollary: Given rooted gene tree t and rooted, binary species tree T, and t* an optimal refinement of t. Then

XL(T,t*) = ∑B[FB(t)-1]as B ranges over the clusters of T.