superfine , enabling large -scale phylogenetic estimation

31
SuperFine, Enabling Large- Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology

Upload: laban

Post on 23-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

SuperFine , Enabling Large -Scale Phylogenetic Estimation. Shel Swenson University of Southern California and Georgia Institute of Technology. Phylogeny (evolutionary tree). Orangutan. Human. Gorilla. Chimpanzee. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

SuperFine, Enabling Large-Scale Phylogenetic Estimation

Shel SwensonUniversity of Southern California

andGeorgia Institute of Technology

Page 2: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Orangutan Gorilla Chimpanzee Human

(1-3) From the Tree of the Life Website,University of Arizona

Phylogeny(evolutionary tree)

1 32

“Nothing in Biology makes sense except in the light of evolution” – Dobhzhansky

Page 3: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Tree of Life, Importance to Biology

Biomedical applicationsMechanisms of evolutionTracking ancient migrationsProtein structure and

functionDrug design

1) Nature Reviews (Genetics)2) Howard Hughes Medical Institute (BioInteractive)3) 1000 Genomes Project

1

32

We are here

Page 4: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

AAGACTT -3 million yrs

-2 million yrs

-1 million yrs

today

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

DNA sequence evolution (idealized)

Page 5: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

AGATTA AGACTA TGGACA TGCGACTAGGTCA

U V W X Y

U

V W

X

Y

Phylogeny Problem

U V W X Y

Page 6: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Two basic approaches for tree estimation on multi-gene datasets

• Apply phylogeny estimation methods to concatenated (“combined”) sequence alignments for different genes

• Compute trees on individual genes and apply a supertree method

This Talk: SuperFine, boosts supertree methods, enablingfaster, more accurate estimation for large scale problems

Page 7: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Using multiple genes

gene 1S1

S2

S3

S4

S7

S8

TCTAATGGAA

GCTAAGGGAA

TCTAAGGGAA

TCTAACGGAA

TCTAATGGAC

TATAACGGAA

gene 3TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA

S1

S3

S4

S7

S8

gene 2GGTAACCCTC

GCTAAACCTC

GGTGACCATC

GCTAAACCTC

S4

S5

S6

S7

Page 8: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Concatenation

gene 1S1

S2

S3

S4

S5

S6

S7

S8

gene 2 gene 3 TCTAATGGAA

GCTAAGGGAA

TCTAAGGGAA

TCTAACGGAA

TCTAATGGAC

TATAACGGAA

GGTAACCCTC

GCTAAACCTC

GGTGACCATC

GCTAAACCTC

TATTGATACA

TCTTGATACC

TAGTGATGCA

CATTCATACC

TAGTGATGCA

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

Page 9: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

. . .

Analyzeseparately

Supertree Method

Two competing approaches gene 1 gene 2 . . . gene k

. . . ConcatenationSpec

ies

Page 10: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Why use supertree methods?

• Missing data• Large dataset sizes

• Incompatible data types (e.g., morphological features, biomolecular sequences, gene orders, even distances based upon biochemistry)

• Unavailable sequence data (only trees)

Page 11: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Many Supertree Methods

• MRP• weighted MRP• Min-Cut• Modified Min-Cut• Semi-strict

Supertree• MRF• MRD• QILI

• SDM• Q-imputation• PhySIC• Majority-Rule

Supertrees• Maximum

Likelihood Supertrees

• and many more ...

Matrix Representation with Parsimony(Most commonly used and among most accurate)

Page 12: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Quantifying Error

FN: false negative (missing edge)FP: false positive (incorrect edge)

FN

FP50% error rate

Page 13: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

FN rateMRP vs. Concatenation

Scaffold Density (%)

FN R

ate

(%)

MRPConcatenation

Concatenation is not always an option We need better supertree methods

Page 14: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

FN RateSuperFine vs. MRP and Concatenation

Scaffold Density (%)

FN R

ate

(%)

MRPSuperFineConcatenation

Page 15: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Running TimeSuperFine vs. MRP

(Concatenation is much slower)

MRP 8-12 sec.SuperFine 2-3 sec.

Scaffold Density (%) Scaffold Density (%)Scaffold Density (%)

Min

utes

MRPSuperFine

Page 16: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Idea behind SuperFine

1. Construct a supertree with low false positive rate

2. Reduce false negatives by resolving areas of uncertainty using a supertree method

Quartet Max Cut

(Swenson et al., Systematic Biology, 2011)

Page 17: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Bipartitions and refinementLet B(T) denote the set of (non-trivial) bipartitions induced by the edges of T.

T refines T’ (T’≤T) if B(T) B(T’)

a

b

c

f

de a

b

c

f

d

e

TB(T) = {ab|cdef, abc|def, abcd|ef}

T’B(T’) = {ab|cdef, abc|def}

Polytomy

Refinement

Page 18: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Idea behind SuperFine

1. Construct a supertree with low FP using the Strict Consensus Merger (SCM) (Huson et al. 1999)

2. Reduce FN by resolving each polytomy using a supertree method

Quartet Max Cut

Page 19: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Strict Consensus Merger (SCM)a b

c d

e

fg

a b

cdh

i j

e

fg

hi j

a b

c

d

a b

c

d

e

fg

a b

c

dh

i j

Page 20: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Property of SCM: Bipartitions in SCM tree correspond to bipartitions in the source trees

a b

c d

e

fg

a b

cdh

i j

e

fg

hi j

a b

c

d

a b

c

d

e

fg

a b

c

dh

i j

Swenson, Ph.D. Thesis, 2009

Page 21: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Performance of SCM

• Low false positive (FP) rate(Estimated supertree has few false edges)

• High false negative (FN) rate(Estimated supertree is missing many true edges)

• Runs in polynomial time (in the number of source trees and total number of species)

Page 22: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Idea behind SuperFine

1. Construct a supertree with low FP using SCM

2. Refine the tree to reduce FN by resolving each polytomy using a supertree method (eg. MRP)

Quartet Max Cut

Page 23: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Resolving a single polytomy, v

• Step 1: Reduce each source tree to a tree on {1,2,...,d}, where d=degree(v)

• Step 2: Apply MRP to the collection of reduced trees, to produce a tree t on leafset {1,2,...,d}

• Step 3: Replace the star tree at v by tree t

Page 24: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Back to Our Examplee

fg

a b

c

dh

i j

a bc e

hi j

d fg

1 2 3

4 5 6

a b

c d

e

fg

a b

cdh

i j

1 1

1 4

1

65

1 1

142

3 3

Page 25: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Where We Use the Propertye

fg

a b

c

dh

i j

4

1

65

1

42 3

a b

c d

e

fg

a b

cdh

i j

Page 26: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Step 1: Reduce each source tree to a tree on the set {1,2,...,d}

a b

c d

e

fg

a b

cdh

i j

4

1

65

1

42 3

Page 27: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Step 2: Apply MRP to the collection of reduced trees

1

2 3

4

1 4

56MRP

1

2 3

4

6

5MRP

Page 28: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Replace polytomy using tree from MRP

1

2 3

4

6

5

a bc e

hi j

d fg

e

fg

a b

c

dh

i jh

dg

fi

j

a

bc

e

Page 29: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

FN RateSuperFine vs. MRP and Concatenation

Scaffold Density (%)

FN R

ate

(%)

MRPSuperFineConcatenation

Page 30: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

Running TimeSuperFine vs. MRP

(Concatenation is much slower)

MRP 8-12 sec.SuperFine 2-3 sec.

Scaffold Density (%) Scaffold Density (%)Scaffold Density (%)

Min

utes

MRPSuperFine

Page 31: SuperFine ,  Enabling Large -Scale Phylogenetic Estimation

SuperFine: Boosting supertree methods• Superfine+MRP vs. MRP (Swenson et al. 2011)

– SuperFine combines the features of the SCM method (polynomial time, low false positive rates) with the lower false negative rate of MRP, to achieve greater accuracy in less time.

– Speed-up results from the re-encoding of source trees as smaller trees.

• SuperFine+QMC vs. QMC (quartet-based)– QMC (Snir 2008), polynomial time, but infeasible for 500+ taxa– SuperFine+QMC, runs where QMC cannot (Swenson et al. 2010)

• SuperFine+MRL vs. MRL (likelihood) (Nguyen et al. 2012)– SuperFine+MRL, faster and more accurate, similar likelihood scores

DACTAL (Nelesen, et al. 2012) Boosting concatenation methods; uses SuperFine in its divide-and-conquer strategy