phylogeny estimation: why it is "hard", and how to design methods with good performance

59
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas at Austin

Upload: dustin-flowers

Post on 03-Jan-2016

29 views

Category:

Documents


4 download

DESCRIPTION

Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance. Tandy Warnow Department of Computer Sciences University of Texas at Austin. The real title:. Phylogeny Estimation: Why it is “Hard” but not how to design methods with good performance - - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Phylogeny Estimation: Why It Is "Hard", and

How to Design Methods with Good Performance

Tandy WarnowDepartment of Computer Sciences

University of Texas at Austin

Page 2: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

The real title:

Phylogeny Estimation:

Why it is “Hard”

but not

how to design methods with good performance -

talk to me separately about this, no time in this lecture!

Page 3: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

This talk

• Intro to phylogenetic estimation (using some terms to be defined later: polynomial time and NP-hard)

• Computational problems and what it means to solve them exactly

• Computational problems, and what it means to “solve them” heuristically

Page 4: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website,University of Arizona

Phylogeny (evolutionary tree)

Page 5: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Evolutionary History

From AToL website

Helps us– predict gene function– develop drugs and

vaccines– understand disease

spread– understand human

origins

Tree of Life

Phylogenetics: estimating evolutionary histories

Page 6: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Page 7: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT

U V W X Y

U

V W

X

Y

Page 8: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

How can we infer evolution?

Page 9: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

1. Heuristics for hard optimization problems (Maximum Parsimony and Maximum Likelihood)

Two types of phylogenetic reconstruction methods

Phylogenetic trees

Cost

Global optimum

Local optimum

2. Polynomial time distance-based methods: UPGMA, Neighbor Joining, FastME, Weighbor, etc.

Page 10: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Maximum Parsimony

• Input: Set S of n aligned sequences of length k• Output:

– A phylogenetic tree T leaf-labeled by sequences in S– additional sequences of length k labeling the internal

nodes of T

such that the total number of changes is minimized

Page 11: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Maximum parsimony (example)

• Input: Four sequences– ACT– ACA– GTT– GTA

• Question: which of the three trees has the best MP scores?

Page 12: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Maximum Parsimony

ACT

GTT ACA

GTA ACA ACT

GTAGTT

ACT

ACA

GTT

GTA

Page 13: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Maximum Parsimony

ACT

GTT

GTT GTA

ACA

GTA

12

2

MP score = 5

ACA ACT

GTAGTT

ACA ACT

3 1 3

MP score = 7

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Optimal MP tree

Page 14: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Maximum Parsimony: computational complexity

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

But how do we find the best tree?

Optimal labeling can becomputed in linear time O(nk)

Page 15: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Exhaustive Search

For every tree in on the set of sequences, DO:

• Score each tree (compute optimal sequences for each internal node, and record the score)

• Keep track of the tree with the best score

How expensive is this?

Page 16: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Exhaustive Search

For every tree in on the set of sequences, DO:

• Score each tree (compute optimal sequences for each internal node, and record the score)

• Keep track of the tree with the best score

How expensive is this?

Page 17: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Don’t try “exhaustive search”

• Number of (unrooted) binary trees on n leaves is (2n-5)!! = (2n-5)x(2n-7)x(2n-9)x…x3

• If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in

2890 millennia

#leaves #trees

4 3

5 15

6 105

7 945

8 10395

9 135135

10 2027025

20 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

Page 18: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Maximum Parsimony: computational complexity

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Finding the optimal MP tree is NP-hard

Optimal labeling can becomputed in linear time O(nk)

Page 19: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

NP-hard(ness)

• What does this mean?

• What are the consequences for a problem being NP-hard?

• What kind of methods are used to “solve” NP-hard problems?

• How should you interpret the output of a software program, when the problem is NP-hard?

Page 20: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

“Real” problem: your brother’s birthday party

• Your brother is turning 10 and you need to arrange his birthday party

• He wants all his friends to come• But some of them hate each other

Your objective: have as few parties as you can, but invite everyone to at least one party (while not having people who hate each other at the same party)

Page 21: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Your brother’s party

• Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben

• Sally and Alice hate each other, also Henry and Sally, Henry and Tommy, Alice and Jimmy, Ben and Sally, and Ben and Henry.

Page 22: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Graph representation of your brother’s friends

Graph has vertices and edges• Vertices = your brother’s friends• Edges between vertices indicate they hate

each other

Page 23: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Your brother’s party

• Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben

• Sally and Alice hate each other, also Henry and Sally, Henry and Tommy, Alice and Jimmy, Ben and Sally, and Ben and Henry.

Page 24: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Coloring vertices to assign friends to parties

• Given graph G with vertices and edges• Assign colors to the vertices so that no edge

connects vertices of the same color, using a minimum number of colors

• Vertices = your brother’s friends• Edges between vertices indicate they hate each

other• Colors = parties

Page 25: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Assigning friends to parties: graph coloring!

• Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben

• We can’t do this with two parties. Why?• What about three?

Page 26: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Your brother’s parties

Solution: three parties!• Sally, Tommy, and Jimmy• Henry and Alice• Ben

Page 27: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

What is the minimum number of colors that a graph needs?

Remember: no edge between vertices of the same color!

Page 28: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

A graph that needs 3 colors

Page 29: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

2-colored graph

Page 30: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

A computational problem

• 2-colorability:• Given graph G, determine if we

can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

Page 31: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Can we 2-color this graph?

Page 32: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Can we 2-color these graphs?

Page 33: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Solving 2-colorability• 2-colorability: Given graph G, determine if we can

assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.

• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

Page 34: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Solving 2-colorability• 2-colorability: Given graph G, determine if we can

assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.

• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

Page 35: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Solving 2-colorability• 2-colorability: Given graph G, determine if we can

assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.

• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

Page 36: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

What about this?

• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

Page 37: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

A 3-colored graph

Page 38: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Can you 3-color these graphs?

Page 39: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

How about this graph?

Page 40: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Testing 3-colorability

• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

The “greedy algorithm” will work correctly in some, but not all cases.

Page 41: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Exhaustive search for 3-colorability

• Look at all possible vertex colorings• See if any is “legal” (no edge between vertices of

the same color)

Problem: there are 3n vertex colorings of a graph on n vertices

Question to students: how many vertex colorings are there for a graph with 10 vertices? 20 vertices? 100 vertices?

Page 42: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

What about this?

• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

Page 43: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

What about this?

• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

• This problem is NP-hard. • What does this mean?

Page 44: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

• Some decision problems can be solved in polynomial time:– Can graph G be 2-colored?– Does graph G have a 3-clique (three vertices that are all

adjacent)?

• Some decision problems seem to not be solvable in polynomial time:– Can graph G be 3-colored?– What is the size of the largest clique in the graph G?

Page 45: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

P vs. NP, continued

• The “big” question in theoretical computer science is:– Is it possible to solve an NP-hard

problem in polynomial time?• If the answer is “yes”, then all NP-hard

problems can be solved in polynomial time, so P=NP. This is generally not believed.

Page 46: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Minimum coloring

• Since 3-colorability is NP-hard, finding the minimum number of colors for a graph is

NP-hard.• That means the problem will be very hard on some

graphs -- even if others can be easy.

• So if your brother has a lot of friends, arranging the minimum number of parties could take you a

very very very very very long time.• So forget solving this problem exactly!

Page 47: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Solving NP-hard optimization problems (like min coloring)

Options:– Solve the problem exactly (but use lots of time

on some inputs)– Use heuristics which may not solve the

problem exactly (and which might be computationally expensive, anyway)

Page 48: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Phylogeny estimation is NP-hard, so

• Most methods that are used for maximum parsimony (or maximum likelihood) are heuristics that are not guaranteed to solve the problems exactly.

• Even the best methods can take a very long time (months or more) on some inputs, without being guaranteed to solve their problems well.

• You do not know how poor the solution is.

Page 49: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Start with some tree and score it

Repeat Change the tree slightly, and see if the new tree has a better score.

until no neighbor of your best tree has a better score (i.e., stop at a local optimum)

Return the best tree you found

Hill-climbing for phylogeny estimation

Page 50: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Exploring “tree space”• Tree space move: Short range move: Nearest neighbor interchange (NNI): swap two subtrees on the

two sides of an internal edge. Long range move: Bisection and reconnection: cut the tree in two subtees along an

edge, and then rejoin the two subtrees to become a different tree.

Page 51: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Two problems: 1. Getting “stuck” in local optima2. Taking too long to get to good solutions

“Solving” NP-hard phylogenetic estimation problems

Phylogenetic trees

Cost

Global optimum

Local optimum

Page 52: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Problems with current techniques for MP

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%.

Performance of TNT with time

Page 53: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Observations

• The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets.

• Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions.

• Apparent convergence can be misleading.

Page 54: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

If a problem is NP-hard

• Some inputs you can solve correctly and quickly, using simple algorithms.

• Some inputs you can solve correctly but it will take a long time.

• Some algorithms will give incorrect answers on some inputs.

• You may not know if your answer is correct or not.

Page 55: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Lessons

• Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima.

• Therefore we still need better heuristics.• Biologists should be cautious in believing

that the trees found are actually “optimal”.

Page 56: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Reconstructing the “Tree” of LifeHandling large datasets:

millions of species

The “Tree of Life” is not really a tree:

reticulate evolution

Page 57: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Phylolab, U. TexasPlease visit us athttp://www.cs.utexas.edu/users/phylo/

Page 58: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

Acknowledgements

• Funding: NSF and the David and Lucile Packard Foundation

• Collaborators and students: Bernard Moret, Luay Nakhleh, Usman Roshan, and Tiffani Williams

Page 59: Phylogeny Estimation:  Why It Is "Hard", and  How to Design Methods with Good Performance

General comments for NP-hard optimization problems

• Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time.

• You may not know when you have an optimal solution, if you use a heuristic.

• Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation do you need?