species trees & constraint programming: recent progress and new challenges

62
Species Trees & Constraint Programming: recent progress and new challenges By Patrick Prosser Presented by Chris Unsworth at CP06

Upload: kat

Post on 15-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Species Trees & Constraint Programming: recent progress and new challenges. By Patrick Prosser Presented by Chris Unsworth at CP06. Outline. Tree of life (what’s that then?) Previous work (conventional and CP model) What’s new? (enhanced model, new problems) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Species Trees & Constraint Programming: recent progress and new challenges

Species Trees & Constraint Programming:

recent progress and new challenges

By Patrick Prosser

Presented by Chris Unsworth at CP06

Page 2: Species Trees & Constraint Programming: recent progress and new challenges

Outline

• Tree of life (what’s that then?)• Previous work (conventional and CP model)• What’s new? (enhanced model, new problems)• Conclusions (what have I told you!?)• Future work (will this never end?)

Page 3: Species Trees & Constraint Programming: recent progress and new challenges

Tree of life• A central goal of systematics

• construct the tree of life

• a tree that represents the relationship between all living things

• The leaf nodes of the tree are species

• The interior nodes are hypothesized species

• extinct, where species diverged

Page 4: Species Trees & Constraint Programming: recent progress and new challenges
Page 5: Species Trees & Constraint Programming: recent progress and new challenges

Not to be confused with this

Page 6: Species Trees & Constraint Programming: recent progress and new challenges

Not to be confused with this

Page 7: Species Trees & Constraint Programming: recent progress and new challenges

Not to be confused withthis either

Page 8: Species Trees & Constraint Programming: recent progress and new challenges

Somethinglikethis

Page 9: Species Trees & Constraint Programming: recent progress and new challenges
Page 10: Species Trees & Constraint Programming: recent progress and new challenges
Page 11: Species Trees & Constraint Programming: recent progress and new challenges
Page 12: Species Trees & Constraint Programming: recent progress and new challenges
Page 13: Species Trees & Constraint Programming: recent progress and new challenges
Page 14: Species Trees & Constraint Programming: recent progress and new challenges
Page 15: Species Trees & Constraint Programming: recent progress and new challenges

To date, biologists have cataloged about 1.7 million species yet estimatesof the total number of species ranges from 4 to 100 million.

“Of the 1.7 million species identified only about 80,000 species have been placed in the tree of life”

E. Pennisi “Modernizing the Tree of Life” Science 300:1692-1697 2003

Page 16: Species Trees & Constraint Programming: recent progress and new challenges

Properties of a Species Tree

• We have a set of leaf nodes, each labelled with a species• the interior nodes have no labels (maybe)• each interior node has 2 children and one parent

(maybe/ideally)– a bifurcating tree (maybe/ideally)

Note: recently there has been a requirements that• interior nodes have divergence dates• leaf nodes correspond to other trees (such as a leaf “cats”)• trees might not bifurcate

Page 17: Species Trees & Constraint Programming: recent progress and new challenges

Super Trees

• We are given two trees, T1 and T2

• S1 and S2 are the sets of leaves for T1 and T2 respectively– remember, leaves are species!

• S1 and S2 have a non-empty intersection– some species appear in both trees

• We want to combine T1 and T2– respecting the relationships in T1 and T2– form a “super tree”

Page 18: Species Trees & Constraint Programming: recent progress and new challenges

combine superTree

Page 19: Species Trees & Constraint Programming: recent progress and new challenges

Overlap is highlighted in the trees andthe superTree

Page 20: Species Trees & Constraint Programming: recent progress and new challenges

Overlap is leafs “a” and “f”

A simple wee example

Page 21: Species Trees & Constraint Programming: recent progress and new challenges

Most Recent Common Ancestors (mrca)

a b

c

mrca(a,b) mrca(a,c)mrca(a,b) mrca(b,c) mrca(a,c) mrca(b,c)

We have 3 species, a, b, and c

Species a and b are more closely relatedto each other than they are to c

The most recent common ancestor of a and bis further from the root than the most recent common ancestor of a and c (and b and c)

mrca(a,b)

mrca(a,c) = mrca(b,c)

cab |a is closer to b than c

NOTE: mrca(x,y) = mrca(y,x)

Page 22: Species Trees & Constraint Programming: recent progress and new challenges

Most Recent Common Ancestors (mrca)

a b

c

mrca(a,b) mrca(a,c)mrca(a,b) mrca(b,c) mrca(a,c) mrca(b,c)

mrca(a,b)

mrca(a,c) = mrca(b,c)

Note: this defines that

Think of mrca(x,y) having integer value “depth”

Page 23: Species Trees & Constraint Programming: recent progress and new challenges

Ultrametric relationship

Given 3 leaf nodes labelled a, b, and c there areonly 4 possible situations

a b c

cab |

a c b

bac |

b c a

abc |

b c a

),,( cba

triples fan

Page 24: Species Trees & Constraint Programming: recent progress and new challenges
Page 25: Species Trees & Constraint Programming: recent progress and new challenges
Page 26: Species Trees & Constraint Programming: recent progress and new challenges
Page 27: Species Trees & Constraint Programming: recent progress and new challenges

a b c a c b b c a a b c

That’s all that there can be, for 3 leafs

Page 28: Species Trees & Constraint Programming: recent progress and new challenges

a b c a c b b c a a b c

Another view

A space made up of triangles

a

b c

Given any three vertices the triangle is eitherisosceles or equilateral

Page 29: Species Trees & Constraint Programming: recent progress and new challenges

Ultrametric relationshipGiven 3 leaf nodes labelled a, b, and c there areonly 4 possible situations

We can represent this using primitive constraints

]),[],[],[(

]),[],[],[],[(

]),[],[],[],[(

]),[],[],[],[(

cbDcaDbaD

caDcbDcaDbaD

baDcaDbcDbaD

caDbaDcbDcaD

Where D[i,j] is a constrained integer variable representing the depth in the tree of the most recent common ancestor of the ith and jth species

Page 30: Species Trees & Constraint Programming: recent progress and new challenges

Ultrametric constraint

Therefore the ultrametric constraint is as follows

]),[],[],[(

]),[],[],[],[(

]),[],[],[],[(

]),[],[],[],[(

cbDcaDbaD

caDcbDcaDbaD

baDcaDbcDbaD

caDbaDcbDcaD

Constraint acting between leaf nodes/species a, b, and cWhere D[x,y] is depth in tree of mrca(x,y) D[x,y] can also be thought of as distance

Page 31: Species Trees & Constraint Programming: recent progress and new challenges

How it goes (part 1)

1. Take 2 species trees T1 and T22. Use the “breakUp” algorithm (Ng & Wormald 1996) on T1 then T2

- This produces a set of triples and fans3. Use the “oneTree” algorithm (Ng & Wormald 1996)

- Generates a superTree or fails

This is the “conventional” (non-CP) approach

Different versions of oneTree and breakUp from Semple and Steel(I think) that treats fans differently (ignores them)

oneTree is essentially the algorithm of Aho, Sagiv,Szymanski and Ullman in SIAM J.Compt 1981

Conventional technology (circa 1981)

Page 32: Species Trees & Constraint Programming: recent progress and new challenges

breakUp generates constraints!

A B

C

D E

F G

1. Find deepest interior node2. Get its descendants (leaf nodes)3. Get a cousin or uncle leaf node4. Generate a triple or fan5. Delete one of the leafs in 2 6. Take the other leaf in 2 and make its parent that leaf7. Go to 1 unless we are at the root with degree 2

Page 33: Species Trees & Constraint Programming: recent progress and new challenges

breakUp generates constraints!

A B

C

D E

F G

Generate triple AB|CThis is the constraint D[A,C] = D[B,C] < D[A,B]

A deepest interior node

Page 34: Species Trees & Constraint Programming: recent progress and new challenges

breakUp generates constraints!

BC

D E

F G

Generate triple DE|CThis is the constraint D[D,C] = D[E,C] < D[D,E]

A deepest interior node

Page 35: Species Trees & Constraint Programming: recent progress and new challenges

breakUp generates constraints!

Generate fan BCEThis is the constraint D[B,C] = D[B,E] = D[C,E]

A deepest interior node

BC E

F G

Page 36: Species Trees & Constraint Programming: recent progress and new challenges

breakUp generates constraints!

Generate triple FG|EThis is the constraint D[E,F] = D[F,G] < D[F,G]

A deepest interior node

E

F G

Page 37: Species Trees & Constraint Programming: recent progress and new challenges

breakUp generates constraints!

E G

Done

The triples and fans can be viewed as constraints that breakthe ultrametric disjunctions

Page 38: Species Trees & Constraint Programming: recent progress and new challenges

The 1st CP approach

Page 39: Species Trees & Constraint Programming: recent progress and new challenges

How it goes (part 2)

This is the CP approach proposed by Gent, Prosser, Smith & Weiin CP03 (a great great paper, go read it )

1. Generate an n by n array of constrained integer variables2. For all 0<i<j<k<n post the ultrametric constraint

- Yes, we have a cubic number of constraints - Yes, we have a quadratic number of variables - This gives us an “ultrametric matrix”

3. Use breakUp on trees T1 and T2 to produce triples and fans4. Post the triples and fans as constraints, breaking disjunctions5. Find a first solution6. Convert the ultrametric matrix to an ultrametric tree

Algorithm for ultrametric matrix to ultrametric treegiven by Dan Gusfield

CP approach (circa 2003)

Page 40: Species Trees & Constraint Programming: recent progress and new challenges

Key here is that we have an array of variablesRepresenting distances and this space must be

ultrametric

Page 41: Species Trees & Constraint Programming: recent progress and new challenges

3

45

B8 CD

EA

0

50

330

3340

85330

E

D

C

B

A

EDCBA

An min ultrametric tree and its min ultrametric matrix

As we go down a branchvalues on interior nodes increase

Matrix value is the valueof the most recent common ancestor of two leaf nodes

Matrix is symmetric

Page 42: Species Trees & Constraint Programming: recent progress and new challenges

The state of play in 2003

• Coded up in claire & choco• more a ”proof of concept” than a useful tool• small data sets only

Page 43: Species Trees & Constraint Programming: recent progress and new challenges

Two species trees of sea birds from the CP03 paper

Page 44: Species Trees & Constraint Programming: recent progress and new challenges

Resultant superTreeOn the left by oneTree and on the right by CP model

Page 45: Species Trees & Constraint Programming: recent progress and new challenges

What’s new

1. Reimplemented in java & JChoco (so faster)2. More robust (thanks to Pierre Flener’s help)3. Can now deal with larger trees (about 70 species)4. Can generate all solutions up to symmetry 5. Can handle divergence dates on interior nodes6. Reimplemented breakUp & oneTree in Java7. All code available on the web

2006

Page 46: Species Trees & Constraint Programming: recent progress and new challenges
Page 47: Species Trees & Constraint Programming: recent progress and new challenges

Bigger Trees

Attempted to reconstruct the supertree in Kennedy & Page’s“Seabird supertrees: Combining partial estimates ofrocellariiform phylogeny” in “The Auk: A Quarterly Journal ofOrnithology” 119:88-108 2002

• 7 trees of seabirds (A through G)• Varying in size from 14 to 90 species

Page 48: Species Trees & Constraint Programming: recent progress and new challenges

From the paper

Table shows on the diagonal the size of each tree, A through GA table entry is the size of the combined treeA table entry in () if trees are incompatibleA table entry of – if trees are too big for CP model

The only compatible trees are A, B, D and FThe resultant supertree has 69 speciesThis takes 20 seconds to produce

Page 49: Species Trees & Constraint Programming: recent progress and new challenges
Page 50: Species Trees & Constraint Programming: recent progress and new challenges

A “lifted” representation

]),[],[],[(

]),[],[],[],[(

]),[],[],[],[(

]),[],[],[],[(

cbDcaDbaD

caDcbDcaDbaD

baDcaDbcDbaD

caDbaDcbDcaD

Rather than instantiate the “D” variables why not just break the disjunctions?

]),[],[],[(4],,[

]),[],[],[],[(3],,[

]),[],[],[],[(2],,[

]),[],[],[],[(1],,[

cbDcaDbaDcbaP

caDcbDcaDbaDcbaP

baDcaDbcDbaDcbaP

caDbaDcbDcaDcbaP

Now the decision variables are P[i,j,k]

And yes, we have a cubic number of P variables

Page 51: Species Trees & Constraint Programming: recent progress and new challenges

A “lifted” representation Rather than instantiate the “D” variables why not just break the disjunctions?

]),[],[],[(4],,[

]),[],[],[],[(3],,[

]),[],[],[],[(2],,[

]),[],[],[],[(1],,[

cbDcaDbaDcbaP

caDcbDcaDbaDcbaP

baDcaDbcDbaDcbaP

caDbaDcbDcaDcbaP

Now the decision variables are P[i,j,k]

Now we can:1. Enumerate all solutions eliminating value symmetries2. Allow ranges of values on interior nodes of trees

- input and output!

Page 52: Species Trees & Constraint Programming: recent progress and new challenges

Ranked Trees

A new problem where input trees have ancestral divergencedates on interior nodes

A new “conventional” technique is the RANKED TREE algorithm

Page 53: Species Trees & Constraint Programming: recent progress and new challenges

Ranked Trees using “lifted” CP model

A new problem where input trees have ancestral divergencedates on interior nodes

We do this in the “lifted” model by merely

1. reading in divergence dates for pairs of species and posting these as constraints into the “D” variables

2. Then solve using the disjunction breaking “P” variables

3. Interior nodes retain range values

4. In addition can enumerate all solutions eliminating value symmetries

Page 54: Species Trees & Constraint Programming: recent progress and new challenges

Two trees of cats. Ranks (divergence information) on interior nodesCommon species in boxes

Page 55: Species Trees & Constraint Programming: recent progress and new challenges

Two ranked cats trees on left, and on the right one of the ranked supertrees

NOTE: range of values [6..9] on mrca(PTE,LTI)

Page 56: Species Trees & Constraint Programming: recent progress and new challenges

7 of the 17 solutions have ranges on interior nodesWithout the “lifted” representation we get 30 solutions (some redundant)

Page 57: Species Trees & Constraint Programming: recent progress and new challenges

Is this a 1st?

We thinks so (or at least Patrick thinks so)

1. enumerate all solutions for ranked supertrees2. remove value symmetries

Page 58: Species Trees & Constraint Programming: recent progress and new challenges

What next?

Reduce the size of the model. with a specialised ultrametric constraint - over 3 variables - over 3 variables plus the P decision variable - over an entire n by n array

Improve propagation of ultrametric constraint - Bound GAC - GAC

New application - Identify common features (back bone) of all supertrees - Address nested taxa - combine all we have

Already underway with Neil Moore

Page 59: Species Trees & Constraint Programming: recent progress and new challenges

Conclusion

• presented a new (non-conventional) way of addressing the supertree problem• constraint model has been shown to be versatile

• enumerate all solutions removing symmetries• address divergence dates on interior nodes• enumerate all solutions for ranked trees

• model is bulky/large• we are working on this

• future extensions• find the backbone of forest of supertrees• address nested taxa

Page 60: Species Trees & Constraint Programming: recent progress and new challenges

I did it all on my ownNO WAY!

Page 61: Species Trees & Constraint Programming: recent progress and new challenges

Thanks for helping

• Pierre Flener• Xavier Lorca• Rod Page• Mike Steel• Charles Semple• Chris Unsworth• Neil Moore• Christine Wu Wei• Barbara Smith• Ian Gent

Page 62: Species Trees & Constraint Programming: recent progress and new challenges

Any questions?