a new recombination lower bound and the minimum perfect phylogenetic forest problem yufeng wu and...

31
A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON’07 July 16, 2007

Upload: tyrone-cockroft

Post on 31-Mar-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

A New Recombination Lower Bound and The Minimum Perfect

Phylogenetic Forest Problem

Yufeng Wu and Dan Gusfield

UC Davis

COCOON’07 July 16, 2007

Page 2: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

2

Recombination

• Recombination: one of the principle genetic forces shaping sequence variations within species

• Two equal length sequences generate a third new equal length sequence during meiosis.

110001111111001

000110000001111

Prefix

Suffix

11000 0000001111

Breakpoint

Page 3: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

3

Ancestral Recombination Graph (ARG)

10 01 0011

10

00

00

01

1 0 0 1

1 1

Network, not tree!

Assumption: at most one mutation per site

Mutations

Recombination

Page 4: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

4

A Min ARG for Kreitman’s data

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

ARG created by SHRUB

Page 5: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

5

Minimizing Recombination• Given enough recombinations, any set of

sequences can be trivially derived on an ARG.

• Problem: Given a set of sequences M, construct an ARG that derives M using one mutation per site, and the minimum number of recombinations (Rmin).

• NP-hard (Wang, et al 2001, Semple et al.).– Efficiently computed Lower bounds on Rmin

exist.

Page 6: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

History Bound (Myers & Griffiths 2003)

000

100

010

011

111

Iterate the following operations1. Remove a column with a single 0 or 12. Remove a duplicate row3. Remove any row

History bound: the minimum number of type-3 operations needed to reduce the matrix to empty

000

100

010

011

00

10

01

01

Empty.

One type-3 operation

00

10

01

M

Page 7: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

7

Graphical interpretation of history bound (HistB)

• Each operation corresponds to an operation that decomposes the optimal, but unknown ARG.

• Removing an exposed recombination node in the ARG corresponds to a single type-3 operation. So when decomposing the optimal ARG, the number of recombination nodes => number of corresponding type-3 operations.

• However, not all type-3 operations correspond to removing a recombination node.• Since the optimal ARG is unknown, the history bound is the

minimum number of type-3 operations needed to make the matrix empty.

Page 8: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

8

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 00101g: 00101

2p s

a: 00010b: 10010c: 00100d: 10100

e: 01100

f: 00101g: 00101

Operations on M correspond to operations on the optimal ARG

M

Page 9: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

9

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 00101

2p s

a: 00010b: 10010c: 00100d: 10100

e: 01100

f: 00101

12345

Type-2 operation

Page 10: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

10

4

1

3

a: 001

b: 101

d: 110

c: 010

e: 010

f: 010

2p s

a: 001b: 101c: 010d: 110

e: 010

f: 010

134

Type-1 operations

Page 11: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

11

4

1

3

a: 001

b: 101

d: 110

c: 010

2p s

a: 001b: 101c: 010d: 110

134

Type-2 operations

Page 12: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

12

4

1

3

a: 001

b: 101 c: 010

a: 001b: 101c: 010

134

Type-3 operation

Then three more Type-1 operations fully reduce M and the ARG.

Page 13: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

13

History bound

• Initially required trying all n! permutations of the rows to choose the type-3 operations.

• The bound can be computed by DP in O(2n) time (Bafna, Bansal).

• On datasets where it can be computed, the history bound is observed to be higher than (or equal to) all studied lower bounds (about ten of them).

• There is no static definition for what the history bound is -- it is only defined by the algorithms that compute it! The work in this paper comes out of an attempt to find a simple static definition.

Page 14: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

14

Why a static definition matters

• We want a definition of what is being computed, independent of how it is computed, so that we can reason about it and find alternative ways to compute or approximate it.

• For example, with no static definition of the history bound, we don’t know how to formulate an integer linear program to compute it.

Page 15: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

15

00000

1

2

4

3

510100

1000001011

00010

01010

12345sites

Site mutations on edgesThe tree derives the set M:1010010000010110101000010 starting from 00000

Only one mutation per siteallowed.

Perfect Phylogeny

Page 16: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

Intro. to Forest Bound: Decompose an Optimal ARG to A Forest of Trees,

removing recombination edges

An ARG with three recombinations

After removing recombination edges, four trees result.

The number of trees is precisely the number of recombinations plus one

Page 17: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

17

Idea behind the Forest Bound (FB)

Each tree created in this way contains at mostone occurrence of any site, and each site occursin at most one of the trees. So the trees form aforest of related perfect phylogenies.

Page 18: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

18

Forest Bound

Given a set of sequences M, partition M intothe fewest subsets so that each subset of sequences can be derived on a tree, whereeach site occurs at most once in the forest oftrees. The number of trees, minus one, is a validlower bound on Rmin.

Page 19: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

22

Comparing the Forest Bound (FB) to:

• History Bound (HistB)

• Optimal Haplotype Bound (OhapB): The currently best lower bound that can be computed in practice for biological data.

• Theorem: On any data, OhapB <= FB <= HistB On some data, OhapB < FB < HistBThus the FB is the highest lower bound with a static

definition.

Page 20: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

23

First, define the Haplotype Lower Bound (S. Myers, 2003)

• Rh = Number of distinct sequences (rows) - Number of distinct sites (columns) -1 <= minimum number of recombinations needed (folklore)

• Before computing Rh, remove any site that is compatible with all other sites. A valid lower bound results - generally increases the bound.

• Generally Rh is really bad bound, often negative, when used on large intervals, but Very Good when used as local bounds in the Composite Method. Myers implemented the method in a program called RecMin, which was a huge advance, generally three times higher than the prior best lower bound method.

• The composite method can be used with any lower bound method and the better the initial lower bounds, the better the composite result.

Page 21: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

24

Then, the Subset Bound (Myers)

• Let S be a subset of sites, and Rh(S) be the haplotype bound computed on the sequences restricted to S. Rh(S) is a valid lower bound on Rmin.

• Optimal Haplotype Bound (OhapB) is the maximum Rh(S) over all subsets of sites.

• Practical computation of OhapB via ILP was studied in (SWG 2005) and exploited in the program Hapbound. Hapbound gives provably better bounds than RecMin.

Page 22: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

25

Now, the Optimal Haplotype Bound (OhapB)

• OhapB is the maximum haplotype bound over any subset of columns.

• NP-hard (Bafna & Bansal, 2005)– Efficiently computed in

practice (Song, Wu, Gusfield 2005)

0 0 0

0 0 1

1 0 0

1 1 1

Rh = 4 – 3 – 1

= 0

11

01

10

00Rh = 4 – 2 – 1

= 1

Page 23: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

Forest Bound (FB) is Higher than Haplotype Bound (Rh)

10001 00010

00100

1101101101

s2

s1s4

Steiner nodes

Rh = number distinct rows – number distinct columns – 1 = 5 -- 5 --1 = --1

s5

s3

FB = 2

Page 24: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

27

FB >= Rh

FB obtained from all the data M is >= FB obtained from a subset of the columns, so assume all columns in M are distinct.

FB = # trees in the FB -- 1 = # nodes -- # edges -- 1 in the forest= # leaves + # Steiner nodes -- # columns -- 1= # rows + # Steiner nodes -- # columns -- 1>= # distinct rows -- # columns-- 1 = Rh

Page 25: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

28

Forest Bound is Higher than Optimal Haplotype Bound

F(Ms) Rh(Ms) (i.e. the optimal haplotype bound).

s2 s3 s5

Optimal subset of columns Mss2 s3 s5

Input matrix M

Page 26: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

Number of Trees after Taking Subset of Columns

s3

s1

s2

s5

s6s4

Minimum forest with 3 trees for entire data

s3

A legal forest for the subset data!

s2, s3, s5.

s2

s5

s3s2

s5

A legal forest with 3 trees for the subset

Cleanup

Also, taking subsets can not increase the number of trees, and so FB(M) FB(Ms).

So, FB(M) FB(Ms) Rh(Ms), so FB OhapB

Page 27: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

30

FB <= HistB

The decomposition of the optimal ARG, directedby the operations of computing the history bound, creates a forest of HistB + 1 trees, where each site occurs at most once, in at most one tree. So FB <= HistB.

Page 28: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

31

Computing the Forest Bound is NP-Hard

• Optimal haplotype bound is quite good, but NP-hard to compute.

• If the forest bound can be efficiently computable, we do not need to use optimal haplotype bound at all.

• Unfortunately, the forest bound is NP-hard to compute.

• Reduction from Exact-cover-by-3 sets.

Page 29: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

NP-hardness Proof for the Minimum Perfect Phylogenetic Forest Problem

{1,3,5}

{1,2,4}

{2,4,6}

3-SetsBinary sequences on a hypercube

Sequences corresponding to the same set form a perfect phylogeny with a single novel sequence (not in input)

Two sequences from different sets are far apart, and would need two many mutations to connect, thus can not belong to the same tree.

Sequences corresponding to same element in two sets need same mutation and thus can not be both chosen.

Page 30: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

33

Integer Programming Formulation for the Forest Bound

• For sequences with m sites, consider the hypercube all possible 2m sequences.

• Minimizing F is equivalent to reducing the number of Steiner nodes in the forests.

• We also need to ensure the edge linking two nodes in a tree is only labeled with columns that do not appear in other trees.

• Can easily incorporate the missing data in the input.• The IP formulation has exponential size, but practical

when the number of columns is relatively small.

Page 31: A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

34

Empirical Results

• On random generated dataset with 15 rows and 7 columns, FB > OhapB on 10% of the data. On more biological meaningful data (generated with simulation program ms), however, OhapB= FB more often.

• On dataset generated by ms with missing entries, FB is more often outperforms an approximate optimal Rh bound:– 30 rows and 7 columns and 30% missing entries: FB was

strictly larger in 8% of the data.– When the level of missing entries is lower, the approx.

OhapB matches the FB more often.