wabi 2005 algorithms for imperfect phylogeny haplotyping (ipph) with a single homoplasy or...

30
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis

Post on 20-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

WABI 2005

Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single

Homoplasy or Recombnation Event

Yun S. Song, Yufeng Wu and Dan Gusfield

University of California, Davis

Haplotyping Problem

• Diploid organisms have two copies of (not identical) chromosomes.

• A single copy is haplotype, a vector of Single Nucleotides Polymorphisms (SNPs)

• SNP: a site with two types of nucleotides occur frequently, 0 or 1

• The mixed description is genotype, vector of 0,1,2– If both haplotypes are 0, genotype is 0– If both haplotypes are 1, genotype is 1– If one is 0 and the other is 1, genotype is 2

Haplotypes and Genotypes

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individual

Merge the haplotypes

Sites: 1 2 3 4 5 6 7 8 9

• Haplotype Inference (HI) Problem: given a set of n genotypes, infer n haplotype pairs that form the given genotypes

Perfect Phylogeny Haplotyping (PPH)

• Finding original haplotypes in nature hopeless without genetic model to guide solution picking

• Gusfield (2002) introduced PPH problem• PPH is to find HI solutions that fit into a

perfect phylogeny.• Nice results for PPH, including a linear time

algorithm

The Perfect Phylogeny Model for Haplotypes

00000

1

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral sequence

Extant sequences at the leaves

Site mutations on edges

The tree derives the set M:1010010000010110101000010

Assume at most 1 mutationat each site

PPH Example

GenotypesInferred

Haplotypes Perfect Phylogeny

Imperfect Phylogeny Haplotyping (IPPH): Extending PPH

• Often, the real biological data does not have PPH solutions.

• Eskin, et al (2003) found deleting small part of data may lead to PPH solution (heuristic)

• Our approach: IPPH with explicit genetic model, with small amount of– Homoplasy, i.e. back or recurrent mutation – Recombination

• Goal: Extend usage of PPH– Real data: may be of small perturbation from PPH– Haplotype block: low recombination or homoplasy

Back/Recurrent Mutation for Haplotypes

Data000010101110

000

000110

2 1

3

010 101

1

010100

More than one mutation at a site

Recombinations: Single Crossover

• Recombination is one of the principle genetic force shaping genetic variations

• Two equal length sequences generate the third equal length sequence

110001111111001 000110000001111

Prefix Suffix

11000 0000001111

breakpoint

IPPH (Imperfect Phylogeny Haplotyping) Problems

• Small deviation from PPH• H-1 IPPH problem

– Find a tree that allows exactly one site to mutate twice – The rest of sites can only mutate at most once– Derive haplotypes for the given genotypes

• R-1 IPPH problem– Find a network that has exactly one recombination

event– Each site mutates at most once– Derive haplotypes for the given genotypes

Number of Minimum Recombinations for Haplotypes

Rmin Rho=1 Rho=3 Rho=5

0 60.8% 23.6% 8.4%

1 31.8% 35.2% 27.6%

2 6.8% 24.8% 27.8%

3 11.6% 21.6%

4 3.8% 9.0%

5 0.8% 3.6%

6 0.2% 1.4%

Frequency of Minimumrecombinations for small rho(scaled recombination rate)

20 sequences30 sites500 simulations

Haplotyping with One Homoplasy

More than one mutation at a site 1

s1 s2 s3

a1 0 0 0

a2 0 1 0

b1 1 0 1

b2 1 1 0

s1 s2 s3

a 0 2 0

b 1 2 2

Genotype Haplotype000

a1b2

2 1

3

a2 b1

1

010100

1 Homoplasy Tree

Algorithm for H1-IPPH

• For each site s in the input genotype data M– Test whether M-{s} has PPH solutions– If not, move to next site.– Otherwise, check whether 1 homoplasy at site s

can lead to HI solutions– If yes, stop and report result

• Assume only one PPH solution for M-{s}• But how to find solutions with 1 homoplasy at

s efficiently?

Example

M

Site i3

M-{i3} {i3}

PPH

M-{i3} {i3} Mh-{i3} h{i3}

r2

r2’ s2’

s2

Assume Mh-{i3} is fixed.Haplotypes for the same genotype must pair up.Two ways to pair

Combine Mh-{i3} with h{i3}

• 4 ways to try pairing i3.• Exponential number in general, even for one PPH solution• Need polynomial-time method to avoid trying all the pairings

?

Mh-{i3} h{i3} Mh1 Mh2

Mh-{i3} h{i3}

Move to Trees

Convert perfect phylogeny tree from PPH solution to un-rooted

1 Homoplasy: from T to Tr, Ts

s s

Recurrent mutation @ site s

Tree T

L1 L2O1 O2

L1, L2 O1, O2 s

Ts

Tree Tr

s induces a split Ts

Deleting s induces tree Tr

From Tr, Ts to T

Find two subtrees Ts1, Ts2, in Tr, s.t.

Tree Tr

L O s

Ts

Ts1, Ts2 corresponds to one side

s s

Tree T

L1 L - L1O1 O2

of Ts

L1 L - L1

2. Pick leaves from Tr corresponding the chosen partition side1. Pick one side of partition from Ts

3. Check whether the selected leaves fit into two sub-trees

1. May need to refine a non-binary vertex before picking subtree

s2 can pair with r2’

Solution

Algorithms and Results

• Efficient graph-coloring based method to select two subtrees (skipped)

• Implemented in C++• Simulation with data with program ms.• Compare to PHASE (a haplotyping program)

– Accuracy: comparable– Speed: at least 10x faster– 100x100 data: about 3 seconds

• Can identify the homoplasy site with high accuracy: >95% in simulation

Algorithm for R1-IPPHM ML MR

Split M by cutting between two sites

PPH Solutions

Build perfect phylogeny for two partitions

1-SPR operation

SPR: subtree-prune-regraft operation

1 recombination condition equivalent to distance-SPR(TL,TR) = 1

Algorithm for R1-IPPH

• Brute-force 1-SPR idea leads to exponential time when TL or TR are not binary.

• Trickier than H1-IPPH, but with care, R1-IPPH can be solved in polynomial time. (not in paper)

Conclusions

• Contributions– Assuming bounded number of PPH solutions1. Polynomial time algorithm for H1-IPPH problem2. Polynomial time algorithm for R1-IPPH problem3. Possible extension to more than 1 homoplasy

event.

• Open problems– Haplotyping with more than 1 recombination

efficiently.– Remove assumption that number of PPH solutions

for M-{s} is bounded.

Thank you

• Questions?