pure parsimony

32
RECOMB SNPs Workshop/Jan 28, 2007 How Accurate is Pure Parsimony Haplotype Inferencing? Sharlee Climer Department of Computer Science and Engineering Department of Biology Washington University in Saint Louis [email protected] www.climer.us Joint work with Weixiong Zhang and Gerold Jaeger

Upload: fola

Post on 13-Jan-2016

56 views

Category:

Documents


0 download

DESCRIPTION

How Accurate is Pure Parsimony Haplotype Inferencing? Sharlee Climer Department of Computer Science and Engineering Department of Biology Washington University in Saint Louis [email protected] www.climer.us Joint work with Weixiong Zhang and Gerold Jaeger. Pure Parsimony. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

How Accurate is Pure Parsimony Haplotype Inferencing?

Sharlee ClimerDepartment of Computer Science and Engineering

Department of BiologyWashington University in Saint Louis

[email protected]

Joint work with Weixiong Zhang and Gerold Jaeger

Page 2: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Pure Parsimony

• Pure Parsimony Haplotype Inferencing (PPHI)– Find smallest set of unique haplotypes that can

resolve a set of genotypes

• Suggested by Earl Hubbell in 2000• Cast as an Integer Linear Program (IP) by

Dan Gusfield [CPM’03]

• Great research interest

Page 3: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Overview

• Biological forces

• Haplotypes with low frequency

• Define haplotype classes

• Data sets

• Characteristics of real data

Page 4: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 5: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 6: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 7: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 8: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 9: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 10: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

Page 11: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological forces

• Relatively few unique haplotypes

• Subset of haplotypes with low frequency

• Problems for PPHI– Large number of optimal solutions– True biological solution might not be

parsimonious

• What are structural characteristics of optimal solutions?

Page 12: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Classes of haplotypes

• Set of possible haplotypes is exponentially large• Partition similar to Traveling Salesman Problem• Backbone haplotypes

– Appear in every optimal solution

• Fat haplotypes– Do not appear in any optimal solution

• Fluid haplotypes– Appear in some, but not all, optimal solutions

Page 13: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Backbone haplotypes

• Implicit backbones– All haplotypes that resolve unambiguous

genotypes

• Explicit backbones– Can identify by solving at most one IP for each

haplotype in solution that isn’t implicit backbone

Page 14: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Backbone haplotypes

h3 h7 h15 h27 h39 h50 h55 h79 h91

bb bb bb bb

Page 15: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Backbone graph

Page 16: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Backbone graph

Page 17: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

An optimal solution

Page 18: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Low frequency haplotype

Page 19: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Low frequency haplotype

Page 20: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Low frequency haplotype

Page 21: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Low frequency haplotype

Page 22: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Data sets

• 7 true haplotype data sets– Orzack et al.[Genetics, 2003]

• 80 genotypes

• 9 sites

• ApoE

– Andres et al. [Genet. Epi., in press]

• 6 sets of complete data

• 39 genotypes

• 5 to 47 sites

• KLK13 and KLK14

Page 23: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Data sets

• HapMap data [Nature 2003, 2005]

– Phase unknown– Random instance generator– 20 unique genotypes – 20 sites– Three populations

• CEU• YRI• JPT+CHB

– 22 chromosomes

Page 24: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Size of haplotype backbonePercentage of haplotypes that are backbones

0

0.2

0.4

0.6

0.8

1

1.2

BF

HG

BV

ceu2:

ceu5:

ceu8:

ceu11:

ceu14:

ceu17:

ceu20:

yri3:

yri6:

yri9:

yri12:

yri15:

yri18:

yri21:

jpt+

chb1:

jpt+

chb4:

jpt+

chb7:

jpt+

chb10:

jpt+

chb13:

jpt+

chb16:

jpt+

chb19:

jpt+

chb22:

Implicit backbones

hBBTotal

Page 25: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Number of fluid haplotypes in each solution

0

2

4

6

8

10

12

14

16

18

20

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75

Page 26: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Number of optimal solutions

1

10

100

1000

1 2 3 45 6 7 8910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576

Page 27: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Number of fluid haplotypes and solutions

0

2

4

6

8

10

12

14

16

18

20

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576

Nu

mb

er

of

flu

id h

ap

loty

pes r

eq

uir

ed

0

200

400

600

800

1000

1200

Nu

mb

er

of

so

luti

on

s

# fluid haplotypes # of solutions

Page 28: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological correctness

Data set

# gen. # sites # BB

hap.

#fluid hap.

# opt. sols.

Avg. distance to real

A 30 9 15 0 1 8

B 10 5 7 0 1 0

C 18 17 9 3 16 7.5

D 10 8 6 1 4 2.5

E 23 26 9 7 >1000 4.33

F 26 22 12 5 630 28.24

G 35 47 12 16 >1000 10.95

Page 29: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological correctness

Data set Parsimony # of haplotypes

True # of haplotypes

A 15 17

B 7 7

C 12 12

D 7 7

E 16 16

F 17 18

G 28 32

Page 30: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Biological correctness

• Accuracy of backbone haplotypes

• Two data sets (F and G) had errors – One parsimony backbone haplotype not in real

solution

Page 31: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Number of solutions vs. number of genotypes

0

2

4

6

8

10

12

14

16

18

nu

mb

er o

f h

aplo

typ

es

0

100

200

300

400

500

600

700

nu

mb

er o

f o

pti

mal

so

luti

on

s

# of haplotypes

# of solutions

Page 32: Pure Parsimony

RECOMB SNPs Workshop/Jan 28, 2007

Conclusions

• Biological forces tend to minimize cardinality, but also create low frequency haplotypes

• Low frequency in unique genotypes might not be low frequency in full set

• Low frequency haplotypes– Large number of optimal solutions

– True solution not necessarily parsimonious

– Combinatorial nature can lead to errors in backbones

• Parsimony combined with other biological clues