1 haplotyping algorithms qunyuan zhang division of statistical genomics gems course m21-621...

Post on 05-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Haplotyping AlgorithmsHaplotyping Algorithms

Qunyuan Zhang

Division of Statistical Genomics

GEMS Course M21-621

Computational Statistical Genetics

Mar. 29, 2012

https://dsgweb.wustl.edu/qunyuan/presentations/Haplotyping_GEMS_2012.ppt

2

Questions

WHAT is haplotype?

WHY study haplotype?

WHY use algorithms for haplotyping?

HOW ? (Data, Hypotheses, Algorithms)

3

WHAT is Haplotype?

A haplotype (Greek haploos = simple) is a combination of alleles at multiple linked loci that are transmitted together. Haplotype may refer to as few as two loci or to an entire chromosome depending on the number of recombination events that have occurred between a given set of loci. The term haplotype is a portmanteau of "haploid genotype.“

In a second meaning, haplotype is a set of single nucleotide polymorphisms (SNPs) on a single chromatid that are statistically associated. It is thought that these associations, and the identification of a few alleles of a haplotype block, can unambiguously identify all other polymorphic sites in its region. Such information is very valuable for investigating the genetics behind common diseases, and is collected by the International HapMap Project.

From http://en.wikipedia.org/wiki/Haplotype

4

Haplotype = Genotype of Haploid

Haplotypes: Ab//aBGenotype: Aa Bb

Haplotype

C G

Haplotype

T A

GenotypeCT GA

Haplotypes: AB//abGenotype: Aa Bb

5

WHY Study Haplotype?

An efficient way of presentation of genetic variation/polymorphism, useful in genomics, population genetics, and genetic epidemiology

Population evolution

LD analysis

Missing genotype imputation

IBD estimation

Tag marker (SNP) selection

Multi-locus linkage & association

6

WHY use algorithm in haplotyping?

Most of current molecular genotyping techniques mix DNA pieces from two complementary chromosomes and only provide genotypes of diploid (mixture of haplotypes)

genotype(AaBb) haplotype (Ab//aB or AB//ab)

Some molecular techniques can directly measure haplotypes, but expensive (money, labor, time ….), especially for genome-wide study.

So, at least now, we need algorithms …

?

7

Ambiguity of Haplotype

Haplotypic ambiguity/uncertainty happens while ≥2 makers/loci are heterozygous and their genetic phase is unknown

Genotype Haplotypes

AA BB AB//AB

Aa bb Ab//ab

Aa Bb Ab//aB or AB//ab

Aa Bb Cc ABC//abc, ABc//abC, Abc//aBC or aBC//Abc

8

Rule-based Approaches(Parsimony & Phylogeny)

Search an optimal set of haplotypes that satisfies some specific rules

9

Parsimony Approaches

1.List all unambiguous haplotypes

2.Resolve ambiguous individuals one by one using listed haplotypes

3. If only half-resolved, add new haplotype to the list

4. Continue 2 & 3

5. Until on one can be solved

ABC, abc, abC Abc

AaBbCC => ABC//abC

AABbCc => ABC//Abc

Continue …

Until on one can be resolved

Clark, 1990, Mol. Biol. Evol., 7(2): 111-122

Parsimony rules: Maximum-resolution of genotypes

and/or Minimum set of haplotypes

Clark’s Algorithm

10

Phylogeny Approaches

D. Gusfield. 2002. Proc. of the 6th Annual Inter. Conf. on Res. In Comput. Mol. Biology, p166–175.

Given a set of genotypes, find a set of explaining haplotypes, which defines a perfect phylogeny. Perfect Phylogeny Haplotype (PPH) rule: coalescent rule (no recombination, infinite-site mutation, but only once for one site)

11

Probability-based Approaches(EM & MCMC)

Calculate probability of haplotype, conditional on genotypes. Pr(H|G)=?

12

Gene/haplotype frequencies HWE, LD

Data Structure for Haplotyping

Haplotypes

LinkageS

ubje

cts(

1,2,

3…)

Loci (A,B,C…)

G1,A G1,B G1,C …

G2,A G2,B G2,C …

G3,A G3,B G4,C …

… … … …

A CB

Genetic RelationshipGenoty

pes

13

HWE & LD

Hardy-Weinberg Equilibrium (HWE)Hardy-Weinberg Disequilibrium (HWD)

HWE: random combination of alleles from the same locus Under HWE, allele freq. determines genotype freq. HWE => Pr(AA)=Pr(A)*Pr(A), Pr(aa)=Pr(a)*Pr(a), Pr(Aa)=2*Pr(A)*Pr(a)

Linkage Equilibrium (LE)Linkage Disequilibrium (LD) LE: random combination of alleles from different loci LD: association between alleles from different loci Under LE, allele freq. determines haplotype freq. LE => Pr(ABC)=Pr(A)*Pr(B)*Pr(C)

14

Genetic Relationship (R) & Linkage (r)

AaBb

AABB

AaBb

AB//ab or aB//Ab

AB//ab

(if r=0) AB//ab

(if r>0) AB//ab, Ab//aB

Recombination rate (r)

r =0, complete Linkage

0< r <0.5, incomplete Linkage

r =0.5, no Linkage

AaBb

AaBb

AABB aabb

15

Haplotyping & Conditional Probability

AaBB: Pr(AB//aB)=1

AAbB: Pr(AB//Ab)=1

AaBb: Pr(AB//ab)=0.5, Pr(Ab//aB)=0.5

AABB, aabb, AABB, aabb, AABB, AABb, aabb

AaBB, aabb, AABB, AABB, AABB, AABB, aabb

aabb, AABB, AABB, AABB, AaBb, AABB,aabb

aabb, AABB, AABB, aabb, AABB, aabb, AABB …

Pr(AB//ab)=Pr(Ab//aB)=0.5 ?HWE or HWD?

LD or LE?

P(H|G, R, r)=?

P(H|G)=?

16

EM Algorithm

for unrelated individuals

Pr(H|G,F)=?

Excoffier et al., 1995, Mol. Biol. Evol., 12(5): 921-927

Hawley et al., 1995, J Hered., 86:409-411 (software: HAPLO)

Pr(AB)=0.25, Pr(Ab)=0.25

Pr(aB)=0.25, Pr(ab)=0.25

ORPr(AB)=0.01, Pr(Ab)=0.49

Pr(aB)=0.49, Pr(ab)=0.01

AaBbPr(AB//ab)=?

Pr(Ab//aB)=?

17

Likelihood: L(G|F)

)()|(

constraint1

)//(0

)//(1

)|Pr(

)|Pr()|(

),,,,,(

),,,,,(

),,,,,(

1 1 1

1

1 1

1

21

21

21

g

kba

h

a

h

b

kab

h

ii

kba

kbakab

ba

h

a

h

b

kabk

g

kk

gk

hi

hi

ffcFGL

f

GHH

GHHc

ffcFG

FGFGL

GGGGG

ffffF

HHHHH

Haplotypes

Joint Likelihood of G given F

Genotypes

Haplotype Frequencies

Prbability of the k-th individual’s G given F & HWE

Haplotype-Genotype compatibility index of the k-th individual

F=? => Max. L(G|F)

18

EM AlgorithmMaximum Likelihood

Estimation of Haplotype Freq.

Lagrange multiplier

0

0

))(()(),(

?)},(max{

)(

Qx

Q

cxgxqxQ

xxq

cxg

g

k tb

ta

h

a

h

b

kab

tb

ta

h

a

h

b

kab

iab

ti

g

kba

h

a

h

b

kab

ba

h

a

h

b

kab

iab

i

i

h

ii

g

kba

h

a

h

b

kab

g

kba

h

a

h

b

kab

h

ii

ffc

ffcz

gf

ffc

ffcz

gf

Q

fQ

fffcFQ

ffcFGLFq

fFg

1 )()(

1 1

)()(

1 1)1(

1

1 1

1 1

11 1 1

1 1 1

1

2

1

2

1

0

0

)1()log(),(

)log())|(log()(

01)(

...),|Pr(...),|Pr(),|Pr( )1()(,

)()1(,

)1()0(,

)0( ttba

tbaba FFGHFFGHFFGHF

Prior Expectation Maximization E … M E M …

EM Recursion

Partial

Derivative

Equations

z=1 if i in (a,b), or z=0 c=1 if (a,b)=>G, or c=0

19

Posterior Probability of Haplotype

0588.0),|Pr(

9412.0005.008.0

08.0

1.0*1.0*5.04.0*4.0*5.0

4.0*4.0*5.0

**)|Pr(**)|Pr(

**)|Pr(),|Pr(

4.0,1.0,1.0,4.0:

5.0)|//Pr()|Pr(

5.0)|//Pr()|Pr(

,,,:

:

*)Pr(*)Pr()Pr(

)Pr(*)|Pr(

)Pr(*)|Pr(),|Pr(

3,2

323,2414,1

414,14,1

4321

3,2

4,1

4321

),(,

,,

FGH

ffGHffGH

ffGHFGH

ffffF

DdEedEDeGH

DdEedeDEGH

deHdEHDeHDEHH

DdEeG

Example

ffHHF

FGH

FGHFGH

k

kk

kk

k

k

k

baba

bakba

kbakba

Prior Prob.

Posterior Prob.

20

Limitation of EM Algorithm

For diploid(2n) organism, a genotype of L heterozygous markers may have 2L possible haplotypes, EM is unpractical for large L

Only suitable for small number of loci, 2~12

While L=20, 2L=1,048,576 …Large space of F

Subseting approaches (partition-ligation & block partitioning etc.) have been used to reduce computational burden …

21

MCMC

Markov Chain Monte Carlo Algorithmfor unrelated individuals

by sampling from Pr(H|G,F)

Stephens et al., 2001, Am. J. Hum. Genet., 68:978-989 (software: PHASE)

22

Markov Chain

)()()()(

)()()()()(

)1()1()1()1()2(

)1()1()1()1()1(

)0()0()0()1()1(

)0()0()0()0()1(

)0()0()0()0()0(

......

....

......

......

......

),|Pr(

......

),|Pr(

......

......

),|Pr(

......

),|Pr(

......

21

121

121

11

121

121

22

121

11

121

NtG

NtG

NtG

NtG

tG

tG

tG

tG

tG

GGGGG

GG

GGGGG

GG

GGGGG

GG

GGGGG

GG

GGGGG

gk

gkk

gkk

gkk

gg

gkk

gkk

gkk

HHHH

HHHHH

HHHHH

HGH

HHHHH

HGH

HHHHH

HGH

HHHHH

HGH

HHHHH

MCMC Estimation

Random sampling based on Pr(H|G,H_)

Repeat many times

After getting close to stationary distribution of P(H|G)

Collect samples

Average over samples

23

Transition Probability ),|Pr(kk GG HGH

))/(2()/(2.

.

),...,,(

)/(

)/()/()/(

:),(

0

),...,(

),...,(

22

''

21

2

)(,

2,1

2,1

)(,

MpMprobwithphasechoserandomlyHHFor

ppprobwithhaplotypeconstructHHFor

ppppgetFinally

MnpthenHHif

MMnMnpthenHHif

checkandHHGthenHGif

pthenHGif

HfromHremove

Gpick

nnnncount

HHHHlist

GallforlociLofHgiven

L

ii

Li

iiii

m

iij

jiij

jikik

iik

Gtba

k

m

m

ktba

k

Add the newly constructed haplotype to list H, pick Gk+1 …kGt

baH )1(,

Coalescent hypothesis, Mutation rate, M haplotypes

subseting loci, reducing time

24

EM vs. MCMC

EM MCMCSearch F, Max. L(G|F)

Haplo. freq. => Haplo. construction

Maximum likelihood approach

“Analytical” posterior distribution

Less loci

Convergence: Local Maximum

Sample from Pr(H|G,F)

Haplo. construction => Haplo. freq.

Sampling approach

“Empirical” posterior distribution

More loci

Better convergence: whole parameter space (more computer time)

25

EM Algorithmfor family data

(no recombination, r=0)

Pr(H{fam.}|G,R,F)=?

Rohde et al., 2001, Human Mutation, 17: 289-295 (software: HAPLO)Becher et al., 2004, Genetic Epidemiology, 27:21-32 (software: FAMHAP)O’Connell, 2000, Genetic Epidemiology, 19(Suppl 1):S64-S70 (software: ZAPLO)

26

Haplotype Configuration of Family

AaBb AaBb

AaBb

AB//ab AB//ab

AB//ab

Ab//aB Ab//aB

Ab//aB

AB//ab AB//ab

Ab//aB

Genotypes

Possible Haplotype Configurations

recombinant, as r=0 or nearly =0, impossible or very low prob. , ignored

27

EM AlgorithmHaplotype Freq. Estimation using Nuclear Families

.

1 1

2211

2 2

2211

1 1

2211

2 2

22112211

1.

1 1

)()()()(

1 1

.

1 1

)()()()(

1 1

.

.

)1(

1 )()(

1 1

)()(

1 1)1(

4

1

.

2

1

..

famN

famh

a

h

b

t

b

t

a

t

b

t

a

h

a

h

b

fam

baba

h

a

h

b

t

b

t

a

t

b

t

a

h

a

h

b

fam

baba

i

baba

fam

ti

g

k tb

ta

h

a

h

b

kab

tb

ta

h

a

h

b

kab

iab

ti

ffffc

ffffcz

Nf

FamiliesNuclear

ffc

ffcz

gf

IndvUnrelatedTips:

Only use parents to calculate haplotype freq. (f)

Use parents+children ’s info to determine compatibility (c)

28

EM AlgorithmHaplotype Freq. Estimation for General Pedigrees

.

2211

22112211

2211

221122112211

.1.

,,...,,,

,,...,,,,

)()()()()()(.

...

,,...,,,

,,...,,,,

)()()()()()(.

......

1.

'.

)1(

...

...1 fam

nn

nnnn

nn

nnnnnn

fam

N

famhhhhhh

bababa

t

b

t

a

t

b

t

a

t

b

t

a

fam

bababa

hhhhhh

bababa

t

b

t

a

t

b

t

a

t

b

t

a

fam

bababa

i

bababa

N

famfam

ti

ffffffc

ffffffcz

n

f

Tips:

Only use founders to calculate haplotype freq. (f)

Use all members (founders & non- founders) to determine compatibility (c)

Discard the cases with too small probabilities to save time

29

Posterior Probability of Haplotype Configuration

22112211 ***)Pr(*)Pr(*)Pr(*)Pr()Pr(

)Pr(*)|Pr(

)Pr(*)|Pr(),|Pr(

*)Pr(*)Pr()Pr(

)Pr(*)|Pr(

)Pr(*)|Pr(),|Pr(

.).(,

,,

11

.).(,

,,

babababaparents

configsallparents

famk

famba

parentsfam

kfam

baparents

famk

famba

N

jba

N

jbafounders

configsallfounders

famk

famba

foundersfam

kfam

bafounders

famk

famba

ffffHHHHF

FGH

FGHFGH

FamilyNuclear

ffHHF

FGH

FGHFGH

FamilyGeneral

founders

jj

founders

jj

Dad Mom

30

A Middle Summary …Subject-oriented Algorithms

Large/General Pedigree & Allowing Recombination (r>0) ?

A CB

X

X

X

Joint Prob. / Likelihood

indiv. by indiv.unrelated

family by familyr=0

31

Next … Locus-oriented Algorithm (Lander-Green)

A CB

X X X Joint Prob./

Likelihood

Locus by Locus

A Pedigree

For Large/General Pedigree Data & Allowing Recombination (r>0)

A CB

32

Inheritance Vector (V) of a pedigree

Lander & Green, 1987, Proc. Natl. Acad. Sci., 84: 2363-2367Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER)Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767(software: MERLIN)

Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2)

Prob.

A

33

Inheritance Vector & Haplotype

5: AaBb

1101 AB//ab 1101

1101 Ab//aB 1111

34

Lander-Green Algorithm

A CB

VA VB VC

Pr(VB|VA) Pr(VC|VB)

…Pr(Vt+1|Vt)

GA

Pr(GA |VA)

GB

Pr(GB |VB)

GC

Pr(GC |VC)

Loci A,B,C,…

One pedigree

Hidden status (inheritance vectors)

Transition Prob.=f(r)

Emission Prob.

Observations (genotypes)

35

Lander-Green Algorithm Based (or Similar) Approaches

Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER)Viterbi algorithm, the best haplotype configuration

Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2)MCMC: Annealing & Metropolis Process

Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767(software: MERLIN)Allowing LD & Marker Cluster/Block

Haplotyping

based on sequencing data

(can be done for individual subject with no population data)

36

Rationale

37Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.

Data Structure

38Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.

Algorithms

39Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.

ML

Or MCMC when H space is huge

Prob(sequence/haplotype)

40Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.

haplotype

=1 if observed sequence X matches assumed haplotype=0 otherwise(for the j-th variant site of i-th fragment )

Sequencing/mapping error

observed sequence

Markov Chain

41Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.

Sampling H from .

42

Practices(1) If a child’s genotype of 4 loci is AaBbCcDD, list all possible haplotype pairs of the child, calculate the probability of each pair, given no any extra information.

(2) If you know his/her father’s genotype is also AaBbCcDD and mother is AaBbCCDD, list all possible haplotype configurations of his/her family, calculate the probability of each configuration. (Assume recombination rate r=0)

(3) If you know the haplotype frequencies below in population: ABCD(0.2),ABcD(0.1),AbcD(0.1)aBCD(0.1),aBcD(0.2),abcD(0.3)calculate the posterior probabilities in (1) .

Within a week, send your answers to (E-mail: qunyuan@wustl.edu)

top related