haplotyping algorithm
DESCRIPTION
Haplotyping Algorithm. Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 6, 2008. Haplotyping…. - PowerPoint PPT PresentationTRANSCRIPT
1
Haplotyping AlgorithmHaplotyping Algorithm
Qunyuan ZhangDivision of Statistical Genomics
GEMS Course M21-621 Computational Statistical Genetics
Mar. 6, 2008
2
Haplotyping…
Using molecular and/or mathematical techniques to measure/infer haplotypes of a subject (or a set of subjects), given a set of genetic makers/loci (locus number L≥2)
3
Questions
WHAT is haplotype?
WHY study haplotype?
WHY use algorithm in haplotyping?
HOW ? (Data, Hypotheses, Algorithms)
4
WHAT is Haplotype?
A haplotype (Greek haploos = simple) is a combination of alleles at multiple linked loci that are transmitted together. Haplotype may refer to as few as two loci or to an entire chromosome depending on the number of recombination events that have occurred between a given set of loci. The term haplotype is a portmanteau of "haploid genotype.“
In a second meaning, haplotype is a set of single nucleotide polymorphisms (SNPs) on a single chromatid that are statistically associated. It is thought that these associations, and the identification of a few alleles of a haplotype block, can unambiguously identify all other polymorphic sites in its region. Such information is very valuable for investigating the genetics behind common diseases, and is collected by the International HapMap Project.
From http://en.wikipedia.org/wiki/Haplotype
5
Haplotype = Genotype of Haploid
Haplotypes: Ab//aBGenotype: Aa Bb
Haplotype
C G
Haplotype
T A
GenotypeCT GA
Haplotypes: AB//abGenotype: Aa Bb
6
WHY Study Haplotype?
An efficient way of presentation of genetic variation/polymorphism, useful in genomics, population genetics, and genetic epidemiology
Population evolution
LD analysis
Missing genotype imputation
IBD estimation
Tag maker (SNP) selection
Multi-locus linkage & association
…
7
WHY use algorithm in haplotyping?
Most of current molecular genotyping techniques mix DNA pieces from two complementary chromosomes and only provide genotypes of diploid (mixture of haplotypes)
genotypes haplotypes
Some molecular techniques can directly measure haplotypes, but expensive (money, labor, time ….), especially for genome-wide study.
So, at least now, we need algorithms …
?
8
Ambiguity of Haplotype
Haplotypic ambiguity/uncertainty happens while ≥2 makers/loci are heterozygous and their genetic phase is unknown
Genotype0:aa,bb or cc1:AA,BB or CC2:Ab,Bb or Cc
Haplotypes0: a,b or c1: A,B or C
AA BB (1 1) AB//AB (11//11)
Aa bb (2 0) Ab//ab (10//00)
Aa Bb (2 2) Ab//aB or AB//ab (10//01 or 11//00)
Aa Bb Cc (2 2 2) ABC//abc, ABc//abC, Abc//aBC or aBC//Abc111//000, 110//001, 100//011 or 011//100
CT GA CG//TA or CA//TG
CT GA GC CGG//TAC,CGC//TAG,CAG//TGC orTGG//CAC
9
Rule-based Approaches(Parsimony & Phylogeny)
Search an optimal set of haplotypes that satisfies some specific rules
10
Parsimony Approaches
1.List all unambiguous haplotypes
2.Resolve ambiguous individuals one by one using listed haplotypes
3. If only half-resolved, add new haplotype to the list
4. Continue 2 & 3
5. Until on one can be solved
ABC, abc, abC Abc
AaBbCC => ABC//abC
AABbCc => ABC//Abc
Continue …
Until on one can be resolved
Clark, 1990, Mol. Biol. Evol., 7(2): 111-122
Parsimony rules: Maximum-resolution of genotypes
and/or Minimum set of haplotypes
Clark’s Algorithm
11
Phylogeny Approaches
D. Gusfield. 2002. Proc. of the 6th Annual Inter. Conf. on Res. In Comput. Mol. Biology, p166–175.
Given a set of genotypes, find a set of explaining haplotypes, which defines a perfect phylogeny. Perfect Phylogeny Haplotype (PPH) rule: coalescent rule (no recombination, infinite-site mutation)
12
Probability-based Approaches(EM & MCMC)
Calculate probability of haplotype, conditional on genotypes. Pr(H|G)=?
13
MutationSelectionAdmixture
Drift(gene frequencies)
LinkageRecombination
LD(haplotype frequencies)
HWE(genotype frequencies)
Epidemiologic Data
Genotype
Sample
Phenotype
Environment Factors
Pr ( P | G,E ) = ?
HaplotypeHaplotype
Pr ( H | G ) = ?
14
Gene/haplotype frequencies HWE, LD
Data Structure for Haplotyping
Haplotypes
LinkageS
ubje
cts(
1,2,
3…)Loci (A,B,C…)
G1,A G1,B G1,C …
G2,A G2,B G2,C …
G3,A G3,B G4,C …
… … … …
A CB
Genetic RelationshipGenotypes
15
HWE & LD
Hardy-Weinberg Equilibrium (HWE)Hardy-Weinberg Disequilibrium (HWD)
HWE: random combination of allelic genes (same loci) Under HWE, allele freq. determines genotype freq. HWE => Pr(AA)=Pr(A)*Pr(A), Pr(aa)=Pr(a)*Pr(a), Pr(Aa)=2*Pr(A)*Pr(a)
Linkage Equilibrium (LE)Linkage Disequilibrium (LD) LE: random combination of genes from different loci LD: association between genes from different loci Under LE, allele freq. determines haplotype freq. LE => Pr(ABC)=Pr(A)*Pr(B)*Pr(C)
16
Genetic Relationship (R) & Linkage (r)
AaBb
AABB
AaBb
AB//ab or aB//Ab
AB//ab
(if r=0) AB//ab
(if r>0) AB//ab, Ab//aB
Recombination rate (r)
r =0, complete Linkage
0< r <0.5, incomplete Linkage
r =0.5, no Linkage
AaBb
AaBb
AABB aabb
17
Haplotyping & Conditional Probability
AaBB: Pr(AB//aB)=1AAbB: Pr(AB//Ab)=1AaBb: Pr(AB//ab)=0.5, Pr(Ab//aB)=0.5
AABB, aabb, AABB, aabb, AABB, AABb, aabb AaBB, aabb, AABB, AABB, AABB, AABB, aabbaabb, AABB, AABB, AABB, AaBb, AABB,aabbaabb, AABB, AABB, aabb, AABB, aabb, AABB …
Pr(AB//ab)=Pr(Ab//aB)=0.5 ?HWE or HWD?
LD or LE?
P(H|G, R, r)=?
P(H|G)=?
18
EM Algorithm
for unrelated individuals
Pr(Ha,b|G,F)=?
Excoffier et al., 1995, Mol. Biol. Evol., 12(5): 921-927
Hawley et al., 1995, J Hered., 86:409-411 (software: HAPLO)
LD: Pr(ABC)≠Pr(A)*Pr(B)*Pr(C)
19
Likelihood: L(G|F)
)()|(
constraint1
)//(0)//(1
)|Pr(
)|Pr()|(
),,,,,(),,,,,(
),,,,,(
1 1 1
1
1 1
1
21
21
21
g
kba
h
a
h
b
kab
h
ii
kba
kbakab
ba
h
a
h
b
kabk
g
kk
gk
hi
hi
ffcFGL
f
GHHGHH
c
ffcFG
FGFGL
GGGGGffffF
HHHHH
Haplotypes
Joint Likelihood of G given F
Genotypes
Haplotype Frequencies
Prbability of the k-th individual’s G given F & HWE
Haplotype-Genotype compatibility index of the k-th individual
F=? => Max. L(G|F)
20
EM AlgorithmMaximum Likelihood Estimation of Haplotype Freq.
Lagrange multiplier
0
0
))(()(),(
?)},(max{)(
QxQ
cxgxqxQ
xxqcxg
g
k tb
ta
h
a
h
b
kab
tb
ta
h
a
h
b
kab
iab
ti
g
kba
h
a
h
b
kab
ba
h
a
h
b
kab
iab
i
i
h
ii
g
kba
h
a
h
b
kab
g
kba
h
a
h
b
kab
h
ii
ffc
ffcz
gf
ffc
ffcz
gf
QfQ
fffcFQ
ffcFGLFq
fFg
1 )()(
1 1
)()(
1 1)1(
1
1 1
1 1
11 1 1
1 1 1
1
21
21
00
)1()log(),(
)log())|(log()(
01)(
...),|Pr(...),|Pr(),|Pr( )1()(,
)()1(,
)1()0(,
)0( ttba
tbaba FFGHFFGHFFGHF
Prior Expectation Maximization E … M E M …
EM Recursion
Partial
Derivative
Equations
21
Posterior Probability of Haplotype
0588.0),|Pr(
9412.0005.008.0
08.01.0*1.0*5.04.0*4.0*5.0
4.0*4.0*5.0
**)|Pr(**)|Pr(**)|Pr(
),|Pr(
4.0,1.0,1.0,4.0:
5.0)|//Pr()|Pr(
5.0)|//Pr()|Pr(,,,:
:*)Pr(*)Pr()Pr(
)Pr(*)|Pr()Pr(*)|Pr(
),|Pr(
3,2
323,2414,1
414,14,1
4321
3,2
4,1
4321
),(,
,,
FGH
ffGHffGHffGH
FGH
ffffF
DdEedEDeGH
DdEedeDEGHdeHdEHDeHDEHH
DdEeGExample
ffHHF
FGHFGH
FGH
k
kk
kk
k
k
k
baba
bakba
kbakba
Prior Prob.
Posterior Prob.
22
Limitation of EM Algorithm
For diploid(2n) organism, a genotype of L heterozygous makers may have 2L possible haplotypes, EM is unpractical for large L
Only suitable for small number of loci, 2~12
While L=20, 2L=1,048,576 …Large space of F
Subseting approaches (partition-ligation & block partitioning etc.) have been used to reduce computational burden …
23
MCMC Markov Chain Monte Carlo Algorithm
for unrelated individuals
by sampling from Pr(H|G,F)
Stephens et al., 2001, Am. J. Hum. Genet., 68:978-989 (software: PHASE)
24
Markov Chain
)()()()(
)()()()()(
)1()1()1()1()2(
)1()1()1()1()1(
)0()0()0()1()1(
)0()0()0()0()1(
)0()0()0()0()0(
......
....
......
......
......
),|Pr(
......
),|Pr(
......
......
),|Pr(
......
),|Pr(
......
21
121
121
11
121
121
22
121
11
121
NtG
NtG
NtG
NtG
tG
tG
tG
tG
tG
GGGGG
GG
GGGGG
GG
GGGGG
GG
GGGGG
GG
GGGGG
gk
gkk
gkk
gkk
gg
gkk
gkk
gkk
HHHH
HHHHH
HHHHH
HGH
HHHHH
HGH
HHHHH
HGH
HHHHH
HGH
HHHHH
MCMC Estimation
Random sampling based on Pr(H|G,H_)
Repeat many times
After getting close to stationary distribution of P(H|G)
Collect samples
Average over samples
25
Transition Probability ),|Pr(kk GG HGH
))/(2()/(2.
.
),...,,(
)/(
)/()/()/(
:),(0
),...,(
),...,(
22
''
21
2
)(,
2,1
2,1
)(,
MpMprobwithphasechoserandomlyHHFor
ppprobwithhaplotypeconstructHHFor
ppppgetFinally
MnpthenHHif
MMnMnpthenHHif
checkandHHGthenHGifpthenHGif
HfromHremove
Gpick
nnnncount
HHHHlist
GallforlociLofHgiven
L
ii
Li
iiii
m
iij
jiij
jikik
iik
Gtba
k
m
m
ktba
k
Add the newly constructed haplotype to list H, pick Gk+1 …kGtbaH )1(,
Coalescent hypothesis, Mutation rate, M haplotypes
subseting loci, reducing time
26
EM vs. MCMCEM MCMCSearch F, Max. L(G|F)
Haplo. freq. => Haplo. construction Maximum likelihood approach
“Analytical” posterior distribution
Less loci
Convergence: Local Maximum
Sample from Pr(H|G,F)
Haplo. construction => Haplo. freq.
Sampling approach
“Empirical” posterior distribution
More loci
Better convergence: whole parameter space (more computer time)
27
EM Algorithmfor family data
(no recombination, r=0)
Pr(Ha,b{fam.}|G,R,F)=?
Rohde et al., 2001, Human Mutation, 17: 289-295 (software: HAPLO)Becher et al., 2004, Genetic Epidemiology, 27:21-32 (software: FAMHAP)O’Connell, 2000, Genetic Epidemiology, 19(Suppl 1):S64-S70 (software: ZAPLO)
28
Haplotype Configuration of Family
AaBb AaBb
AaBb
AB//ab AB//ab
AB//ab
Ab//aB Ab//aB
Ab//aB
AB//ab AB//ab
Ab//aB
Genotypes
Possible Haplotype Configurations
recombinant, as r=0 or nearly =0, impossible or very low prob. , ignored
29
EM AlgorithmHaplotype Freq. Estimation using Nuclear Families
.
1 12211
2 22211
1 12211
2 222112211
1.
1 1
)()()()(
1 1
.
1 1
)()()()(
1 1
.
.
)1(
1 )()(
1 1
)()(
1 1)1(
41
.
21
..
famN
famh
a
h
b
tb
ta
tb
ta
h
a
h
b
fambaba
h
a
h
b
tb
ta
tb
ta
h
a
h
b
fambaba
ibaba
fam
ti
g
k tb
ta
h
a
h
b
kab
tb
ta
h
a
h
b
kab
iab
ti
ffffc
ffffcz
Nf
FamiliesNuclear
ffc
ffcz
gf
IndvUnrelatedTips:
Only use parents to calculate haplotype freq. (f)
Use parents+children ’s info to determine compatibility (c)
30
EM AlgorithmHaplotype Freq. Estimation for General Pedigrees
.
221122112211
2211221122112211
.1.
,,...,,,
,,...,,,,
)()()()()()(....
,,...,,,
,,...,,,,
)()()()()()(.......
1.
'.
)1(
...
...1 fam
nnnnnn
nnnnnnnn
fam
N
famhhhhhh
bababa
tb
ta
tb
ta
tb
ta
fambababa
hhhhhh
bababa
tb
ta
tb
ta
tb
ta
fambababa
ibababa
N
famfam
ti
ffffffc
ffffffcz
nf
Tips:
Only use founders to calculate haplotype freq. (f)
Use all members (founders & non- founders) to determine compatibility (c)
Discard the cases with too small probabilities to save time
31
Posterior Probability of Haplotype Configuration
22112211 ***)Pr(*)Pr(*)Pr(*)Pr()Pr(
)Pr(*)|Pr()Pr(*)|Pr(
),|Pr(
*)Pr(*)Pr()Pr(
)Pr(*)|Pr()Pr(*)|Pr(
),|Pr(
.).(,
,,
11
.).(,
,,
babababaparents
configsallparents
famk
famba
parentsfam
kfam
baparents
famk
famba
N
jba
N
jbafounders
configsallfounders
famk
famba
foundersfam
kfam
bafounders
famk
famba
ffffHHHHF
FGHFGH
FGH
FamilyNuclear
ffHHF
FGHFGH
FGH
FamilyGeneral
founders
jj
founders
jj
Dad Mom
32
A Middle Summary …Subject-oriented Algorithms
Large/General Pedigree & Allowing Recombination (r>0) ?
A CB
X
X
X
Joint Prob. / Likelihood
indiv. by indiv.unrelated
family by familyr=0
33
Next … Locus-oriented Algorithm (Lander-Green)
A CB
X X X Joint Prob./
Likelihood…
Locus by LocusA Pedigree
For Large/General Pedigree Data & Allowing Recombination (r>0)
A CB
34
Inheritance Vector (V) of a pedigree
Lander & Green, 1987, Proc. Natl. Acad. Sci., 84: 2363-2367Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER)Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767(software: MERLIN)
Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2)
Prob.
A
35
Inheritance Vector & Haplotype
5: AaBb
1101 AB//ab 1101
1101 Ab//aB 1111
36
Lander-Green Algorithm
A CB
…
VA VB VC
Pr(VB|VA) Pr(VC|VB)
…Pr(Vt+1|Vt)
GA
Pr(GA |VA)
GB
Pr(GB |VB)
GC
Pr(GC |VC)
Loci A,B,C,…
One pedigree
Hidden status (inheritance vectors)
Transition Prob.=f(r)
Emission Prob.
Observations (genotypes)
37
Lander-Green Algorithm Based (or Similar) Approaches
Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER)Viterbi algorithm, the best haplotype configuration
Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2)MCMC: Annealing & Metropolis Process
Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767(software: MERLIN)Allowing LD & Marker Cluster/Block
38
Practices(1) If a child’s genotype of 4 loci is AaBbCcDD, list all possible haplotype pairs of the child, calculate the probability of each pair.
(2) If you know his/her father’s genotype is also AaBbCcDD and mother is AaBbCCDD, list all possible haplotype configurations of his/her family, calculate the probability of each configuration. (Assume recombination rate r=0)
(3) Randomly assign a frequency to each haplotype in (1), say, f(ABCD)=0.4,f(abcD)=0.2,…,etc. Make sure the sum=1. Take these frequencies as the true haplotype frequencies in population, recalculate the (posterior) probabilities in (1) and (2).
Within a week, send your answers to (E-mail: [email protected])