c1 c2 c3 linkage disequilibrium 1 mapping and haploblock

50
1 Linkage Disequilibrium Mapping and HaploBlock Gideon Greenspan [email protected] Guest lecture for course: Computational Genetics (236608)

Upload: others

Post on 03-Feb-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

1

C1

A1 A2 A3

H1 H2 H3

A4 A5 A6 A7

H4 H5 H6 H7

C2 C3

Linkage DisequilibriumMapping and HaploBlock

Gideon [email protected]

Guest lecture for course:Computational Genetics (236608)

Page 2: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

2

Part 1: LD Mapping

• Basic LD Mapping– χ-squared test for individual SNPs

• Mapping with Haplotypes– Population phenomena

• Haplotyping– Clark algorithm– EM algorithm

Page 3: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

3

Linkage Disequilibrium• LD = Another word for ‘correlation’

– Correlation between markers in a population• Random recombination destroys correlation

– Close markers may have high LD– Above 1 Mb, LD disappears

Page 4: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

4

LD Mapping: The Basics

• Take set of unrelated individuals– Ideally from a small, inbred population

• Measure markers at high resolution– Single Nucleotide Polymorphisms are ideal

• Test marker–disease correlations– Non-parametric disease model– Suitable (in theory) for low penetrance

Page 5: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

5

LD Mapping in Action

3 1 10 1 2

1 3 11 1 1

Page 6: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

6

Chi-Squared Test

600500100∑29526431a30523669A∑ControlCase

Observed Counts

600500100∑295245.8349.17a305254.1750.83A∑ControlCase

Expected Counts

χ 2 =(o − e)2

e∑ =15.85 1 degree of freedom⇒ p-value = 0.0001

Page 7: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

7

SNPs

• Single base pair which exhibits variation– Caused by point mutations during meiosis– Variation almost always biallelic

• dbSNP contains ~ 4.3×106 SNPs– Over 1 SNP per 1,000 base pairs– About half with minor allele frequency > 20%– This number is still growing rapidly!

Page 8: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

8

LD Mapping in Context

Identifychromosome

(108 bp)

Linkageanalysis

(106~107 bp)

Identifygenes

(105~106 bp)Resequencing

(100 bp)

Page 9: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

9

False Positives

• Causes of spurious LD– Population structure

• Migration and admixture• Preferential mating

– Phenotypic site interaction• Disease epistasis

• Key problem: too many SNP tests– Bonferroni correction

Page 10: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

10

Haplotypes

Chromosome SNP Non-variable region

Haplotype = AGTCT

A G T C T

SNP allele

Generally, only a few of the 2loci possiblehaplotypes cover >90% of a population,

due to bottleneck effects and genetic drift.

Page 11: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

11

Bottleneck Effects

106 years 105 years

Page 12: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

12

Genetic Drift

0

100

200

300

400

500

600

700

800

900

1000

0 100 200 300 400 500 600 700 800 900 1000

Generation

Alle

le F

requ

ency

Page 13: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

13

LD Mapping with Haplotypes

• Obtain haplotypes for a genomic region– Treat haplotype as correlated allele

• Advantage: fewer tests– Reduced false positive rate

• Disadvantage: ignores recombination– Different haplotypes could contain target

• Best: consider partial haplotypes…

Page 14: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

14

The Haplotyping Problem1 2 3 4 5Variable Loci

MaternalChromosome

PaternalChromosome

ObservedGenotypes A/T C/G T/T A/C G/T

A G T C T

T C T A G

HiddenHaplotypes

Page 15: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

15

Why is it hard?• A series of joint measurements containing h

heterozygous loci can be divided 2h-1 ways(we don’t care which is maternal or paternal).

A/T C/G T/T A/C G/T

A C T AG

T G T C T

AG T AG

T C T C T

A C T A T

T G T CG

A C T CG

T G T A T A C T C T

T G T AG

AG T A T

T C T CG

AG T CG

T C T A T

AG T C T

T C T AGCorrect!

Page 16: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

16

Why is it approachable?

• Many of the haplotypes appear many times.• Data for many individuals allows inference.

A/T C/G T/T A/C G/T A/A C/G C/T A/C T/T

Solution seems ‘better’ since it uses fewer haplotypes.

A C T A TT G T CG A C C C TAG T A T

T C T AGAG T C T A C C A TAG T C T

Page 17: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

17

Formalization 1• Assume all loci biallelic (realistic).• Individuals numbered 1…n• Loci numbered 1…l• Possible alleles B={0,1}• Possible haplotypes H= Bl

• Possible locus observations L={[B,B]}• Possible genotypes G=Ll

• Possible haplotype pairs D={[H,H]}

Page 18: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

18

Formalization 2

• Given a true haplotype pair [h1,h2] ∈ D,G(h1,h2) ∈ G is the genotype observed.

• Given an observed genotype g ∈ G,D(g) ⊆ D is set of possible haplotype pairs.

• Problem input: (g1,…,gn) where gi ∈ G• Problem output: (d1,…,dn) where di ∈ D(gi)

Page 19: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

19

Clark’s Algorithm

1. Initialize set S to {}.2. For genotypes gi with a single possibility [h1,h2]

assign di=[h1,h2] and add h1,h2 to S.3. For genotypes gi with a possibility containing a

member h1 ∈ S and another haplotype h2, assigndi=[h1,h2] and add h2 to S.

4. Repeat step 3 until all haplotypes are assigned orwe add nothing new to S.

5. Assign any remaining di arbitrarily.

Page 20: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

20

Clark: Run

A/T C/G T/T A/C G/T

A/A C/G C/T A/C T/T

A/A C/G T/T C/C T/T AG T C TA C T C T

A/T C/C C/T A/A G/T

T G T AGA C T C T

AGC A TA C T C T

A C T A TT C C AG

A/T C/G C/C A/A G/T AGC A TT C C AG6

hapl

otyp

es u

sed

Page 21: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

21

Clark: Rerun (same input)

A/T C/G T/T A/C G/T

A/A C/G C/T A/C T/T

A/A C/G T/T C/C T/T AG T C TA C T C T

A/T C/C C/T A/A G/T

AG T C TT C T AG

AG T C TA C C A T

T C T AGA C C A T

A/T C/G C/C A/A G/T T GC AGA C C A T5

hapl

otyp

es u

sed

Page 22: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

22

Clark: Comments

• Implementation is very fast, O(ln2)• Total failure if no starting point.• Blind haplotyping of ‘orphans’ at end.• Arbitrary selections based on input order.

– Try multiple orderings, select best results.• Or formulate choices as integer program

– Solve approximately by linear relaxation.

Page 23: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

23

Likelihood: Bayes NetworkPr(HM=h|θH)=Pr(HP=h|θH)=θh

D=[hm,hp]

G=G(d)

HM HP|HM|=|HP|=2l

D|D|= 2l +2l×(2l -1)/2

G|G|= 3l

Page 24: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

24

EM Algorithm

1. Initialize θH0 to uniform distribution with

small random perturbations.2. E step: For each d ∈ D, calculate

Exp(d|θHi) using observed data G.

3. M step: Calculate θHi+1 from Exp(d|θH

i).4. Repeat 2 and 3 until θH converges.5. Assign output di by maximum likelihood.

Page 25: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

25

EM: Iterations (same input)

0.0370.138 0.155

0.1500.134 0.121 0.113 0.106 0.101 0.100 0.100

0.037

0.072 0.1020.150

0.193 0.225 0.260 0.289 0.298 0.300 0.300

0.037

0.1360.204 0.266 0.279

0.2870.294 0.299 0.300 0.300

0.040

0.037 0.081 0.141 0.182 0.196 0.199 0.200 0.200 0.200 0.200

0.040 0.024 0.025 0.028 0.033 0.0470.073 0.095 0.100 0.100 0.100

0.245

0 1 2 3 4 5 6 7 8 9 10

Iteration

TGCAGTGCATTGTCGTGTCTTGTAGTGTATTCCAGTCCATTCTCGTCTCTTCTAGTCTATAGCCTAGCAGAGCATAGTCGAGTCTAGTAGAGTATACCCTACCAGACCATACTCGACTCTACTAGACTAT

Page 26: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

26

EM: Output = Clark rerunA/T C/G T/T A/C G/T

A/A C/G C/T A/C T/T

A/A C/G T/T C/C T/T

A/T C/C C/T A/A G/T

A/T C/G C/C A/A G/T

AG T C TA C T C T

AG T C TT C T AG

AG T C TA C C A T

T C T AGA C C A T

T GC AGA C C A T

θH

θH

θH

θH

θH

Page 27: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

27

EM: Comments

• Problem of EM entering local minima– Try multiple runs, select best result.

• Generally far better results than Clark.• Exponential complexity O(2l⋅n) due to |H|

– Ignore impossible haplotypes for O(1.5l⋅n)• Still infeasible for >30 loci

– Hierarchical EM, MCMC methods.

Page 28: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

28

Part 2: HaploBlock

• Haplotype blocks• Statistical model• Model inference• Model criterion• Applications

– Haplotyping– Block-based LD mapping

Page 29: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

29

Recombination Hotspots

Page 30: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

30

Haplotype Blocks

1 GAACTGC ATTCGACTG CCAGTAGC 2 ACGTACA GATGAGCTG CCAGTAGC … 99 ACGTACA AACCGAGGT TGTACTAA100 GAACTGC GATGAGCTG TGTGCTAA

Recombinationhotspot

separates blocks

Few blockvariants due to

bottlenecks, drift

Mutationhotspot

Page 31: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

31

Bayesian Network Model

C

A1 A2 A3

H1 H2 H3

Values of variable C are 1…qdenoting index of block’s haplotype

Pr(C = c) is frequency of haplotype c

Values of variable Aj are A,C,G,T,–denoting allele at site j of haplotype.Example: A1 A2 A3 = CTA for C = 2

Pr(aj | c) is deterministic

Values of variable Hj are A,C,G,T,–denoting allele at site j observed

after possible haplotype mutations

Pr(hj | aj) is cumulative mutation rate

Page 32: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

32

Bayesian Network Model

haplotype block

Pr(c, a1, a2, a3, h1, h2, h3) =

Pr(c) ×Pr(a1 | c) ×Pr(a2 | c) ×Pr(a3 | c) ×Pr(h1 | a1) ×Pr(h2 | a2) ×Pr(h3 | a3)

C

A1 A2 A3

H1 H2 H3

Page 33: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

33

Bayesian Network Model

C1

A1 A2 A3

H1 H2 H3

recombination hotspot

A4 A5 A6 A7

H4 H5 H6 H7

C2 C3

haplotype block

Page 34: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

34

Data Likelihood

• For haplotypes H, likelihood is:

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

C1 C2

Pr(H) = Lc1

∑ La1

∑cb

∑Pr(c1) Pr(ck | ck−1)

k= 2

b

Pr(a j | ck )Pr(h j | a j )j= sk

ek

∏k=1

b

al

h∈H∏

But we can calculate this efficientlyusing a suitable elimination ordering!

Page 35: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

35

Data Criterion

• Maximum Likelihood leads to over-fitting– No hotspots, no mutations, many ancestors– Need to consider model complexity– Min DL(H,M)=DL(M)-log2Pr(H|M)

• DL(M) considers variable elements only– Ancestor block sequences– Markov chain parameters

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

C1 C2

Page 36: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

36

Model Search

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

C1 C2

4 3

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

C1C2

4 3nudge

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

C1

5

add/remove

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

C1 C2

4 2

ancestors

Page 37: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

37

Hotspot Addition – Initial Scan

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

Page 38: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

38

Hotspot Addition – Minima 1

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

Page 39: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

39

Hotspot Addition – Pass 2

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

Page 40: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

40

Hotspot Addition – Minima 2

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

Page 41: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

41

Hotspot Addition – Pass 3

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

Page 42: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

42

Hotspot Addition – Minima 3

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

Page 43: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

43

Hotspot Addition – Pass 4

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

Page 44: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

44

Hotspot Addition – Summary

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

Page 45: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

45

Ck

As Ae

Hs He

EM Modifications• Deterministic distribution for Aj• Mutation rate constraints for Hj• Infeasible calculations along whole model

1. Fix zero mutation rate tocluster sequences.Ck

As Ae

Hs He

2. Extract ancestor alleles bymaximum likelihood.

Ck

As Ae

3. Constrained EM formutation rates.

As Ae

Hs He

Ck

4. EM for Markov chain.Hs He

Ck

Page 46: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

46

Model for Haplotyping

C1

A1 A2 A3

H1 H2 H3

A4 A5 A6 A7

H4 H5 H6 H7

C2 C3

C'1

A'1 A'2 A'3

H'1 H'2 H'3

A'4 A'5 A'6 A'7

H'4 H'5 H'6 H'7

C'2 C'3

G1 G2 G3 G4 G5 G6 G7

• Learn modeldirectly fromgenotypes

• Haplotypepair: choosemost likelyunder model

Page 47: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

47

Haplotyping Results

.0152.0083.0047.0009.0042.0095Hierarchical EM

=2x3x2x2x2xImprovement factor

HaploBlock .0098.0048.0014.0005.0020.0047

.0419.0183.0262.0655.0403.0669PHASE

.0102failed.0077.0204failed.0224HAPLOTYPER

.0381.0234.0329.0280.0251.0548Clark

ACEC21eC21dC21cC21bC21aSite pairwise error rate

C21x data: 20 haplotypes, 100 SNPs over ≤ 35kb, Patil et al. (2001)ACE data: 22 haplotypes, 52 SNPs over 24kb, Rieder et al. (1999)

Average shown for 10 random pairings of true haplotypes

Page 48: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

48

Model for LD Mapping• Learn model from marker data• Mapping: try making phenotype

dependent on each block

C1

A1 A2 A3

H1 H2 H3

A4 A5 A6 A7

H4 H5 H6 H7

C2 C3

P

Page 49: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

49

LD Mapping Results

1.4x3x3xImprovement factor

HaploBlock 24 kb37 kb40 kb

33 kb105 kb131 kbNo Blocks

107 kb–144 kbBLADE

Chr 215q31 genos5q31 haplosResequencing required

5q31 data: 258 haplotypes, 98 SNPs over 464kb, Daly et al. (2001)Chr 21 data: 20 haplotypes, 5 sets of 200 SNPs, Patil et al. (2001)

Average shown for 5 random selections of target SNP

Page 50: C1 C2 C3 Linkage Disequilibrium 1 Mapping and HaploBlock

50

HaploBlock: Comments

• Our model boils down to an HMM– Calculations have linear complexity– Forward/backward probability caching

• Better to infer multiple models– Prevent getting stuck in local minima– Account for uncertainty of block identification– Use Gibbs-style iterations on hotspots– Take ‘average’ result over set of models