haplotype blocks an overview a. polanski department of statistics rice university
TRANSCRIPT
![Page 1: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/1.jpg)
Haplotype Blocks
An Overview
A. Polanski
Department of Statistics
Rice University
![Page 2: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/2.jpg)
Key Papers
1. N. Patil et al., (2001), Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21, Science, vol. 294, pp. 1719-1723
2. N. Wang et al., (2002), Distribution of Recombination Crossovers and the Origin of Haplotype Blocks: The Interplay of Population History, Recombination and Mutation, Am. J. Hum. Genet., vol. 71, pp. 1227-1234.
3. K. Zhang et al., (2002), A Dynamic Programming Algorithm for Haplotype Block Partitioning, PNAS, vol. 99, pp. 7335-7339
![Page 3: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/3.jpg)
Supplementary Papers
1. R. Hudson, N. Kaplan, (1985), Statistical Properties of the Number of Recombination Events in The History of a Sample of DNA sequences, Genetics, vol. 111, pp. 147-164
2. R. Hudson, 2002, Generating Samples under a Wright-Fisher Neutral Model of Genetic Variation, Bioinformatics, vol. 18, pp. 337-338
3. D. Reich et al., (2001), Linkage Disequilibrium in the Human Genome, Nature, vol. 411, pp. 199-204
![Page 4: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/4.jpg)
What are Haplotype Blocks ?
Haplotype block = a sequence of contiguous markers on DNA, homogeneous according to some criterion
Markers = Single Nucleotide Polymorphisms (SNPs)
![Page 5: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/5.jpg)
Data (Patil et al. 2001)
Chromosome 21
Physically separated the two copies of chromosome 21 using a rodent-human somatic cell hybrid technique
Sample of 20 copies of chromosome 21 (32397439 bases)
Found: 35989 SNPs
![Page 6: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/6.jpg)
Fig. 2 from (Patil et al. 2001)
![Page 7: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/7.jpg)
01000000000000000000100000000000000100001110000000001000000010010000000010010000000000000000000010000000011010000101010100000000010000000000010000000000100100001000000000000001011001001001010001001000000000010010001011000000001101010010101010000000000100010001011000101000000001010001100000000001010000000000010000010011000001110100100000011000011000100010001101000000000000001000100100010100000000101000110000000000101000000000001000001001100000111010010000001100001100010001000110100000000010000000000010000100000100100000000000000000001001001001001010001001000000000010010001011000000001100100000000000001000000010000100001001000000000001000001100000000001010000000010010011010001000000001000000100100000100111010000000000000000000100000000000100001001101001000000000000000000010010010010010100010010000000000100100010110000000011001000000000001000100000000000000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010000000000000100000001000000000000001000001100000000000000000100100000000100100000000000000000000100000000110100001010101000000000100000000000100001000001001000000000000000000010010011010010100010010000000000100100010110000000011001000000000001000100000000000000001000001000101000000000000000001000000001001000000001001000000000000000000001000010001101010010101010000010000000000010000100000000010100000000000000000000000000100101000000100100000000000000000000100000000110100001010101010001000000000000000010000010001010000000000000000010000000010010000010010010000001000000001000010000000011010100101010100000000100100000000010010000000000011000011010000000010100000010100100100100010010000010100001001000001001110100000000000100010000000001000000100000100010100000000000000000100000000100100000100100100000010000000010000100000000110101001010101000000000001000000000100100000000000100000110000000000101000000001001001001000100000000100000010010000010011101010000000010000000000100000000010010000000000010000011010000000010100000010100100100100010010000010000001001001001001110100000000000000100100001000000100010000000101000000001100111111000000011000000000000001001110101000000101010010000000000100000101111000001000000000001000010000000001010000000000000000000000000010010100000010010000000000000000000010000000011010000101010100001010000000000001000000000000010000010011101000010000000100000000000000010010001010000001000100100100000001000001011010
20 ……
i = 1, 2, …, 35989
SNP no i
![Page 8: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/8.jpg)
Problems
![Page 9: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/9.jpg)
How do we determine boundaries between blocks ?
1. Average value of standarized coefficient of linkage disequilibrium is greater than some threshold (Wang et al. 2002, Reich et al. 2001)
2. Infer sites in the sample of DNA sequences where recombination events happened in the past history (Wang et al. 2002, Hudson, 2002)
3. Chromosome coverage – minimum number of SNPs to account for majority of haplotypes (Patil et al. 2001, Zhang et al. 2002)
![Page 10: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/10.jpg)
What evolutionary forces are responsible for haplotype blocks
formation ?
• Mutation
• Genetic drift
• Recombination
• Recombination hot spots
![Page 11: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/11.jpg)
Methods
![Page 12: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/12.jpg)
Method 1 (Wang et al. 2002)
Infer sites in the sample of DNA sequences where recombination events happened in the past history
![Page 13: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/13.jpg)
Three gamete condition
Consider a pair of SNPs, SNP1 and SNP2. If there was no recombination between SNP1 and SNP2, they must satisfy three gamete condition
SNP1 SNP2SNP1 SNP2
AG
CC
G T
AG CTAC
GC
GT
![Page 14: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/14.jpg)
Four gamete test (Hudson and Kaplan, 1985)
If we see all four gametes at SNP1 and SNP2
SNP1 SNP2
AG
CC
G T
A T
Then there must have been a recombination event between these sites in their past history
4GT
![Page 15: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/15.jpg)
Array of pairwise 4GT test resultsHudson and Kaplan, 1985
D, dij=
0, if there are less then 4 gametes
1, if there are 4 gametes
What is the minimal number of recombinations that couldexplain observed data ?Statistics FR (Hudson and Kaplan, 1985)
![Page 16: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/16.jpg)
Fig. 1 from Wang et al., 2002
D
Block 1 Block 2 Block 3
![Page 17: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/17.jpg)
Wang et al., 2002 - Study
• R. Hudson’s program for simulating genealogies with mutation, drift and recombination under various demographic scenarios
• Study of dependence of average lengths of blocks on different factors
• Comparison of simulation results to data from Patil et al., 2002
![Page 18: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/18.jpg)
Dependence of average lengths of blocks on recombination frequency
![Page 19: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/19.jpg)
… on sample size
![Page 20: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/20.jpg)
... on mutation intensity
![Page 21: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/21.jpg)
Comparison to data from Patil et al. 2001
• Compute distribution of haplotype block lengths in the data from Patil et al. 2001
• Try to tune parameters and R to obtain similar distribution in the simulations
![Page 22: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/22.jpg)
… Failed
![Page 23: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/23.jpg)
Try a mixture of two different recombination frequencies - better
![Page 24: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/24.jpg)
Method 2 (Patil, 2001)
Chromosome coverage – minimum number of SNPs to account for majority of haplotypes
![Page 25: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/25.jpg)
Fig. 2 from (Patil et al. 2001)
![Page 26: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/26.jpg)
Problem formulation
Define block boundaries to minimize the number of SNPs that distinguish at least percent of the haplotypes in each block
![Page 27: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/27.jpg)
Common haplotypes
Those represented more than one in the block
![Page 28: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/28.jpg)
Condition
Common haplotypes must constitute at least =80 percent of all haplotypes in the block
Blocks that do not satisfy this are not allowed
![Page 29: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/29.jpg)
Fragment of Fig. 2 from Patil et al., 2001
![Page 30: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/30.jpg)
Notation
• B – block defined as numbers of SNPs,
e.g., B = 45, 46,….50, or B = i, i+1,…, j
• L(B) length of the block (number of SNPs)
• f(B) – minimum number of SNP’s required to distinguish common haplotypes
![Page 31: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/31.jpg)
Greedy Solution01000000000000000000100000000000000100001110000000001000000010010000000010010000000000000000000010000000011010000101010100000000010000000000010000000000100100001000000000000001011001001001010001001000000000010010001011000000001101010010101010000000000100010001011000101000000001010001100000000001010000000000010000010011000001110100100000011000011000100010001101000000000000001000100100010100000000101000110000000000101000000000001000001001100000111010010000001100001100010001000110100000000010000000000010000100000100100000000000000000001001001001001010001001000000000010010001011000000001100100000000000001000000010000100001001000000000001000001100000000001010000000010010011010001000000001000000100100000100111010000000000000000000100000000000100001001101001000000000000000000010010010010010100010010000000000100100010110000000011001000000000001000100000000000000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010000000000000100000001000000000000001000001100000000000000000100100000000100100000000000000000000100000000110100001010101000000000100000000000100001000001001000000000000000000010010011010010100010010000000000100100010110000000011001000000000001000100000000000000001000001000101000000000000000001000000001001000000001001000000000000000000001000010001101010010101010000010000000000010000100000000010100000000000000000000000000100101000000100100000000000000000000100000000110100001010101010001000000000000000010000010001010000000000000000010000000010010000010010010000001000000001000010000000011010100101010100000000100100000000010010000000000011000011010000000010100000010100100100100010010000010100001001000001001110100000000000100010000000001000000100000100010100000000000000000100000000100100000100100100000010000000010000100000000110101001010101000000000001000000000100100000000000100000110000000000101000000001001001001000100000000100000010010000010011101010000000010000000000100000000010010000000000010000011010000000010100000010100100100100010010000010000001001001001001110100000000000000100100001000000100010000000101000000001100111111000000011000000000000001001110101000000101010010000000000100000101111000001000000000001000010000000001010000000000000000000000000010010100000010010000000000000000000010000000011010000101010100001010000000000001000000000000010000010011101000010000000100000000000000010010001010000001000100100100000001000001011010
Start End
1. Increment end0. Fix Start =End
2. Compute ratio L(B)/f(B)
…….
3. Stop at max
4. Go to 0
![Page 32: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/32.jpg)
Results
• 4563 representative SNPs (13%)
• 4135 blocks
![Page 33: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/33.jpg)
Method 3 (Zhang et al. 2002)
Solves the same problem of 80% chromosome coverage, but using the better method of dynamic programming
![Page 34: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/34.jpg)
Dynamic programming solution01000000000000000000100000000000000100001110000000001000000010010000000010010000000000000000000010000000011010000101010100000000010000000000010000000000100100001000000000000001011001001001010001001000000000010010001011000000001101010010101010000000000100010001011000101000000001010001100000000001010000000000010000010011000001110100100000011000011000100010001101000000000000001000100100010100000000101000110000000000101000000000001000001001100000111010010000001100001100010001000110100000000010000000000010000100000100100000000000000000001001001001001010001001000000000010010001011000000001100100000000000001000000010000100001001000000000001000001100000000001010000000010010011010001000000001000000100100000100111010000000000000000000100000000000100001001101001000000000000000000010010010010010100010010000000000100100010110000000011001000000000001000100000000000000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010000000000000100000001000000000000001000001100000000000000000100100000000100100000000000000000000100000000110100001010101000000000100000000000100001000001001000000000000000000010010011010010100010010000000000100100010110000000011001000000000001000100000000000000001000001000101000000000000000001000000001001000000001001000000000000000000001000010001101010010101010000010000000000010000100000000010100000000000000000000000000100101000000100100000000000000000000100000000110100001010101010001000000000000000010000010001010000000000000000010000000010010000010010010000001000000001000010000000011010100101010100000000100100000000010010000000000011000011010000000010100000010100100100100010010000010100001001000001001110100000000000100010000000001000000100000100010100000000000000000100000000100100000100100100000010000000010000100000000110101001010101000000000001000000000100100000000000100000110000000000101000000001001001001000100000000100000010010000010011101010000000010000000000100000000010010000000000010000011010000000010100000010100100100100010010000010000001001001001001110100000000000000100100001000000100010000000101000000001100111111000000011000000000000001001110101000000101010010000000000100000101111000001000000000001000010000000001010000000000000000000000000010010100000010010000000000000000000010000000011010000101010100001010000000000001000000000000010000010011101000010000000100000000000000010010001010000001000100100100000001000001011010
……
Optimal partition of SNPs 1,2, … i
Assume that for all i=1, 2, …, j-1 we know optimal block partition,B1(i), B2(i), …, Bk(i) that minimizes:
i
K
kki iBfS
1
)]([
B1(i) B2(i) B3(i)
![Page 35: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/35.jpg)
Bellman’s equation
)},...,1,({ 11,..1
min jiifSS iji
j
![Page 36: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/36.jpg)
Results
• 3582 representative SNPs (compared to 4563 from greedy algorithm)
• 2575 blocks (compared to 4135 blocks from greedy algorithm)
![Page 37: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/37.jpg)
Conclusions
• Studying haplotype block partitions is very important to
1. Constructing haplotype maps for genetic
traits
2. Understanding recombination in human
genome
![Page 38: Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University](https://reader030.vdocuments.net/reader030/viewer/2022032707/56649e555503460f94b4cdd3/html5/thumbnails/38.jpg)
To expect
• A lot of papers in this area appearing in scientific journals