genotype error detection using hidden markov models of haplotype diversity ion mandoiu cse...
Post on 20-Dec-2015
229 views
TRANSCRIPT
![Page 1: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/1.jpg)
Genotype Error Detection using
Hidden Markov Models of
Haplotype Diversity
Ion Mandoiu
CSE Department, University of Connecticut
Joint work with Justin Kennedy and Bogdan Pasaniuc
![Page 2: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/2.jpg)
Outline
Introduction
Likelihood Sensitivity Approach to Error Detection
HMM-Based Algorithms
Experimental Results
Conclusion
![Page 3: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/3.jpg)
3
Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs)
High density in the human genome: 1 107 SNPs out of total 3 109 base pairs
Single Nucleotide Polymorphisms
… ataggtccCtatttcgcgcCgtatacacgggActata …… ataggtccGtatttcgcgcCgtatacacgggTctata …… ataggtccCtatttcgcgcCgtatacacgggTctata …
![Page 4: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/4.jpg)
Haplotypes and Genotypes
Diploids: two homologous copies of each chromosome One inherited from mother and one from father
Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor
Genotype: description of alleles on both chromosomes 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor)
allele; 2 - the chromosomes contain different alleles
011100110001000010021200210
+two haplotypes per individual
genotype
![Page 5: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/5.jpg)
5
Identification and fine mapping of disease-related genes Methods: Linkage analysis, allele-sharing, association studies Genotype data: large pedigrees, sibling pairs, trios,
unrelated
Why SNP Genotypes?
![Page 6: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/6.jpg)
Genotyping Errors
A real problem despite advances in genotyping technology [Zaitlen et al. 2005] found 1.1% inconsistencies among the 20
million dbSNP genotypes typed multiple times
Error types Systematic errors (e.g., assay failure) detected by departure
from HWE [Hosking et al. 2004] For pedigree data some errors detected as Mendelian
Inconsistencies (MIs) Undetected errors
E.g., if mother/father/child are all heterozygous, any error is Mendelian consistent
Only ~30% detectable as MIs for trios [Gordon et al. 1999]
![Page 7: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/7.jpg)
Effects of Undetected Genotyping Errors
Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based)
Errors as low as .1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04]
1% errors decrease power by 10-50% for linkage, and by 5-20% for association [Douglas et al. 00, Abecasis et al. 01]
![Page 8: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/8.jpg)
Related Work
Improved genotype calling algorithms [Di et al. 05, Rabbee&Speed 06, Nicolae et al. 06]
Explicit modeling in analysis methods [Sieberts et al. 01, Sobel et al. 02, Abecasis et al. 02,Cheng 06] Computationally complex
Separate error detection step [Douglas et al. 00, Abecasis et al. 02, Becker et al. 06] Detected errors can be retyped, imputed, or ignored in
downstream analyses
![Page 9: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/9.jpg)
Outline
Introduction
Likelihood Sensitivity Approach to Error Detection
HMM-Based Algorithms
Experimental Results
Conclusion
![Page 10: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/10.jpg)
Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
Likelihood of best phasing for original trio T
0 1 1 1 0 0 h1
0 0 0 1 0 1 h3
0 1 1 1 0 0 h1
0 1 0 1 0 1 h2
0 0 0 1 0 1 h3
0 1 1 1 0 0 h4
)()()()( MAX)( 4321 hphphphpTL
![Page 11: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/11.jpg)
Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
Likelihood of best phasing for original trio T
)()()()( MAX)( 4321 hphphphpTL
? 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3
0 1 0 1 0 1 h’1
0 1 1 1 0 0 h’2
0 0 0 1 0 0 h’ 3
0 1 1 1 0 1 h’ 4
Likelihood of best phasing for modified trio T’
)'()'()'()'( MAX)'( 4321 hphphphpTL
![Page 12: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/12.jpg)
Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
?
Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=104)
![Page 13: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/13.jpg)
Implementation in FAMHAP[Becker et al. 06]
Window-based algorithm For each window including the SNP
under test, generate list of H most frequent haplotypes (default H=50)
Find most likely trio phasings by pruned search over the H4 quadruples of frequent haplotypes
Flag genotype as an error if L(T’)/L(T) > R for at least one window
Mother …201012 1 02210...Father …201202 2 10211...Child …000120 2 21021...
![Page 14: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/14.jpg)
Limitations of FAMHAP Implementation
Truncating the list of haplotypes to size H may lead to sub-optimal phasings and inaccurate L(T) values
False positives caused by nearby errors (due to the use of multiple short windows)
Our approach: HMM model of haplotype diversity all haplotypes are
represented + no need for short windows Alternate likelihood functions scalable runtime
![Page 15: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/15.jpg)
Outline
Introduction
Likelihood Sensitivity Approach to Error Detection
HMM-Based Algorithms
Experimental Results
Conclusion
![Page 16: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/16.jpg)
HMM Model
Similar to models proposed by [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05]
Unlike [Scheet&Stephens 06], recombination ratios not modeled explicitly
Block-free model, paths with high transition probability correspond to “founder” haplotypes
(Figure from Rastas et al. 07)
![Page 17: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/17.jpg)
HMM Training
Previous works use EM training of HMM based on unrelated genotype data
Our 2-step algorithm exploits pedigree info Step 1: Infer haplotypes using pedigree-aware algorithm
based on entropy-minimization Step 2: train HMM based on inferred haplotypes, using
Baum-Welch
![Page 18: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/18.jpg)
Complexity of Computing Maximum Phasing Probability
• For unrelated genotypes, computing maximum phasing probability is hard to approximate within a factor of O(f½-) unless ZPP=NP, where f is the number of founders
• For trios, hard to approx. within O(f1/4 -)
• Reductions from the clique problem
![Page 19: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/19.jpg)
Alternate Likelihood Functions
• Viterbi probability (ViterbiProb): the maximum probability of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio
• Probability of Viterbi Haplotypes (ViterbiHaps): product of total probabilities of the 4 Viterbi haplotypes
• Total Trio Probability (TotalProb): total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4-tuples of paths
![Page 20: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/20.jpg)
For a fixed trio, Viterbi paths can be found using a 4-path version of Viterbi’s algorithm in time
K3 speed-up by factoring common terms:
Efficient Computation of Viterbi Probability for Trios
)( 8NKO
)},'()',,,;({max),,,;1(),,,;1( 4443213'43214321 4qqqqqqjPreqqqqjEqqqqjV
jQq
• = maximum probability of emitting SNP genotypes at locus j+1 from states • = transition probability
),,,;1( 4321 qqqqjE ),,,( 4321 qqqq
Where:
![Page 21: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/21.jpg)
Viterbi probability Likelihoods of all 3N modified trios can be computed within
time using forward-backward algorithm Overall runtime for M trios
Probability of Viterbi haplotypes Obtain haplotypes from standard traceback, then compute
haplotype probabilities using forward algorithms Overall runtime
Total trio probability Similar pre-computation speed-up & forward-backward algorithm Overall runtime
Overall Runtimes
)( 5MNKO
))(( 25 KNNKMO
)( 5MNKO
)( 5NKO
![Page 22: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/22.jpg)
Outline
Introduction
Likelihood Sensitivity Approach to Error Detection
HMM-Based Algorithms
Experimental Results
Conclusion
![Page 23: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/23.jpg)
Datasets
Real dataset [Becker et al. 2006] 35 SNP loci on chromosome 16 covering a region of
91kb 551 trios
Synthetic datasets 35 SNPs, 30-551 trios Preserved missing data pattern of real dataset Haplotypes assigned to trios based on frequencies
inferred from real dataset 1% error rate, four error insertion models
Random allele Random genotype Heterozygous-to-homozygous Homozygous-to-heterozygous
![Page 24: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/24.jpg)
Experimental Setup
Two strategies for handling MIs Set all three individuals to unknown prior to error
detection, or Set child only to unknown (preserving parents’ original
data)
Two testing strategies Test one SNP genotype: ViterbiProb-1, ViterbiHaps-1,
TotalProb-1 Simultaneously test three SNP genotypes at the same
locus: ViterbiProb-3, ViterbiHaps-3, TotalProb-3
![Page 25: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/25.jpg)
Comparison with FAMHAP (Random Allele Errors)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0001 0.001 0.01 0.1 1
FP rate
Sen
siti
vity
TrioProb-1
ViterbiHaps-1
ViterbiProb-1
FAMHAP-1
TrioProb-3
ViterbiHaps-3
ViterbiProb-3
FAMHAP-3
![Page 26: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/26.jpg)
Children vs. Parents (Random Allele Errors)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015 0.02 0.025
FP rate
Sen
siti
vity
TrioProb-1-P
TrioProb-1-C
![Page 27: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/27.jpg)
Error Model Comparison(TrioProb-1 Parents)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015 0.02 0.025
FP rate
Se
ns
itiv
ity
Random Allele 1%
Random Genotype 1%
Heterozygous-to-Homozygous 1%
Homozygous-to-Heterozygous 1%
![Page 28: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/28.jpg)
TrioProb-1 Results on Real Dataset
[Becker et al. 06] resequenced all trio members at 41 loci flagged by FAMHAP-3
23 SNP genotypes were identified as true errors 41*3-23=100 resequenced SNP genotypes agree with
original calls Predictive value for R=104 is between 18/26=69% and
24/26=92%, compared to 23/41=56% for FAMHAP-3
Threshold 2 3 4 2 3 4 2 3 4 2 3 4Parents 80 15 9 9 9 8 2 1 1 69 5 0Children 27 21 17 11 10 10 3 3 1 13 8 6Total 107 36 26 20 19 18 5 4 2 82 13 6
Total Signals True Positives False Positives Unknown
![Page 29: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/29.jpg)
Pedigree Info vs. Sample Size Effect
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015 0.02 0.025 0.03
FP rate
Sen
siti
vity 551-TrioProb-1-T
129-TrioProb-1-T
30-TrioProb-1-T
551-Unrelated-ViterbiProb-1
![Page 30: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/30.jpg)
Unrelated vs. Trio Likelihood Sensitivity
1
10
100
1000
10000
100000
0.1
0.3
0.5
0.7
0.9
1.1
1.3
1.5
1.7
1.9
2.1
2.3
2.5
2.7
2.9
3.1
3.3
3.5
3.7
3.9
4.1
4.3
4.5
4.7
4.9 >5
no error error
1
10
100
1000
10000
100000
no error error
Unrelated ViterbiProb-1 Likelihood ratios (children)
Trio ViterbiProb-1 Likelihood ratios (children)
![Page 31: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/31.jpg)
Combining Likelihood Functions (Children, Random Allele Model)
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.002 0.004 0.006 0.008 0.01
FAMHAP
Unrelated
Duo
Trio
MinUT
MinDT
MinUDT
Majority
MinUD
![Page 32: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/32.jpg)
Combining Likelihood Functions (Parents, Random Allele Model)
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.002 0.004 0.006 0.008 0.01
FAMHAP
Unrelated
Duo
Trio
MinUT
MinDT
MinUDT
Majority
MinUD
![Page 33: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/33.jpg)
Outline
Introduction
Likelihood Sensitivity Approach to Error Detection
HMM-Based Algorithms
Experimental Results
Conclusion
![Page 34: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/34.jpg)
Conclusion
Proposed efficient methods for error detection in trio genotype data based on a HMM model of haplotype diversity
Significantly improved detection accuracy compared to FAMHAP High sensitivity even for very low FP rates Runtime linear in #SNPs and #trios
Ongoing work Iterative error detection
Fix MIs using likelihood before error detection Correct errors with high likelihood ratio, then recompute likelihood
ratios (possibly after re-phasing and HMM re-training) Integration with genotype calling algorithms
Combine low level intensity data with haplotype-based likelihoods Most useful when less pedigree info is available (unrelated, sibling
pairs w/o parent genotypes, parents in trios) Locus specific thresholds, p-values
Via simulations similar to [Douglas et al. 00]
![Page 35: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d485503460f94a241a5/html5/thumbnails/35.jpg)
Questions?