aug2013 illumina platinum genomes
TRANSCRIPT
© 2010 Illumina, Inc. All rights reserved.Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.
Platinum Genomes: Identifying variants
using a large pedigree
Michael A. Eberle
GIAB August, 2013
2
Platinum Genome project: Improving technology & tools
Create a catalogue of highly accurate whole-genome variant calls within a well characterized pedigree
– SNPs, indels & CNVs– Including highly confident reference positions– Provide direct supporting evidence for every variant call
Develop a framework to assess variant callers
Provide a path to improve variant callers by providing a better truth data to sensitively assess sensitivity and precision
– Modifying the SNP filters to maximize accuracy
Correct FPFN
Truth Test
3
NIST GIAB – Pedigree analysis
12889 12890 12891 12892
12877 12878
12879 12880 12881 12882 12883 12884 12885 1288712886 12888 12893
All 17 members sequenced to at least 50x depth (PCR-Free protocol)
Variants are called across the pedigree using different software & technology
Inheritance information provides high confident, direct validation of variant calls
Analysis of SNPs in the parents and 11 children
4
Pedigree Analysis – Using haplotypes to detect conflicts
ACAGTA
ACAGTA
ACAGTA
ACATTA
ACAGTA
ATCTGA
ATCTGA
ATCTGA
GTCGTC
GTCGTC
GTCGTC
GCATTA
GCATTA
GCATTA
GCATTA
GCATTA
With a sufficiently large pedigree all four possible inheritance patterns will be observed and most of the genotypes can be phased into haplotypes
Parents
Children
5
Using haplotypes to detect conflicts
ACAGTA
ACAGTA
ACAGTA
ACATTA
ACAGTA
ATCTGA
ATCTGA
ATCTGA
GTCGTC
GTCGTC
GTCGTC
GCATTA
GCATTA
GCATTA
GCATTA
GCATTA
Individual GT accuracy is assessed using surrounding genotype calls across the pedigree
Genotypes are parsimoniously phased to minimize the number of conflicts across the pedigree
Facilitates assigning conflicts to sample, imputation of missing data and error correction
Error at this sample/position
Parents
Children
6
First step is to define the inheritance of the parental chromosomes to the eleven children everywhere in the genome
– Identified 709 crossover events between the parents and eleven children
Variants called across the pedigree using multiple callers– E.g. GATK, Cortex, Isaac & CGI for SNPs
Define accurate variants as those where the genotypes are 100% consistent with the transmission of the parental haplotypes
– At any position of the genome there are only 16 possible combinations of genotypes (biallelic & diploid) across the pedigree that are consistent with the inheritance pattern
– 313 (~1.6M) possible genotype combinations
Analysis of variant calls within the pedigree structure
7
Homozygous positions (GATK)– ~2.6B positions identified as homozygous reference across the pedigree
SNPs (GATK, Cortex, Isaac & CGI)– ~4.7M positions where SNPs agree with transmission of parental chromosomes– >95% (4.5M) called consistent with transmission by multiple algorithms/technologies– >98% (4.6M) with supporting evidence from other call sets (i.e. same variant called in
at least one of the samples)
Indels (GATK, Cortex & CGI)– ~640k indels consistent with transmission of parental chromosomes– Events range in size from 1 to 350bp
CNVs (BreakDancer & Grouper)– ~772 CNVs - mostly deletions though a couple of duplications– Events range from 1kb to 322kb though still refining break points
Current state
8
CNVs
9
Incorporating larger variants
SNPs and small indels work well because the genotypes are highly accurate– A single genotyping error in any of the 13 samples will almost never be consistent
with the haplotype transmission
Developing approaches for other variants types that have lower calling accuracy– Many CNV callers do not provide GT information– Accuracy is too low to use pedigree-consistency
10
Incorporating CNVs into this framework
Make breakpoint calls within each sample using
BreakDancer & Grouper
Identify regions of overlap between samples (keeping
singletons)
Corroborate based on read counts within the putative CNV
events
Refine to breakpoint resolution
NA12877
NA12878
NA12879
NA12880
NA12881
NA12882
Test Regions
• Count the uniquely aligned reads within the defined break points for the test regions for each sample & identify events where the read counts are consistent with a deletion or duplication
• For internally-consistent events, follow up with targeted analysis to identify bp resolution of events
• On average ~150x depth for every event
11
AB CD CB DA CB DB DA CB CA DB CB CA DA0
500
1000
1500
2000
Rea
d C
ount
s
0
1
2
Using read counts to confirm deletions – 8.5kb deletion
Best Sol’n: A=0 ; B=1 ; C=1 ; D=1
All Samples with haplotype A are consistent with haploid based on read countsA A A A A A
Diploid
Haploid
Zero-ploid
12
Breakdown of 772 “accurate” CNVs (1kb to 322kb in size)
26640898
BreakDancerGrouper
13
Assembling breakpoints for the 772 CNVs– Reassessing the “failed” calls where applicable
Incorporating different calling algorithms / methods– E.g. SNP inheritance can help identify CNVs that are missed by other methods– Including mate pair data (~2kb insert size)
Working on different methods to improve our catalogue of ~30bp to 2kb events & incorporating different callers
Assigning error modes for “failed” SNPs– Many look like cell line mutations & alignment errors
Comparing our call set to other datasets to assess accuracy and completeness– Other GIAB call sets– Fosmid data (Jaffe & Kidd)
Next steps
14
Illumina Oxford
Morten Kallberg Zamin Iqbal
Xiaoyu Chen Gil McVean
Han-Yu Chuang
Phil Tedder
Sean Humphray
Elliott Margulies
David Bentley
This data and more available at www.platinumgenomes.org
Acknowledgements