the 1000 genomes project gil mcvean department of statistics, oxford
Post on 19-Dec-2015
214 views
TRANSCRIPT
The 1000 Genomes Project
Gil McVeanDepartment of Statistics, Oxford
What is the 1000 Genomes Project?
• A catalogue of all types of genetic variation, including rare variants (c. 1% frequency) obtained by sequencing at least 1000 individuals from geographic centres of major medical genetics interest
• A large international collaboration– UK, USA, China, Germany
• An exploration of the use of next-generation technologies for population-scale genome sequencing
• A resource for accelerating the rate of identifying disease mechanisms in the follow-up to disease-association studies
Samples for the main project
UKFIN
TSIESP
CEU
JPTCHB
CHS
DAI
KVTGMB
GHN
YRI
MLW
LWK
Major population groups comprised of subpopulations of c. 100 each
MXL
ASW
newCMB
PRO
Population-scale genome sequencing
Haplotypes2x
10x
Pilot experiments
• Pilot 1– Low-coverage (2x-4x) on 60 unrelated individuals from each of CEU, YRI and
CHB+JPT
• Pilot 2– High-coverage (20x diploid) on 2 trios (one from CEU, one from YRI)
• Pilot 3– Exons from 1000 genes to 20x in c. 1000 samples (largely European)
Complete!
The 1000G Low Coverage Pilot
• 185 individuals from 4 populations– CEU (63), CHB (30), JPT (30), YRI (62)
Population Technology N Individuals Mapped Bases (billions)
Mean Coverage / Individual
CEU SLX 52 482 3.09SOLiD 30 240 2.66454 18 132 2.45
CHB SLX 30 234 2.60JPT SLX 28 227 2.70
454 2 9.6 1.60YRI SLX 60 594 3.30
SOLiD 5 20.6 1.38454 2 10.8 1.80
Combined 185 1,884 3.52
Even still, at lot of data isn’t much
• In the Pilot 1 sample 1 tera-basepairs leaves the CEU with…– 6% of genotypes with 0 reads– 16% of genotypes with < 2 reads– 29% of genotypes with < 3 reads
ftp.1000genomes.ebi.ac.uk
www.1000genomes.org
Pilot release expected Nov/Dec 2009
ftp-trace.ncbi.nih.gov/1000genomes/ftp
What has the project already generated?
Over 9 millions novel SNPs
• Total 17.2 M SNPs called
• Previously ~12M SNPs “known” (dbSNP 129)
– 7.9M confirmed– 9.2M novel
4.84
1.09
0.78
0.48
2.80 5.65
1.54
CEU YRI
CHB+JPT
0.50
0.38
0.29 0.26
2.20 4.38
1.35
CEU YRI
CHB+JPT
Total SNPs Novel SNPs
Le Quang
A near complete record of common SNPs
Durbin, Le Quang
0.88
0.90
0.92
0.94
0.96
0.98
1.00
HomRef Het HomNonRef Average
CEU
JPT
CHB
YRI
A set of accurate genotypes
• This is about where simulations suggest we should be with 2-4x on 60 samples
• Note this quality is much much better than if calls were made marginally
Durbin, Le Quang
Many novel indels and larger structural variants
Ref-free in
serti
ons
Ref-assi
sted in
serti
ons
Ref-free deletions
Ref-assi
sted deletions
0
500
1000
1500
2000
2500
3000
Calls>50bpDGV1000g release
Zam Iqbal
Up to 50kb
Novel sequence from de novo assembly
Some interesting biology - variation in SNP density
Some more interesting biology – high Fst SNPs
Ryan Hernandez, Adam Auton
Even more interesting biology – loss of function mutations
Daniel MacArthur
A robust and modular pipeline for analysis of population-scale sequence data
An efficient format for storing aligned reads and a set of tools to manipulate and view the files
• SAM/BAM format for storing (aligned) reads
Bioinformatics (2009) http://samtools.sourceforge.net
An information-rich format for storing generic haplotype/genotype data and tools for manipulating the files
www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2
Using the 1000G data now
IMPUTE
Genotypes in additional samples from standard product
Reference panel(1000G)
Imputation
… 11101010101011 …… 00111110000111 …… 11110000011101 …… 00101011100101 … … 1.2..1.0.0..22…
… 11220110200122 … Imputed genotypes
Imputation performance across SNP types from P1 (CEU) from Affy 500k
Annotation # SNPs Info measure
All 414,321 0.780
MAF < 5% 102,000 0.543
MAF > 5% 312,321 0.857
UCSC Genes 6,628 0.736
Depth < 100 3,153 (0.7%) 0.611
SimpRpts 25,625 0.607
SimpRpts + Depth < 100 1,652 (6.5%) 0.671
SegDups 24,301 0.686
SegDups + Depth < 100 665 (2.7%) 0.388
Jonathan Marchini
Looking forward...
• Already have data generated for c. 200 more Europeans– Data generation largely complete by mid 2010
• Much work still to be done on accurate inference of all types of variation from NGS data
• Data already proven useful for a number of projects – please use it
Thanks to the many...