the 1000 genomes project gil mcvean department of statistics, oxford

The 1000 Genomes Project

Gil McVeanDepartment of Statistics, Oxford

What is the 1000 Genomes Project?

• A catalogue of all types of genetic variation, including rare variants (c. 1% frequency) obtained by sequencing at least 1000 individuals from geographic centres of major medical genetics interest

• A large international collaboration– UK, USA, China, Germany

• An exploration of the use of next-generation technologies for population-scale genome sequencing

• A resource for accelerating the rate of identifying disease mechanisms in the follow-up to disease-association studies

Samples for the main project

UKFIN

TSIESP

CEU

JPTCHB

CHS

DAI

KVTGMB

GHN

YRI

MLW

LWK

Major population groups comprised of subpopulations of c. 100 each

MXL

ASW

newCMB

PRO

Population-scale genome sequencing

Haplotypes2x

10x

Pilot experiments

• Pilot 1– Low-coverage (2x-4x) on 60 unrelated individuals from each of CEU, YRI and

CHB+JPT

• Pilot 2– High-coverage (20x diploid) on 2 trios (one from CEU, one from YRI)

• Pilot 3– Exons from 1000 genes to 20x in c. 1000 samples (largely European)

Complete!

The 1000G Low Coverage Pilot

• 185 individuals from 4 populations– CEU (63), CHB (30), JPT (30), YRI (62)

Population Technology N Individuals Mapped Bases (billions)

Mean Coverage / Individual

CEU SLX 52 482 3.09SOLiD 30 240 2.66454 18 132 2.45

CHB SLX 30 234 2.60JPT SLX 28 227 2.70

454 2 9.6 1.60YRI SLX 60 594 3.30

SOLiD 5 20.6 1.38454 2 10.8 1.80

Combined 185 1,884 3.52

Even still, at lot of data isn’t much

• In the Pilot 1 sample 1 tera-basepairs leaves the CEU with…– 6% of genotypes with 0 reads– 16% of genotypes with < 2 reads– 29% of genotypes with < 3 reads

ftp.1000genomes.ebi.ac.uk

www.1000genomes.org

Pilot release expected Nov/Dec 2009

ftp-trace.ncbi.nih.gov/1000genomes/ftp

What has the project already generated?

Over 9 millions novel SNPs

• Total 17.2 M SNPs called

• Previously ~12M SNPs “known” (dbSNP 129)

– 7.9M confirmed– 9.2M novel

4.84

1.09

0.78

0.48

2.80 5.65

1.54

CEU YRI

CHB+JPT

0.50

0.38

0.29 0.26

2.20 4.38

1.35

CEU YRI

CHB+JPT

Total SNPs Novel SNPs

Le Quang

A near complete record of common SNPs

Durbin, Le Quang

0.88

0.90

0.92

0.94

0.96

0.98

1.00

HomRef Het HomNonRef Average

CEU

JPT

CHB

YRI

A set of accurate genotypes

• This is about where simulations suggest we should be with 2-4x on 60 samples

• Note this quality is much much better than if calls were made marginally

Durbin, Le Quang

Many novel indels and larger structural variants

Ref-free in

serti

ons

Ref-assi

sted in

serti

ons

Ref-free deletions

Ref-assi

sted deletions

0

500

1000

1500

2000

2500

3000

Calls>50bpDGV1000g release

Zam Iqbal

Up to 50kb

Novel sequence from de novo assembly

Some interesting biology - variation in SNP density

Some more interesting biology – high Fst SNPs

Ryan Hernandez, Adam Auton

Even more interesting biology – loss of function mutations

Daniel MacArthur

A robust and modular pipeline for analysis of population-scale sequence data

An efficient format for storing aligned reads and a set of tools to manipulate and view the files

• SAM/BAM format for storing (aligned) reads

Bioinformatics (2009) http://samtools.sourceforge.net

An information-rich format for storing generic haplotype/genotype data and tools for manipulating the files

www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2

Using the 1000G data now

IMPUTE

Genotypes in additional samples from standard product

Reference panel(1000G)

Imputation

… 11101010101011 …… 00111110000111 …… 11110000011101 …… 00101011100101 … … 1.2..1.0.0..22…

… 11220110200122 … Imputed genotypes

Imputation performance across SNP types from P1 (CEU) from Affy 500k

Annotation # SNPs Info measure

All 414,321 0.780

MAF < 5% 102,000 0.543

MAF > 5% 312,321 0.857

UCSC Genes 6,628 0.736

Depth < 100 3,153 (0.7%) 0.611

SimpRpts 25,625 0.607

SimpRpts + Depth < 100 1,652 (6.5%) 0.671

SegDups 24,301 0.686

SegDups + Depth < 100 665 (2.7%) 0.388

Jonathan Marchini

Looking forward...

• Already have data generated for c. 200 more Europeans– Data generation largely complete by mid 2010

• Much work still to be done on accurate inference of all types of variation from NGS data

• Data already proven useful for a number of projects – please use it

Thanks to the many...

the 1000 genomes project gil mcvean department of statistics, oxford

Documents