george church thu 27-apr-2006 9:30-11 broad-mpg thanks to: new sequencing technologies & diploid...

George Church Thu 27-Apr-2006 9:30-11 Broad-MPG

Thanks to:

New Sequencing Technologies& Diploid Personal Genomes

NHGRI Seq Tech 2004: Agencourt, 454, Microchip, 2005: Nanofluidics, Network, VisiGen Affymetrix, Helicos, Solexa-Lynx

http://doegenomestolife.org/

‘Next Generation’ Technology Development

Multi-molecule Our roleAffymetrix SoftwareGorfinkel Polony to Capillary454 LifeSci Paired ends, emulsionLynx/Solexa Multiplexing & polonyAgencourt Seq by Ligation (SbL)

Single molecules Helicos Biosci SAB, cleavable fluorsPacific Biosci -Agilent Nanopores Visigen Biotech -Complete Genomics SbL

Sequencing components

1. Applications & goals2. Cost, accuracy, continuity goals3. Source, consent, ELSI4. Sample prep5. Technology development, deployment, scaling6. Software: data acquisition to interpretation7. Human interface, education

Sequencing applications

1. Environment (genetic): maternal, allergens, microbes2. Small mutations: whole genome vs targeted3. DNA copy number & rearrangements (paired ends)4. Exons conserved &/or mutable regions5. Haplotype: LD &/or causative combinations in cis6. RNA Digital Analysis of Gene Expression (by counting)7. RNA splicing (that arrays can’t handle)8. Proteomics: MS, Ab, aptamers9. Metabolomics: MS, Ab, aptamers10.Microbial evolution resequencing (needs consensus accuracy)

11.Cancer resequencing 12.Gene synthesis by sequencing (needs raw accuracy)13. DNA methylation

Why single chromosome sequencing? (or single cell or single particle?)

(1) When we only have one cell as in Preimplantation Genetic Diagnosis (PGD) or environmental samples

(2) Sequence relations >100 kbp (haplotypes)

(3) Prioritizing or pooling (rare) species based on an initial DNA screen

(4) Anything relating 2 or more chromosomes (in a cell or virus)

(5) Cell-cell interactions (e.g. predator-prey, symbionts, commensals, parasites, etc)

Zhang et al. Nature Genet. Mar 2006

Method#1: ‘in situ’ haplotypingSequencing/genotyping on single human chromosomes

153Mbp

Method#2: Chromosome dilution library QC: Reverse-FISH of amplicons

Sequencing/genotyping on single human chromosomes

Amplicon 19

Amplicon 6q

Single chromosome molecule sequencing

• How?– Isothermal Strand Displacement Amplification

from a single chromosome (Ploning)– Shotgun sequencing on the amplicon

• Challenges– Non-specific amplification competes with a single

template molecule– Amplicons have high-order DNA structures, which

creates issues in sequencing library construction

Reduce chimeras when cloning from SDA Plones

Single cell chromosome molecule sequencing

Phi-29

debranching

S1 nuclease

digestionDNA pol I nick

translation

From 19% to 6%

Single cell chromosome molecule sequencing

Chromosome# #1 #2

# Good seq reads 7,166 10,660

Average length (bp) 769.4 676.6

Total length (bp) 5,513,520 7,212,556

# unkown seqs 12 10

# vectors 23 44

# other seqs 74 2

% genome sampled 63% 67%

Plone amplification

errors: < 1.7×10-5

Ploning & sequencing 2.5 Mbp molecules

In vitro paired tag libraries

Bead polonies via emulsion PCR

Monolayer gel

immobilization

Enrich amplified

beads

SOFTWARE

Images → Tag Sequences

Tag Sequences → GenomeSBE or SBLsequencing

Epifluorescence & Flow Cell

Shendure, Porreca, Reppas, Lin, McCutcheon, Rosenbaum, Wang, Zhang, Mitra, Church (2005) Science 309:1728.

Integrated Polony Sequencing Pipeline(open source hardware, software, wetware)

R

Paired-end libraries

+ ligate

dilute, ligate

amplify

Shear or Nla III digest

select

hRCA

digest

Mme Iligate

amplify ePCR

Shendure, Porreca, et al. (2005) Science 309: 1728Margulies et al. (2005) Nature 437: 376.

L M

Distribution of Distances Between Mate-Paired Tags

distance (bp)

freq

uen

cy

980 ± 96 bp

1.0 kb

2.0 kb

10.7 bpFT

3’5’ Tag 1 ePCR bead

7 bp 6 bp 7 bp 6 bp

Tag 2

Each yields 6 to 7 bp of contiguous sequence

34 bp new sequence per 135 bp amplicon

4 positions for paired-end anchor 'primers'

L M R

ACUCAUC…(3’)…TAGAGT????????????????TGAGTAG…(5’)

5’-Cy5-nnnnAnnnn-3’ 5’-Cy3-nnnnGnnnn-3’ 5’-TR-nnnnCnnnn-3’ 5’-Cy3+Cy5-nnnnTnnnn-3’

5'PO4

Sequencing by Ligation (SBL) with fluorescent combinatorial 9-mers

Excitation Emission 647 700 555 605 572 630 555 700

nm

Shendure, Porreca, et al. (2005) Science 309:1728

HPLC autosampler

(96 wells)

syringe pump

Automation Schematic

microscope

& xyz stage

flow-cell

temperature control

Off the Shelf Instrumentation

$140,000

Mitra

Porreca

Shendure

Image Collection & Data Processing

514 raster positions x 4 images per cycle

26 cycles of sequencing

2 additional image sets for object-finding algorithms

54996 images (1000 x 1000, 14-bit)

Porecca et al.

100GBytes5M reads$500 run

Open Source Readmapper

• Hash all the reads (n)• Scan genome (m), and

for each window:– Does current window

exist in hash?

– If so, move downstream, scan d positions & test hash for membership

• Hash all possible reads from genome (m)

• Scan the reads (n), and for each:– Does it occur in the hash?

– If so, does the second exist?

– If so, take union (k)

m + (n * d) = 10+ hours, 20 nodes, 1.6e6 reads

n * k = 10 hours, 1 node, 1.6e6 reads

v1.0 (Shendure, Porreca et al)

v2.0 (Gary Gao, Sasha Wait)

Error quantitation

Median raw

Polony = 3E-3 (99.7%)

454 raw = 4E-2 (96%)

Shendure, Porreca et al, 2005

6X consensus <3E-7[>Q65, 99.99997%]

0

2

4

6

8

10

12

14

16

18

1E-8 1E-6 1E-4 1E-2

SBL $/kb

ABI $/kb

454 $/kb

$/kb @4E-5 $7 $9 0.8 0.07

$/3e9@1X 3M 300K $30K Paired ends yes no yesDevice $ 300K 500K 140K

Cost vs consensus error rate

ABI 454 454 Sep05Sep05 PolonyPolonySep05 Feb 06

Consensus error rate Total errors (E.coli)

(Human)

1E-4 Bermuda/Hapmap 500

600,000

4E-5 454 @40X 200 240,000

3E-7 Polony-SbL @6X 0 1800

1E-8 Goal for 2006 0 60

Goal of genotyping & resequencing Discovery of variantsE.g. cancer somatic mutations ~1E-6 (or lab evolved cells)

Why low error rates?

Also, effectively reduce (sub)genome target size by enrichment for exons or common SNPs to reduce cost & # false positives.

Position Type Gene LocationABI

ConfirmComments

986,334 T > G ompFPromoter-

10 Only in evolved strain

985,797 T > G ompF Glu > Ala Only in evolved strain

931,960 ▲8 bp lrp frameshift Only in evolved strain

3,957,960 C > T ppiC 5' UTR MG1655 heterogeneity

-3274 T > C cI Glu > Glu red heterogeneity

-9846 T > CORF6

1Lys > Gly red heterogeneity

Mutation Discovery in Engineered/Evolved E.coli

Shendure, Porreca, et al. (2005) Science 309:1728

0

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

# of passages

Do

ub

lin

g t

ime

(h

r)

Q1

Q3

Q2-1

Q2-2

EcNR1

Sequence monitoring of evolution(optimize small molecule synthesis/transport)

Sequence trp-

Reppas, Lin & Church

• Glu-117 → Ala (in the pore)

• Charged residue known to affect pore size and selectivity

• Promoter mutation at position (-12)

• Makes -10 box more consensus-like

-12 -11 -10 -9 -8 -7 -6

AAAGAT

CAAGAT

Can increase import & export capability simultaneously

ompF - non-specific transport channel

3 independent lines of Trp/Tyr co-culture frozen.

OmpF: 42R-> G, L, C, 113 D->V, 117 E->APromoter: -12A->C, -35 C->ALrp: 1bp deletion, 9bp deletion, 8bp

deletion, IS2 insertion, R->L in DBD.

Heterogeneity within each time-point reflecting colony heterogeneity.

Co-evolution of mutual biosensorssequenced across time & within each time-point

proximal tagplacement

distal tagplacement

1200000 1216001

1200000 12160011,206k 1,210k

Incorrect distanceRed=same strand

Black opposite strand

Mixture of wild & 2kb Inversion (pin)

Using paired ends, rearrangement & copy-number detection is >1000X easier than point mutation detection (6X vs 6000X)

1M Causative GenomeChanges CGCs

(10X MIP pool $20)Strand displacement amplification (ploning)

Polony sequencing 7E8 pixels

Chip Genotyping/Haplotyping

Exons & conserved 3%

(6X $9K)

Diplomechromosome

dilution shotgun (0.01X $300)

40K RNA diplome (10X MIP pool $20)

Personal Genome Project (ELSI)

Open source hardware, software, wetware Human Diplome Sequencing

Strategies

Padlock, Molecular Inversion Probes (MIPs)

Causative Genomic Changes (CGCs, e.g. conserved 3%) (not restricted to Single Nucleotides or Polymorphisms >1%)

Hardenbol .. Landegren Davis et al. Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat Biotechnol. 2003 21:673-8 . “10,000 targeted SNPs genotyped in a single tube assay.” Genome Res. 2005 15:269

Vitkup, Sander, Church (2003) The Amino-acid Mutational Spectrum of Human Genetic Disease. Genome Biol. 4: R72. (CG to CA, TG)

CGCATG

Genomic DNA

Alternative alleles

Universal primers

R

L

Optional multiplex tag

MIPs for VDJ Polonies

http://www.infobiogen.fr/services/chromcancer/Genes/TCRBID24.html

xxx

Over the whole field of human T-cells1 TRAC + 2 TRBC primers

cDNA

47 TRAV * 50 TRAJ + 46 TRAV * 13 TRBJ = 2948 MIP oligosor

47 TRAV * 1 TRAC + 46 TRAV * 2 TRBC = 139 MIP oligosIn situ RCA or PCR for each T-cell

Polony sequencing of tag &/or gap fill (e.g. 18 to 33bp in CDR3)(two tags per cell sufficient?)

http://www.infobiogen.fr/services/chromcancer/Genes/TCRBID24.html

‘Next Generation’ Technology Development

Multi-molecule Our roleAffymetrix SoftwareGorfinkel Polony to Capillary454 LifeSci Paired ends, emulsionLynx/Solexa Multiplexing & polonyAgencourt Seq by Ligation (SbL)

Single molecules Helicos Biosci SAB, cleavable fluorsPacific Biosci -Agilent Nanopores Visigen Biotech -Complete Genomics SbL

Human subjects consent

“Because the database will be public, people who do identity testing, such as for paternity testing or law enforcement, may also use the samples, the database, and the HapMap, to do general research. However, it will be very hard for anyone to learn anything about you personally from any of this research because none of the samples, the database, or the HapMap will include your name or any other information that could identify you or your family.”

YRI= Yoruba, Ibadan, Nigeria JPT= Japan, TokyoCHB= China (Han) BeijingCEU= CEPH (N&W Europe) Utah

http://www.hapmap.org/downloads/elsi/CEPH_Reconsent_Form.pdf

Is anonymity in genomics realistic? http://arep.med.harvard.edu/PGP/Anon.htm1) Re-identification after “de-identification” using other public data. Group Insurance Commission list of birth date, gender, and zip code was sufficient to re-identify medical records of Governor Weld & family via voter-registration records (1998) (2) Hacking. “Drug Records, Confidential Data vulnerable via Harvard ID number & PharmaCare loophole” (2005). A hacker gained access to confidential medical info at the U. Washington Medical Center -- 4000 files (names, conditions, etc, 2000)(3) Combination of surnames from genotype with geographical infoAn anonymous sperm donor was traced on the internet 2005 by his 15 year old son who used his own Y chromosome genealogy to access surname relations. (4) Inferring phenotype from genotype Markers for eye, skin, and hair color, height, weight, racial features, dysmorphologies, etc. are known & the list is growing.(5) Unexpected self-identification. An example of this at Celera undermined confidence in the investigators. Kennedy D. Science. 2002 297:1237. Not wicked, perhaps, but tacky.(6) A tiny amount of DNA data in the public domain with a name leverages the rest. This would allow the vast amount of DNA data in the HapMap (or other study) to be identified. This can happen for example in court cases even if the suspect is acquitted.(7) Identification by phenotype. If CT or MR imaging data is part of a study, one could reconstruct a person’s appearance . Even blood chemistry can be identifying in some cases.

http://arep.med.harvard.edu/PGP/Anon.htm

"Open-source" Personal Genome Project (PGP)

• Harvard Medical School IRB Human Subjects protocol submitted Sep-2004, approved Aug-2005 renewed Feb-2006.

• Start with 3 highly-informed individuals consenting to non-anonymous genomes & extensive phenotypes (medical records, imaging, omics).

• Cell lines in Coriell NIGMS Repository

G M Church GM (2005) The Personal Genome ProjectNature Molecular Systems Biology doi:10.1038/msb4100040

Kohane IS, Altman RB. (2005) Health-information altruists--a potentially critical resource. N Engl J Med. 10;353(19):2074-7.

It is likely that less-privileged citizens ‘might be’ less likely to volunteer & will be more likely to volunteer due to higher financial risk. These same people ‘might be’ even less likely to volunteer is the data might become public. These same folks might be especially impacted socially if identifying (genome and/or phenome) data were to get out after they were assured that it would not.

Discussion: Ascertainment bias vs. risk of disclosure without consent.

Five categories:1) Withdrawal from studies due to new information on risks (all data destroyed).2) Highest security (possibly higher than the original study)

encryption, aggressive de-identification, only expert access with IRB-approval of each person, not whole teams. Consent form clearly states the risks (see previous slides).

3) Medium security, similar to current practice, but consented as above. IRB approval for teams to download de-identified data.

4) Open-PGP-type security. Click-through agreement. IRB-approval only for data collection, not for data reading.

5) Fully open. No IRB approval; full web access e.g. subject initiated.

Proposal for multi-tiered (re)consent of subjects in genomic studies

george church thu 27-apr-2006 9:30-11 broad-mpg thanks to: new sequencing technologies & diploid...

Documents

ploning sequencing

virus5 cellcell interactions

single particle

chromosome dilution

matepaired ta

good seq reads7

initial dna screen4

genome sampled63g