george church thu 27-apr-2006 9:30-11 broad-mpg thanks to: new sequencing technologies & diploid...
Post on 15-Jan-2016
217 views
TRANSCRIPT
George Church Thu 27-Apr-2006 9:30-11 Broad-MPG
Thanks to:
New Sequencing Technologies& Diploid Personal Genomes
NHGRI Seq Tech 2004: Agencourt, 454, Microchip, 2005: Nanofluidics, Network, VisiGen Affymetrix, Helicos, Solexa-Lynx
‘Next Generation’ Technology Development
Multi-molecule Our roleAffymetrix SoftwareGorfinkel Polony to Capillary454 LifeSci Paired ends, emulsionLynx/Solexa Multiplexing & polonyAgencourt Seq by Ligation (SbL)
Single molecules Helicos Biosci SAB, cleavable fluorsPacific Biosci -Agilent Nanopores Visigen Biotech -Complete Genomics SbL
Sequencing components
1. Applications & goals2. Cost, accuracy, continuity goals3. Source, consent, ELSI4. Sample prep5. Technology development, deployment, scaling6. Software: data acquisition to interpretation7. Human interface, education
Sequencing applications
1. Environment (genetic): maternal, allergens, microbes2. Small mutations: whole genome vs targeted3. DNA copy number & rearrangements (paired ends)4. Exons conserved &/or mutable regions5. Haplotype: LD &/or causative combinations in cis6. RNA Digital Analysis of Gene Expression (by counting)7. RNA splicing (that arrays can’t handle)8. Proteomics: MS, Ab, aptamers9. Metabolomics: MS, Ab, aptamers10.Microbial evolution resequencing (needs consensus accuracy)
11.Cancer resequencing 12.Gene synthesis by sequencing (needs raw accuracy)13. DNA methylation
Why single chromosome sequencing? (or single cell or single particle?)
(1) When we only have one cell as in Preimplantation Genetic Diagnosis (PGD) or environmental samples
(2) Sequence relations >100 kbp (haplotypes)
(3) Prioritizing or pooling (rare) species based on an initial DNA screen
(4) Anything relating 2 or more chromosomes (in a cell or virus)
(5) Cell-cell interactions (e.g. predator-prey, symbionts, commensals, parasites, etc)
Zhang et al. Nature Genet. Mar 2006
Method#1: ‘in situ’ haplotypingSequencing/genotyping on single human chromosomes
153Mbp
Method#2: Chromosome dilution library QC: Reverse-FISH of amplicons
Sequencing/genotyping on single human chromosomes
Amplicon 19
Amplicon 6q
Single chromosome molecule sequencing
• How?– Isothermal Strand Displacement Amplification
from a single chromosome (Ploning)– Shotgun sequencing on the amplicon
• Challenges– Non-specific amplification competes with a single
template molecule– Amplicons have high-order DNA structures, which
creates issues in sequencing library construction
Reduce chimeras when cloning from SDA Plones
Single cell chromosome molecule sequencing
Phi-29
debranching
S1 nuclease
digestionDNA pol I nick
translation
From 19% to 6%
Single cell chromosome molecule sequencing
Chromosome# #1 #2
# Good seq reads 7,166 10,660
Average length (bp) 769.4 676.6
Total length (bp) 5,513,520 7,212,556
# unkown seqs 12 10
# vectors 23 44
# other seqs 74 2
% genome sampled 63% 67%
Plone amplification
errors: < 1.7×10-5
Ploning & sequencing 2.5 Mbp molecules
In vitro paired tag libraries
Bead polonies via emulsion PCR
Monolayer gel
immobilization
Enrich amplified
beads
SOFTWARE
Images → Tag Sequences
Tag Sequences → GenomeSBE or SBLsequencing
Epifluorescence & Flow Cell
Shendure, Porreca, Reppas, Lin, McCutcheon, Rosenbaum, Wang, Zhang, Mitra, Church (2005) Science 309:1728.
Integrated Polony Sequencing Pipeline(open source hardware, software, wetware)
R
Paired-end libraries
+ ligate
dilute, ligate
amplify
Shear or Nla III digest
select
hRCA
digest
Mme Iligate
amplify ePCR
Shendure, Porreca, et al. (2005) Science 309: 1728Margulies et al. (2005) Nature 437: 376.
L M
Distribution of Distances Between Mate-Paired Tags
distance (bp)
freq
uen
cy
980 ± 96 bp
1.0 kb
2.0 kb
10.7 bpFT
3’5’ Tag 1 ePCR bead
7 bp 6 bp 7 bp 6 bp
Tag 2
Each yields 6 to 7 bp of contiguous sequence
34 bp new sequence per 135 bp amplicon
4 positions for paired-end anchor 'primers'
L M R
ACUCAUC…(3’)…TAGAGT????????????????TGAGTAG…(5’)
5’-Cy5-nnnnAnnnn-3’ 5’-Cy3-nnnnGnnnn-3’ 5’-TR-nnnnCnnnn-3’ 5’-Cy3+Cy5-nnnnTnnnn-3’
5'PO4
Sequencing by Ligation (SBL) with fluorescent combinatorial 9-mers
Excitation Emission 647 700 555 605 572 630 555 700
nm
Shendure, Porreca, et al. (2005) Science 309:1728
HPLC autosampler
(96 wells)
syringe pump
Automation Schematic
microscope
& xyz stage
flow-cell
temperature control
Off the Shelf Instrumentation
$140,000
Mitra
Porreca
Shendure
Image Collection & Data Processing
514 raster positions x 4 images per cycle
26 cycles of sequencing
2 additional image sets for object-finding algorithms
54996 images (1000 x 1000, 14-bit)
Porecca et al.
100GBytes5M reads$500 run
Open Source Readmapper
• Hash all the reads (n)• Scan genome (m), and
for each window:– Does current window
exist in hash?
– If so, move downstream, scan d positions & test hash for membership
• Hash all possible reads from genome (m)
• Scan the reads (n), and for each:– Does it occur in the hash?
– If so, does the second exist?
– If so, take union (k)
m + (n * d) = 10+ hours, 20 nodes, 1.6e6 reads
n * k = 10 hours, 1 node, 1.6e6 reads
v1.0 (Shendure, Porreca et al)
v2.0 (Gary Gao, Sasha Wait)
Error quantitation
Median raw
Polony = 3E-3 (99.7%)
454 raw = 4E-2 (96%)
Shendure, Porreca et al, 2005
6X consensus <3E-7[>Q65, 99.99997%]
0
2
4
6
8
10
12
14
16
18
1E-8 1E-6 1E-4 1E-2
SBL $/kb
ABI $/kb
454 $/kb
$/kb @4E-5 $7 $9 0.8 0.07
$/3e9@1X 3M 300K $30K Paired ends yes no yesDevice $ 300K 500K 140K
Cost vs consensus error rate
ABI 454 454 Sep05Sep05 PolonyPolonySep05 Feb 06
Consensus error rate Total errors (E.coli)
(Human)
1E-4 Bermuda/Hapmap 500
600,000
4E-5 454 @40X 200 240,000
3E-7 Polony-SbL @6X 0 1800
1E-8 Goal for 2006 0 60
Goal of genotyping & resequencing Discovery of variantsE.g. cancer somatic mutations ~1E-6 (or lab evolved cells)
Why low error rates?
Also, effectively reduce (sub)genome target size by enrichment for exons or common SNPs to reduce cost & # false positives.
Position Type Gene LocationABI
ConfirmComments
986,334 T > G ompFPromoter-
10 Only in evolved strain
985,797 T > G ompF Glu > Ala Only in evolved strain
931,960 ▲8 bp lrp frameshift Only in evolved strain
3,957,960 C > T ppiC 5' UTR MG1655 heterogeneity
-3274 T > C cI Glu > Glu red heterogeneity
-9846 T > CORF6
1Lys > Gly red heterogeneity
Mutation Discovery in Engineered/Evolved E.coli
Shendure, Porreca, et al. (2005) Science 309:1728
0
1
2
3
4
5
6
7
8
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
# of passages
Do
ub
lin
g t
ime
(h
r)
Q1
Q3
Q2-1
Q2-2
EcNR1
Sequence monitoring of evolution(optimize small molecule synthesis/transport)
Sequence trp-
Reppas, Lin & Church
• Glu-117 → Ala (in the pore)
• Charged residue known to affect pore size and selectivity
• Promoter mutation at position (-12)
• Makes -10 box more consensus-like
-12 -11 -10 -9 -8 -7 -6
AAAGAT
CAAGAT
Can increase import & export capability simultaneously
ompF - non-specific transport channel
3 independent lines of Trp/Tyr co-culture frozen.
OmpF: 42R-> G, L, C, 113 D->V, 117 E->APromoter: -12A->C, -35 C->ALrp: 1bp deletion, 9bp deletion, 8bp
deletion, IS2 insertion, R->L in DBD.
Heterogeneity within each time-point reflecting colony heterogeneity.
Co-evolution of mutual biosensorssequenced across time & within each time-point
proximal tagplacement
distal tagplacement
1200000 1216001
1200000 12160011,206k 1,210k
Incorrect distanceRed=same strand
Black opposite strand
Mixture of wild & 2kb Inversion (pin)
Using paired ends, rearrangement & copy-number detection is >1000X easier than point mutation detection (6X vs 6000X)
1M Causative GenomeChanges CGCs
(10X MIP pool $20)Strand displacement amplification (ploning)
Polony sequencing 7E8 pixels
Chip Genotyping/Haplotyping
Exons & conserved 3%
(6X $9K)
Diplomechromosome
dilution shotgun (0.01X $300)
40K RNA diplome (10X MIP pool $20)
Personal Genome Project (ELSI)
Open source hardware, software, wetware Human Diplome Sequencing
Strategies
Padlock, Molecular Inversion Probes (MIPs)
Causative Genomic Changes (CGCs, e.g. conserved 3%) (not restricted to Single Nucleotides or Polymorphisms >1%)
Hardenbol .. Landegren Davis et al. Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat Biotechnol. 2003 21:673-8 . “10,000 targeted SNPs genotyped in a single tube assay.” Genome Res. 2005 15:269
Vitkup, Sander, Church (2003) The Amino-acid Mutational Spectrum of Human Genetic Disease. Genome Biol. 4: R72. (CG to CA, TG)
CGCATG
Genomic DNA
Alternative alleles
Universal primers
R
L
Optional multiplex tag
MIPs for VDJ Polonies
http://www.infobiogen.fr/services/chromcancer/Genes/TCRBID24.html
xxx
Over the whole field of human T-cells1 TRAC + 2 TRBC primers
cDNA
47 TRAV * 50 TRAJ + 46 TRAV * 13 TRBJ = 2948 MIP oligosor
47 TRAV * 1 TRAC + 46 TRAV * 2 TRBC = 139 MIP oligosIn situ RCA or PCR for each T-cell
Polony sequencing of tag &/or gap fill (e.g. 18 to 33bp in CDR3)(two tags per cell sufficient?)
‘Next Generation’ Technology Development
Multi-molecule Our roleAffymetrix SoftwareGorfinkel Polony to Capillary454 LifeSci Paired ends, emulsionLynx/Solexa Multiplexing & polonyAgencourt Seq by Ligation (SbL)
Single molecules Helicos Biosci SAB, cleavable fluorsPacific Biosci -Agilent Nanopores Visigen Biotech -Complete Genomics SbL
Human subjects consent
“Because the database will be public, people who do identity testing, such as for paternity testing or law enforcement, may also use the samples, the database, and the HapMap, to do general research. However, it will be very hard for anyone to learn anything about you personally from any of this research because none of the samples, the database, or the HapMap will include your name or any other information that could identify you or your family.”
YRI= Yoruba, Ibadan, Nigeria JPT= Japan, TokyoCHB= China (Han) BeijingCEU= CEPH (N&W Europe) Utah
http://www.hapmap.org/downloads/elsi/CEPH_Reconsent_Form.pdf
Is anonymity in genomics realistic? http://arep.med.harvard.edu/PGP/Anon.htm1) Re-identification after “de-identification” using other public data. Group Insurance Commission list of birth date, gender, and zip code was sufficient to re-identify medical records of Governor Weld & family via voter-registration records (1998) (2) Hacking. “Drug Records, Confidential Data vulnerable via Harvard ID number & PharmaCare loophole” (2005). A hacker gained access to confidential medical info at the U. Washington Medical Center -- 4000 files (names, conditions, etc, 2000)(3) Combination of surnames from genotype with geographical infoAn anonymous sperm donor was traced on the internet 2005 by his 15 year old son who used his own Y chromosome genealogy to access surname relations. (4) Inferring phenotype from genotype Markers for eye, skin, and hair color, height, weight, racial features, dysmorphologies, etc. are known & the list is growing.(5) Unexpected self-identification. An example of this at Celera undermined confidence in the investigators. Kennedy D. Science. 2002 297:1237. Not wicked, perhaps, but tacky.(6) A tiny amount of DNA data in the public domain with a name leverages the rest. This would allow the vast amount of DNA data in the HapMap (or other study) to be identified. This can happen for example in court cases even if the suspect is acquitted.(7) Identification by phenotype. If CT or MR imaging data is part of a study, one could reconstruct a person’s appearance . Even blood chemistry can be identifying in some cases.
"Open-source" Personal Genome Project (PGP)
• Harvard Medical School IRB Human Subjects protocol submitted Sep-2004, approved Aug-2005 renewed Feb-2006.
• Start with 3 highly-informed individuals consenting to non-anonymous genomes & extensive phenotypes (medical records, imaging, omics).
• Cell lines in Coriell NIGMS Repository
G M Church GM (2005) The Personal Genome ProjectNature Molecular Systems Biology doi:10.1038/msb4100040
Kohane IS, Altman RB. (2005) Health-information altruists--a potentially critical resource. N Engl J Med. 10;353(19):2074-7.
It is likely that less-privileged citizens ‘might be’ less likely to volunteer & will be more likely to volunteer due to higher financial risk. These same people ‘might be’ even less likely to volunteer is the data might become public. These same folks might be especially impacted socially if identifying (genome and/or phenome) data were to get out after they were assured that it would not.
Discussion: Ascertainment bias vs. risk of disclosure without consent.
Five categories:1) Withdrawal from studies due to new information on risks (all data destroyed).2) Highest security (possibly higher than the original study)
encryption, aggressive de-identification, only expert access with IRB-approval of each person, not whole teams. Consent form clearly states the risks (see previous slides).
3) Medium security, similar to current practice, but consented as above. IRB approval for teams to download de-identified data.
4) Open-PGP-type security. Click-through agreement. IRB-approval only for data collection, not for data reading.
5) Fully open. No IRB approval; full web access e.g. subject initiated.
Proposal for multi-tiered (re)consent of subjects in genomic studies
.