identification of rare alleles and their carriers -...
TRANSCRIPT
Noam Shental
Department of Computer Science, The Open University of Israel
A The Open University Of Israel
Identification of rare alleles and their carriers
using compressed se(que)nsing
The Task:
Identification of rare alleles and their carriers
Output
rare allele at position #1
rare allele at position #2
• List of mutation loci.
• List of carriers
We try to perform this task with minimal resources
• Genes („regions‟) of interest
• Large set of DNA samples
Input
Motivation #1: Nationwide Carrier Screen
for known risk alleles
Example: Rare recessive genetic diseases
Carrier Healthy!
Normal Healthy
Genotype Phenotype
Affected Sick
Nationwide carrier screen
Genetic Disorder Carrier
rate
Tay-Sachs 1:25
Cystic Fibrosis 1:30
Familial Dysautonomia 1:30
Usher Syndrome 1:40
Canavan 1:40
Glycogen Storage 1:71
Fanconi Anemia C 1:80
Niemann-Pick 1:80
Mucolipidosis type 4 1:100
Bloom 1:102
Nemaline Myopathay 1:108
Large scale carrier screen
(rates vary across ethnic groups)
Genetic Disorder Carrier
rate
Tay-Sachs 1:25
Specific mutations
HEXA gene on chromosome 15
over 100 mutations are known
TS (1277) 3.50%
TS (1421) 0.30%
TS (F304/305)
0.10%
TS (G269) 0.10%
TS (R170Q) 0.20%
Motivation #2: de novo SNP identification
TS (1277) 3.50%
TS (1421) 0.30%
TS (F304/305)
0.10%
TS (G269) 0.10%
TS (R170Q) 0.20%
HEXA gene on chromosome 15
over 100 mutations are known
Tay-Sachs carriers – find new mutations
A. known genes
Motivation #2: de novo SNP identification
B. Association Studies
ControlsCases
• „discoveries‟ - variations that are less prevalent in the control groups.
• p - value
The Task:
Identification of rare alleles and their carriers
• Genes („regions‟) of interest
• Large set of DNA samples
Input Output
rare allele at position #1
rare allele at position #2
• List of mutation loci.
• List of carriers
We try to perform this task with minimal resources
Identification of rare alleles and their carriers
using compressed se(que)nsing
Joint work with
Amnon Amir
Department of Physics of Complex Systems, Weizmann Institute of Science
Or Zuk
Broad Institute of MIT and Harvard
Nucleic Acids Research, Aug 2010, doi:10.1093/nar/gkq675
Looking for collaborations
• Motivation
• Overview of the approach
•Introduction:
compressed sensing (CS) and its correspondence to our problem
next generation sequencing technology (NGST)
• Model:
modeling NGST
• Simulation results:
„pure simulations‟
Simulations based on experimental data
• Prior work
• Current research
• Conclusions
Outline
• Motivation
• Overview of the approach
•Introduction:
compressed sensing (CS) and its correspondence to our problem
next generation sequencing technology (NGST)
• Model:
modeling NGST
• Simulation results:
„pure simulations‟
Simulations based on experimental data
• Prior work
• Current research
• Conclusions
Outline
Specific mutations - notation
“A”
“B”
“B”
Reference genome …AGCGTTCT…
…AGTGTTCT…Single-nucleotide polymorphism (SNPs)
…AGGTTCTInsertions/Deletions (InDels)
Carrier test screen: Amplify a sample of DNA and then test
“AA” “AB”
fraction of B‟s out of tested alleles1/20
Naïve approach – one test per individual
collect DNA samples
Apply 9
independent tests
AB ABAA AA AA AA AAAA AA
fraction of B‟s out of tested alleles01/2 0 0 0 1/2 0 0 0
infer/reconstruct
52
11
42
0
52
1
Compressed sensing based group testing
Next Generation
Sequencing
Technology
compressed
sensinga few tests instead of 9
fraction of B‟s
# pools
Example
• Generic approach that puts together sequencing and CS for identifying rare
allele and their carriers.
• Much higher efficiency over the naive approach - may significantly improve
cost effectiveness in future Association Studies, and in screening large DNA
cohorts for specific risk alleles.
• In our approach experimental costs (both sample preparation and direct
sequencing costs) are proportional to the number of pools and NOT to the
number samples
Take home message
• Motivation
• Overview of the approach
•Introduction:
compressed sensing (CS) and its correspondence to our problem
next generation sequencing technology (NGST)
• Model:
modeling NGST
• Simulation results:
„pure simulations‟
Simulations based on experimental data
• Prior work
• Current research
• Conclusions
Outline
The CS problemTask:
Infer the entries of a vector x with minimal number
of “operations”
Nx
x
x
x
2
1
Assumption:x is sparse
351 xmy ii
53.4
0
0
0
0
345
0
0
47.1
0
x 11,1,1,11,1,1,1,1 im
How is it done?
We can select a vector as we wish, and get: im
xmy ii
“measurement”
The CS problem (cont)
We can repeat that k times, and get k “measurements”
xMy
k
Nk
m
m
m
M
2
1
Problem: k<<N: under-determined system
How is it done?
We can select a vector as we wish, and get: im
xmy ii
“measurement”
The CS problem (cont)
Solution: CS breakthrough #1:
If M obeys certain properties it is effectively invertible
xyM
"" 1
Example: Bernoulli Matrix
1111111111
1111111111
1111111111
M
1001010010
0101101100
1110110101
M
Easy to create a “suitable” matrix M
The CS problem (cont)
Solution: CS breakthrough #2:
solving the following optimization yields the correct solution
yxMtsxx
..argmin1x
xx
With probability almost 1
many off-the-shelf algorithms – we applied the GPSR algorithm
infer/reconstruct
52
11
42
0
52
1
Compressed sensing based group testing
Next Generation
Sequencing
Technology
Compressed
sensing
Rare allele identification in a CS framework
52
11
2
1
xmy ii
individuals in the pool
5
11,0,1,1,1,0,0,0,1im
x
# rare alleles
0
0
0
1
0
0
0
0
1
AA
AA
AA
AB
AA
AA
AA
AA
AB
52
11
infer/reconstruct
52
11
42
0
52
1
Compressed sensing based group testing
Next Generation
Sequencing
Technology
Compressed
sensing
Measuring device – NGST
Roche/454 Illumina Solexa
Applied Biosystems
SOLiD
Helicos
Time: a few days, Price : a few thousand $
Constantly improving (exponentially!)
NGST output
output: “reads”
Illumina: A few millions reads per lane
454: almost 1 million
line = “read”
coverage: # reads per location
NGST – targeted sequencing
We measure the number of reads containing B out of
the total number of reads. Here: 1/16
• Motivation
• Overview of the approach
•Introduction:
compressed sensing (CS) and its correspondence to our problem
next generation sequencing technology (NGST)
• Model:
modeling NGST
• Simulation results:
„pure simulations‟
Simulations based on experimental data
• Prior work
• Current research
• Conclusions
Outline
Parts of this modeling appeared in P. Prabhu & I. Pe‟er, Genome Research July 09
Ideal measurement - the fraction of “B” reads:
model formulation
xmy ii
2
1
r is itself a random variable )1,loci#
reads # total(~ r
1. sampling noise: finite number of reads from each site - r
NGST measurement:
2. Technical errors:
reread errors: 0.5-1%
DNA preparation errors
2
12,1,0
)21/()1
(2
1..minarg* rr
x
eezr
xMtsxxN
),(~ ii yrBinomialz , Estimated frequency: ii yrz /
Unique properties of this application
2. the matrix M is known up to noise: DNA preparation errors
3. potential constraints on the matrix M - sparseness:
1001010010
0101101100
1110110101
Mpotential technical
problems
total amount of DNA
1. measurement noise is pool dependent xmy ii
2
1),(~ ii yrBinomialz
• Motivation
• Overview of the approach
•Introduction:
compressed sensing (CS) and its correspondence to our problem
next generation sequencing technology (NGST)
• Model:
modeling NGST
• Simulation results:
„pure‟ insilico simulations
Simulations based on experimental data
• Prior work
• Current research
• Conclusions
Outline
Matlab package available at
www.broadinstitute.org/mpg/comseq/
„pure‟ in silico simulations
In the paper:
• 3-70 fold more efficient than the naïve approach
• dependence on the number targeted loci
• homozygous rare allele case (BB)
• combination with barcodes
• coverage vs. number of pools trade off
• dependence on the three noise factors
• dependence on the number of samples per pool.
Results
# pools
Simulations based on experimental data
Objective – validate our model.
In the paper:
• Carrier identification based on pooled experimental data (Out et al. Hum. Mutat.,
2009)
• Decoding mixtures based on the 1000 Genomes Pilot 3 project.
Results
“pooled DNA” experiments (e.g. GWAS): allele frequency – not carriers
Prior work
Group testing in general
• Drug screening
• Streaming algorithms
• communications
(D. Du and F.K. Hwang., A.C. Gilbert and M.J. Strauss)
Group testing for rare allele detection
Genome Research journal July 09
S. Prabhu and I. Pe‟er - “Overlapping pools for high-throughput targeted
resequencing.” – single carrier
Y. Erlich, G.J. Hannon et al. “DNA Sudoku-harnessing high-throughput
sequencing for multiplexed specimen” - pooling based Chinese-
Remainder-Theorem and barcoding.
Y. Erlich et al., “Compressed Genotyping”, IEEE Information Theory, 2010
Y. Erlich, NS, Amnon Amir, Or Zuk, “Compressed Sensing Approach for High
Throughput Carrier Screen”, Allerton 2009
Current work – Dor Yeshorim
In collaboration with Y. Erlich,
Whitehead Institute, MIT.
3000 DNA samples
Current work – Dor Yeshorim
• 41 samples with known genotype
• Pool these 41 samples and sequence
BLOOM
CF1152
CF3849L
CF508
CN285
FD696R
GS
TS1277
Current work – Sorghum bicolor
Submitted proposal with
Eyal Fridman, Faculty of Agriculture, HUJI
Zhanguo Xin, USDA-ARS
Yaniv Erlich, Whitehread Institute, MIT
Collaborators:
Rivka Elbaum, Faculty of Agriculture, HUJI
Noam Shomron, TAU
Or Zuk, Broad Institute
“Efficient allele mining in a model sequenced crop via the compressed
sequencing approach”
Current work – Sorghum bicolor
Eyal FridmanRivka Elbaum
Validation phase: mutant detection - the COMT as a target gene:
• 800 samples
• 2 known loci in the COMT gene; each in a single sample
Expected resources:
• A single Illumina lane
• # of pools
# p
oo
ls n
eed
ed
to
reac
h 9
5%
zero
err
or
sim
ula
tio
ns
Current work – Sorghum bicolor
Exploration phase: denovo SNP identification.
• Candidate genes involved in silica and water homeostasis
[Rivka Elbaum, HUJI]
• Target 120nt in 7 aquaporin NIP-like genes.
• 6400 samples
Pooling design - The same pools may be used to seek variation in any other gene
Expected resources: 2 Illumina lanes
Current work – Sorghum bicolor
suggested workflow
samples
create
pools
once and
for all!
Target
gene #1
order
primers
for gene #1
amplify the
pools
Target
gene #2
order
primers
for gene #2
amplify the
pools
Costs are proportional to the number of pools!
sample preparation
direct sequencing costs
• Generic approach that puts together sequencing and CS for identifying rare
allele and their carriers.
• Much higher efficiency over the naive approach - may significantly improve
cost effectiveness in future Association Studies, and in screening large DNA
cohorts for specific risk alleles.
•The method naturally deals with all possible scenarios of multiple carriers and
heterozygous or homozygous rare alleles.
• Seamlessly combined with barcodes
• In our approach experimental costs (both sample preparation and direct
sequencing costs) are proportional to the number of pools and NOT to the
number samples
Conclusions
Ackonwledgements
Amnon Amir
Department of Physics of Complex Systems, Weizmann Institute of Science
Or Zuk
Broad Institute of MIT and Harvard
Yaniv Erlich
Whitehead Institute, MIT