identification of rare alleles and their carriers -...

Noam Shental

Department of Computer Science, The Open University of Israel

[email protected]

A The Open University Of Israel

Identification of rare alleles and their carriers

using compressed se(que)nsing

mailto:[email protected]

The Task:


Output

rare allele at position #1


• List of mutation loci.

• List of carriers

We try to perform this task with minimal resources

• Genes („regions‟) of interest

• Large set of DNA samples

Input

Motivation #1: Nationwide Carrier Screen

for known risk alleles

Example: Rare recessive genetic diseases

Carrier Healthy!

Normal Healthy

Genotype Phenotype

Affected Sick

Nationwide carrier screen

http://upload.wikimedia.org/wikipedia/commons/3/3e/Autorecessive.svg

Genetic Disorder Carrier

rate

Tay-Sachs 1:25

Cystic Fibrosis 1:30

Familial Dysautonomia 1:30

Usher Syndrome 1:40

Canavan 1:40

Glycogen Storage 1:71

Fanconi Anemia C 1:80

Niemann-Pick 1:80

Mucolipidosis type 4 1:100

Bloom 1:102

Nemaline Myopathay 1:108

Large scale carrier screen

(rates vary across ethnic groups)

Genetic Disorder Carrier

rate

Tay-Sachs 1:25

Specific mutations

HEXA gene on chromosome 15

over 100 mutations are known

TS (1277) 3.50%

TS (1421) 0.30%

TS (F304/305)

0.10%

TS (G269) 0.10%

TS (R170Q) 0.20%

Motivation #2: de novo SNP identification

TS (1277) 3.50%

TS (1421) 0.30%

TS (F304/305)

0.10%

TS (G269) 0.10%

TS (R170Q) 0.20%

HEXA gene on chromosome 15

over 100 mutations are known

Tay-Sachs carriers – find new mutations

A. known genes

Motivation #2: de novo SNP identification

B. Association Studies

ControlsCases

• „discoveries‟ - variations that are less prevalent in the control groups.

• p - value

The Task:


• Genes („regions‟) of interest

• Large set of DNA samples

Input Output



• List of mutation loci.

• List of carriers

We try to perform this task with minimal resources


using compressed se(que)nsing

Joint work with

Amnon Amir

Department of Physics of Complex Systems, Weizmann Institute of Science

Or Zuk

Broad Institute of MIT and Harvard

Nucleic Acids Research, Aug 2010, doi:10.1093/nar/gkq675

Looking for collaborations

• Motivation

• Overview of the approach

•Introduction:

compressed sensing (CS) and its correspondence to our problem

next generation sequencing technology (NGST)

• Model:

modeling NGST

• Simulation results:

„pure simulations‟

Simulations based on experimental data

• Prior work

• Current research

• Conclusions

Outline

Specific mutations - notation

“A”

“B”

“B”

Reference genome …AGCGTTCT…

…AGTGTTCT…Single-nucleotide polymorphism (SNPs)

…AGGTTCTInsertions/Deletions (InDels)

Carrier test screen: Amplify a sample of DNA and then test

“AA” “AB”

fraction of B‟s out of tested alleles1/20

Naïve approach – one test per individual

collect DNA samples

Apply 9

independent tests

AB ABAA AA AA AA AAAA AA

fraction of B‟s out of tested alleles01/2 0 0 0 1/2 0 0 0

infer/reconstruct

52

11

42

0

52

1

Compressed sensing based group testing

Next Generation

Sequencing

Technology

compressed

sensinga few tests instead of 9

fraction of B‟s

# pools

Example

• Generic approach that puts together sequencing and CS for identifying rare

allele and their carriers.

• Much higher efficiency over the naive approach - may significantly improve

cost effectiveness in future Association Studies, and in screening large DNA

cohorts for specific risk alleles.

• In our approach experimental costs (both sample preparation and direct

sequencing costs) are proportional to the number of pools and NOT to the

number samples

Take home message

• Motivation


•Introduction:



• Model:

modeling NGST




• Prior work


• Conclusions

Outline

The CS problemTask:

Infer the entries of a vector x with minimal number

of “operations”

Nx

x

x

x

2

1

Assumption:x is sparse

351 xmy ii

53.4

0

0

0

0

345

0

0

47.1

0

x 11,1,1,11,1,1,1,1 im

How is it done?

We can select a vector as we wish, and get: im

xmy ii

“measurement”

The CS problem (cont)

We can repeat that k times, and get k “measurements”

xMy

k

Nk

m

m

m

M

2

1

Problem: k<<N: under-determined system

How is it done?

We can select a vector as we wish, and get: im

xmy ii

“measurement”


Solution: CS breakthrough #1:

If M obeys certain properties it is effectively invertible

xyM

"" 1

Example: Bernoulli Matrix

1111111111

1111111111

1111111111

M

1001010010

0101101100

1110110101

M

Easy to create a “suitable” matrix M


Solution: CS breakthrough #2:

solving the following optimization yields the correct solution

yxMtsxx

..argmin1x

xx

With probability almost 1

many off-the-shelf algorithms – we applied the GPSR algorithm

infer/reconstruct

52

11

42

0

52

1


Next Generation

Sequencing

Technology

Compressed

sensing

Rare allele identification in a CS framework

52

11

2

1

xmy ii

individuals in the pool

5

11,0,1,1,1,0,0,0,1im

x

# rare alleles

0

0

0

1

0

0

0

0

1

AA

AA

AA

AB

AA

AA

AA

AA

AB

52

11

infer/reconstruct

52

11

42

0

52

1


Next Generation

Sequencing

Technology

Compressed

sensing

Measuring device – NGST

Roche/454 Illumina Solexa

Applied Biosystems

SOLiD

Helicos

Time: a few days, Price : a few thousand $

Constantly improving (exponentially!)

NGST output

output: “reads”

Illumina: A few millions reads per lane

454: almost 1 million

line = “read”

coverage: # reads per location

NGST – targeted sequencing

We measure the number of reads containing B out of

the total number of reads. Here: 1/16

• Motivation


•Introduction:



• Model:

modeling NGST




• Prior work


• Conclusions

Outline

Parts of this modeling appeared in P. Prabhu & I. Pe‟er, Genome Research July 09

Ideal measurement - the fraction of “B” reads:

model formulation

xmy ii

2

1

r is itself a random variable )1,loci#

reads # total(~ r

1. sampling noise: finite number of reads from each site - r

NGST measurement:

2. Technical errors:

reread errors: 0.5-1%

DNA preparation errors

2

12,1,0

)21/()1

(2

1..minarg* rr

x

eezr

xMtsxxN

),(~ ii yrBinomialz , Estimated frequency: ii yrz /

Unique properties of this application

2. the matrix M is known up to noise: DNA preparation errors

3. potential constraints on the matrix M - sparseness:

1001010010

0101101100

1110110101

Mpotential technical

problems

total amount of DNA

1. measurement noise is pool dependent xmy ii

2

1),(~ ii yrBinomialz

• Motivation


•Introduction:



• Model:

modeling NGST


„pure‟ insilico simulations


• Prior work


• Conclusions

Outline

Matlab package available at

www.broadinstitute.org/mpg/comseq/

„pure‟ in silico simulations

In the paper:

• 3-70 fold more efficient than the naïve approach

• dependence on the number targeted loci

• homozygous rare allele case (BB)

• combination with barcodes

• coverage vs. number of pools trade off

• dependence on the three noise factors

• dependence on the number of samples per pool.

Results

# pools


Objective – validate our model.

In the paper:

• Carrier identification based on pooled experimental data (Out et al. Hum. Mutat.,

2009)

• Decoding mixtures based on the 1000 Genomes Pilot 3 project.

Results

“pooled DNA” experiments (e.g. GWAS): allele frequency – not carriers

Prior work

Group testing in general

• Drug screening

• Streaming algorithms

• communications

(D. Du and F.K. Hwang., A.C. Gilbert and M.J. Strauss)

Group testing for rare allele detection

Genome Research journal July 09

S. Prabhu and I. Pe‟er - “Overlapping pools for high-throughput targeted

resequencing.” – single carrier

Y. Erlich, G.J. Hannon et al. “DNA Sudoku-harnessing high-throughput

sequencing for multiplexed specimen” - pooling based Chinese-

Remainder-Theorem and barcoding.

Y. Erlich et al., “Compressed Genotyping”, IEEE Information Theory, 2010

Y. Erlich, NS, Amnon Amir, Or Zuk, “Compressed Sensing Approach for High

Throughput Carrier Screen”, Allerton 2009

Current work – Dor Yeshorim

In collaboration with Y. Erlich,

Whitehead Institute, MIT.

3000 DNA samples

Current work – Dor Yeshorim

• 41 samples with known genotype

• Pool these 41 samples and sequence

BLOOM

CF1152

CF3849L

CF508

CN285

FD696R

GS

TS1277

Current work – Sorghum bicolor

Submitted proposal with

Eyal Fridman, Faculty of Agriculture, HUJI

Zhanguo Xin, USDA-ARS

Yaniv Erlich, Whitehread Institute, MIT

Collaborators:

Rivka Elbaum, Faculty of Agriculture, HUJI

Noam Shomron, TAU

Or Zuk, Broad Institute

“Efficient allele mining in a model sequenced crop via the compressed

sequencing approach”


Eyal FridmanRivka Elbaum

Validation phase: mutant detection - the COMT as a target gene:

• 800 samples

• 2 known loci in the COMT gene; each in a single sample

Expected resources:

• A single Illumina lane

• # of pools

# p

oo

ls n

eed

ed

to

reac

h 9

5%

zero

err

or

sim

ula

tio

ns


Exploration phase: denovo SNP identification.

• Candidate genes involved in silica and water homeostasis

[Rivka Elbaum, HUJI]

• Target 120nt in 7 aquaporin NIP-like genes.

• 6400 samples

Pooling design - The same pools may be used to seek variation in any other gene

Expected resources: 2 Illumina lanes


suggested workflow

samples

create

pools

once and

for all!

Target

gene #1

order

primers

for gene #1

amplify the

pools

Target

gene #2

order

primers

for gene #2

amplify the

pools

Costs are proportional to the number of pools!

sample preparation

direct sequencing costs

• Generic approach that puts together sequencing and CS for identifying rare

allele and their carriers.

• Much higher efficiency over the naive approach - may significantly improve

cost effectiveness in future Association Studies, and in screening large DNA

cohorts for specific risk alleles.

•The method naturally deals with all possible scenarios of multiple carriers and

heterozygous or homozygous rare alleles.

• Seamlessly combined with barcodes

• In our approach experimental costs (both sample preparation and direct

sequencing costs) are proportional to the number of pools and NOT to the

number samples

Conclusions

Ackonwledgements

Amnon Amir

Department of Physics of Complex Systems, Weizmann Institute of Science

Or Zuk

Broad Institute of MIT and Harvard

Yaniv Erlich

Whitehead Institute, MIT

identification of rare alleles and their carriers -...

Documents