Download - Information Theory of DNA Sequencing
![Page 1: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/1.jpg)
Information Theory of DNA Sequencing
David Tse Dept. of EECSU.C. Berkeley
LIDS Student ConferenceMIT
Feb. 2, 2012
Research supported by NSF Center for Science of Information.
Guy Bresler Abolfazl Motahari
![Page 2: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/2.jpg)
DNA sequencing
DNA: the blueprint of life
Problem: to obtain the sequence of nucleotides.
…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…
courtesy: Batzoglou
![Page 3: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/3.jpg)
Impetus: Human Genome Project
1990: Start
2001: Draft
2003: Finished3 billion basepairs
courtesy: Batzoglou
![Page 4: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/4.jpg)
Sequencing Gets Cheaper and Faster
Cost of one human genome• HGP: $ 3 billion• 2004: $30,000,000• 2008: $100,000• 2010: $10,000• 2011: $4,000 • 2012-13: $1,000• ???: $300
courtesy: Batzoglou
![Page 5: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/5.jpg)
But many genomes to sequence
100 million species(e.g. phylogeny)
7 billion individuals (SNP, personal genomics)
1013 cells in a human(e.g. somatic mutations
such as HIV, cancer) courtesy: Batzoglou
![Page 6: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/6.jpg)
Whole Genome Shotgun Sequencing
Reads are assembled to reconstruct the original DNA sequence.
![Page 7: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/7.jpg)
Sequencing Technologies • HGP era: single technology
(Sanger)
• Current: multiple “next generation” technologies (eg. Illumina, SoLiD, Pac Bio, Ion Torrent, etc.)
• All provide massively parallel sequencing.
• Each technology has different read lengths, noise profiles, etc
![Page 8: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/8.jpg)
Assembly Algorithms
• Many proposed algorithms.
• Different algorithms tailored to different technologies.
• Each algorithm deals with the full complexity of the problem while trying to scale well with the massive amount of data.
• Lots of heuristics used in the design.
![Page 9: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/9.jpg)
A Basic Question
• What is the minimum number of reads needed to reconstruct with a given reliability?
• A benchmark for comparing different algorithms.
• An algorithm-independent basis for comparing different technologies and designing new ones.
![Page 10: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/10.jpg)
Coverage Analysis
• Pioneered by Lander-Waterman
• What is the minimum number of reads to ensure there is no gap between the reads with a desired prob.?
• Only provides a lower bound.
• Can one get a tight lower bound?
![Page 11: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/11.jpg)
Communication and Sequencing: An Analogy
Communication:
Sequencing:
![Page 12: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/12.jpg)
Communication: Fundamental Limits
Cchannel = channel capacityHsource = sourceentropy rate
Shannon 48
Asymptotically reliable communication at rate R source symbols per channel output symbol if and only if:
R < C channelH source
Given statistical models for source and channel:
![Page 13: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/13.jpg)
DNA Sequencing: Fundamental Limits?
S1;S2; : : : ;SG R 1;R 2; : : : ;R N
• Define: sequencing rate R = G/N basepairs per read
• Question: can one define a sequencing capacity C such that:
asymptotically reliable reconstruction is possible if and only if R < C?
![Page 14: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/14.jpg)
A Simple Model
• DNA sequence: i.i.d. with distribution p.
• Starting positions of reads are i.i.d. uniform on the DNA sequence.
• Read process is noiseless.
Will extend to more complex source model and noisy read process later.
![Page 15: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/15.jpg)
The read channel
• Capacity depends on
– read length: L
– DNA length: G
• Normalized read length:
• Eg. L = 100, G = 3 £ 109 :
read channelAGCTTATAGGTCCGCATTACC AGGTCC
¹L := LlogG
L ") C "
G ") C #
¹L = 4:6
![Page 16: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/16.jpg)
Result: Sequencing Capacity
H2(p) = ¡ log4X
i=1p2i
Renyi entropy of order 2
C = 0
C = ¹L
![Page 17: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/17.jpg)
Coverage Constraint
TL
Starting positions of reads ~ Poisson(1/R)
E [# of gaps]= N ¢P [T > L]= Ne¡ LR
R = GN
E [# of gaps] ! 0, R < ¹L
G
N reads
![Page 18: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/18.jpg)
No-Duplication Constraint
E [# of duplicated pairs]¼N 2 ¢Ã 4X
i=1p2i
! L
L L L L
E [# of duplicated pairs] ! 0, ¹L > 2
H2(p)
= N 2e¡ L H 2(p)
The two possibilities have the same set of length L subsequences.
![Page 19: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/19.jpg)
Achievability
coverage constraint
no-duplication constraint
achievable?
![Page 20: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/20.jpg)
Greedy Algorithm
Input: the set of N reads of length L
1. Set the initial set of contigs as the reads.
2. Find two contigs with largest overlap and merge them into a new contig.
3. Repeat step 2 until only one contig remains or no more merging can be done.
Algorithm progresses in stages: at stage merge reads at overlap
`= L ¡ 1;L ¡ 2;: : : ;0`
![Page 21: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/21.jpg)
Greedy algorithm: the beginning
gap
Most reads have large overlap with neighbors
Expected # of errors in stage L-1:probability two disjoint
reads are equal
Very small since no-duplication constraint is satisfied.
![Page 22: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/22.jpg)
Greedy algorithm: stage
probability two disjoint reads appear to overlap
Expected # of errors at stage
This may get larger, but no larger than when¡Ne¡ L =R ¢2
Very small since coverage constraint is satisfied.
`= 0
![Page 23: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/23.jpg)
Summary: Two Regimes
coverage-limitedregime
duplication-limitedregime
![Page 24: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/24.jpg)
Relation to Earlier Works
• Coverage constraint: Lander-Waterman 88
• No-duplication constraint: Arratia et al 96
• Arratia et al focused on a model where all length L subsequences are given (seq. by hybridization)
• Our result: the two constraints together are necessary and sufficient for shotgun sequencing.
![Page 25: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/25.jpg)
Rest of Talk
• Impact of read noise.
• Impact of repeats in DNA sequence
![Page 26: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/26.jpg)
Read Noise
Model:
discrete memoryless channel defined by transition probabilities
ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGTTATACTTA
![Page 27: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/27.jpg)
Modified Greedy algorithm
YX
This is a hypothesis testing problem!
We observe two strings: X and Y.
Are they noisy versions of the same DNA subsequence?
Or from two different locations?
Y
Do we merge the two reads at overlap ?
(merge)
(do not merge)
![Page 28: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/28.jpg)
Impact on Sequencing Rate
MAP rule: declare H0 if
• Hypothesis test:
H0: noisy versions of the same DNA subsequence (merge)
H1: from disjoint DNA subsequences (do not merge)
Y
X Y
Two types of error:
• missed detection (new type of error)
• false positive (same as before)
![Page 29: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/29.jpg)
coverage constraint
no-duplication constraint
obtained by optimizing MAP threshold
Impact on Sequencing Rate
![Page 30: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/30.jpg)
More Complex DNA Statistics
• i.i.d. is not a very good model for the DNA sequence.
• More generally, we may want to model it as a correlated random process.
• For short-scale correlation, H2(p) can be replaced by the Renyi entropy rate of the process.
• But for higher mammals, DNA contains long repeats, repeat length comparable or longer than reads.
• This is handled by paired-end reads in practice.
H2 = lim`! 1
¡ 1` logP (x` = y`)
![Page 31: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/31.jpg)
A Simple Model for Repeats
K
Model: M repeats of length K placed uniformly into DNA sequence
If repeat length K>> read length L, how to reconstruct sequence?
Use paired-end reads:
J reads come in pairs with known separationThese reads can bridge the repeats
![Page 32: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/32.jpg)
coverage constraint
no-duplication constraint
Impact on Sequencing Rate
coverage of repeatsconstraint
If J > 2d + Kthen capacityis the same aswithout repeats
constant indep of K
K= repeat lengthJ = paired-end
separation
![Page 33: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/33.jpg)
Conclusion
• DNA sequencing is an important problem.
• Many new technologies and new applications.
• An analogy between sequencing and communication is drawn.
• A notion of sequencing capacity is formulated.
• A principled design framework?