information theory of dna sequencing
DESCRIPTION
Information Theory of DNA Sequencing. David Tse Dept. of EECS U.C. Berkeley LIDS Student Conference MIT Feb. 2, 2012 Research supported by NSF Center for Science of Information. Abolfazl Motahari. Guy Bresler. TexPoint fonts used in EMF: A A A A A A A A A A A A A A A A. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/1.jpg)
Information Theory of DNA Sequencing
David Tse Dept. of EECSU.C. Berkeley
LIDS Student ConferenceMIT
Feb. 2, 2012
Research supported by NSF Center for Science of Information.
Guy Bresler Abolfazl Motahari
![Page 2: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/2.jpg)
DNA sequencing
DNA: the blueprint of life
Problem: to obtain the sequence of nucleotides.
…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…
courtesy: Batzoglou
![Page 3: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/3.jpg)
Impetus: Human Genome Project
1990: Start
2001: Draft
2003: Finished3 billion basepairs
courtesy: Batzoglou
![Page 4: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/4.jpg)
Sequencing Gets Cheaper and Faster
Cost of one human genome• HGP: $ 3 billion• 2004: $30,000,000• 2008: $100,000• 2010: $10,000• 2011: $4,000 • 2012-13: $1,000• ???: $300
courtesy: Batzoglou
![Page 5: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/5.jpg)
But many genomes to sequence
100 million species(e.g. phylogeny)
7 billion individuals (SNP, personal genomics)
1013 cells in a human(e.g. somatic mutations
such as HIV, cancer) courtesy: Batzoglou
![Page 6: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/6.jpg)
Whole Genome Shotgun Sequencing
Reads are assembled to reconstruct the original DNA sequence.
![Page 7: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/7.jpg)
Sequencing Technologies • HGP era: single technology
(Sanger)
• Current: multiple “next generation” technologies (eg. Illumina, SoLiD, Pac Bio, Ion Torrent, etc.)
• All provide massively parallel sequencing.
• Each technology has different read lengths, noise profiles, etc
![Page 8: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/8.jpg)
Assembly Algorithms
• Many proposed algorithms.
• Different algorithms tailored to different technologies.
• Each algorithm deals with the full complexity of the problem while trying to scale well with the massive amount of data.
• Lots of heuristics used in the design.
![Page 9: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/9.jpg)
A Basic Question
• What is the minimum number of reads needed to reconstruct with a given reliability?
• A benchmark for comparing different algorithms.
• An algorithm-independent basis for comparing different technologies and designing new ones.
![Page 10: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/10.jpg)
Coverage Analysis
• Pioneered by Lander-Waterman
• What is the minimum number of reads to ensure there is no gap between the reads with a desired prob.?
• Only provides a lower bound.
• Can one get a tight lower bound?
![Page 11: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/11.jpg)
Communication and Sequencing: An Analogy
Communication:
Sequencing:
![Page 12: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/12.jpg)
Communication: Fundamental Limits
Cchannel = channel capacityHsource = sourceentropy rate
Shannon 48
Asymptotically reliable communication at rate R source symbols per channel output symbol if and only if:
R < C channelH source
Given statistical models for source and channel:
![Page 13: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/13.jpg)
DNA Sequencing: Fundamental Limits?
S1;S2; : : : ;SG R 1;R 2; : : : ;R N
• Define: sequencing rate R = G/N basepairs per read
• Question: can one define a sequencing capacity C such that:
asymptotically reliable reconstruction is possible if and only if R < C?
![Page 14: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/14.jpg)
A Simple Model
• DNA sequence: i.i.d. with distribution p.
• Starting positions of reads are i.i.d. uniform on the DNA sequence.
• Read process is noiseless.
Will extend to more complex source model and noisy read process later.
![Page 15: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/15.jpg)
The read channel
• Capacity depends on
– read length: L
– DNA length: G
• Normalized read length:
• Eg. L = 100, G = 3 £ 109 :
read channelAGCTTATAGGTCCGCATTACC AGGTCC
¹L := LlogG
L ") C "
G ") C #
¹L = 4:6
![Page 16: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/16.jpg)
Result: Sequencing Capacity
H2(p) = ¡ log4X
i=1p2i
Renyi entropy of order 2
C = 0
C = ¹L
![Page 17: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/17.jpg)
Coverage Constraint
TL
Starting positions of reads ~ Poisson(1/R)
E [# of gaps]= N ¢P [T > L]= Ne¡ LR
R = GN
E [# of gaps] ! 0, R < ¹L
G
N reads
![Page 18: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/18.jpg)
No-Duplication Constraint
E [# of duplicated pairs]¼N 2 ¢Ã 4X
i=1p2i
! L
L L L L
E [# of duplicated pairs] ! 0, ¹L > 2
H2(p)
= N 2e¡ L H 2(p)
The two possibilities have the same set of length L subsequences.
![Page 19: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/19.jpg)
Achievability
coverage constraint
no-duplication constraint
achievable?
![Page 20: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/20.jpg)
Greedy Algorithm
Input: the set of N reads of length L
1. Set the initial set of contigs as the reads.
2. Find two contigs with largest overlap and merge them into a new contig.
3. Repeat step 2 until only one contig remains or no more merging can be done.
Algorithm progresses in stages: at stage merge reads at overlap
`= L ¡ 1;L ¡ 2;: : : ;0`
![Page 21: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/21.jpg)
Greedy algorithm: the beginning
gap
Most reads have large overlap with neighbors
Expected # of errors in stage L-1:probability two disjoint
reads are equal
Very small since no-duplication constraint is satisfied.
![Page 22: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/22.jpg)
Greedy algorithm: stage
probability two disjoint reads appear to overlap
Expected # of errors at stage
This may get larger, but no larger than when¡Ne¡ L =R ¢2
Very small since coverage constraint is satisfied.
`= 0
![Page 23: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/23.jpg)
Summary: Two Regimes
coverage-limitedregime
duplication-limitedregime
![Page 24: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/24.jpg)
Relation to Earlier Works
• Coverage constraint: Lander-Waterman 88
• No-duplication constraint: Arratia et al 96
• Arratia et al focused on a model where all length L subsequences are given (seq. by hybridization)
• Our result: the two constraints together are necessary and sufficient for shotgun sequencing.
![Page 25: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/25.jpg)
Rest of Talk
• Impact of read noise.
• Impact of repeats in DNA sequence
![Page 26: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/26.jpg)
Read Noise
Model:
discrete memoryless channel defined by transition probabilities
ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGTTATACTTA
![Page 27: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/27.jpg)
Modified Greedy algorithm
YX
This is a hypothesis testing problem!
We observe two strings: X and Y.
Are they noisy versions of the same DNA subsequence?
Or from two different locations?
Y
Do we merge the two reads at overlap ?
(merge)
(do not merge)
![Page 28: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/28.jpg)
Impact on Sequencing Rate
MAP rule: declare H0 if
• Hypothesis test:
H0: noisy versions of the same DNA subsequence (merge)
H1: from disjoint DNA subsequences (do not merge)
Y
X Y
Two types of error:
• missed detection (new type of error)
• false positive (same as before)
![Page 29: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/29.jpg)
coverage constraint
no-duplication constraint
obtained by optimizing MAP threshold
Impact on Sequencing Rate
![Page 30: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/30.jpg)
More Complex DNA Statistics
• i.i.d. is not a very good model for the DNA sequence.
• More generally, we may want to model it as a correlated random process.
• For short-scale correlation, H2(p) can be replaced by the Renyi entropy rate of the process.
• But for higher mammals, DNA contains long repeats, repeat length comparable or longer than reads.
• This is handled by paired-end reads in practice.
H2 = lim`! 1
¡ 1` logP (x` = y`)
![Page 31: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/31.jpg)
A Simple Model for Repeats
K
Model: M repeats of length K placed uniformly into DNA sequence
If repeat length K>> read length L, how to reconstruct sequence?
Use paired-end reads:
J reads come in pairs with known separationThese reads can bridge the repeats
![Page 32: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/32.jpg)
coverage constraint
no-duplication constraint
Impact on Sequencing Rate
coverage of repeatsconstraint
If J > 2d + Kthen capacityis the same aswithout repeats
constant indep of K
K= repeat lengthJ = paired-end
separation
![Page 33: Information Theory of DNA Sequencing](https://reader036.vdocuments.net/reader036/viewer/2022062521/5681664c550346895dd9c6b3/html5/thumbnails/33.jpg)
Conclusion
• DNA sequencing is an important problem.
• Many new technologies and new applications.
• An analogy between sequencing and communication is drawn.
• A notion of sequencing capacity is formulated.
• A principled design framework?