dna sequencing lecture 9, tuesday april 29, 2003
Post on 19-Dec-2015
218 views
TRANSCRIPT
![Page 1: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/1.jpg)
DNA Sequencing
Lecture 9, Tuesday April 29, 2003
![Page 2: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/2.jpg)
Lecture 9, Tuesday April 29, 2003
Reading
Basic:
ARACHNE: A Whole-Genome Shotgun Assembler
Euler: A shotgun assembler based on finding Eulerian paths
Optional:
Transposons; Genome Sizes;ARACHNE 2: Assembly of the mouse genome
Skim through following 2 free Nature issues:Mouse (December 2002);50 year anniversary (last week!)
![Page 3: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/3.jpg)
Lecture 9, Tuesday April 29, 2003
DNA sequencing – vectors
+ =
DNA
Shake
DNA fragments
VectorCircular genome(bacterium, plasmid)
Knownlocation
(restrictionsite)
![Page 4: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/4.jpg)
Lecture 9, Tuesday April 29, 2003
DNA sequencing – gel electrophoresis
Start at primer(restriction site)
Grow DNA chain
Include dideoxynucleoside(modified a, c, g, t)
Stops reaction at allpossible points
Separate products withlength, using gel electrophoresis
![Page 5: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/5.jpg)
Lecture 9, Tuesday April 29, 2003
Output of PHRAP: a read
A read: 500-700 nucleotides
A C G A A T C A G …. A16 18 21 23 25 15 28 30 32 21
Quality scores: -10log10Prob(Error)
Reads can be obtained from leftmost, rightmost ends of the insert
Double-barreled sequencing:Both leftmost & rightmost ends are sequenced
![Page 6: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/6.jpg)
Lecture 9, Tuesday April 29, 2003
Method to sequence segments longer than 500
cut many times at random (Shotgun)
genomic segment
Get one or two reads from each segment
~500 bp ~500 bp
![Page 7: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/7.jpg)
Lecture 9, Tuesday April 29, 2003
Reconstructing the Sequence (Fragment Assembly)
Cover region with ~7-fold redundancy (7X)
Overlap reads and extend to reconstruct the original genomic region
reads
![Page 8: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/8.jpg)
Lecture 9, Tuesday April 29, 2003
Challenges with Fragment Assembly
• Sequencing errors~1-2% of bases are wrong
• Repeats
• Computation: ~ O( N2 ) where N = # reads
false overlap due to repeat
![Page 9: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/9.jpg)
Lecture 9, Tuesday April 29, 2003
Strategies for sequencing a whole genome
1. Hierarchical – Clone-by-clonei. Break genome into many long piecesii. Map each long piece onto the genomeiii. Sequence each piece with shotgun
Example: Yeast, Worm, Human, Rat
2. Online version of (1) – Walkingi. Break genome into many long piecesii. Start sequencing each piece with shotguniii. Construct map as you go
Example: Rice genome
3. Whole genome shotgun
One large shotgun pass on the whole genome
Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu
![Page 10: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/10.jpg)
Lecture 9, Tuesday April 29, 2003
Hierarchical Sequencing Strategy
1. Obtain a large collection of BAC clones2. Map them onto the genome (Physical Mapping)3. Select a minimum tiling path4. Sequence each clone in the path with shotgun5. Assemble6. Put everything together
a BAC clone
mapgenome
![Page 11: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/11.jpg)
Lecture 9, Tuesday April 29, 2003
2. Digestion
Restriction enzymes cut DNA where specific words appear
1. Cut each clone separately with an enzyme2. Run fragments on a gel and measure length3. Clones Ca, Cb have fragments of length { li, lj, lk } overlap
Double digestion:Cut with enzyme A, enzyme B, then enzymes A + B
![Page 12: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/12.jpg)
Online Clone-by-cloneThe Walking Method
Lecture 9, Tuesday April 29, 2003
![Page 13: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/13.jpg)
Lecture 9, Tuesday April 29, 2003
The Walking Method
1. Build a very redundant library of BACs with sequenced clone-ends (cheap to build)
2. Sequence some “seed” clones
3. “Walk” from seeds using clone-ends to pick library clones that extend left & right
![Page 14: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/14.jpg)
Lecture 9, Tuesday April 29, 2003
Walking: An Example
![Page 15: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/15.jpg)
Lecture 9, Tuesday April 29, 2003
Advantages & Disadvantages of Hierarchical Sequencing
Hierarchical Sequencing– ADV. Easy assembly– DIS. Build library & physical map; redundant sequencing
Whole Genome Shotgun (WGS)– ADV. No mapping, no redundant sequencing– DIS. Difficult to assemble and resolve repeats
The Walking method – motivation
Sequence the genome clone-by-clone without a physical map
The only costs involved are:– Library of end-sequenced clones (CHEAP)– Sequencing
![Page 16: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/16.jpg)
Lecture 9, Tuesday April 29, 2003
Walking off a Single Seed
• Low redundant sequencing
• Many sequential steps
![Page 17: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/17.jpg)
Lecture 9, Tuesday April 29, 2003
Walking off a single clone is impractical
Cycle time to process one clone: 1-2 months
1. Grow clone2. Prepare & Shear DNA3. Prepare shotgun library & perform shotgun4. Assemble in a computer5. Close remaining gaps
A mammalian genome would need 15,000 walking steps !
![Page 18: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/18.jpg)
Lecture 9, Tuesday April 29, 2003
Walking off Several Seeds in Parallel
• Few sequential steps
• Additional redundant sequencing
In general, can sequence a genome in ~5 walking steps, with <20% redundant sequencing
Efficient Inefficient
![Page 19: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/19.jpg)
Lecture 9, Tuesday April 29, 2003
Using Two Libraries
Solution: Use a second library of small clones
Most inefficiency comes from closing a small ocean with a much larger clone
![Page 20: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/20.jpg)
Whole-Genome Shotgun Sequencing
Lecture 9, Tuesday April 29, 2003
![Page 21: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/21.jpg)
Lecture 9, Tuesday April 29, 2003
Whole Genome Shotgun Sequencing
cut many times at random
genome
forward-reverse linked reads
plasmids (2 – 10 Kbp)
cosmids (40 Kbp) known dist
~500 bp~500 bp
![Page 22: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/22.jpg)
Lecture 9, Tuesday April 29, 2003
ARACHNE: Steps to Assemble a Genome
1. Find overlapping reads
4. Derive consensus sequence ..ACGATTACAATAGGTT..
2. Merge good pairs of reads into longer contigs
3. Link contigs to form supercontigs
![Page 23: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/23.jpg)
Lecture 9, Tuesday April 29, 2003
1. Find Overlapping Reads
• Sort all k-mers in reads (k ~ 24)
TAGATTACACAGATTAC
TAGATTACACAGATTAC|||||||||||||||||
• Find pairs of reads sharing a k-mer
• Extend to full alignment – throw away if not >95% similar
T GA
TAGA| ||
TACA
TAGT||
![Page 24: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/24.jpg)
Lecture 9, Tuesday April 29, 2003
1. Find Overlapping Reads
One caveat: repeats
A k-mer that appears N times, initiates N2 comparisons
ALU: 1,000,000 times
Solution:
Discard all k-mers that appear more than c Coverage, (c ~ 10)
![Page 25: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/25.jpg)
Lecture 9, Tuesday April 29, 2003
1. Find Overlapping Reads
Create local multiple alignments from the overlapping reads
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA
![Page 26: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/26.jpg)
Lecture 9, Tuesday April 29, 2003
1. Find Overlapping Reads (cont’d)
• Correct errors using multiple alignment
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGA
C: 20C: 35T: 30C: 35C: 40
C: 20C: 35C: 0C: 35C: 40
• Score alignments
• Accept alignments with good scores
A: 15A: 25A: 40A: 25-
A: 15A: 25A: 40A: 25A: 0
![Page 27: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/27.jpg)
Lecture 9, Tuesday April 29, 2003
Basic principle of assembly
Repeats confuse us
Ability to merge two reads is related to our ability to detect repeats
We can dismiss as repeat any overlap of < t% similarity
Role of error correction:
Discards ~90% of single-letter sequencing errors
Threshold t% increases
![Page 28: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/28.jpg)
Lecture 9, Tuesday April 29, 2003
2. Merge Reads into Contigs (cont’d)
Merge reads up to potential repeat boundaries(Myers, 1995)
repeat region
![Page 29: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/29.jpg)
Lecture 9, Tuesday April 29, 2003
2. Merge Reads into Contigs (cont’d)
• Ignore non-maximal reads• Merge only maximal reads into contigs
repeat region
![Page 30: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/30.jpg)
Lecture 9, Tuesday April 29, 2003
2. Merge Reads into Contigs (cont’d)
• Ignore “hanging” reads, when detecting repeat boundaries
sequencing errorrepeat boundary???
b
a
![Page 31: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/31.jpg)
Lecture 9, Tuesday April 29, 2003
2. Merge Reads into Contigs (cont’d)
?????
Unambiguous
• Insert non-maximal reads whenever unambiguous
![Page 32: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/32.jpg)
Lecture 9, Tuesday April 29, 2003
3. Link Contigs into Supercontigs
Too dense: Overcollapsed?
(Myers et al. 2000)
Inconsistent links: Overcollapsed?
Normal density
![Page 33: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/33.jpg)
Lecture 9, Tuesday April 29, 2003
Find all links between unique contigs
3. Link Contigs into Supercontigs (cont’d)
Connect contigs incrementally, if 2 links
![Page 34: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/34.jpg)
Lecture 9, Tuesday April 29, 2003
Fill gaps in supercontigs with paths of overcollapsed contigs
3. Link Contigs into Supercontigs (cont’d)
![Page 35: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/35.jpg)
Lecture 9, Tuesday April 29, 2003
Define G = ( V, E )V := contigs
E := ( A, B ) such that d( A, B ) < C
Reason to do so: Efficiency; full shortest paths cannot be computed
3. Link Contigs into Supercontigs (cont’d)
d ( A, B )Contig A
Contig B
![Page 36: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/36.jpg)
Lecture 9, Tuesday April 29, 2003
3. Link Contigs into Supercontigs (cont’d)
Contig AContig B
Define T: contigs linked to either A or B
Fill gap between A and B if there is a path in G passing only from contigs in T
![Page 37: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/37.jpg)
Lecture 9, Tuesday April 29, 2003
4. Derive Consensus Sequence
Derive multiple alignment from pairwise read alignments
TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted voting
![Page 38: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/38.jpg)
Lecture 9, Tuesday April 29, 2003
Simulated Whole Genome Shotgun
• Known genomesFlu, yeast, fly, Human chromosomes 21, 22
• Make “realistic” shotgun reads
• Run ARACHNE
• Align output with genome and compare
![Page 39: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/39.jpg)
Lecture 9, Tuesday April 29, 2003
Making a Simulated Read
Simulated reads have error patterns taken from random real reads
ERRORIZER
Simulated read
artificial shotgun
read
real read
![Page 40: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/40.jpg)
Lecture 9, Tuesday April 29, 2003
Human 22, Results of Simulations
Plasmid/ Cosmid cov
10 X / 0.5 X
5 X / 0.5 X
3 X/ 0 X
N50 contig 353 Kb 15 Kb 2.7 Kb
Mean contig 142 Kb 10.6 Kb 2.0 Kb
N50 scaffold 3 Mb 3 Mb 4.1 Kb
Avg base qual 41 32 26
% > 2 kb 97.3 91.1 67
![Page 41: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/41.jpg)
Lecture 9, Tuesday April 29, 2003
Neurospora crassa Genome (Real Data)
• 40 Mb genome, shotgun sequencing complete (WI-CGR)
Coverage:1705 contigs368 supercontigs
• 1% uncovered (of finished BACs)
• Evaluated assembly using 1.5Mb of finished BACs
Efficiency:Time: 20 hrMemory: 9 Gb
Accuracy:< 3 misassemblies compared with 1 Gb of finished sequence
Errors/106 letters:Subst. 260Indel: 164
![Page 42: DNA Sequencing Lecture 9, Tuesday April 29, 2003](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d3e5503460f94a179ed/html5/thumbnails/42.jpg)
Lecture 9, Tuesday April 29, 2003
Mouse Genome
Improved version of ARACHNE assembled the mouse genome
Several heuristics of iteratively:Breaking supercontigs that are suspiciousRejoining supercontigs
Size of problem: 32,000,000 reads
Time: 15 days, 1 processorMemory: 28 Gb
N50 Contig size: 16.3 Kb 24.8 Kb N50 Supercontig size: .265 Mb 16.9 Mb